All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] PCI Shared Memory device
@ 2010-04-21 17:53 ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

Latest patch for PCI shared memory device that maps a host shared memory object
to be shared between guests.

new in this series
    - fixed segfault for non-server case
    - code style fixes
    - removed limit on the number of guests
    - shared memory server is now in qemu.git/contrib
    - made irqfd/ioeventfd setup functions generic
    - removed interrupts when guest joined (let application handle it)

    v4:
    - moved to single Doorbell register and use datamatch to trigger different
      VMs rather than one register per eventfd
    - remove writing arbitrary values to eventfds.  Only values of 1 are now
      written to ensure correct usage

Cam Macdonell (5):
  Device specification for shared memory PCI device
  Support adding a file to qemu's ram allocation
  Adds two new functions for assigning ioeventfd and irqfds.
  Inter-VM shared memory PCI device
  the stand-alone shared memory server for inter-VM shared memory

 Makefile.target                         |    3 +
 contrib/ivshmem-server/Makefile         |   16 +
 contrib/ivshmem-server/README           |   30 ++
 contrib/ivshmem-server/ivshmem_server.c |  339 ++++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 +++++++++
 contrib/ivshmem-server/send_scm.h       |   19 +
 cpu-common.h                            |    2 +
 docs/specs/ivshmem_device_spec.txt      |   91 ++++
 exec.c                                  |   36 ++
 hw/ivshmem.c                            |  728 +++++++++++++++++++++++++++++++
 kvm-all.c                               |   44 ++
 kvm.h                                   |   14 +
 qemu-char.c                             |    6 +
 qemu-char.h                             |    3 +
 qemu-doc.texi                           |   25 +
 15 files changed, 1564 insertions(+), 0 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h
 create mode 100644 docs/specs/ivshmem_device_spec.txt
 create mode 100644 hw/ivshmem.c


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] [PATCH v5 0/5] PCI Shared Memory device
@ 2010-04-21 17:53 ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

Latest patch for PCI shared memory device that maps a host shared memory object
to be shared between guests.

new in this series
    - fixed segfault for non-server case
    - code style fixes
    - removed limit on the number of guests
    - shared memory server is now in qemu.git/contrib
    - made irqfd/ioeventfd setup functions generic
    - removed interrupts when guest joined (let application handle it)

    v4:
    - moved to single Doorbell register and use datamatch to trigger different
      VMs rather than one register per eventfd
    - remove writing arbitrary values to eventfds.  Only values of 1 are now
      written to ensure correct usage

Cam Macdonell (5):
  Device specification for shared memory PCI device
  Support adding a file to qemu's ram allocation
  Adds two new functions for assigning ioeventfd and irqfds.
  Inter-VM shared memory PCI device
  the stand-alone shared memory server for inter-VM shared memory

 Makefile.target                         |    3 +
 contrib/ivshmem-server/Makefile         |   16 +
 contrib/ivshmem-server/README           |   30 ++
 contrib/ivshmem-server/ivshmem_server.c |  339 ++++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 +++++++++
 contrib/ivshmem-server/send_scm.h       |   19 +
 cpu-common.h                            |    2 +
 docs/specs/ivshmem_device_spec.txt      |   91 ++++
 exec.c                                  |   36 ++
 hw/ivshmem.c                            |  728 +++++++++++++++++++++++++++++++
 kvm-all.c                               |   44 ++
 kvm.h                                   |   14 +
 qemu-char.c                             |    6 +
 qemu-char.h                             |    3 +
 qemu-doc.texi                           |   25 +
 15 files changed, 1564 insertions(+), 0 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h
 create mode 100644 docs/specs/ivshmem_device_spec.txt
 create mode 100644 hw/ivshmem.c

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v5 1/5] Device specification for shared memory PCI device
  2010-04-21 17:53 ` [Qemu-devel] " Cam Macdonell
@ 2010-04-21 17:53   ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

---
 docs/specs/ivshmem_device_spec.txt |   91 ++++++++++++++++++++++++++++++++++++
 1 files changed, 91 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/ivshmem_device_spec.txt

diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
new file mode 100644
index 0000000..b955b22
--- /dev/null
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -0,0 +1,91 @@
+
+Device Specification for Inter-VM shared memory device
+------------------------------------------------------
+
+The Inter-VM shared memory device is designed to share a region of memory to
+userspace in multiple virtual guests.  The memory region does not belong to any
+guest, but is a POSIX memory object on the host.  Optionally, the device may
+support sending interrupts to other guests sharing the same memory region.
+
+
+The Inter-VM PCI device
+-----------------------
+
+*BARs*
+
+The device supports three BARs.  BAR0 is a 1 Kbyte MMIO region to support
+registers.  BAR1 is used for MSI-X when it is enabled in the device.  BAR2 is
+used to map the shared memory object from the host.  The size of BAR2 is
+specified when the guest is started and must be a power of 2 in size.
+
+*Registers*
+
+The device currently supports 4 registers of 32-bits each.  Registers
+are used for synchronization between guests sharing the same memory object when
+interrupts are supported (this requires using the shared memory server).
+
+The server assigns each VM an ID number and sends this ID number to the Qemu
+process when the guest starts.
+
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12
+};
+
+The first two registers are the interrupt mask and status registers.  Mask and
+status are only used with pin-based interrupts.  They are unused with MSI
+interrupts.
+
+Status Register: The status register is set to 1 when an interrupt occurs.
+
+Mask Register: The mask register is bitwise ANDed with the interrupt status
+and the result will raise an interrupt if it is non-zero.  However, since 1 is
+the only value the status will be set to, it is only the first bit of the mask
+that has any effect.  Therefore interrupts can be masked by setting the first
+bit to 0 and unmasked by setting the first bit to 1.
+
+IVPosition Register: The IVPosition register is read-only and reports the
+guest's ID number.
+
+Doorbell Register:  To interrupt another guest, a guest must write to the
+Doorbell register.  The doorbell register is 32-bits, logically divided into
+two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
+16-bits are the interrupt vector to trigger.  The semantics of the value
+written to the doorbell depends on whether the device is using MSI or a regular
+pin-based interrupt.  In short, MSI uses vectors while regular interrupts set the
+status register.
+
+Regular Interrupts
+
+If regular interrupts are used (due to either a guest not supporting MSI or the
+user specifying not to use them on startup) then the value written to the lower
+16-bits of the Doorbell register results is arbitrary and will trigger an
+interrupt in the destination guest.
+
+Message Signalled Interrupts
+
+A ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
+written to the Doorbell register must be between 0 and the maximum number of
+vectors the guest supports.  The lower 16 bits written to the doorbell is the
+MSI vector that will be raised in the destination guest.  The number of MSI
+vectors is configurable but it is set when the VM is started.
+
+The important thing to remember with MSI is that it is only a signal, no status
+is set (since MSI interrupts are not shared).  All information other than the
+interrupt itself should be communicated via the shared memory region.  Devices
+supporting multiple MSI vectors can use different vectors to indicate different
+events have occurred.  The semantics of interrupt vectors are left to the
+user's discretion.
+
+
+Usage in the Guest
+------------------
+
+The shared memory device is intended to be used with the provided UIO driver.
+Very little configuration is needed.  The guest should map BAR0 to access the
+registers (an array of 32-bit ints allows simple writing) and map BAR2 to
+access the shared memory region itself.  The size of the shared memory region
+is specified when the guest (or shared memory server) is started.  A guest may
+map the whole shared memory region or only part of it.
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [PATCH v5 1/5] Device specification for shared memory PCI device
@ 2010-04-21 17:53   ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

---
 docs/specs/ivshmem_device_spec.txt |   91 ++++++++++++++++++++++++++++++++++++
 1 files changed, 91 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/ivshmem_device_spec.txt

diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
new file mode 100644
index 0000000..b955b22
--- /dev/null
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -0,0 +1,91 @@
+
+Device Specification for Inter-VM shared memory device
+------------------------------------------------------
+
+The Inter-VM shared memory device is designed to share a region of memory to
+userspace in multiple virtual guests.  The memory region does not belong to any
+guest, but is a POSIX memory object on the host.  Optionally, the device may
+support sending interrupts to other guests sharing the same memory region.
+
+
+The Inter-VM PCI device
+-----------------------
+
+*BARs*
+
+The device supports three BARs.  BAR0 is a 1 Kbyte MMIO region to support
+registers.  BAR1 is used for MSI-X when it is enabled in the device.  BAR2 is
+used to map the shared memory object from the host.  The size of BAR2 is
+specified when the guest is started and must be a power of 2 in size.
+
+*Registers*
+
+The device currently supports 4 registers of 32-bits each.  Registers
+are used for synchronization between guests sharing the same memory object when
+interrupts are supported (this requires using the shared memory server).
+
+The server assigns each VM an ID number and sends this ID number to the Qemu
+process when the guest starts.
+
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12
+};
+
+The first two registers are the interrupt mask and status registers.  Mask and
+status are only used with pin-based interrupts.  They are unused with MSI
+interrupts.
+
+Status Register: The status register is set to 1 when an interrupt occurs.
+
+Mask Register: The mask register is bitwise ANDed with the interrupt status
+and the result will raise an interrupt if it is non-zero.  However, since 1 is
+the only value the status will be set to, it is only the first bit of the mask
+that has any effect.  Therefore interrupts can be masked by setting the first
+bit to 0 and unmasked by setting the first bit to 1.
+
+IVPosition Register: The IVPosition register is read-only and reports the
+guest's ID number.
+
+Doorbell Register:  To interrupt another guest, a guest must write to the
+Doorbell register.  The doorbell register is 32-bits, logically divided into
+two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
+16-bits are the interrupt vector to trigger.  The semantics of the value
+written to the doorbell depends on whether the device is using MSI or a regular
+pin-based interrupt.  In short, MSI uses vectors while regular interrupts set the
+status register.
+
+Regular Interrupts
+
+If regular interrupts are used (due to either a guest not supporting MSI or the
+user specifying not to use them on startup) then the value written to the lower
+16-bits of the Doorbell register results is arbitrary and will trigger an
+interrupt in the destination guest.
+
+Message Signalled Interrupts
+
+A ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
+written to the Doorbell register must be between 0 and the maximum number of
+vectors the guest supports.  The lower 16 bits written to the doorbell is the
+MSI vector that will be raised in the destination guest.  The number of MSI
+vectors is configurable but it is set when the VM is started.
+
+The important thing to remember with MSI is that it is only a signal, no status
+is set (since MSI interrupts are not shared).  All information other than the
+interrupt itself should be communicated via the shared memory region.  Devices
+supporting multiple MSI vectors can use different vectors to indicate different
+events have occurred.  The semantics of interrupt vectors are left to the
+user's discretion.
+
+
+Usage in the Guest
+------------------
+
+The shared memory device is intended to be used with the provided UIO driver.
+Very little configuration is needed.  The guest should map BAR0 to access the
+registers (an array of 32-bit ints allows simple writing) and map BAR2 to
+access the shared memory region itself.  The size of the shared memory region
+is specified when the guest (or shared memory server) is started.  A guest may
+map the whole shared memory region or only part of it.
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v5 2/5] Support adding a file to qemu's ram allocation
  2010-04-21 17:53   ` [Qemu-devel] " Cam Macdonell
@ 2010-04-21 17:53     ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a
host file into guest RAM.  This function mmaps the opened file anywhere and adds
the memory to the ram blocks.

Usage is

qemu_ram_mmap(fd, size, MAP_SHARED, offset);
---
 cpu-common.h |    2 ++
 exec.c       |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/cpu-common.h b/cpu-common.h
index 49c7fb3..d7c7d3a 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -9,6 +9,7 @@
 
 #include "bswap.h"
 #include "qemu-queue.h"
+#include <sys/types.h>
 
 #if !defined(CONFIG_USER_ONLY)
 
@@ -32,6 +33,7 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr,
 }
 
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
+ram_addr_t qemu_ram_mmap(int fd, ram_addr_t addr, int flags, off_t offset);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
 void qemu_ram_free(ram_addr_t addr);
 /* This should only be used for ram local to a device.  */
diff --git a/exec.c b/exec.c
index 467a0e7..702348e 100644
--- a/exec.c
+++ b/exec.c
@@ -2811,6 +2811,42 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path)
 }
 #endif
 
+ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t offset)
+{
+    RAMBlock *new_block;
+
+    size = TARGET_PAGE_ALIGN(size);
+    new_block = qemu_malloc(sizeof(*new_block));
+
+    /* map the file passed as a parameter to be this part of memory */
+    new_block->host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset);
+
+    if (new_block->host == MAP_FAILED)
+        exit(1);
+
+#ifdef MADV_MERGEABLE
+    madvise(new_block->host, size, MADV_MERGEABLE);
+#endif
+
+    new_block->offset = last_ram_offset;
+    new_block->length = size;
+
+    new_block->next = ram_blocks;
+    ram_blocks = new_block;
+
+    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+        (last_ram_offset + size) >> TARGET_PAGE_BITS);
+    memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
+           0xff, size >> TARGET_PAGE_BITS);
+
+    last_ram_offset += size;
+
+    if (kvm_enabled())
+        kvm_setup_guest_memory(new_block->host, size);
+
+    return new_block->offset;
+}
+
 ram_addr_t qemu_ram_alloc(ram_addr_t size)
 {
     RAMBlock *new_block;
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [PATCH v5 2/5] Support adding a file to qemu's ram allocation
@ 2010-04-21 17:53     ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a
host file into guest RAM.  This function mmaps the opened file anywhere and adds
the memory to the ram blocks.

Usage is

qemu_ram_mmap(fd, size, MAP_SHARED, offset);
---
 cpu-common.h |    2 ++
 exec.c       |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/cpu-common.h b/cpu-common.h
index 49c7fb3..d7c7d3a 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -9,6 +9,7 @@
 
 #include "bswap.h"
 #include "qemu-queue.h"
+#include <sys/types.h>
 
 #if !defined(CONFIG_USER_ONLY)
 
@@ -32,6 +33,7 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr,
 }
 
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
+ram_addr_t qemu_ram_mmap(int fd, ram_addr_t addr, int flags, off_t offset);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
 void qemu_ram_free(ram_addr_t addr);
 /* This should only be used for ram local to a device.  */
diff --git a/exec.c b/exec.c
index 467a0e7..702348e 100644
--- a/exec.c
+++ b/exec.c
@@ -2811,6 +2811,42 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path)
 }
 #endif
 
+ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t offset)
+{
+    RAMBlock *new_block;
+
+    size = TARGET_PAGE_ALIGN(size);
+    new_block = qemu_malloc(sizeof(*new_block));
+
+    /* map the file passed as a parameter to be this part of memory */
+    new_block->host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset);
+
+    if (new_block->host == MAP_FAILED)
+        exit(1);
+
+#ifdef MADV_MERGEABLE
+    madvise(new_block->host, size, MADV_MERGEABLE);
+#endif
+
+    new_block->offset = last_ram_offset;
+    new_block->length = size;
+
+    new_block->next = ram_blocks;
+    ram_blocks = new_block;
+
+    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+        (last_ram_offset + size) >> TARGET_PAGE_BITS);
+    memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
+           0xff, size >> TARGET_PAGE_BITS);
+
+    last_ram_offset += size;
+
+    if (kvm_enabled())
+        kvm_setup_guest_memory(new_block->host, size);
+
+    return new_block->offset;
+}
+
 ram_addr_t qemu_ram_alloc(ram_addr_t size)
 {
     RAMBlock *new_block;
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
  2010-04-21 17:53     ` [Qemu-devel] " Cam Macdonell
@ 2010-04-21 17:53       ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

Generic functions to assign irqfds and ioeventfds.

---
 kvm-all.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 kvm.h     |   14 ++++++++++++++
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index fb8d4b8..d5c7775 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1193,6 +1193,50 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset)
 }
 
 #ifdef KVM_IOEVENTFD
+int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi)
+{
+    struct kvm_irqfd call = { };
+    int r;
+
+    call.fd = fd;
+    call.gsi = gsi;
+
+    if (!kvm_enabled())
+        return -ENOSYS;
+    r = kvm_vm_ioctl(kvm_state, KVM_IRQFD, &call);
+
+    if (r < 0) {
+        return r;
+    }
+    return 0;
+}
+
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign)
+{
+
+    int ret;
+    struct kvm_ioeventfd iofd;
+
+    iofd.datamatch = val;
+    iofd.addr = addr;
+    iofd.len = 4;
+    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+    iofd.fd = fd;
+
+    if (!kvm_enabled())
+        return -ENOSYS;
+    if (!assign)
+        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+
+    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
+
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool assign)
 {
     struct kvm_ioeventfd kick = {
diff --git a/kvm.h b/kvm.h
index c63e314..831d68f 100644
--- a/kvm.h
+++ b/kvm.h
@@ -174,9 +174,23 @@ static inline void cpu_synchronize_post_init(CPUState *env)
 }
 
 #if defined(KVM_IOEVENTFD) && defined(CONFIG_KVM)
+int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi);
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool assign);
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool assign);
 #else
 static inline
+int kvm_set_irqfd(PCIDevice* pdev, uint16_t vector, int fd)
+{
+    return -ENOSYS;
+}
+
+static inline
+int kvm_set_ioeventfd_mmio_long(int fd, uint16_t adr, uint16_t val, bool assign)
+{
+    return -ENOSYS;
+}
+
+static inline
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool assign)
 {
     return -ENOSYS;
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
@ 2010-04-21 17:53       ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

Generic functions to assign irqfds and ioeventfds.

---
 kvm-all.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 kvm.h     |   14 ++++++++++++++
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index fb8d4b8..d5c7775 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1193,6 +1193,50 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset)
 }
 
 #ifdef KVM_IOEVENTFD
+int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi)
+{
+    struct kvm_irqfd call = { };
+    int r;
+
+    call.fd = fd;
+    call.gsi = gsi;
+
+    if (!kvm_enabled())
+        return -ENOSYS;
+    r = kvm_vm_ioctl(kvm_state, KVM_IRQFD, &call);
+
+    if (r < 0) {
+        return r;
+    }
+    return 0;
+}
+
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign)
+{
+
+    int ret;
+    struct kvm_ioeventfd iofd;
+
+    iofd.datamatch = val;
+    iofd.addr = addr;
+    iofd.len = 4;
+    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+    iofd.fd = fd;
+
+    if (!kvm_enabled())
+        return -ENOSYS;
+    if (!assign)
+        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+
+    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
+
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool assign)
 {
     struct kvm_ioeventfd kick = {
diff --git a/kvm.h b/kvm.h
index c63e314..831d68f 100644
--- a/kvm.h
+++ b/kvm.h
@@ -174,9 +174,23 @@ static inline void cpu_synchronize_post_init(CPUState *env)
 }
 
 #if defined(KVM_IOEVENTFD) && defined(CONFIG_KVM)
+int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi);
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool assign);
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool assign);
 #else
 static inline
+int kvm_set_irqfd(PCIDevice* pdev, uint16_t vector, int fd)
+{
+    return -ENOSYS;
+}
+
+static inline
+int kvm_set_ioeventfd_mmio_long(int fd, uint16_t adr, uint16_t val, bool assign)
+{
+    return -ENOSYS;
+}
+
+static inline
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool assign)
 {
     return -ENOSYS;
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-04-21 17:53       ` [Qemu-devel] " Cam Macdonell
@ 2010-04-21 17:53         ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
                    [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
    -chardev socket,path=<path>,id=<id>

(shared memory server is qemu.git/contrib/ivshmem-server)

Sample programs and init scripts are in a git repo here:

    www.gitorious.org/nahanni
---
 Makefile.target |    3 +
 hw/ivshmem.c    |  727 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-char.c     |    6 +
 qemu-char.h     |    3 +
 qemu-doc.texi   |   25 ++
 5 files changed, 764 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index 1ffd802..bc9a681 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y = pckbd.o dma.o
 obj-i386-y += vga.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..f8d8fdb
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,727 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include <sys/eventfd.h>
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "msix.h"
+#include "qemu-kvm.h"
+#include "libkvm.h"
+
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI     1
+
+#define DEBUG_IVSHMEM
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct EventfdEntry {
+    PCIDevice *pdev;
+    int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+    PCIDevice dev;
+    uint32_t intrmask;
+    uint32_t intrstatus;
+    uint32_t doorbell;
+
+    CharDriverState * chr;
+    CharDriverState ** eventfd_chr;
+    int ivshmem_mmio_io_addr;
+
+    pcibus_t mmio_addr;
+    unsigned long ivshmem_offset;
+    uint64_t ivshmem_size; /* size of shared memory region */
+    int shm_fd; /* shared memory file descriptor */
+
+    int nr_allocated_vms;
+    /* array of eventfds for each guest */
+    int ** eventfds;
+    /* keep track of # of eventfds for each guest*/
+    int * eventfds_posn_count;
+
+    int nr_alloc_guests;
+    int vm_id;
+    int num_eventfds;
+    uint32_t vectors;
+    uint32_t features;
+    EventfdEntry *eventfd_table;
+
+    char * shmobj;
+    char * sizearg;
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+    return (ivs->features & (1 << feature));
+}
+
+static inline int is_power_of_two(int x) {
+    return (x & (x-1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size);
+    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s, val);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s, val);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s, 0);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    u_int64_t write_one = 1;
+    u_int16_t dest = val >> 16;
+    u_int16_t vector = val & 0xff;
+
+    addr &= 0xfe;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        case Doorbell:
+            /* check doorbell range */
+            if ((vector >= 0) && (vector < s->eventfds_posn_count[dest])) {
+                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n", write_one, dest, vector);
+                if (write(s->eventfds[dest][vector], &(write_one), 8) != 8) {
+                    IVSHMEM_DPRINTF("error writing to eventfd\n");
+                }
+            }
+            break;
+        default:
+            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
+    }
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+
+        case IVPosition:
+            /* return my id in the ivshmem list */
+            ret = s->vm_id;
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 8;
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
+
+    EventfdEntry *entry = opaque;
+    PCIDevice *pdev = entry->pdev;
+
+    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
+    msix_notify(pdev, entry->vector);
+}
+
+static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
+                                                                    int vector)
+{
+    /* create a event character device based on the passed eventfd */
+    IVShmemState *s = opaque;
+    CharDriverState * chr;
+
+    chr = qemu_chr_open_eventfd(eventfd);
+
+    if (chr == NULL) {
+        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
+        exit(-1);
+    }
+
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        s->eventfd_table[vector].pdev = &s->dev;
+        s->eventfd_table[vector].vector = vector;
+
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
+                      ivshmem_event, &s->eventfd_table[vector]);
+    } else {
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
+                      ivshmem_event, s);
+    }
+
+    return chr;
+
+}
+
+static int check_shm_size(IVShmemState *s, int shmemfd) {
+    /* check that the guest isn't going to try and map more memory than the
+     * card server allocated return -1 to indicate error */
+
+    struct stat buf;
+
+    fstat(shmemfd, &buf);
+
+    if (s->ivshmem_size > buf.st_size) {
+        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
+        fprintf(stderr, " than shared object size (%ld > %ld)\n",
+                                          s->ivshmem_size, buf.st_size);
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+static void create_shared_memory_BAR(IVShmemState *s, int fd) {
+
+    s->shm_fd = fd;
+
+    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
+             MAP_SHARED, 0);
+
+    /* region for shared memory */
+    pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+}
+
+static void close_guest_eventfds(IVShmemState *s, int posn)
+{
+    int i, guest_curr_max;
+
+    guest_curr_max = s->eventfds_posn_count[posn];
+
+    for (i = 0; i < guest_curr_max; i++)
+        close(s->eventfds[posn][i]);
+
+    free(s->eventfds[posn]);
+    s->eventfds_posn_count[posn] = 0;
+}
+
+/* this function increase the dynamic storage need to store data about other
+ * guests */
+static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
+
+    int j, old_nr_alloc;
+
+    old_nr_alloc = s->nr_alloc_guests;
+
+    while (s->nr_alloc_guests < new_min_size)
+        s->nr_alloc_guests = s->nr_alloc_guests * 2;
+
+    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
+    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
+                                                        sizeof(int *));
+    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
+                                                    s->nr_alloc_guests *
+                                                        sizeof(int));
+    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
+                                                    sizeof(EventfdEntry));
+
+    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
+            (s->eventfd_table == NULL)) {
+        fprintf(stderr, "Allocation error - exiting\n");
+        exit(1);
+    }
+
+    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
+                                    s->nr_alloc_guests * sizeof(void *));
+        if (s->eventfd_chr == NULL) {
+            fprintf(stderr, "Allocation error - exiting\n");
+            exit(1);
+        }
+    }
+
+    /* zero out new pointers */
+    for (j = old_nr_alloc; j < s->nr_alloc_guests; j++) {
+        s->eventfds[j] = NULL;
+    }
+}
+
+static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
+{
+    IVShmemState *s = opaque;
+    int incoming_fd, tmp_fd;
+    int guest_curr_max;
+    long incoming_posn;
+
+    memcpy(&incoming_posn, buf, sizeof(long));
+    /* pick off s->chr->msgfd and store it, posn should accompany msg */
+    tmp_fd = qemu_chr_get_msgfd(s->chr);
+    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
+
+    /* make sure we have enough space for this guest */
+    if (incoming_posn >= s->nr_alloc_guests) {
+        increase_dynamic_storage(s, incoming_posn);
+    }
+
+    if (tmp_fd == -1) {
+        /* if posn is positive and unseen before then this is our posn*/
+        if ((incoming_posn >= 0) && (s->eventfds[incoming_posn] == NULL)) {
+            /* receive our posn */
+            s->vm_id = incoming_posn;
+            return;
+        } else {
+            /* otherwise an fd == -1 means an existing guest has gone away */
+            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
+            close_guest_eventfds(s, incoming_posn);
+            return;
+        }
+    }
+
+    /* because of the implementation of get_msgfd, we need a dup */
+    incoming_fd = dup(tmp_fd);
+
+    /* if the position is -1, then it's shared memory region fd */
+    if (incoming_posn == -1) {
+
+        s->num_eventfds = 0;
+
+        if (check_shm_size(s, incoming_fd) == -1) {
+            exit(-1);
+        }
+
+        /* creating a BAR in qemu_chr callback may be crazy */
+        create_shared_memory_BAR(s, incoming_fd);
+
+       return;
+    }
+
+    /* each guest has an array of eventfds, and we keep track of how many
+     * guests for each VM */
+    guest_curr_max = s->eventfds_posn_count[incoming_posn];
+    if (guest_curr_max == 0) {
+        /* one eventfd per MSI vector */
+        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
+                                                                sizeof(int));
+    }
+
+    /* this is an eventfd for a particular guest VM */
+    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
+                                                                incoming_fd);
+    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
+
+    /* increment count for particular guest */
+    s->eventfds_posn_count[incoming_posn]++;
+
+    /* ioeventfd and irqfd are enabled together,
+     * so the flag IRQFD refers to both */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) && guest_curr_max >= 0) {
+        /* allocate ioeventfd for the new fd
+         * received for guest @ incoming_posn */
+        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
+                                (incoming_posn << 16) | guest_curr_max, 1);
+    }
+
+    /* keep track of the maximum VM ID */
+    if (incoming_posn > s->num_eventfds) {
+        s->num_eventfds = incoming_posn;
+    }
+
+    if (incoming_posn == s->vm_id) {
+        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            /* setup irqfd for this VM's eventfd */
+            int vector = guest_curr_max;
+            kvm_set_irqfd(s->eventfds[s->vm_id][guest_curr_max], vector,
+                                        s->dev.msix_irq_entries[vector].gsi);
+        } else {
+            /* initialize char device for callback
+             * if this is one of my eventfd */
+            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
+                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
+        }
+    }
+
+    return;
+}
+
+static void ivshmem_reset(DeviceState *d)
+{
+    return;
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->mmio_addr = addr;
+    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
+
+    /* now that our mmio region has been allocated, we can receive
+     * the file descriptors */
+    if (s->chr != NULL) {
+        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
+                     ivshmem_event, s);
+    }
+
+}
+
+static uint64_t ivshmem_get_size(IVShmemState * s) {
+
+    uint64_t value;
+    char *ptr;
+
+    value = strtoul(s->sizearg, &ptr, 10);
+    switch (*ptr) {
+        case 0: case 'M': case 'm':
+            value <<= 20;
+            break;
+        case 'G': case 'g':
+            value <<= 30;
+            break;
+        default:
+            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
+            exit(1);
+    }
+
+    /* BARs must be a power of 2 */
+    if (!is_power_of_two(value)) {
+        fprintf(stderr, "ivshmem: size must be power of 2\n");
+        exit(1);
+    }
+
+    return value;
+
+}
+
+static int pci_ivshmem_init(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+    uint8_t *pci_conf;
+    int i;
+
+    if (s->sizearg == NULL)
+        s->ivshmem_size = 4 << 20; /* 4 MB default */
+    else {
+        s->ivshmem_size = ivshmem_get_size(s);
+    }
+
+    /* IRQFD requires MSI */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
+        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
+        exit(1);
+    }
+
+    pci_conf = s->dev.config;
+    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+    pci_conf[0x0a] = 0x00; /* RAM controller */
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; /* header_type */
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+    /* region for registers*/
+    pci_register_bar(&s->dev, 0, 0x400,
+                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
+
+    /* allocate the MSI-X vectors */
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+
+        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
+            pci_register_bar(&s->dev, 1,
+                             msix_bar_size(&s->dev),
+                             PCI_BASE_ADDRESS_SPACE_MEMORY,
+                             msix_mmio_map);
+            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
+        } else {
+            IVSHMEM_DPRINTF("msix initialization failed\n");
+        }
+
+        /* 'activate' the vectors */
+        for (i = 0; i < s->vectors; i++) {
+            msix_vector_use(&s->dev, i);
+        }
+    }
+
+    if ((s->chr != NULL) && (strncmp(s->chr->filename, "unix:", 5) == 0)) {
+        /* if we get a UNIX socket as the parameter we will talk
+         * to the ivshmem server later once the MMIO BAR is actually
+         * allocated (see ivshmem_mmio_map) */
+
+        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
+                                                            s->chr->filename);
+
+        /* we allocate enough space for 16 guests and grow as needed */
+        s->nr_alloc_guests = 16;
+        s->vm_id = -1;
+
+        /* allocate/initialize space for interrupt handling */
+        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
+        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
+        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
+
+        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
+
+        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            s->eventfd_chr = (CharDriverState **)qemu_malloc(s->nr_alloc_guests *
+                                                            sizeof(void *));
+        }
+
+    } else {
+        /* just map the file immediately, we're not using a server */
+        int fd;
+
+        if (s->shmobj == NULL) {
+            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
+        }
+
+        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
+
+        /* try opening with O_EXCL and if it succeeds zero the memory
+         * by truncating to 0 */
+        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
+           /* truncate file to length PCI device's memory */
+            if (ftruncate(fd, s->ivshmem_size) != 0) {
+                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
+            }
+
+        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
+            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+            exit(-1);
+        }
+
+        create_shared_memory_BAR(s, fd);
+
+    }
+
+
+    return 0;
+}
+
+static int pci_ivshmem_uninit(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+
+    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
+
+    return 0;
+}
+
+static PCIDeviceInfo ivshmem_info = {
+    .qdev.name  = "ivshmem",
+    .qdev.size  = sizeof(IVShmemState),
+    .qdev.reset = ivshmem_reset,
+    .init       = pci_ivshmem_init,
+    .exit       = pci_ivshmem_uninit,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
+        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
+        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
+        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
+        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
+        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void ivshmem_register_devices(void)
+{
+    pci_qdev_register(&ivshmem_info);
+}
+
+device_init(ivshmem_register_devices)
diff --git a/qemu-char.c b/qemu-char.c
index 048da3f..41cb8c7 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2076,6 +2076,12 @@ static void tcp_chr_read(void *opaque)
     }
 }
 
+CharDriverState *qemu_chr_open_eventfd(int eventfd){
+
+    return qemu_chr_open_fd(eventfd, eventfd);
+
+}
+
 static void tcp_chr_connect(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu-char.h b/qemu-char.h
index 3a9427b..1571091 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -93,6 +93,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
 void qemu_chr_info(Monitor *mon, QObject **ret_data);
 CharDriverState *qemu_chr_find(const char *name);
 
+/* add an eventfd to the qemu devices that are polled */
+CharDriverState *qemu_chr_open_eventfd(int eventfd);
+
 extern int term_escape_char;
 
 /* async I/O support */
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 6647b7b..2df4687 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible to make VLANs
 that span several QEMU instances. See @ref{sec_invocation} to have a
 basic example.
 
+@section Other Devices
+
+@subsection Inter-VM Shared Memory device
+
+With KVM enabled on a Linux host, a shared memory device is available.  Guests
+map a POSIX shared memory region into the guest as a PCI device that enables
+zero-copy communication to the application level of the guests.  The basic
+syntax is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+@end example
+
+If desired, interrupts can be sent between guest VMs accessing the same shared
+memory region.  Interrupt support requires using a shared memory server and
+using a chardev socket to connect to it.  The code for the shared memory server
+is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
+memory server is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
+qemu -chardev socket,path=<path>,id=<id>
+@end example
+
 @node direct_linux_boot
 @section Direct Linux Boot
 
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-04-21 17:53         ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 17:53 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
                    [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
    -chardev socket,path=<path>,id=<id>

(shared memory server is qemu.git/contrib/ivshmem-server)

Sample programs and init scripts are in a git repo here:

    www.gitorious.org/nahanni
---
 Makefile.target |    3 +
 hw/ivshmem.c    |  727 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-char.c     |    6 +
 qemu-char.h     |    3 +
 qemu-doc.texi   |   25 ++
 5 files changed, 764 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index 1ffd802..bc9a681 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y = pckbd.o dma.o
 obj-i386-y += vga.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..f8d8fdb
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,727 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include <sys/eventfd.h>
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "msix.h"
+#include "qemu-kvm.h"
+#include "libkvm.h"
+
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI     1
+
+#define DEBUG_IVSHMEM
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct EventfdEntry {
+    PCIDevice *pdev;
+    int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+    PCIDevice dev;
+    uint32_t intrmask;
+    uint32_t intrstatus;
+    uint32_t doorbell;
+
+    CharDriverState * chr;
+    CharDriverState ** eventfd_chr;
+    int ivshmem_mmio_io_addr;
+
+    pcibus_t mmio_addr;
+    unsigned long ivshmem_offset;
+    uint64_t ivshmem_size; /* size of shared memory region */
+    int shm_fd; /* shared memory file descriptor */
+
+    int nr_allocated_vms;
+    /* array of eventfds for each guest */
+    int ** eventfds;
+    /* keep track of # of eventfds for each guest*/
+    int * eventfds_posn_count;
+
+    int nr_alloc_guests;
+    int vm_id;
+    int num_eventfds;
+    uint32_t vectors;
+    uint32_t features;
+    EventfdEntry *eventfd_table;
+
+    char * shmobj;
+    char * sizearg;
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+    return (ivs->features & (1 << feature));
+}
+
+static inline int is_power_of_two(int x) {
+    return (x & (x-1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size);
+    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s, val);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s, val);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s, 0);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    u_int64_t write_one = 1;
+    u_int16_t dest = val >> 16;
+    u_int16_t vector = val & 0xff;
+
+    addr &= 0xfe;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        case Doorbell:
+            /* check doorbell range */
+            if ((vector >= 0) && (vector < s->eventfds_posn_count[dest])) {
+                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n", write_one, dest, vector);
+                if (write(s->eventfds[dest][vector], &(write_one), 8) != 8) {
+                    IVSHMEM_DPRINTF("error writing to eventfd\n");
+                }
+            }
+            break;
+        default:
+            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
+    }
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+
+        case IVPosition:
+            /* return my id in the ivshmem list */
+            ret = s->vm_id;
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 8;
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
+
+    EventfdEntry *entry = opaque;
+    PCIDevice *pdev = entry->pdev;
+
+    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
+    msix_notify(pdev, entry->vector);
+}
+
+static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
+                                                                    int vector)
+{
+    /* create a event character device based on the passed eventfd */
+    IVShmemState *s = opaque;
+    CharDriverState * chr;
+
+    chr = qemu_chr_open_eventfd(eventfd);
+
+    if (chr == NULL) {
+        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
+        exit(-1);
+    }
+
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        s->eventfd_table[vector].pdev = &s->dev;
+        s->eventfd_table[vector].vector = vector;
+
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
+                      ivshmem_event, &s->eventfd_table[vector]);
+    } else {
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
+                      ivshmem_event, s);
+    }
+
+    return chr;
+
+}
+
+static int check_shm_size(IVShmemState *s, int shmemfd) {
+    /* check that the guest isn't going to try and map more memory than the
+     * card server allocated return -1 to indicate error */
+
+    struct stat buf;
+
+    fstat(shmemfd, &buf);
+
+    if (s->ivshmem_size > buf.st_size) {
+        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
+        fprintf(stderr, " than shared object size (%ld > %ld)\n",
+                                          s->ivshmem_size, buf.st_size);
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+static void create_shared_memory_BAR(IVShmemState *s, int fd) {
+
+    s->shm_fd = fd;
+
+    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
+             MAP_SHARED, 0);
+
+    /* region for shared memory */
+    pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+}
+
+static void close_guest_eventfds(IVShmemState *s, int posn)
+{
+    int i, guest_curr_max;
+
+    guest_curr_max = s->eventfds_posn_count[posn];
+
+    for (i = 0; i < guest_curr_max; i++)
+        close(s->eventfds[posn][i]);
+
+    free(s->eventfds[posn]);
+    s->eventfds_posn_count[posn] = 0;
+}
+
+/* this function increase the dynamic storage need to store data about other
+ * guests */
+static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
+
+    int j, old_nr_alloc;
+
+    old_nr_alloc = s->nr_alloc_guests;
+
+    while (s->nr_alloc_guests < new_min_size)
+        s->nr_alloc_guests = s->nr_alloc_guests * 2;
+
+    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
+    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
+                                                        sizeof(int *));
+    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
+                                                    s->nr_alloc_guests *
+                                                        sizeof(int));
+    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
+                                                    sizeof(EventfdEntry));
+
+    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
+            (s->eventfd_table == NULL)) {
+        fprintf(stderr, "Allocation error - exiting\n");
+        exit(1);
+    }
+
+    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
+                                    s->nr_alloc_guests * sizeof(void *));
+        if (s->eventfd_chr == NULL) {
+            fprintf(stderr, "Allocation error - exiting\n");
+            exit(1);
+        }
+    }
+
+    /* zero out new pointers */
+    for (j = old_nr_alloc; j < s->nr_alloc_guests; j++) {
+        s->eventfds[j] = NULL;
+    }
+}
+
+static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
+{
+    IVShmemState *s = opaque;
+    int incoming_fd, tmp_fd;
+    int guest_curr_max;
+    long incoming_posn;
+
+    memcpy(&incoming_posn, buf, sizeof(long));
+    /* pick off s->chr->msgfd and store it, posn should accompany msg */
+    tmp_fd = qemu_chr_get_msgfd(s->chr);
+    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
+
+    /* make sure we have enough space for this guest */
+    if (incoming_posn >= s->nr_alloc_guests) {
+        increase_dynamic_storage(s, incoming_posn);
+    }
+
+    if (tmp_fd == -1) {
+        /* if posn is positive and unseen before then this is our posn*/
+        if ((incoming_posn >= 0) && (s->eventfds[incoming_posn] == NULL)) {
+            /* receive our posn */
+            s->vm_id = incoming_posn;
+            return;
+        } else {
+            /* otherwise an fd == -1 means an existing guest has gone away */
+            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
+            close_guest_eventfds(s, incoming_posn);
+            return;
+        }
+    }
+
+    /* because of the implementation of get_msgfd, we need a dup */
+    incoming_fd = dup(tmp_fd);
+
+    /* if the position is -1, then it's shared memory region fd */
+    if (incoming_posn == -1) {
+
+        s->num_eventfds = 0;
+
+        if (check_shm_size(s, incoming_fd) == -1) {
+            exit(-1);
+        }
+
+        /* creating a BAR in qemu_chr callback may be crazy */
+        create_shared_memory_BAR(s, incoming_fd);
+
+       return;
+    }
+
+    /* each guest has an array of eventfds, and we keep track of how many
+     * guests for each VM */
+    guest_curr_max = s->eventfds_posn_count[incoming_posn];
+    if (guest_curr_max == 0) {
+        /* one eventfd per MSI vector */
+        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
+                                                                sizeof(int));
+    }
+
+    /* this is an eventfd for a particular guest VM */
+    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
+                                                                incoming_fd);
+    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
+
+    /* increment count for particular guest */
+    s->eventfds_posn_count[incoming_posn]++;
+
+    /* ioeventfd and irqfd are enabled together,
+     * so the flag IRQFD refers to both */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) && guest_curr_max >= 0) {
+        /* allocate ioeventfd for the new fd
+         * received for guest @ incoming_posn */
+        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
+                                (incoming_posn << 16) | guest_curr_max, 1);
+    }
+
+    /* keep track of the maximum VM ID */
+    if (incoming_posn > s->num_eventfds) {
+        s->num_eventfds = incoming_posn;
+    }
+
+    if (incoming_posn == s->vm_id) {
+        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            /* setup irqfd for this VM's eventfd */
+            int vector = guest_curr_max;
+            kvm_set_irqfd(s->eventfds[s->vm_id][guest_curr_max], vector,
+                                        s->dev.msix_irq_entries[vector].gsi);
+        } else {
+            /* initialize char device for callback
+             * if this is one of my eventfd */
+            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
+                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
+        }
+    }
+
+    return;
+}
+
+static void ivshmem_reset(DeviceState *d)
+{
+    return;
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->mmio_addr = addr;
+    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
+
+    /* now that our mmio region has been allocated, we can receive
+     * the file descriptors */
+    if (s->chr != NULL) {
+        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
+                     ivshmem_event, s);
+    }
+
+}
+
+static uint64_t ivshmem_get_size(IVShmemState * s) {
+
+    uint64_t value;
+    char *ptr;
+
+    value = strtoul(s->sizearg, &ptr, 10);
+    switch (*ptr) {
+        case 0: case 'M': case 'm':
+            value <<= 20;
+            break;
+        case 'G': case 'g':
+            value <<= 30;
+            break;
+        default:
+            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
+            exit(1);
+    }
+
+    /* BARs must be a power of 2 */
+    if (!is_power_of_two(value)) {
+        fprintf(stderr, "ivshmem: size must be power of 2\n");
+        exit(1);
+    }
+
+    return value;
+
+}
+
+static int pci_ivshmem_init(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+    uint8_t *pci_conf;
+    int i;
+
+    if (s->sizearg == NULL)
+        s->ivshmem_size = 4 << 20; /* 4 MB default */
+    else {
+        s->ivshmem_size = ivshmem_get_size(s);
+    }
+
+    /* IRQFD requires MSI */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
+        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
+        exit(1);
+    }
+
+    pci_conf = s->dev.config;
+    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+    pci_conf[0x0a] = 0x00; /* RAM controller */
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; /* header_type */
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+    /* region for registers*/
+    pci_register_bar(&s->dev, 0, 0x400,
+                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
+
+    /* allocate the MSI-X vectors */
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+
+        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
+            pci_register_bar(&s->dev, 1,
+                             msix_bar_size(&s->dev),
+                             PCI_BASE_ADDRESS_SPACE_MEMORY,
+                             msix_mmio_map);
+            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
+        } else {
+            IVSHMEM_DPRINTF("msix initialization failed\n");
+        }
+
+        /* 'activate' the vectors */
+        for (i = 0; i < s->vectors; i++) {
+            msix_vector_use(&s->dev, i);
+        }
+    }
+
+    if ((s->chr != NULL) && (strncmp(s->chr->filename, "unix:", 5) == 0)) {
+        /* if we get a UNIX socket as the parameter we will talk
+         * to the ivshmem server later once the MMIO BAR is actually
+         * allocated (see ivshmem_mmio_map) */
+
+        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
+                                                            s->chr->filename);
+
+        /* we allocate enough space for 16 guests and grow as needed */
+        s->nr_alloc_guests = 16;
+        s->vm_id = -1;
+
+        /* allocate/initialize space for interrupt handling */
+        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
+        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
+        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
+
+        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
+
+        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            s->eventfd_chr = (CharDriverState **)qemu_malloc(s->nr_alloc_guests *
+                                                            sizeof(void *));
+        }
+
+    } else {
+        /* just map the file immediately, we're not using a server */
+        int fd;
+
+        if (s->shmobj == NULL) {
+            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
+        }
+
+        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
+
+        /* try opening with O_EXCL and if it succeeds zero the memory
+         * by truncating to 0 */
+        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
+           /* truncate file to length PCI device's memory */
+            if (ftruncate(fd, s->ivshmem_size) != 0) {
+                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
+            }
+
+        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
+            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+            exit(-1);
+        }
+
+        create_shared_memory_BAR(s, fd);
+
+    }
+
+
+    return 0;
+}
+
+static int pci_ivshmem_uninit(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+
+    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
+
+    return 0;
+}
+
+static PCIDeviceInfo ivshmem_info = {
+    .qdev.name  = "ivshmem",
+    .qdev.size  = sizeof(IVShmemState),
+    .qdev.reset = ivshmem_reset,
+    .init       = pci_ivshmem_init,
+    .exit       = pci_ivshmem_uninit,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
+        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
+        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
+        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
+        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
+        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void ivshmem_register_devices(void)
+{
+    pci_qdev_register(&ivshmem_info);
+}
+
+device_init(ivshmem_register_devices)
diff --git a/qemu-char.c b/qemu-char.c
index 048da3f..41cb8c7 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2076,6 +2076,12 @@ static void tcp_chr_read(void *opaque)
     }
 }
 
+CharDriverState *qemu_chr_open_eventfd(int eventfd){
+
+    return qemu_chr_open_fd(eventfd, eventfd);
+
+}
+
 static void tcp_chr_connect(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu-char.h b/qemu-char.h
index 3a9427b..1571091 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -93,6 +93,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
 void qemu_chr_info(Monitor *mon, QObject **ret_data);
 CharDriverState *qemu_chr_find(const char *name);
 
+/* add an eventfd to the qemu devices that are polled */
+CharDriverState *qemu_chr_open_eventfd(int eventfd);
+
 extern int term_escape_char;
 
 /* async I/O support */
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 6647b7b..2df4687 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible to make VLANs
 that span several QEMU instances. See @ref{sec_invocation} to have a
 basic example.
 
+@section Other Devices
+
+@subsection Inter-VM Shared Memory device
+
+With KVM enabled on a Linux host, a shared memory device is available.  Guests
+map a POSIX shared memory region into the guest as a PCI device that enables
+zero-copy communication to the application level of the guests.  The basic
+syntax is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+@end example
+
+If desired, interrupts can be sent between guest VMs accessing the same shared
+memory region.  Interrupt support requires using a shared memory server and
+using a chardev socket to connect to it.  The code for the shared memory server
+is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
+memory server is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
+qemu -chardev socket,path=<path>,id=<id>
+@end example
+
 @node direct_linux_boot
 @section Direct Linux Boot
 
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v5 5/5] shared memory server for inter-VM shared memory
  2010-04-21 17:53         ` [Qemu-devel] " Cam Macdonell
@ 2010-04-21 18:00           ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 18:00 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

this code is a standalone server which will pass file descriptors for the shared
memory region and eventfds to support interrupts between guests using inter-VM
shared memory.
---
 contrib/ivshmem-server/Makefile         |   16 ++
 contrib/ivshmem-server/README           |   30 +++
 contrib/ivshmem-server/ivshmem_server.c |  339 +++++++++++++++++++++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 +++++++++++++++++++
 contrib/ivshmem-server/send_scm.h       |   19 ++
 5 files changed, 612 insertions(+), 0 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h

diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile
new file mode 100644
index 0000000..da40ffa
--- /dev/null
+++ b/contrib/ivshmem-server/Makefile
@@ -0,0 +1,16 @@
+CC = gcc
+CFLAGS = -O3 -Wall -Werror
+LIBS = -lrt
+
+# a very simple makefile to build the inter-VM shared memory server
+
+all: ivshmem_server
+
+.c.o:
+	$(CC) $(CFLAGS) -c $^ -o $@
+
+ivshmem_server: ivshmem_server.o send_scm.o
+	$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+clean:
+	rm -f *.o ivshmem_server
diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
new file mode 100644
index 0000000..b1fc2a2
--- /dev/null
+++ b/contrib/ivshmem-server/README
@@ -0,0 +1,30 @@
+Using the ivshmem shared memory server
+--------------------------------------
+
+This server is only supported on Linux.
+
+To use the shared memory server, first compile it.  Running 'make' should
+accomplish this.  An executable named 'ivshmem_server' will be built.
+
+to display the options run:
+
+./ivshmem_server -h
+
+Options
+-------
+
+    -h  print help message
+
+    -p <path on host>
+        unix socket to listen on.  The qemu-kvm chardev needs to connect on
+        this socket. (default: '/tmp/ivshmem_socket')
+
+    -s <string>
+        POSIX shared object to create that is the shared memory (default: 'ivshmem')
+
+    -m <#>
+        size of the POSIX object in MBs (default: 1)
+
+    -n <#>
+        number of eventfds for each guest.  This number must match the
+        'vectors' argument passed the ivshmem device. (default: 1)
diff --git a/contrib/ivshmem-server/ivshmem_server.c b/contrib/ivshmem-server/ivshmem_server.c
new file mode 100644
index 0000000..2dbf76f
--- /dev/null
+++ b/contrib/ivshmem-server/ivshmem_server.c
@@ -0,0 +1,339 @@
+/*
+ * A stand-alone shared memory server for inter-VM shared memory for KVM
+*/
+
+#include <errno.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/select.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "send_scm.h"
+
+#define DEFAULT_SOCK_PATH "/tmp/ivshmem_socket"
+#define DEFAULT_SHM_OBJ "ivshmem"
+
+#define DEBUG 1
+
+typedef struct server_state {
+    vmguest_t *live_vms;
+    int nr_allocated_vms;
+    int shm_size;
+    long live_count;
+    long total_count;
+    int shm_fd;
+    char * path;
+    char * shmobj;
+    int maxfd, conn_socket;
+    long msi_vectors;
+} server_state_t;
+
+void usage(char const *prg);
+int find_set(fd_set * readset, int max);
+void print_vec(server_state_t * s, const char * c);
+
+void add_new_guest(server_state_t * s);
+void parse_args(int argc, char **argv, server_state_t * s);
+int create_listening_socket(char * path);
+
+int main(int argc, char ** argv)
+{
+    fd_set readset;
+    server_state_t * s;
+
+    s = (server_state_t *)calloc(1, sizeof(server_state_t));
+
+    s->live_count = 0;
+    s->total_count = 0;
+    parse_args(argc, argv, s);
+
+    /* open shared memory file  */
+    if ((s->shm_fd = shm_open(s->shmobj, O_CREAT|O_RDWR, S_IRWXU)) < 0)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+        exit(-1);
+    }
+
+    ftruncate(s->shm_fd, s->shm_size);
+
+    s->conn_socket = create_listening_socket(s->path);
+
+    s->maxfd = s->conn_socket;
+
+    for(;;) {
+        int ret, handle, i;
+        char buf[1024];
+
+        print_vec(s, "vm_sockets");
+
+        FD_ZERO(&readset);
+        /* conn socket is in Live_vms at posn 0 */
+        FD_SET(s->conn_socket, &readset);
+        for (i = 0; i < s->total_count; i++) {
+            if (s->live_vms[i].alive != 0) {
+                FD_SET(s->live_vms[i].sockfd, &readset);
+            }
+        }
+
+        printf("\nWaiting (maxfd = %d)\n", s->maxfd);
+
+        ret = select(s->maxfd + 1, &readset, NULL, NULL, NULL);
+
+        if (ret == -1) {
+            perror("select()");
+        }
+
+        handle = find_set(&readset, s->maxfd + 1);
+        if (handle == -1) continue;
+
+        if (handle == s->conn_socket) {
+
+            printf("[NC] new connection\n");
+            FD_CLR(s->conn_socket, &readset);
+
+            /* The Total_count is equal to the new guests VM ID */
+            add_new_guest(s);
+
+            /* update our the maximum file descriptor number */
+            s->maxfd = s->live_vms[s->total_count - 1].sockfd > s->maxfd ?
+                            s->live_vms[s->total_count - 1].sockfd : s->maxfd;
+
+            s->live_count++;
+            printf("Live_count is %ld\n", s->live_count);
+
+        } else {
+            /* then we have received a disconnection */
+            int recv_ret;
+            long i, j;
+            long deadposn = -1;
+
+            recv_ret = recv(handle, buf, 1, 0);
+
+            printf("[DC] recv returned %d\n", recv_ret);
+
+            /* find the dead VM in our list and move it do the dead list. */
+            for (i = 0; i < s->total_count; i++) {
+                if (s->live_vms[i].sockfd == handle) {
+                    deadposn = i;
+                    s->live_vms[i].alive = 0;
+                    close(s->live_vms[i].sockfd);
+
+                    for (j = 0; j < s->msi_vectors; j++) {
+                        close(s->live_vms[i].efd[j]);
+                    }
+
+                    free(s->live_vms[i].efd);
+                    s->live_vms[i].sockfd = -1;
+                    break;
+                }
+            }
+
+            for (j = 0; j < s->total_count; j++) {
+                /* update remaining clients that one client has left/died */
+                if (s->live_vms[j].alive) {
+                    printf("[UD] sending kill of fd[%ld] to %ld\n",
+                                                                deadposn, j);
+                    sendKill(s->live_vms[j].sockfd, deadposn, sizeof(deadposn));
+                }
+            }
+
+            s->live_count--;
+
+            /* close the socket for the departed VM */
+            close(handle);
+        }
+
+    }
+
+    return 0;
+}
+
+void add_new_guest(server_state_t * s) {
+
+    struct sockaddr_un remote;
+    socklen_t t = sizeof(remote);
+    long i, j;
+    int vm_sock;
+    long new_posn;
+    long neg1 = -1;
+
+    vm_sock = accept(s->conn_socket, (struct sockaddr *)&remote, &t);
+
+    if ( vm_sock == -1 ) {
+        perror("accept");
+        exit(1);
+    }
+
+    new_posn = s->total_count;
+
+    if (new_posn == s->nr_allocated_vms) {
+        printf("increasing vm slots\n");
+        s->nr_allocated_vms = s->nr_allocated_vms * 2;
+        if (s->nr_allocated_vms < 16)
+            s->nr_allocated_vms = 16;
+        s->live_vms = realloc(s->live_vms,
+                    s->nr_allocated_vms * sizeof(vmguest_t));
+
+        if (s->live_vms == NULL) {
+            fprintf(stderr, "realloc failed - quitting\n");
+            exit(-1);
+        }
+    }
+
+    s->live_vms[new_posn].posn = new_posn;
+    printf("[NC] Live_vms[%ld]\n", new_posn);
+    s->live_vms[new_posn].efd = (int *) malloc(sizeof(int));
+    for (i = 0; i < s->msi_vectors; i++) {
+        s->live_vms[new_posn].efd[i] = eventfd(0, 0);
+        printf("\tefd[%ld] = %d\n", i, s->live_vms[new_posn].efd[i]);
+    }
+    s->live_vms[new_posn].sockfd = vm_sock;
+    s->live_vms[new_posn].alive = 1;
+
+
+    sendPosition(vm_sock, new_posn);
+    sendUpdate(vm_sock, neg1, sizeof(long), s->shm_fd);
+    printf("[NC] trying to send fds to new connection\n");
+    sendRights(vm_sock, new_posn, sizeof(new_posn), s->live_vms, s->msi_vectors);
+
+    printf("[NC] Connected (count = %ld).\n", new_posn);
+    for (i = 0; i < new_posn; i++) {
+        if (s->live_vms[i].alive) {
+            // ping all clients that a new client has joined
+            printf("[UD] sending fd[%ld] to %ld\n", new_posn, i);
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("\tefd[%ld] = [%d]", j, s->live_vms[new_posn].efd[j]);
+                sendUpdate(s->live_vms[i].sockfd, new_posn,
+                        sizeof(new_posn), s->live_vms[new_posn].efd[j]);
+            }
+            printf("\n");
+        }
+    }
+
+    s->total_count++;
+}
+
+int create_listening_socket(char * path) {
+
+    struct sockaddr_un local;
+    int len, conn_socket;
+
+    if ((conn_socket = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
+        perror("socket");
+        exit(1);
+    }
+
+    local.sun_family = AF_UNIX;
+    strcpy(local.sun_path, path);
+    unlink(local.sun_path);
+    len = strlen(local.sun_path) + sizeof(local.sun_family);
+    if (bind(conn_socket, (struct sockaddr *)&local, len) == -1) {
+        perror("bind");
+        exit(1);
+    }
+
+    if (listen(conn_socket, 5) == -1) {
+        perror("listen");
+        exit(1);
+    }
+
+    return conn_socket;
+
+}
+
+void parse_args(int argc, char **argv, server_state_t * s) {
+
+    int c;
+
+    s->shm_size = 1024 * 1024; // default shm_size
+    s->path = NULL;
+    s->shmobj = NULL;
+    s->msi_vectors = 1;
+
+	while ((c = getopt(argc, argv, "hp:s:m:n:")) != -1) {
+
+        switch (c) {
+            // path to listening socket
+            case 'p':
+                s->path = optarg;
+                break;
+            // name of shared memory object
+            case 's':
+                s->shmobj = optarg;
+                break;
+            // size of shared memory object
+            case 'm':
+                s->shm_size = atol(optarg)*1024*1024;
+                break;
+            case 'n':
+                s->msi_vectors = atol(optarg);
+                break;
+            case 'h':
+            default:
+	            usage(argv[0]);
+		        exit(1);
+		}
+	}
+
+    if (s->path == NULL) {
+        s->path = strdup(DEFAULT_SOCK_PATH);
+    }
+
+    printf("listening socket: %s\n", s->path);
+
+    if (s->shmobj == NULL) {
+        s->shmobj = strdup(DEFAULT_SHM_OBJ);
+    }
+
+    printf("shared object: %s\n", s->shmobj);
+    printf("shared object size: %d MB\n", s->shm_size);
+
+}
+
+void print_vec(server_state_t * s, const char * c) {
+
+    int i, j;
+
+#if DEBUG
+    printf("%s (%ld) = ", c, s->total_count);
+    for (i = 0; i < s->total_count; i++) {
+        if (s->live_vms[i].alive) {
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("[%d|%d] ", s->live_vms[i].sockfd, s->live_vms[i].efd[j]);
+            }
+        }
+    }
+    printf("\n");
+#endif
+
+}
+
+int find_set(fd_set * readset, int max) {
+
+    int i;
+
+    for (i = 1; i < max; i++) {
+        if (FD_ISSET(i, readset)) {
+            return i;
+        }
+    }
+
+    printf("nothing set\n");
+    return -1;
+
+}
+
+void usage(char const *prg) {
+	fprintf(stderr, "use: %s [-h]  [-p <unix socket>] [-s <shm obj>]
+                        [-m <size in MB>] [-n <# of MSI vectors>]\n", prg);
+}
+
+
diff --git a/contrib/ivshmem-server/send_scm.c b/contrib/ivshmem-server/send_scm.c
new file mode 100644
index 0000000..b1bb4a3
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.c
@@ -0,0 +1,208 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/syscall.h>
+#include <sys/un.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <poll.h>
+#include "send_scm.h"
+
+#ifndef POLLRDHUP
+#define POLLRDHUP 0x2000
+#endif
+
+int readUpdate(int fd, long * posn, int * newfd)
+{
+    struct msghdr msg;
+    struct iovec iov[1];
+    struct cmsghdr *cmptr;
+    size_t len;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+
+    msg.msg_name = 0;
+    msg.msg_namelen = 0;
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    msg.msg_flags = 0;
+    msg.msg_iov = iov;
+    msg.msg_iovlen = 1;
+
+    iov[0].iov_base = &posn;
+    iov[0].iov_len = sizeof(posn);
+
+    do {
+        len = recvmsg(fd, &msg, 0);
+    } while (len == (size_t) (-1) && (errno == EINTR || errno == EAGAIN));
+
+    printf("iov[0].buf is %ld\n", *((long *)iov[0].iov_base));
+    printf("len is %ld\n", len);
+    // TODO: Logging
+    if (len == (size_t) (-1)) {
+        perror("recvmsg()");
+        return -1;
+    }
+
+    if (msg.msg_controllen < sizeof(struct cmsghdr))
+        return *posn;
+
+    for (cmptr = CMSG_FIRSTHDR(&msg); cmptr != NULL;
+        cmptr = CMSG_NXTHDR(&msg, cmptr)) {
+        if (cmptr->cmsg_level != SOL_SOCKET ||
+            cmptr->cmsg_type != SCM_RIGHTS){
+                printf("continuing %ld\n", sizeof(size_t));
+                printf("read msg_size = %ld\n", msg_size);
+                if (cmptr->cmsg_len != sizeof(control))
+                    printf("not equal (%ld != %ld)\n",cmptr->cmsg_len,sizeof(control));
+                continue;
+        }
+
+        memcpy(newfd, CMSG_DATA(cmptr), sizeof(int));
+        printf("posn is %ld (fd = %d)\n", *posn, *newfd);
+        return 0;
+    }
+
+    fprintf(stderr, "bad data in packet\n");
+    return -1;
+}
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors)
+{
+    int j, newfd;
+
+    for (; ;){
+        long posn = 0;
+
+        readUpdate(fd, &posn, &newfd);
+        printf("reading posn %ld ", posn);
+        fds[posn] = (int *)malloc (msi_vectors * sizeof(int));
+        fds[posn][0] = newfd;
+        for (j = 1; j < msi_vectors; j++) {
+            readUpdate(fd, &posn, &newfd);
+            fds[posn][j] = newfd;
+            printf("%d.", fds[posn][j]);
+        }
+        printf("\n");
+
+        /* stop reading once i've read my own eventfds */
+        if (posn == count)
+            break;
+    }
+
+    return 0;
+}
+
+int sendKill(int fd, long const posn, size_t posn_len) {
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    struct pollfd mypollfd;
+    int rv;
+
+    iov[0].iov_base = (void *) &posn;
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_len = 0;
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    printf("Killing posn %ld\n", posn);
+
+    // check if the fd is dead or not
+    mypollfd.fd = fd;
+    mypollfd.events = POLLRDHUP;
+    mypollfd.revents = 0;
+
+    rv = poll(&mypollfd, 1, 0);
+
+    printf("rv is %d\n", rv);
+
+    if (rv == 0) {
+        len = sendmsg(fd, &msg, 0);
+        if (len == (size_t) (-1)) {
+            perror("sendmsg()");
+            return -1;
+        }
+        return (len == posn_len);
+    } else {
+        printf("already dead\n");
+        return 0;
+    }
+}
+
+int sendUpdate(int fd, long posn, size_t posn_len, int sendfd)
+{
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    iov[0].iov_base = (void *) (&posn);
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_type = SCM_RIGHTS;
+    cmsg->cmsg_len = CMSG_LEN(msg_size);
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    memcpy((CMSG_DATA(cmsg)), &sendfd, msg_size);
+
+    len = sendmsg(fd, &msg, 0);
+    if (len == (size_t) (-1)) {
+        perror("sendmsg()");
+        return -1;
+    }
+
+    return (len == posn_len);
+
+}
+
+int sendPosition(int fd, long const posn)
+{
+    int rv;
+
+    rv = send(fd, &posn, sizeof(long), 0);
+    if (rv != sizeof(long)) {
+        fprintf(stderr, "error sending posn\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+int sendRights(int fd, long const count, size_t count_len, vmguest_t * Live_vms,
+                                                            long msi_vectors)
+{
+    /* updates about new guests are sent one at a time */
+
+    long i, j;
+
+    for (i = 0; i <= count; i++) {
+        if (Live_vms[i].alive) {
+            for (j = 0; j < msi_vectors; j++) {
+                sendUpdate(Live_vms[count].sockfd, i, sizeof(long),
+                                                        Live_vms[i].efd[j]);
+            }
+        }
+    }
+
+    return 0;
+
+}
diff --git a/contrib/ivshmem-server/send_scm.h b/contrib/ivshmem-server/send_scm.h
new file mode 100644
index 0000000..48c9a8d
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.h
@@ -0,0 +1,19 @@
+#ifndef SEND_SCM
+#define SEND_SCM
+
+struct vm_guest_conn {
+    int posn;
+    int sockfd;
+    int * efd;
+    int alive;
+};
+
+typedef struct vm_guest_conn vmguest_t;
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors);
+int sendRights(int fd, long const count, size_t count_len, vmguest_t *Live_vms, long msi_vectors);
+int readUpdate(int fd, long * posn, int * newfd);
+int sendUpdate(int fd, long const posn, size_t posn_len, int sendfd);
+int sendPosition(int fd, long const posn);
+int sendKill(int fd, long const posn, size_t posn_len);
+#endif
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [PATCH v5 5/5] shared memory server for inter-VM shared memory
@ 2010-04-21 18:00           ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-04-21 18:00 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

this code is a standalone server which will pass file descriptors for the shared
memory region and eventfds to support interrupts between guests using inter-VM
shared memory.
---
 contrib/ivshmem-server/Makefile         |   16 ++
 contrib/ivshmem-server/README           |   30 +++
 contrib/ivshmem-server/ivshmem_server.c |  339 +++++++++++++++++++++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 +++++++++++++++++++
 contrib/ivshmem-server/send_scm.h       |   19 ++
 5 files changed, 612 insertions(+), 0 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h

diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile
new file mode 100644
index 0000000..da40ffa
--- /dev/null
+++ b/contrib/ivshmem-server/Makefile
@@ -0,0 +1,16 @@
+CC = gcc
+CFLAGS = -O3 -Wall -Werror
+LIBS = -lrt
+
+# a very simple makefile to build the inter-VM shared memory server
+
+all: ivshmem_server
+
+.c.o:
+	$(CC) $(CFLAGS) -c $^ -o $@
+
+ivshmem_server: ivshmem_server.o send_scm.o
+	$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+clean:
+	rm -f *.o ivshmem_server
diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
new file mode 100644
index 0000000..b1fc2a2
--- /dev/null
+++ b/contrib/ivshmem-server/README
@@ -0,0 +1,30 @@
+Using the ivshmem shared memory server
+--------------------------------------
+
+This server is only supported on Linux.
+
+To use the shared memory server, first compile it.  Running 'make' should
+accomplish this.  An executable named 'ivshmem_server' will be built.
+
+to display the options run:
+
+./ivshmem_server -h
+
+Options
+-------
+
+    -h  print help message
+
+    -p <path on host>
+        unix socket to listen on.  The qemu-kvm chardev needs to connect on
+        this socket. (default: '/tmp/ivshmem_socket')
+
+    -s <string>
+        POSIX shared object to create that is the shared memory (default: 'ivshmem')
+
+    -m <#>
+        size of the POSIX object in MBs (default: 1)
+
+    -n <#>
+        number of eventfds for each guest.  This number must match the
+        'vectors' argument passed the ivshmem device. (default: 1)
diff --git a/contrib/ivshmem-server/ivshmem_server.c b/contrib/ivshmem-server/ivshmem_server.c
new file mode 100644
index 0000000..2dbf76f
--- /dev/null
+++ b/contrib/ivshmem-server/ivshmem_server.c
@@ -0,0 +1,339 @@
+/*
+ * A stand-alone shared memory server for inter-VM shared memory for KVM
+*/
+
+#include <errno.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/select.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "send_scm.h"
+
+#define DEFAULT_SOCK_PATH "/tmp/ivshmem_socket"
+#define DEFAULT_SHM_OBJ "ivshmem"
+
+#define DEBUG 1
+
+typedef struct server_state {
+    vmguest_t *live_vms;
+    int nr_allocated_vms;
+    int shm_size;
+    long live_count;
+    long total_count;
+    int shm_fd;
+    char * path;
+    char * shmobj;
+    int maxfd, conn_socket;
+    long msi_vectors;
+} server_state_t;
+
+void usage(char const *prg);
+int find_set(fd_set * readset, int max);
+void print_vec(server_state_t * s, const char * c);
+
+void add_new_guest(server_state_t * s);
+void parse_args(int argc, char **argv, server_state_t * s);
+int create_listening_socket(char * path);
+
+int main(int argc, char ** argv)
+{
+    fd_set readset;
+    server_state_t * s;
+
+    s = (server_state_t *)calloc(1, sizeof(server_state_t));
+
+    s->live_count = 0;
+    s->total_count = 0;
+    parse_args(argc, argv, s);
+
+    /* open shared memory file  */
+    if ((s->shm_fd = shm_open(s->shmobj, O_CREAT|O_RDWR, S_IRWXU)) < 0)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+        exit(-1);
+    }
+
+    ftruncate(s->shm_fd, s->shm_size);
+
+    s->conn_socket = create_listening_socket(s->path);
+
+    s->maxfd = s->conn_socket;
+
+    for(;;) {
+        int ret, handle, i;
+        char buf[1024];
+
+        print_vec(s, "vm_sockets");
+
+        FD_ZERO(&readset);
+        /* conn socket is in Live_vms at posn 0 */
+        FD_SET(s->conn_socket, &readset);
+        for (i = 0; i < s->total_count; i++) {
+            if (s->live_vms[i].alive != 0) {
+                FD_SET(s->live_vms[i].sockfd, &readset);
+            }
+        }
+
+        printf("\nWaiting (maxfd = %d)\n", s->maxfd);
+
+        ret = select(s->maxfd + 1, &readset, NULL, NULL, NULL);
+
+        if (ret == -1) {
+            perror("select()");
+        }
+
+        handle = find_set(&readset, s->maxfd + 1);
+        if (handle == -1) continue;
+
+        if (handle == s->conn_socket) {
+
+            printf("[NC] new connection\n");
+            FD_CLR(s->conn_socket, &readset);
+
+            /* The Total_count is equal to the new guests VM ID */
+            add_new_guest(s);
+
+            /* update our the maximum file descriptor number */
+            s->maxfd = s->live_vms[s->total_count - 1].sockfd > s->maxfd ?
+                            s->live_vms[s->total_count - 1].sockfd : s->maxfd;
+
+            s->live_count++;
+            printf("Live_count is %ld\n", s->live_count);
+
+        } else {
+            /* then we have received a disconnection */
+            int recv_ret;
+            long i, j;
+            long deadposn = -1;
+
+            recv_ret = recv(handle, buf, 1, 0);
+
+            printf("[DC] recv returned %d\n", recv_ret);
+
+            /* find the dead VM in our list and move it do the dead list. */
+            for (i = 0; i < s->total_count; i++) {
+                if (s->live_vms[i].sockfd == handle) {
+                    deadposn = i;
+                    s->live_vms[i].alive = 0;
+                    close(s->live_vms[i].sockfd);
+
+                    for (j = 0; j < s->msi_vectors; j++) {
+                        close(s->live_vms[i].efd[j]);
+                    }
+
+                    free(s->live_vms[i].efd);
+                    s->live_vms[i].sockfd = -1;
+                    break;
+                }
+            }
+
+            for (j = 0; j < s->total_count; j++) {
+                /* update remaining clients that one client has left/died */
+                if (s->live_vms[j].alive) {
+                    printf("[UD] sending kill of fd[%ld] to %ld\n",
+                                                                deadposn, j);
+                    sendKill(s->live_vms[j].sockfd, deadposn, sizeof(deadposn));
+                }
+            }
+
+            s->live_count--;
+
+            /* close the socket for the departed VM */
+            close(handle);
+        }
+
+    }
+
+    return 0;
+}
+
+void add_new_guest(server_state_t * s) {
+
+    struct sockaddr_un remote;
+    socklen_t t = sizeof(remote);
+    long i, j;
+    int vm_sock;
+    long new_posn;
+    long neg1 = -1;
+
+    vm_sock = accept(s->conn_socket, (struct sockaddr *)&remote, &t);
+
+    if ( vm_sock == -1 ) {
+        perror("accept");
+        exit(1);
+    }
+
+    new_posn = s->total_count;
+
+    if (new_posn == s->nr_allocated_vms) {
+        printf("increasing vm slots\n");
+        s->nr_allocated_vms = s->nr_allocated_vms * 2;
+        if (s->nr_allocated_vms < 16)
+            s->nr_allocated_vms = 16;
+        s->live_vms = realloc(s->live_vms,
+                    s->nr_allocated_vms * sizeof(vmguest_t));
+
+        if (s->live_vms == NULL) {
+            fprintf(stderr, "realloc failed - quitting\n");
+            exit(-1);
+        }
+    }
+
+    s->live_vms[new_posn].posn = new_posn;
+    printf("[NC] Live_vms[%ld]\n", new_posn);
+    s->live_vms[new_posn].efd = (int *) malloc(sizeof(int));
+    for (i = 0; i < s->msi_vectors; i++) {
+        s->live_vms[new_posn].efd[i] = eventfd(0, 0);
+        printf("\tefd[%ld] = %d\n", i, s->live_vms[new_posn].efd[i]);
+    }
+    s->live_vms[new_posn].sockfd = vm_sock;
+    s->live_vms[new_posn].alive = 1;
+
+
+    sendPosition(vm_sock, new_posn);
+    sendUpdate(vm_sock, neg1, sizeof(long), s->shm_fd);
+    printf("[NC] trying to send fds to new connection\n");
+    sendRights(vm_sock, new_posn, sizeof(new_posn), s->live_vms, s->msi_vectors);
+
+    printf("[NC] Connected (count = %ld).\n", new_posn);
+    for (i = 0; i < new_posn; i++) {
+        if (s->live_vms[i].alive) {
+            // ping all clients that a new client has joined
+            printf("[UD] sending fd[%ld] to %ld\n", new_posn, i);
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("\tefd[%ld] = [%d]", j, s->live_vms[new_posn].efd[j]);
+                sendUpdate(s->live_vms[i].sockfd, new_posn,
+                        sizeof(new_posn), s->live_vms[new_posn].efd[j]);
+            }
+            printf("\n");
+        }
+    }
+
+    s->total_count++;
+}
+
+int create_listening_socket(char * path) {
+
+    struct sockaddr_un local;
+    int len, conn_socket;
+
+    if ((conn_socket = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
+        perror("socket");
+        exit(1);
+    }
+
+    local.sun_family = AF_UNIX;
+    strcpy(local.sun_path, path);
+    unlink(local.sun_path);
+    len = strlen(local.sun_path) + sizeof(local.sun_family);
+    if (bind(conn_socket, (struct sockaddr *)&local, len) == -1) {
+        perror("bind");
+        exit(1);
+    }
+
+    if (listen(conn_socket, 5) == -1) {
+        perror("listen");
+        exit(1);
+    }
+
+    return conn_socket;
+
+}
+
+void parse_args(int argc, char **argv, server_state_t * s) {
+
+    int c;
+
+    s->shm_size = 1024 * 1024; // default shm_size
+    s->path = NULL;
+    s->shmobj = NULL;
+    s->msi_vectors = 1;
+
+	while ((c = getopt(argc, argv, "hp:s:m:n:")) != -1) {
+
+        switch (c) {
+            // path to listening socket
+            case 'p':
+                s->path = optarg;
+                break;
+            // name of shared memory object
+            case 's':
+                s->shmobj = optarg;
+                break;
+            // size of shared memory object
+            case 'm':
+                s->shm_size = atol(optarg)*1024*1024;
+                break;
+            case 'n':
+                s->msi_vectors = atol(optarg);
+                break;
+            case 'h':
+            default:
+	            usage(argv[0]);
+		        exit(1);
+		}
+	}
+
+    if (s->path == NULL) {
+        s->path = strdup(DEFAULT_SOCK_PATH);
+    }
+
+    printf("listening socket: %s\n", s->path);
+
+    if (s->shmobj == NULL) {
+        s->shmobj = strdup(DEFAULT_SHM_OBJ);
+    }
+
+    printf("shared object: %s\n", s->shmobj);
+    printf("shared object size: %d MB\n", s->shm_size);
+
+}
+
+void print_vec(server_state_t * s, const char * c) {
+
+    int i, j;
+
+#if DEBUG
+    printf("%s (%ld) = ", c, s->total_count);
+    for (i = 0; i < s->total_count; i++) {
+        if (s->live_vms[i].alive) {
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("[%d|%d] ", s->live_vms[i].sockfd, s->live_vms[i].efd[j]);
+            }
+        }
+    }
+    printf("\n");
+#endif
+
+}
+
+int find_set(fd_set * readset, int max) {
+
+    int i;
+
+    for (i = 1; i < max; i++) {
+        if (FD_ISSET(i, readset)) {
+            return i;
+        }
+    }
+
+    printf("nothing set\n");
+    return -1;
+
+}
+
+void usage(char const *prg) {
+	fprintf(stderr, "use: %s [-h]  [-p <unix socket>] [-s <shm obj>]
+                        [-m <size in MB>] [-n <# of MSI vectors>]\n", prg);
+}
+
+
diff --git a/contrib/ivshmem-server/send_scm.c b/contrib/ivshmem-server/send_scm.c
new file mode 100644
index 0000000..b1bb4a3
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.c
@@ -0,0 +1,208 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/syscall.h>
+#include <sys/un.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <poll.h>
+#include "send_scm.h"
+
+#ifndef POLLRDHUP
+#define POLLRDHUP 0x2000
+#endif
+
+int readUpdate(int fd, long * posn, int * newfd)
+{
+    struct msghdr msg;
+    struct iovec iov[1];
+    struct cmsghdr *cmptr;
+    size_t len;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+
+    msg.msg_name = 0;
+    msg.msg_namelen = 0;
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    msg.msg_flags = 0;
+    msg.msg_iov = iov;
+    msg.msg_iovlen = 1;
+
+    iov[0].iov_base = &posn;
+    iov[0].iov_len = sizeof(posn);
+
+    do {
+        len = recvmsg(fd, &msg, 0);
+    } while (len == (size_t) (-1) && (errno == EINTR || errno == EAGAIN));
+
+    printf("iov[0].buf is %ld\n", *((long *)iov[0].iov_base));
+    printf("len is %ld\n", len);
+    // TODO: Logging
+    if (len == (size_t) (-1)) {
+        perror("recvmsg()");
+        return -1;
+    }
+
+    if (msg.msg_controllen < sizeof(struct cmsghdr))
+        return *posn;
+
+    for (cmptr = CMSG_FIRSTHDR(&msg); cmptr != NULL;
+        cmptr = CMSG_NXTHDR(&msg, cmptr)) {
+        if (cmptr->cmsg_level != SOL_SOCKET ||
+            cmptr->cmsg_type != SCM_RIGHTS){
+                printf("continuing %ld\n", sizeof(size_t));
+                printf("read msg_size = %ld\n", msg_size);
+                if (cmptr->cmsg_len != sizeof(control))
+                    printf("not equal (%ld != %ld)\n",cmptr->cmsg_len,sizeof(control));
+                continue;
+        }
+
+        memcpy(newfd, CMSG_DATA(cmptr), sizeof(int));
+        printf("posn is %ld (fd = %d)\n", *posn, *newfd);
+        return 0;
+    }
+
+    fprintf(stderr, "bad data in packet\n");
+    return -1;
+}
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors)
+{
+    int j, newfd;
+
+    for (; ;){
+        long posn = 0;
+
+        readUpdate(fd, &posn, &newfd);
+        printf("reading posn %ld ", posn);
+        fds[posn] = (int *)malloc (msi_vectors * sizeof(int));
+        fds[posn][0] = newfd;
+        for (j = 1; j < msi_vectors; j++) {
+            readUpdate(fd, &posn, &newfd);
+            fds[posn][j] = newfd;
+            printf("%d.", fds[posn][j]);
+        }
+        printf("\n");
+
+        /* stop reading once i've read my own eventfds */
+        if (posn == count)
+            break;
+    }
+
+    return 0;
+}
+
+int sendKill(int fd, long const posn, size_t posn_len) {
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    struct pollfd mypollfd;
+    int rv;
+
+    iov[0].iov_base = (void *) &posn;
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_len = 0;
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    printf("Killing posn %ld\n", posn);
+
+    // check if the fd is dead or not
+    mypollfd.fd = fd;
+    mypollfd.events = POLLRDHUP;
+    mypollfd.revents = 0;
+
+    rv = poll(&mypollfd, 1, 0);
+
+    printf("rv is %d\n", rv);
+
+    if (rv == 0) {
+        len = sendmsg(fd, &msg, 0);
+        if (len == (size_t) (-1)) {
+            perror("sendmsg()");
+            return -1;
+        }
+        return (len == posn_len);
+    } else {
+        printf("already dead\n");
+        return 0;
+    }
+}
+
+int sendUpdate(int fd, long posn, size_t posn_len, int sendfd)
+{
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    iov[0].iov_base = (void *) (&posn);
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_type = SCM_RIGHTS;
+    cmsg->cmsg_len = CMSG_LEN(msg_size);
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    memcpy((CMSG_DATA(cmsg)), &sendfd, msg_size);
+
+    len = sendmsg(fd, &msg, 0);
+    if (len == (size_t) (-1)) {
+        perror("sendmsg()");
+        return -1;
+    }
+
+    return (len == posn_len);
+
+}
+
+int sendPosition(int fd, long const posn)
+{
+    int rv;
+
+    rv = send(fd, &posn, sizeof(long), 0);
+    if (rv != sizeof(long)) {
+        fprintf(stderr, "error sending posn\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+int sendRights(int fd, long const count, size_t count_len, vmguest_t * Live_vms,
+                                                            long msi_vectors)
+{
+    /* updates about new guests are sent one at a time */
+
+    long i, j;
+
+    for (i = 0; i <= count; i++) {
+        if (Live_vms[i].alive) {
+            for (j = 0; j < msi_vectors; j++) {
+                sendUpdate(Live_vms[count].sockfd, i, sizeof(long),
+                                                        Live_vms[i].efd[j]);
+            }
+        }
+    }
+
+    return 0;
+
+}
diff --git a/contrib/ivshmem-server/send_scm.h b/contrib/ivshmem-server/send_scm.h
new file mode 100644
index 0000000..48c9a8d
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.h
@@ -0,0 +1,19 @@
+#ifndef SEND_SCM
+#define SEND_SCM
+
+struct vm_guest_conn {
+    int posn;
+    int sockfd;
+    int * efd;
+    int alive;
+};
+
+typedef struct vm_guest_conn vmguest_t;
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors);
+int sendRights(int fd, long const count, size_t count_len, vmguest_t *Live_vms, long msi_vectors);
+int readUpdate(int fd, long * posn, int * newfd);
+int sendUpdate(int fd, long const posn, size_t posn_len, int sendfd);
+int sendPosition(int fd, long const posn);
+int sendKill(int fd, long const posn, size_t posn_len);
+#endif
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v5 4/5] RESEND: Inter-VM shared memory PCI device
  2010-04-21 17:53         ` [Qemu-devel] " Cam Macdonell
@ 2010-05-05 16:57           ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-05 16:57 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

For completeness, just a one-line change to use the merged version of kvm_set_irqfd()

---
 Makefile.target |    3 +
 hw/ivshmem.c    |  727 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-char.c     |    6 +
 qemu-char.h     |    3 +
 qemu-doc.texi   |   25 ++
 5 files changed, 764 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index cb5ab2a..8a4cef3 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y = pckbd.o dma.o
 obj-i386-y += vga.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..e73c543
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,727 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include <sys/eventfd.h>
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "msix.h"
+#include "qemu-kvm.h"
+#include "libkvm.h"
+
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI     1
+
+#define DEBUG_IVSHMEM
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct EventfdEntry {
+    PCIDevice *pdev;
+    int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+    PCIDevice dev;
+    uint32_t intrmask;
+    uint32_t intrstatus;
+    uint32_t doorbell;
+
+    CharDriverState * chr;
+    CharDriverState ** eventfd_chr;
+    int ivshmem_mmio_io_addr;
+
+    pcibus_t mmio_addr;
+    unsigned long ivshmem_offset;
+    uint64_t ivshmem_size; /* size of shared memory region */
+    int shm_fd; /* shared memory file descriptor */
+
+    int nr_allocated_vms;
+    /* array of eventfds for each guest */
+    int ** eventfds;
+    /* keep track of # of eventfds for each guest*/
+    int * eventfds_posn_count;
+
+    int nr_alloc_guests;
+    int vm_id;
+    int num_eventfds;
+    uint32_t vectors;
+    uint32_t features;
+    EventfdEntry *eventfd_table;
+
+    char * shmobj;
+    char * sizearg;
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+    return (ivs->features & (1 << feature));
+}
+
+static inline int is_power_of_two(int x) {
+    return (x & (x-1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size);
+    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s, val);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s, val);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s, 0);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    u_int64_t write_one = 1;
+    u_int16_t dest = val >> 16;
+    u_int16_t vector = val & 0xff;
+
+    addr &= 0xfe;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        case Doorbell:
+            /* check doorbell range */
+            if ((vector >= 0) && (vector < s->eventfds_posn_count[dest])) {
+                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n", write_one, dest, vector);
+                if (write(s->eventfds[dest][vector], &(write_one), 8) != 8) {
+                    IVSHMEM_DPRINTF("error writing to eventfd\n");
+                }
+            }
+            break;
+        default:
+            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
+    }
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+
+        case IVPosition:
+            /* return my id in the ivshmem list */
+            ret = s->vm_id;
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 8;
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
+
+    EventfdEntry *entry = opaque;
+    PCIDevice *pdev = entry->pdev;
+
+    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
+    msix_notify(pdev, entry->vector);
+}
+
+static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
+                                                                    int vector)
+{
+    /* create a event character device based on the passed eventfd */
+    IVShmemState *s = opaque;
+    CharDriverState * chr;
+
+    chr = qemu_chr_open_eventfd(eventfd);
+
+    if (chr == NULL) {
+        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
+        exit(-1);
+    }
+
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        s->eventfd_table[vector].pdev = &s->dev;
+        s->eventfd_table[vector].vector = vector;
+
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
+                      ivshmem_event, &s->eventfd_table[vector]);
+    } else {
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
+                      ivshmem_event, s);
+    }
+
+    return chr;
+
+}
+
+static int check_shm_size(IVShmemState *s, int shmemfd) {
+    /* check that the guest isn't going to try and map more memory than the
+     * card server allocated return -1 to indicate error */
+
+    struct stat buf;
+
+    fstat(shmemfd, &buf);
+
+    if (s->ivshmem_size > buf.st_size) {
+        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
+        fprintf(stderr, " than shared object size (%ld > %ld)\n",
+                                          s->ivshmem_size, buf.st_size);
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+static void create_shared_memory_BAR(IVShmemState *s, int fd) {
+
+    s->shm_fd = fd;
+
+    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
+             MAP_SHARED, 0);
+
+    /* region for shared memory */
+    pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+}
+
+static void close_guest_eventfds(IVShmemState *s, int posn)
+{
+    int i, guest_curr_max;
+
+    guest_curr_max = s->eventfds_posn_count[posn];
+
+    for (i = 0; i < guest_curr_max; i++)
+        close(s->eventfds[posn][i]);
+
+    free(s->eventfds[posn]);
+    s->eventfds_posn_count[posn] = 0;
+}
+
+/* this function increase the dynamic storage need to store data about other
+ * guests */
+static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
+
+    int j, old_nr_alloc;
+
+    old_nr_alloc = s->nr_alloc_guests;
+
+    while (s->nr_alloc_guests < new_min_size)
+        s->nr_alloc_guests = s->nr_alloc_guests * 2;
+
+    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
+    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
+                                                        sizeof(int *));
+    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
+                                                    s->nr_alloc_guests *
+                                                        sizeof(int));
+    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
+                                                    sizeof(EventfdEntry));
+
+    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
+            (s->eventfd_table == NULL)) {
+        fprintf(stderr, "Allocation error - exiting\n");
+        exit(1);
+    }
+
+    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
+                                    s->nr_alloc_guests * sizeof(void *));
+        if (s->eventfd_chr == NULL) {
+            fprintf(stderr, "Allocation error - exiting\n");
+            exit(1);
+        }
+    }
+
+    /* zero out new pointers */
+    for (j = old_nr_alloc; j < s->nr_alloc_guests; j++) {
+        s->eventfds[j] = NULL;
+    }
+}
+
+static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
+{
+    IVShmemState *s = opaque;
+    int incoming_fd, tmp_fd;
+    int guest_curr_max;
+    long incoming_posn;
+
+    memcpy(&incoming_posn, buf, sizeof(long));
+    /* pick off s->chr->msgfd and store it, posn should accompany msg */
+    tmp_fd = qemu_chr_get_msgfd(s->chr);
+    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
+
+    /* make sure we have enough space for this guest */
+    if (incoming_posn >= s->nr_alloc_guests) {
+        increase_dynamic_storage(s, incoming_posn);
+    }
+
+    if (tmp_fd == -1) {
+        /* if posn is positive and unseen before then this is our posn*/
+        if ((incoming_posn >= 0) && (s->eventfds[incoming_posn] == NULL)) {
+            /* receive our posn */
+            s->vm_id = incoming_posn;
+            return;
+        } else {
+            /* otherwise an fd == -1 means an existing guest has gone away */
+            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
+            close_guest_eventfds(s, incoming_posn);
+            return;
+        }
+    }
+
+    /* because of the implementation of get_msgfd, we need a dup */
+    incoming_fd = dup(tmp_fd);
+
+    /* if the position is -1, then it's shared memory region fd */
+    if (incoming_posn == -1) {
+
+        s->num_eventfds = 0;
+
+        if (check_shm_size(s, incoming_fd) == -1) {
+            exit(-1);
+        }
+
+        /* creating a BAR in qemu_chr callback may be crazy */
+        create_shared_memory_BAR(s, incoming_fd);
+
+       return;
+    }
+
+    /* each guest has an array of eventfds, and we keep track of how many
+     * guests for each VM */
+    guest_curr_max = s->eventfds_posn_count[incoming_posn];
+    if (guest_curr_max == 0) {
+        /* one eventfd per MSI vector */
+        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
+                                                                sizeof(int));
+    }
+
+    /* this is an eventfd for a particular guest VM */
+    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
+                                                                incoming_fd);
+    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
+
+    /* increment count for particular guest */
+    s->eventfds_posn_count[incoming_posn]++;
+
+    /* ioeventfd and irqfd are enabled together,
+     * so the flag IRQFD refers to both */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) && guest_curr_max >= 0) {
+        /* allocate ioeventfd for the new fd
+         * received for guest @ incoming_posn */
+        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
+                                (incoming_posn << 16) | guest_curr_max, 1);
+    }
+
+    /* keep track of the maximum VM ID */
+    if (incoming_posn > s->num_eventfds) {
+        s->num_eventfds = incoming_posn;
+    }
+
+    if (incoming_posn == s->vm_id) {
+        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            /* setup irqfd for this VM's eventfd */
+            int vector = guest_curr_max;
+            kvm_set_irqfd(s->dev.msix_irq_entries[vector].gsi,
+                                s->eventfds[s->vm_id][guest_curr_max], 1);
+        } else {
+            /* initialize char device for callback
+             * if this is one of my eventfd */
+            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
+                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
+        }
+    }
+
+    return;
+}
+
+static void ivshmem_reset(DeviceState *d)
+{
+    return;
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->mmio_addr = addr;
+    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
+
+    /* now that our mmio region has been allocated, we can receive
+     * the file descriptors */
+    if (s->chr != NULL) {
+        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
+                     ivshmem_event, s);
+    }
+
+}
+
+static uint64_t ivshmem_get_size(IVShmemState * s) {
+
+    uint64_t value;
+    char *ptr;
+
+    value = strtoul(s->sizearg, &ptr, 10);
+    switch (*ptr) {
+        case 0: case 'M': case 'm':
+            value <<= 20;
+            break;
+        case 'G': case 'g':
+            value <<= 30;
+            break;
+        default:
+            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
+            exit(1);
+    }
+
+    /* BARs must be a power of 2 */
+    if (!is_power_of_two(value)) {
+        fprintf(stderr, "ivshmem: size must be power of 2\n");
+        exit(1);
+    }
+
+    return value;
+
+}
+
+static int pci_ivshmem_init(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+    uint8_t *pci_conf;
+    int i;
+
+    if (s->sizearg == NULL)
+        s->ivshmem_size = 4 << 20; /* 4 MB default */
+    else {
+        s->ivshmem_size = ivshmem_get_size(s);
+    }
+
+    /* IRQFD requires MSI */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
+        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
+        exit(1);
+    }
+
+    pci_conf = s->dev.config;
+    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+    pci_conf[0x0a] = 0x00; /* RAM controller */
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; /* header_type */
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+    /* region for registers*/
+    pci_register_bar(&s->dev, 0, 0x400,
+                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
+
+    /* allocate the MSI-X vectors */
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+
+        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
+            pci_register_bar(&s->dev, 1,
+                             msix_bar_size(&s->dev),
+                             PCI_BASE_ADDRESS_SPACE_MEMORY,
+                             msix_mmio_map);
+            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
+        } else {
+            IVSHMEM_DPRINTF("msix initialization failed\n");
+        }
+
+        /* 'activate' the vectors */
+        for (i = 0; i < s->vectors; i++) {
+            msix_vector_use(&s->dev, i);
+        }
+    }
+
+    if ((s->chr != NULL) && (strncmp(s->chr->filename, "unix:", 5) == 0)) {
+        /* if we get a UNIX socket as the parameter we will talk
+         * to the ivshmem server later once the MMIO BAR is actually
+         * allocated (see ivshmem_mmio_map) */
+
+        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
+                                                            s->chr->filename);
+
+        /* we allocate enough space for 16 guests and grow as needed */
+        s->nr_alloc_guests = 16;
+        s->vm_id = -1;
+
+        /* allocate/initialize space for interrupt handling */
+        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
+        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
+        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
+
+        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
+
+        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            s->eventfd_chr = (CharDriverState **)qemu_malloc(s->nr_alloc_guests *
+                                                            sizeof(void *));
+        }
+
+    } else {
+        /* just map the file immediately, we're not using a server */
+        int fd;
+
+        if (s->shmobj == NULL) {
+            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
+        }
+
+        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
+
+        /* try opening with O_EXCL and if it succeeds zero the memory
+         * by truncating to 0 */
+        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
+           /* truncate file to length PCI device's memory */
+            if (ftruncate(fd, s->ivshmem_size) != 0) {
+                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
+            }
+
+        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
+            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+            exit(-1);
+        }
+
+        create_shared_memory_BAR(s, fd);
+
+    }
+
+
+    return 0;
+}
+
+static int pci_ivshmem_uninit(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+
+    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
+
+    return 0;
+}
+
+static PCIDeviceInfo ivshmem_info = {
+    .qdev.name  = "ivshmem",
+    .qdev.size  = sizeof(IVShmemState),
+    .qdev.reset = ivshmem_reset,
+    .init       = pci_ivshmem_init,
+    .exit       = pci_ivshmem_uninit,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
+        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
+        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
+        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
+        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
+        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void ivshmem_register_devices(void)
+{
+    pci_qdev_register(&ivshmem_info);
+}
+
+device_init(ivshmem_register_devices)
diff --git a/qemu-char.c b/qemu-char.c
index ac65a1c..b2e50d0 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2093,6 +2093,12 @@ static void tcp_chr_read(void *opaque)
     }
 }
 
+CharDriverState *qemu_chr_open_eventfd(int eventfd){
+
+    return qemu_chr_open_fd(eventfd, eventfd);
+
+}
+
 static void tcp_chr_connect(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu-char.h b/qemu-char.h
index e3a0783..6ea01ba 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -94,6 +94,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
 void qemu_chr_info(Monitor *mon, QObject **ret_data);
 CharDriverState *qemu_chr_find(const char *name);
 
+/* add an eventfd to the qemu devices that are polled */
+CharDriverState *qemu_chr_open_eventfd(int eventfd);
+
 extern int term_escape_char;
 
 /* async I/O support */
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 6647b7b..2df4687 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible to make VLANs
 that span several QEMU instances. See @ref{sec_invocation} to have a
 basic example.
 
+@section Other Devices
+
+@subsection Inter-VM Shared Memory device
+
+With KVM enabled on a Linux host, a shared memory device is available.  Guests
+map a POSIX shared memory region into the guest as a PCI device that enables
+zero-copy communication to the application level of the guests.  The basic
+syntax is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+@end example
+
+If desired, interrupts can be sent between guest VMs accessing the same shared
+memory region.  Interrupt support requires using a shared memory server and
+using a chardev socket to connect to it.  The code for the shared memory server
+is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
+memory server is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
+qemu -chardev socket,path=<path>,id=<id>
+@end example
+
 @node direct_linux_boot
 @section Direct Linux Boot
 
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [PATCH v5 4/5] RESEND: Inter-VM shared memory PCI device
@ 2010-05-05 16:57           ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-05 16:57 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

For completeness, just a one-line change to use the merged version of kvm_set_irqfd()

---
 Makefile.target |    3 +
 hw/ivshmem.c    |  727 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-char.c     |    6 +
 qemu-char.h     |    3 +
 qemu-doc.texi   |   25 ++
 5 files changed, 764 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index cb5ab2a..8a4cef3 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y = pckbd.o dma.o
 obj-i386-y += vga.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..e73c543
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,727 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include <sys/eventfd.h>
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "msix.h"
+#include "qemu-kvm.h"
+#include "libkvm.h"
+
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI     1
+
+#define DEBUG_IVSHMEM
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct EventfdEntry {
+    PCIDevice *pdev;
+    int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+    PCIDevice dev;
+    uint32_t intrmask;
+    uint32_t intrstatus;
+    uint32_t doorbell;
+
+    CharDriverState * chr;
+    CharDriverState ** eventfd_chr;
+    int ivshmem_mmio_io_addr;
+
+    pcibus_t mmio_addr;
+    unsigned long ivshmem_offset;
+    uint64_t ivshmem_size; /* size of shared memory region */
+    int shm_fd; /* shared memory file descriptor */
+
+    int nr_allocated_vms;
+    /* array of eventfds for each guest */
+    int ** eventfds;
+    /* keep track of # of eventfds for each guest*/
+    int * eventfds_posn_count;
+
+    int nr_alloc_guests;
+    int vm_id;
+    int num_eventfds;
+    uint32_t vectors;
+    uint32_t features;
+    EventfdEntry *eventfd_table;
+
+    char * shmobj;
+    char * sizearg;
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+    return (ivs->features & (1 << feature));
+}
+
+static inline int is_power_of_two(int x) {
+    return (x & (x-1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size);
+    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s, val);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s, val);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s, 0);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    u_int64_t write_one = 1;
+    u_int16_t dest = val >> 16;
+    u_int16_t vector = val & 0xff;
+
+    addr &= 0xfe;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        case Doorbell:
+            /* check doorbell range */
+            if ((vector >= 0) && (vector < s->eventfds_posn_count[dest])) {
+                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n", write_one, dest, vector);
+                if (write(s->eventfds[dest][vector], &(write_one), 8) != 8) {
+                    IVSHMEM_DPRINTF("error writing to eventfd\n");
+                }
+            }
+            break;
+        default:
+            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
+    }
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+
+        case IVPosition:
+            /* return my id in the ivshmem list */
+            ret = s->vm_id;
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 8;
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
+
+    EventfdEntry *entry = opaque;
+    PCIDevice *pdev = entry->pdev;
+
+    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
+    msix_notify(pdev, entry->vector);
+}
+
+static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
+                                                                    int vector)
+{
+    /* create a event character device based on the passed eventfd */
+    IVShmemState *s = opaque;
+    CharDriverState * chr;
+
+    chr = qemu_chr_open_eventfd(eventfd);
+
+    if (chr == NULL) {
+        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
+        exit(-1);
+    }
+
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        s->eventfd_table[vector].pdev = &s->dev;
+        s->eventfd_table[vector].vector = vector;
+
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
+                      ivshmem_event, &s->eventfd_table[vector]);
+    } else {
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
+                      ivshmem_event, s);
+    }
+
+    return chr;
+
+}
+
+static int check_shm_size(IVShmemState *s, int shmemfd) {
+    /* check that the guest isn't going to try and map more memory than the
+     * card server allocated return -1 to indicate error */
+
+    struct stat buf;
+
+    fstat(shmemfd, &buf);
+
+    if (s->ivshmem_size > buf.st_size) {
+        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
+        fprintf(stderr, " than shared object size (%ld > %ld)\n",
+                                          s->ivshmem_size, buf.st_size);
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+static void create_shared_memory_BAR(IVShmemState *s, int fd) {
+
+    s->shm_fd = fd;
+
+    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
+             MAP_SHARED, 0);
+
+    /* region for shared memory */
+    pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+}
+
+static void close_guest_eventfds(IVShmemState *s, int posn)
+{
+    int i, guest_curr_max;
+
+    guest_curr_max = s->eventfds_posn_count[posn];
+
+    for (i = 0; i < guest_curr_max; i++)
+        close(s->eventfds[posn][i]);
+
+    free(s->eventfds[posn]);
+    s->eventfds_posn_count[posn] = 0;
+}
+
+/* this function increase the dynamic storage need to store data about other
+ * guests */
+static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
+
+    int j, old_nr_alloc;
+
+    old_nr_alloc = s->nr_alloc_guests;
+
+    while (s->nr_alloc_guests < new_min_size)
+        s->nr_alloc_guests = s->nr_alloc_guests * 2;
+
+    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
+    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
+                                                        sizeof(int *));
+    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
+                                                    s->nr_alloc_guests *
+                                                        sizeof(int));
+    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
+                                                    sizeof(EventfdEntry));
+
+    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
+            (s->eventfd_table == NULL)) {
+        fprintf(stderr, "Allocation error - exiting\n");
+        exit(1);
+    }
+
+    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
+                                    s->nr_alloc_guests * sizeof(void *));
+        if (s->eventfd_chr == NULL) {
+            fprintf(stderr, "Allocation error - exiting\n");
+            exit(1);
+        }
+    }
+
+    /* zero out new pointers */
+    for (j = old_nr_alloc; j < s->nr_alloc_guests; j++) {
+        s->eventfds[j] = NULL;
+    }
+}
+
+static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
+{
+    IVShmemState *s = opaque;
+    int incoming_fd, tmp_fd;
+    int guest_curr_max;
+    long incoming_posn;
+
+    memcpy(&incoming_posn, buf, sizeof(long));
+    /* pick off s->chr->msgfd and store it, posn should accompany msg */
+    tmp_fd = qemu_chr_get_msgfd(s->chr);
+    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
+
+    /* make sure we have enough space for this guest */
+    if (incoming_posn >= s->nr_alloc_guests) {
+        increase_dynamic_storage(s, incoming_posn);
+    }
+
+    if (tmp_fd == -1) {
+        /* if posn is positive and unseen before then this is our posn*/
+        if ((incoming_posn >= 0) && (s->eventfds[incoming_posn] == NULL)) {
+            /* receive our posn */
+            s->vm_id = incoming_posn;
+            return;
+        } else {
+            /* otherwise an fd == -1 means an existing guest has gone away */
+            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
+            close_guest_eventfds(s, incoming_posn);
+            return;
+        }
+    }
+
+    /* because of the implementation of get_msgfd, we need a dup */
+    incoming_fd = dup(tmp_fd);
+
+    /* if the position is -1, then it's shared memory region fd */
+    if (incoming_posn == -1) {
+
+        s->num_eventfds = 0;
+
+        if (check_shm_size(s, incoming_fd) == -1) {
+            exit(-1);
+        }
+
+        /* creating a BAR in qemu_chr callback may be crazy */
+        create_shared_memory_BAR(s, incoming_fd);
+
+       return;
+    }
+
+    /* each guest has an array of eventfds, and we keep track of how many
+     * guests for each VM */
+    guest_curr_max = s->eventfds_posn_count[incoming_posn];
+    if (guest_curr_max == 0) {
+        /* one eventfd per MSI vector */
+        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
+                                                                sizeof(int));
+    }
+
+    /* this is an eventfd for a particular guest VM */
+    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
+                                                                incoming_fd);
+    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
+
+    /* increment count for particular guest */
+    s->eventfds_posn_count[incoming_posn]++;
+
+    /* ioeventfd and irqfd are enabled together,
+     * so the flag IRQFD refers to both */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) && guest_curr_max >= 0) {
+        /* allocate ioeventfd for the new fd
+         * received for guest @ incoming_posn */
+        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
+                                (incoming_posn << 16) | guest_curr_max, 1);
+    }
+
+    /* keep track of the maximum VM ID */
+    if (incoming_posn > s->num_eventfds) {
+        s->num_eventfds = incoming_posn;
+    }
+
+    if (incoming_posn == s->vm_id) {
+        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            /* setup irqfd for this VM's eventfd */
+            int vector = guest_curr_max;
+            kvm_set_irqfd(s->dev.msix_irq_entries[vector].gsi,
+                                s->eventfds[s->vm_id][guest_curr_max], 1);
+        } else {
+            /* initialize char device for callback
+             * if this is one of my eventfd */
+            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
+                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
+        }
+    }
+
+    return;
+}
+
+static void ivshmem_reset(DeviceState *d)
+{
+    return;
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->mmio_addr = addr;
+    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
+
+    /* now that our mmio region has been allocated, we can receive
+     * the file descriptors */
+    if (s->chr != NULL) {
+        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
+                     ivshmem_event, s);
+    }
+
+}
+
+static uint64_t ivshmem_get_size(IVShmemState * s) {
+
+    uint64_t value;
+    char *ptr;
+
+    value = strtoul(s->sizearg, &ptr, 10);
+    switch (*ptr) {
+        case 0: case 'M': case 'm':
+            value <<= 20;
+            break;
+        case 'G': case 'g':
+            value <<= 30;
+            break;
+        default:
+            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
+            exit(1);
+    }
+
+    /* BARs must be a power of 2 */
+    if (!is_power_of_two(value)) {
+        fprintf(stderr, "ivshmem: size must be power of 2\n");
+        exit(1);
+    }
+
+    return value;
+
+}
+
+static int pci_ivshmem_init(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+    uint8_t *pci_conf;
+    int i;
+
+    if (s->sizearg == NULL)
+        s->ivshmem_size = 4 << 20; /* 4 MB default */
+    else {
+        s->ivshmem_size = ivshmem_get_size(s);
+    }
+
+    /* IRQFD requires MSI */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
+        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
+        exit(1);
+    }
+
+    pci_conf = s->dev.config;
+    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+    pci_conf[0x0a] = 0x00; /* RAM controller */
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; /* header_type */
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+    /* region for registers*/
+    pci_register_bar(&s->dev, 0, 0x400,
+                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
+
+    /* allocate the MSI-X vectors */
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+
+        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
+            pci_register_bar(&s->dev, 1,
+                             msix_bar_size(&s->dev),
+                             PCI_BASE_ADDRESS_SPACE_MEMORY,
+                             msix_mmio_map);
+            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
+        } else {
+            IVSHMEM_DPRINTF("msix initialization failed\n");
+        }
+
+        /* 'activate' the vectors */
+        for (i = 0; i < s->vectors; i++) {
+            msix_vector_use(&s->dev, i);
+        }
+    }
+
+    if ((s->chr != NULL) && (strncmp(s->chr->filename, "unix:", 5) == 0)) {
+        /* if we get a UNIX socket as the parameter we will talk
+         * to the ivshmem server later once the MMIO BAR is actually
+         * allocated (see ivshmem_mmio_map) */
+
+        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
+                                                            s->chr->filename);
+
+        /* we allocate enough space for 16 guests and grow as needed */
+        s->nr_alloc_guests = 16;
+        s->vm_id = -1;
+
+        /* allocate/initialize space for interrupt handling */
+        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
+        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
+        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
+
+        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
+
+        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            s->eventfd_chr = (CharDriverState **)qemu_malloc(s->nr_alloc_guests *
+                                                            sizeof(void *));
+        }
+
+    } else {
+        /* just map the file immediately, we're not using a server */
+        int fd;
+
+        if (s->shmobj == NULL) {
+            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
+        }
+
+        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
+
+        /* try opening with O_EXCL and if it succeeds zero the memory
+         * by truncating to 0 */
+        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
+           /* truncate file to length PCI device's memory */
+            if (ftruncate(fd, s->ivshmem_size) != 0) {
+                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
+            }
+
+        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
+            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+            exit(-1);
+        }
+
+        create_shared_memory_BAR(s, fd);
+
+    }
+
+
+    return 0;
+}
+
+static int pci_ivshmem_uninit(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+
+    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
+
+    return 0;
+}
+
+static PCIDeviceInfo ivshmem_info = {
+    .qdev.name  = "ivshmem",
+    .qdev.size  = sizeof(IVShmemState),
+    .qdev.reset = ivshmem_reset,
+    .init       = pci_ivshmem_init,
+    .exit       = pci_ivshmem_uninit,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
+        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
+        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
+        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
+        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
+        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void ivshmem_register_devices(void)
+{
+    pci_qdev_register(&ivshmem_info);
+}
+
+device_init(ivshmem_register_devices)
diff --git a/qemu-char.c b/qemu-char.c
index ac65a1c..b2e50d0 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2093,6 +2093,12 @@ static void tcp_chr_read(void *opaque)
     }
 }
 
+CharDriverState *qemu_chr_open_eventfd(int eventfd){
+
+    return qemu_chr_open_fd(eventfd, eventfd);
+
+}
+
 static void tcp_chr_connect(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu-char.h b/qemu-char.h
index e3a0783..6ea01ba 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -94,6 +94,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
 void qemu_chr_info(Monitor *mon, QObject **ret_data);
 CharDriverState *qemu_chr_find(const char *name);
 
+/* add an eventfd to the qemu devices that are polled */
+CharDriverState *qemu_chr_open_eventfd(int eventfd);
+
 extern int term_escape_char;
 
 /* async I/O support */
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 6647b7b..2df4687 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible to make VLANs
 that span several QEMU instances. See @ref{sec_invocation} to have a
 basic example.
 
+@section Other Devices
+
+@subsection Inter-VM Shared Memory device
+
+With KVM enabled on a Linux host, a shared memory device is available.  Guests
+map a POSIX shared memory region into the guest as a PCI device that enables
+zero-copy communication to the application level of the guests.  The basic
+syntax is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+@end example
+
+If desired, interrupts can be sent between guest VMs accessing the same shared
+memory region.  Interrupt support requires using a shared memory server and
+using a chardev socket to connect to it.  The code for the shared memory server
+is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
+memory server is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
+qemu -chardev socket,path=<path>,id=<id>
+@end example
+
 @node direct_linux_boot
 @section Direct Linux Boot
 
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-04-21 17:53         ` [Qemu-devel] " Cam Macdonell
@ 2010-05-06 17:32           ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-06 17:32 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 04/21/2010 12:53 PM, Cam Macdonell wrote:
> Support an inter-vm shared memory device that maps a shared-memory object as a
> PCI device in the guest.  This patch also supports interrupts between guest by
> communicating over a unix domain socket.  This patch applies to the qemu-kvm
> repository.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>
> Interrupts are supported between multiple VMs by using a shared memory server
> by using a chardev socket.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>                      [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>      -chardev socket,path=<path>,id=<id>
>
> (shared memory server is qemu.git/contrib/ivshmem-server)
>
> Sample programs and init scripts are in a git repo here:
>
>      www.gitorious.org/nahanni
> ---
>   Makefile.target |    3 +
>   hw/ivshmem.c    |  727 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   qemu-char.c     |    6 +
>   qemu-char.h     |    3 +
>   qemu-doc.texi   |   25 ++
>   5 files changed, 764 insertions(+), 0 deletions(-)
>   create mode 100644 hw/ivshmem.c
>
> diff --git a/Makefile.target b/Makefile.target
> index 1ffd802..bc9a681 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>   obj-y += rtl8139.o
>   obj-y += e1000.o
>
> +# Inter-VM PCI shared memory
> +obj-y += ivshmem.o
> +
>   # Hardware support
>   obj-i386-y = pckbd.o dma.o
>   obj-i386-y += vga.o
> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
> new file mode 100644
> index 0000000..f8d8fdb
> --- /dev/null
> +++ b/hw/ivshmem.c
> @@ -0,0 +1,727 @@
> +/*
> + * Inter-VM Shared Memory PCI device.
> + *
> + * Author:
> + *      Cam Macdonell<cam@cs.ualberta.ca>
> + *
> + * Based On: cirrus_vga.c and rtl8139.c
> + *
> + * This code is licensed under the GNU GPL v2.
> + */
> +#include<sys/mman.h>
> +#include<sys/types.h>
> +#include<sys/socket.h>
> +#include<sys/io.h>
> +#include<sys/ioctl.h>
> +#include<sys/eventfd.h>
>    

This will break the Windows along with any non-Linux unix or any Linux 
old enough to not have eventfd support.

If it's based on cirrus_vga.c and rtl8139.c, then it ought to carry the 
respective copyrights, no?

Regards,

Anthony Liguori

> +#include "hw.h"
> +#include "console.h"
> +#include "pc.h"
> +#include "pci.h"
> +#include "sysemu.h"
> +
> +#include "msix.h"
> +#include "qemu-kvm.h"
> +#include "libkvm.h"
> +
> +#include<sys/eventfd.h>
> +#include<sys/mman.h>
> +#include<sys/socket.h>
> +#include<sys/ioctl.h>
> +
> +#define IVSHMEM_IRQFD   0
> +#define IVSHMEM_MSI     1
> +
> +#define DEBUG_IVSHMEM
> +#ifdef DEBUG_IVSHMEM
> +#define IVSHMEM_DPRINTF(fmt, args...)        \
> +    do {printf("IVSHMEM: " fmt, ##args); } while (0)
> +#else
> +#define IVSHMEM_DPRINTF(fmt, args...)
> +#endif
> +
> +typedef struct EventfdEntry {
> +    PCIDevice *pdev;
> +    int vector;
> +} EventfdEntry;
> +
> +typedef struct IVShmemState {
> +    PCIDevice dev;
> +    uint32_t intrmask;
> +    uint32_t intrstatus;
> +    uint32_t doorbell;
> +
> +    CharDriverState * chr;
> +    CharDriverState ** eventfd_chr;
> +    int ivshmem_mmio_io_addr;
> +
> +    pcibus_t mmio_addr;
> +    unsigned long ivshmem_offset;
> +    uint64_t ivshmem_size; /* size of shared memory region */
> +    int shm_fd; /* shared memory file descriptor */
> +
> +    int nr_allocated_vms;
> +    /* array of eventfds for each guest */
> +    int ** eventfds;
> +    /* keep track of # of eventfds for each guest*/
> +    int * eventfds_posn_count;
> +
> +    int nr_alloc_guests;
> +    int vm_id;
> +    int num_eventfds;
> +    uint32_t vectors;
> +    uint32_t features;
> +    EventfdEntry *eventfd_table;
> +
> +    char * shmobj;
> +    char * sizearg;
> +} IVShmemState;
> +
> +/* registers for the Inter-VM shared memory device */
> +enum ivshmem_registers {
> +    IntrMask = 0,
> +    IntrStatus = 4,
> +    IVPosition = 8,
> +    Doorbell = 12,
> +};
> +
> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
> +    return (ivs->features&  (1<<  feature));
> +}
> +
> +static inline int is_power_of_two(int x) {
> +    return (x&  (x-1)) == 0;
> +}
> +
> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
> +                    pcibus_t addr, pcibus_t size, int type)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
> +
> +    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size);
> +    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
> +
> +}
> +
> +/* accessing registers - based on rtl8139 */
> +static void ivshmem_update_irq(IVShmemState *s, int val)
> +{
> +    int isr;
> +    isr = (s->intrstatus&  s->intrmask)&  0xffffffff;
> +
> +    /* don't print ISR resets */
> +    if (isr) {
> +        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
> +           isr ? 1 : 0, s->intrstatus, s->intrmask);
> +    }
> +
> +    qemu_set_irq(s->dev.irq[0], (isr != 0));
> +}
> +
> +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
> +
> +    s->intrmask = val;
> +
> +    ivshmem_update_irq(s, val);
> +}
> +
> +static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
> +{
> +    uint32_t ret = s->intrmask;
> +
> +    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
> +
> +    return ret;
> +}
> +
> +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
> +
> +    s->intrstatus = val;
> +
> +    ivshmem_update_irq(s, val);
> +    return;
> +}
> +
> +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
> +{
> +    uint32_t ret = s->intrstatus;
> +
> +    /* reading ISR clears all interrupts */
> +    s->intrstatus = 0;
> +
> +    ivshmem_update_irq(s, 0);
> +
> +    return ret;
> +}
> +
> +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
> +{
> +
> +    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
> +}
> +
> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVShmemState *s = opaque;
> +
> +    u_int64_t write_one = 1;
> +    u_int16_t dest = val>>  16;
> +    u_int16_t vector = val&  0xff;
> +
> +    addr&= 0xfe;
> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ivshmem_IntrMask_write(s, val);
> +            break;
> +
> +        case IntrStatus:
> +            ivshmem_IntrStatus_write(s, val);
> +            break;
> +
> +        case Doorbell:
> +            /* check doorbell range */
> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest])) {
> +                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n", write_one, dest, vector);
> +                if (write(s->eventfds[dest][vector],&(write_one), 8) != 8) {
> +                    IVSHMEM_DPRINTF("error writing to eventfd\n");
> +                }
> +            }
> +            break;
> +        default:
> +            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
> +    }
> +}
> +
> +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
> +}
> +
> +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
> +{
> +
> +    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
> +    return 0;
> +}
> +
> +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
> +{
> +
> +    IVShmemState *s = opaque;
> +    uint32_t ret;
> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ret = ivshmem_IntrMask_read(s);
> +            break;
> +
> +        case IntrStatus:
> +            ret = ivshmem_IntrStatus_read(s);
> +            break;
> +
> +        case IVPosition:
> +            /* return my id in the ivshmem list */
> +            ret = s->vm_id;
> +            break;
> +
> +        default:
> +            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
> +            ret = 0;
> +    }
> +
> +    return ret;
> +
> +}
> +
> +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
> +{
> +    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
> +
> +    return 0;
> +}
> +
> +static void ivshmem_mmio_writeb(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writeb(opaque, addr&  0xFF, val);
> +}
> +
> +static void ivshmem_mmio_writew(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writew(opaque, addr&  0xFF, val);
> +}
> +
> +static void ivshmem_mmio_writel(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writel(opaque, addr&  0xFF, val);
> +}
> +
> +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
> +{
> +    return ivshmem_io_readb(opaque, addr&  0xFF);
> +}
> +
> +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
> +{
> +    uint32_t val = ivshmem_io_readw(opaque, addr&  0xFF);
> +    return val;
> +}
> +
> +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
> +{
> +    uint32_t val = ivshmem_io_readl(opaque, addr&  0xFF);
> +    return val;
> +}
> +
> +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
> +    ivshmem_mmio_readb,
> +    ivshmem_mmio_readw,
> +    ivshmem_mmio_readl,
> +};
> +
> +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
> +    ivshmem_mmio_writeb,
> +    ivshmem_mmio_writew,
> +    ivshmem_mmio_writel,
> +};
> +
> +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
> +{
> +    IVShmemState *s = opaque;
> +
> +    ivshmem_IntrStatus_write(s, *buf);
> +
> +    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
> +}
> +
> +static int ivshmem_can_receive(void * opaque)
> +{
> +    return 8;
> +}
> +
> +static void ivshmem_event(void *opaque, int event)
> +{
> +    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
> +}
> +
> +static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
> +
> +    EventfdEntry *entry = opaque;
> +    PCIDevice *pdev = entry->pdev;
> +
> +    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
> +    msix_notify(pdev, entry->vector);
> +}
> +
> +static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
> +                                                                    int vector)
> +{
> +    /* create a event character device based on the passed eventfd */
> +    IVShmemState *s = opaque;
> +    CharDriverState * chr;
> +
> +    chr = qemu_chr_open_eventfd(eventfd);
> +
> +    if (chr == NULL) {
> +        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
> +        exit(-1);
> +    }
> +
> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +        s->eventfd_table[vector].pdev =&s->dev;
> +        s->eventfd_table[vector].vector = vector;
> +
> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
> +                      ivshmem_event,&s->eventfd_table[vector]);
> +    } else {
> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
> +                      ivshmem_event, s);
> +    }
> +
> +    return chr;
> +
> +}
> +
> +static int check_shm_size(IVShmemState *s, int shmemfd) {
> +    /* check that the guest isn't going to try and map more memory than the
> +     * card server allocated return -1 to indicate error */
> +
> +    struct stat buf;
> +
> +    fstat(shmemfd,&buf);
> +
> +    if (s->ivshmem_size>  buf.st_size) {
> +        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
> +        fprintf(stderr, " than shared object size (%ld>  %ld)\n",
> +                                          s->ivshmem_size, buf.st_size);
> +        return -1;
> +    } else {
> +        return 0;
> +    }
> +}
> +
> +static void create_shared_memory_BAR(IVShmemState *s, int fd) {
> +
> +    s->shm_fd = fd;
> +
> +    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
> +             MAP_SHARED, 0);
> +
> +    /* region for shared memory */
> +    pci_register_bar(&s->dev, 2, s->ivshmem_size,
> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
> +}
> +
> +static void close_guest_eventfds(IVShmemState *s, int posn)
> +{
> +    int i, guest_curr_max;
> +
> +    guest_curr_max = s->eventfds_posn_count[posn];
> +
> +    for (i = 0; i<  guest_curr_max; i++)
> +        close(s->eventfds[posn][i]);
> +
> +    free(s->eventfds[posn]);
> +    s->eventfds_posn_count[posn] = 0;
> +}
> +
> +/* this function increase the dynamic storage need to store data about other
> + * guests */
> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
> +
> +    int j, old_nr_alloc;
> +
> +    old_nr_alloc = s->nr_alloc_guests;
> +
> +    while (s->nr_alloc_guests<  new_min_size)
> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
> +
> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
> +                                                        sizeof(int *));
> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
> +                                                    s->nr_alloc_guests *
> +                                                        sizeof(int));
> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
> +                                                    sizeof(EventfdEntry));
> +
> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
> +            (s->eventfd_table == NULL)) {
> +        fprintf(stderr, "Allocation error - exiting\n");
> +        exit(1);
> +    }
> +
> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
> +                                    s->nr_alloc_guests * sizeof(void *));
> +        if (s->eventfd_chr == NULL) {
> +            fprintf(stderr, "Allocation error - exiting\n");
> +            exit(1);
> +        }
> +    }
> +
> +    /* zero out new pointers */
> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
> +        s->eventfds[j] = NULL;
> +    }
> +}
> +
> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
> +{
> +    IVShmemState *s = opaque;
> +    int incoming_fd, tmp_fd;
> +    int guest_curr_max;
> +    long incoming_posn;
> +
> +    memcpy(&incoming_posn, buf, sizeof(long));
> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
> +
> +    /* make sure we have enough space for this guest */
> +    if (incoming_posn>= s->nr_alloc_guests) {
> +        increase_dynamic_storage(s, incoming_posn);
> +    }
> +
> +    if (tmp_fd == -1) {
> +        /* if posn is positive and unseen before then this is our posn*/
> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL)) {
> +            /* receive our posn */
> +            s->vm_id = incoming_posn;
> +            return;
> +        } else {
> +            /* otherwise an fd == -1 means an existing guest has gone away */
> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
> +            close_guest_eventfds(s, incoming_posn);
> +            return;
> +        }
> +    }
> +
> +    /* because of the implementation of get_msgfd, we need a dup */
> +    incoming_fd = dup(tmp_fd);
> +
> +    /* if the position is -1, then it's shared memory region fd */
> +    if (incoming_posn == -1) {
> +
> +        s->num_eventfds = 0;
> +
> +        if (check_shm_size(s, incoming_fd) == -1) {
> +            exit(-1);
> +        }
> +
> +        /* creating a BAR in qemu_chr callback may be crazy */
> +        create_shared_memory_BAR(s, incoming_fd);
> +
> +       return;
> +    }
> +
> +    /* each guest has an array of eventfds, and we keep track of how many
> +     * guests for each VM */
> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
> +    if (guest_curr_max == 0) {
> +        /* one eventfd per MSI vector */
> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
> +                                                                sizeof(int));
> +    }
> +
> +    /* this is an eventfd for a particular guest VM */
> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
> +                                                                incoming_fd);
> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
> +
> +    /* increment count for particular guest */
> +    s->eventfds_posn_count[incoming_posn]++;
> +
> +    /* ioeventfd and irqfd are enabled together,
> +     * so the flag IRQFD refers to both */
> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&  guest_curr_max>= 0) {
> +        /* allocate ioeventfd for the new fd
> +         * received for guest @ incoming_posn */
> +        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
> +                                (incoming_posn<<  16) | guest_curr_max, 1);
> +    }
> +
> +    /* keep track of the maximum VM ID */
> +    if (incoming_posn>  s->num_eventfds) {
> +        s->num_eventfds = incoming_posn;
> +    }
> +
> +    if (incoming_posn == s->vm_id) {
> +        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +            /* setup irqfd for this VM's eventfd */
> +            int vector = guest_curr_max;
> +            kvm_set_irqfd(s->eventfds[s->vm_id][guest_curr_max], vector,
> +                                        s->dev.msix_irq_entries[vector].gsi);
> +        } else {
> +            /* initialize char device for callback
> +             * if this is one of my eventfd */
> +            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
> +                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
> +        }
> +    }
> +
> +    return;
> +}
> +
> +static void ivshmem_reset(DeviceState *d)
> +{
> +    return;
> +}
> +
> +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
> +                       pcibus_t addr, pcibus_t size, int type)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
> +
> +    s->mmio_addr = addr;
> +    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
> +
> +    /* now that our mmio region has been allocated, we can receive
> +     * the file descriptors */
> +    if (s->chr != NULL) {
> +        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
> +                     ivshmem_event, s);
> +    }
> +
> +}
> +
> +static uint64_t ivshmem_get_size(IVShmemState * s) {
> +
> +    uint64_t value;
> +    char *ptr;
> +
> +    value = strtoul(s->sizearg,&ptr, 10);
> +    switch (*ptr) {
> +        case 0: case 'M': case 'm':
> +            value<<= 20;
> +            break;
> +        case 'G': case 'g':
> +            value<<= 30;
> +            break;
> +        default:
> +            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
> +            exit(1);
> +    }
> +
> +    /* BARs must be a power of 2 */
> +    if (!is_power_of_two(value)) {
> +        fprintf(stderr, "ivshmem: size must be power of 2\n");
> +        exit(1);
> +    }
> +
> +    return value;
> +
> +}
> +
> +static int pci_ivshmem_init(PCIDevice *dev)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
> +    uint8_t *pci_conf;
> +    int i;
> +
> +    if (s->sizearg == NULL)
> +        s->ivshmem_size = 4<<  20; /* 4 MB default */
> +    else {
> +        s->ivshmem_size = ivshmem_get_size(s);
> +    }
> +
> +    /* IRQFD requires MSI */
> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&
> +        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
> +        exit(1);
> +    }
> +
> +    pci_conf = s->dev.config;
> +    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
> +    pci_conf[0x01] = 0x1a;
> +    pci_conf[0x02] = 0x10;
> +    pci_conf[0x03] = 0x11;
> +    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
> +    pci_conf[0x0a] = 0x00; /* RAM controller */
> +    pci_conf[0x0b] = 0x05;
> +    pci_conf[0x0e] = 0x00; /* header_type */
> +
> +    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
> +                                    ivshmem_mmio_write, s);
> +    /* region for registers*/
> +    pci_register_bar(&s->dev, 0, 0x400,
> +                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
> +
> +    /* allocate the MSI-X vectors */
> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +
> +        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
> +            pci_register_bar(&s->dev, 1,
> +                             msix_bar_size(&s->dev),
> +                             PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                             msix_mmio_map);
> +            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
> +        } else {
> +            IVSHMEM_DPRINTF("msix initialization failed\n");
> +        }
> +
> +        /* 'activate' the vectors */
> +        for (i = 0; i<  s->vectors; i++) {
> +            msix_vector_use(&s->dev, i);
> +        }
> +    }
> +
> +    if ((s->chr != NULL)&&  (strncmp(s->chr->filename, "unix:", 5) == 0)) {
> +        /* if we get a UNIX socket as the parameter we will talk
> +         * to the ivshmem server later once the MMIO BAR is actually
> +         * allocated (see ivshmem_mmio_map) */
> +
> +        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
> +                                                            s->chr->filename);
> +
> +        /* we allocate enough space for 16 guests and grow as needed */
> +        s->nr_alloc_guests = 16;
> +        s->vm_id = -1;
> +
> +        /* allocate/initialize space for interrupt handling */
> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
> +        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
> +
> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
> +
> +        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +            s->eventfd_chr = (CharDriverState **)qemu_malloc(s->nr_alloc_guests *
> +                                                            sizeof(void *));
> +        }
> +
> +    } else {
> +        /* just map the file immediately, we're not using a server */
> +        int fd;
> +
> +        if (s->shmobj == NULL) {
> +            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
> +        }
> +
> +        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
> +
> +        /* try opening with O_EXCL and if it succeeds zero the memory
> +         * by truncating to 0 */
> +        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
> +                        S_IRWXU|S_IRWXG|S_IRWXO))>  0) {
> +           /* truncate file to length PCI device's memory */
> +            if (ftruncate(fd, s->ivshmem_size) != 0) {
> +                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
> +            }
> +
> +        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
> +                        S_IRWXU|S_IRWXG|S_IRWXO))<  0) {
> +            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
> +            exit(-1);
> +        }
> +
> +        create_shared_memory_BAR(s, fd);
> +
> +    }
> +
> +
> +    return 0;
> +}
> +
> +static int pci_ivshmem_uninit(PCIDevice *dev)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
> +
> +    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
> +
> +    return 0;
> +}
> +
> +static PCIDeviceInfo ivshmem_info = {
> +    .qdev.name  = "ivshmem",
> +    .qdev.size  = sizeof(IVShmemState),
> +    .qdev.reset = ivshmem_reset,
> +    .init       = pci_ivshmem_init,
> +    .exit       = pci_ivshmem_uninit,
> +    .qdev.props = (Property[]) {
> +        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
> +        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
> +        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
> +        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
> +        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
> +        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
> +        DEFINE_PROP_END_OF_LIST(),
> +    }
> +};
> +
> +static void ivshmem_register_devices(void)
> +{
> +    pci_qdev_register(&ivshmem_info);
> +}
> +
> +device_init(ivshmem_register_devices)
> diff --git a/qemu-char.c b/qemu-char.c
> index 048da3f..41cb8c7 100644
> --- a/qemu-char.c
> +++ b/qemu-char.c
> @@ -2076,6 +2076,12 @@ static void tcp_chr_read(void *opaque)
>       }
>   }
>
> +CharDriverState *qemu_chr_open_eventfd(int eventfd){
> +
> +    return qemu_chr_open_fd(eventfd, eventfd);
> +
> +}
> +
>   static void tcp_chr_connect(void *opaque)
>   {
>       CharDriverState *chr = opaque;
> diff --git a/qemu-char.h b/qemu-char.h
> index 3a9427b..1571091 100644
> --- a/qemu-char.h
> +++ b/qemu-char.h
> @@ -93,6 +93,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
>   void qemu_chr_info(Monitor *mon, QObject **ret_data);
>   CharDriverState *qemu_chr_find(const char *name);
>
> +/* add an eventfd to the qemu devices that are polled */
> +CharDriverState *qemu_chr_open_eventfd(int eventfd);
> +
>   extern int term_escape_char;
>
>   /* async I/O support */
> diff --git a/qemu-doc.texi b/qemu-doc.texi
> index 6647b7b..2df4687 100644
> --- a/qemu-doc.texi
> +++ b/qemu-doc.texi
> @@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible to make VLANs
>   that span several QEMU instances. See @ref{sec_invocation} to have a
>   basic example.
>
> +@section Other Devices
> +
> +@subsection Inter-VM Shared Memory device
> +
> +With KVM enabled on a Linux host, a shared memory device is available.  Guests
> +map a POSIX shared memory region into the guest as a PCI device that enables
> +zero-copy communication to the application level of the guests.  The basic
> +syntax is:
> +
> +@example
> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
> +@end example
> +
> +If desired, interrupts can be sent between guest VMs accessing the same shared
> +memory region.  Interrupt support requires using a shared memory server and
> +using a chardev socket to connect to it.  The code for the shared memory server
> +is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
> +memory server is:
> +
> +@example
> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
> +                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
> +qemu -chardev socket,path=<path>,id=<id>
> +@end example
> +
>   @node direct_linux_boot
>   @section Direct Linux Boot
>
>    


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-06 17:32           ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-06 17:32 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 04/21/2010 12:53 PM, Cam Macdonell wrote:
> Support an inter-vm shared memory device that maps a shared-memory object as a
> PCI device in the guest.  This patch also supports interrupts between guest by
> communicating over a unix domain socket.  This patch applies to the qemu-kvm
> repository.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>
> Interrupts are supported between multiple VMs by using a shared memory server
> by using a chardev socket.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>                      [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>      -chardev socket,path=<path>,id=<id>
>
> (shared memory server is qemu.git/contrib/ivshmem-server)
>
> Sample programs and init scripts are in a git repo here:
>
>      www.gitorious.org/nahanni
> ---
>   Makefile.target |    3 +
>   hw/ivshmem.c    |  727 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   qemu-char.c     |    6 +
>   qemu-char.h     |    3 +
>   qemu-doc.texi   |   25 ++
>   5 files changed, 764 insertions(+), 0 deletions(-)
>   create mode 100644 hw/ivshmem.c
>
> diff --git a/Makefile.target b/Makefile.target
> index 1ffd802..bc9a681 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>   obj-y += rtl8139.o
>   obj-y += e1000.o
>
> +# Inter-VM PCI shared memory
> +obj-y += ivshmem.o
> +
>   # Hardware support
>   obj-i386-y = pckbd.o dma.o
>   obj-i386-y += vga.o
> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
> new file mode 100644
> index 0000000..f8d8fdb
> --- /dev/null
> +++ b/hw/ivshmem.c
> @@ -0,0 +1,727 @@
> +/*
> + * Inter-VM Shared Memory PCI device.
> + *
> + * Author:
> + *      Cam Macdonell<cam@cs.ualberta.ca>
> + *
> + * Based On: cirrus_vga.c and rtl8139.c
> + *
> + * This code is licensed under the GNU GPL v2.
> + */
> +#include<sys/mman.h>
> +#include<sys/types.h>
> +#include<sys/socket.h>
> +#include<sys/io.h>
> +#include<sys/ioctl.h>
> +#include<sys/eventfd.h>
>    

This will break the Windows along with any non-Linux unix or any Linux 
old enough to not have eventfd support.

If it's based on cirrus_vga.c and rtl8139.c, then it ought to carry the 
respective copyrights, no?

Regards,

Anthony Liguori

> +#include "hw.h"
> +#include "console.h"
> +#include "pc.h"
> +#include "pci.h"
> +#include "sysemu.h"
> +
> +#include "msix.h"
> +#include "qemu-kvm.h"
> +#include "libkvm.h"
> +
> +#include<sys/eventfd.h>
> +#include<sys/mman.h>
> +#include<sys/socket.h>
> +#include<sys/ioctl.h>
> +
> +#define IVSHMEM_IRQFD   0
> +#define IVSHMEM_MSI     1
> +
> +#define DEBUG_IVSHMEM
> +#ifdef DEBUG_IVSHMEM
> +#define IVSHMEM_DPRINTF(fmt, args...)        \
> +    do {printf("IVSHMEM: " fmt, ##args); } while (0)
> +#else
> +#define IVSHMEM_DPRINTF(fmt, args...)
> +#endif
> +
> +typedef struct EventfdEntry {
> +    PCIDevice *pdev;
> +    int vector;
> +} EventfdEntry;
> +
> +typedef struct IVShmemState {
> +    PCIDevice dev;
> +    uint32_t intrmask;
> +    uint32_t intrstatus;
> +    uint32_t doorbell;
> +
> +    CharDriverState * chr;
> +    CharDriverState ** eventfd_chr;
> +    int ivshmem_mmio_io_addr;
> +
> +    pcibus_t mmio_addr;
> +    unsigned long ivshmem_offset;
> +    uint64_t ivshmem_size; /* size of shared memory region */
> +    int shm_fd; /* shared memory file descriptor */
> +
> +    int nr_allocated_vms;
> +    /* array of eventfds for each guest */
> +    int ** eventfds;
> +    /* keep track of # of eventfds for each guest*/
> +    int * eventfds_posn_count;
> +
> +    int nr_alloc_guests;
> +    int vm_id;
> +    int num_eventfds;
> +    uint32_t vectors;
> +    uint32_t features;
> +    EventfdEntry *eventfd_table;
> +
> +    char * shmobj;
> +    char * sizearg;
> +} IVShmemState;
> +
> +/* registers for the Inter-VM shared memory device */
> +enum ivshmem_registers {
> +    IntrMask = 0,
> +    IntrStatus = 4,
> +    IVPosition = 8,
> +    Doorbell = 12,
> +};
> +
> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
> +    return (ivs->features&  (1<<  feature));
> +}
> +
> +static inline int is_power_of_two(int x) {
> +    return (x&  (x-1)) == 0;
> +}
> +
> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
> +                    pcibus_t addr, pcibus_t size, int type)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
> +
> +    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size);
> +    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
> +
> +}
> +
> +/* accessing registers - based on rtl8139 */
> +static void ivshmem_update_irq(IVShmemState *s, int val)
> +{
> +    int isr;
> +    isr = (s->intrstatus&  s->intrmask)&  0xffffffff;
> +
> +    /* don't print ISR resets */
> +    if (isr) {
> +        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
> +           isr ? 1 : 0, s->intrstatus, s->intrmask);
> +    }
> +
> +    qemu_set_irq(s->dev.irq[0], (isr != 0));
> +}
> +
> +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
> +
> +    s->intrmask = val;
> +
> +    ivshmem_update_irq(s, val);
> +}
> +
> +static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
> +{
> +    uint32_t ret = s->intrmask;
> +
> +    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
> +
> +    return ret;
> +}
> +
> +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
> +
> +    s->intrstatus = val;
> +
> +    ivshmem_update_irq(s, val);
> +    return;
> +}
> +
> +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
> +{
> +    uint32_t ret = s->intrstatus;
> +
> +    /* reading ISR clears all interrupts */
> +    s->intrstatus = 0;
> +
> +    ivshmem_update_irq(s, 0);
> +
> +    return ret;
> +}
> +
> +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
> +{
> +
> +    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
> +}
> +
> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVShmemState *s = opaque;
> +
> +    u_int64_t write_one = 1;
> +    u_int16_t dest = val>>  16;
> +    u_int16_t vector = val&  0xff;
> +
> +    addr&= 0xfe;
> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ivshmem_IntrMask_write(s, val);
> +            break;
> +
> +        case IntrStatus:
> +            ivshmem_IntrStatus_write(s, val);
> +            break;
> +
> +        case Doorbell:
> +            /* check doorbell range */
> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest])) {
> +                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n", write_one, dest, vector);
> +                if (write(s->eventfds[dest][vector],&(write_one), 8) != 8) {
> +                    IVSHMEM_DPRINTF("error writing to eventfd\n");
> +                }
> +            }
> +            break;
> +        default:
> +            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
> +    }
> +}
> +
> +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
> +}
> +
> +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
> +{
> +
> +    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
> +    return 0;
> +}
> +
> +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
> +{
> +
> +    IVShmemState *s = opaque;
> +    uint32_t ret;
> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ret = ivshmem_IntrMask_read(s);
> +            break;
> +
> +        case IntrStatus:
> +            ret = ivshmem_IntrStatus_read(s);
> +            break;
> +
> +        case IVPosition:
> +            /* return my id in the ivshmem list */
> +            ret = s->vm_id;
> +            break;
> +
> +        default:
> +            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
> +            ret = 0;
> +    }
> +
> +    return ret;
> +
> +}
> +
> +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
> +{
> +    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
> +
> +    return 0;
> +}
> +
> +static void ivshmem_mmio_writeb(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writeb(opaque, addr&  0xFF, val);
> +}
> +
> +static void ivshmem_mmio_writew(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writew(opaque, addr&  0xFF, val);
> +}
> +
> +static void ivshmem_mmio_writel(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writel(opaque, addr&  0xFF, val);
> +}
> +
> +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
> +{
> +    return ivshmem_io_readb(opaque, addr&  0xFF);
> +}
> +
> +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
> +{
> +    uint32_t val = ivshmem_io_readw(opaque, addr&  0xFF);
> +    return val;
> +}
> +
> +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
> +{
> +    uint32_t val = ivshmem_io_readl(opaque, addr&  0xFF);
> +    return val;
> +}
> +
> +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
> +    ivshmem_mmio_readb,
> +    ivshmem_mmio_readw,
> +    ivshmem_mmio_readl,
> +};
> +
> +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
> +    ivshmem_mmio_writeb,
> +    ivshmem_mmio_writew,
> +    ivshmem_mmio_writel,
> +};
> +
> +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
> +{
> +    IVShmemState *s = opaque;
> +
> +    ivshmem_IntrStatus_write(s, *buf);
> +
> +    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
> +}
> +
> +static int ivshmem_can_receive(void * opaque)
> +{
> +    return 8;
> +}
> +
> +static void ivshmem_event(void *opaque, int event)
> +{
> +    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
> +}
> +
> +static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
> +
> +    EventfdEntry *entry = opaque;
> +    PCIDevice *pdev = entry->pdev;
> +
> +    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
> +    msix_notify(pdev, entry->vector);
> +}
> +
> +static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
> +                                                                    int vector)
> +{
> +    /* create a event character device based on the passed eventfd */
> +    IVShmemState *s = opaque;
> +    CharDriverState * chr;
> +
> +    chr = qemu_chr_open_eventfd(eventfd);
> +
> +    if (chr == NULL) {
> +        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
> +        exit(-1);
> +    }
> +
> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +        s->eventfd_table[vector].pdev =&s->dev;
> +        s->eventfd_table[vector].vector = vector;
> +
> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
> +                      ivshmem_event,&s->eventfd_table[vector]);
> +    } else {
> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
> +                      ivshmem_event, s);
> +    }
> +
> +    return chr;
> +
> +}
> +
> +static int check_shm_size(IVShmemState *s, int shmemfd) {
> +    /* check that the guest isn't going to try and map more memory than the
> +     * card server allocated return -1 to indicate error */
> +
> +    struct stat buf;
> +
> +    fstat(shmemfd,&buf);
> +
> +    if (s->ivshmem_size>  buf.st_size) {
> +        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
> +        fprintf(stderr, " than shared object size (%ld>  %ld)\n",
> +                                          s->ivshmem_size, buf.st_size);
> +        return -1;
> +    } else {
> +        return 0;
> +    }
> +}
> +
> +static void create_shared_memory_BAR(IVShmemState *s, int fd) {
> +
> +    s->shm_fd = fd;
> +
> +    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
> +             MAP_SHARED, 0);
> +
> +    /* region for shared memory */
> +    pci_register_bar(&s->dev, 2, s->ivshmem_size,
> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
> +}
> +
> +static void close_guest_eventfds(IVShmemState *s, int posn)
> +{
> +    int i, guest_curr_max;
> +
> +    guest_curr_max = s->eventfds_posn_count[posn];
> +
> +    for (i = 0; i<  guest_curr_max; i++)
> +        close(s->eventfds[posn][i]);
> +
> +    free(s->eventfds[posn]);
> +    s->eventfds_posn_count[posn] = 0;
> +}
> +
> +/* this function increase the dynamic storage need to store data about other
> + * guests */
> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
> +
> +    int j, old_nr_alloc;
> +
> +    old_nr_alloc = s->nr_alloc_guests;
> +
> +    while (s->nr_alloc_guests<  new_min_size)
> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
> +
> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
> +                                                        sizeof(int *));
> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
> +                                                    s->nr_alloc_guests *
> +                                                        sizeof(int));
> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
> +                                                    sizeof(EventfdEntry));
> +
> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
> +            (s->eventfd_table == NULL)) {
> +        fprintf(stderr, "Allocation error - exiting\n");
> +        exit(1);
> +    }
> +
> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
> +                                    s->nr_alloc_guests * sizeof(void *));
> +        if (s->eventfd_chr == NULL) {
> +            fprintf(stderr, "Allocation error - exiting\n");
> +            exit(1);
> +        }
> +    }
> +
> +    /* zero out new pointers */
> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
> +        s->eventfds[j] = NULL;
> +    }
> +}
> +
> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
> +{
> +    IVShmemState *s = opaque;
> +    int incoming_fd, tmp_fd;
> +    int guest_curr_max;
> +    long incoming_posn;
> +
> +    memcpy(&incoming_posn, buf, sizeof(long));
> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
> +
> +    /* make sure we have enough space for this guest */
> +    if (incoming_posn>= s->nr_alloc_guests) {
> +        increase_dynamic_storage(s, incoming_posn);
> +    }
> +
> +    if (tmp_fd == -1) {
> +        /* if posn is positive and unseen before then this is our posn*/
> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL)) {
> +            /* receive our posn */
> +            s->vm_id = incoming_posn;
> +            return;
> +        } else {
> +            /* otherwise an fd == -1 means an existing guest has gone away */
> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
> +            close_guest_eventfds(s, incoming_posn);
> +            return;
> +        }
> +    }
> +
> +    /* because of the implementation of get_msgfd, we need a dup */
> +    incoming_fd = dup(tmp_fd);
> +
> +    /* if the position is -1, then it's shared memory region fd */
> +    if (incoming_posn == -1) {
> +
> +        s->num_eventfds = 0;
> +
> +        if (check_shm_size(s, incoming_fd) == -1) {
> +            exit(-1);
> +        }
> +
> +        /* creating a BAR in qemu_chr callback may be crazy */
> +        create_shared_memory_BAR(s, incoming_fd);
> +
> +       return;
> +    }
> +
> +    /* each guest has an array of eventfds, and we keep track of how many
> +     * guests for each VM */
> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
> +    if (guest_curr_max == 0) {
> +        /* one eventfd per MSI vector */
> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
> +                                                                sizeof(int));
> +    }
> +
> +    /* this is an eventfd for a particular guest VM */
> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
> +                                                                incoming_fd);
> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
> +
> +    /* increment count for particular guest */
> +    s->eventfds_posn_count[incoming_posn]++;
> +
> +    /* ioeventfd and irqfd are enabled together,
> +     * so the flag IRQFD refers to both */
> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&  guest_curr_max>= 0) {
> +        /* allocate ioeventfd for the new fd
> +         * received for guest @ incoming_posn */
> +        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
> +                                (incoming_posn<<  16) | guest_curr_max, 1);
> +    }
> +
> +    /* keep track of the maximum VM ID */
> +    if (incoming_posn>  s->num_eventfds) {
> +        s->num_eventfds = incoming_posn;
> +    }
> +
> +    if (incoming_posn == s->vm_id) {
> +        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +            /* setup irqfd for this VM's eventfd */
> +            int vector = guest_curr_max;
> +            kvm_set_irqfd(s->eventfds[s->vm_id][guest_curr_max], vector,
> +                                        s->dev.msix_irq_entries[vector].gsi);
> +        } else {
> +            /* initialize char device for callback
> +             * if this is one of my eventfd */
> +            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
> +                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
> +        }
> +    }
> +
> +    return;
> +}
> +
> +static void ivshmem_reset(DeviceState *d)
> +{
> +    return;
> +}
> +
> +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
> +                       pcibus_t addr, pcibus_t size, int type)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
> +
> +    s->mmio_addr = addr;
> +    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
> +
> +    /* now that our mmio region has been allocated, we can receive
> +     * the file descriptors */
> +    if (s->chr != NULL) {
> +        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
> +                     ivshmem_event, s);
> +    }
> +
> +}
> +
> +static uint64_t ivshmem_get_size(IVShmemState * s) {
> +
> +    uint64_t value;
> +    char *ptr;
> +
> +    value = strtoul(s->sizearg,&ptr, 10);
> +    switch (*ptr) {
> +        case 0: case 'M': case 'm':
> +            value<<= 20;
> +            break;
> +        case 'G': case 'g':
> +            value<<= 30;
> +            break;
> +        default:
> +            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
> +            exit(1);
> +    }
> +
> +    /* BARs must be a power of 2 */
> +    if (!is_power_of_two(value)) {
> +        fprintf(stderr, "ivshmem: size must be power of 2\n");
> +        exit(1);
> +    }
> +
> +    return value;
> +
> +}
> +
> +static int pci_ivshmem_init(PCIDevice *dev)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
> +    uint8_t *pci_conf;
> +    int i;
> +
> +    if (s->sizearg == NULL)
> +        s->ivshmem_size = 4<<  20; /* 4 MB default */
> +    else {
> +        s->ivshmem_size = ivshmem_get_size(s);
> +    }
> +
> +    /* IRQFD requires MSI */
> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&
> +        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
> +        exit(1);
> +    }
> +
> +    pci_conf = s->dev.config;
> +    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
> +    pci_conf[0x01] = 0x1a;
> +    pci_conf[0x02] = 0x10;
> +    pci_conf[0x03] = 0x11;
> +    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
> +    pci_conf[0x0a] = 0x00; /* RAM controller */
> +    pci_conf[0x0b] = 0x05;
> +    pci_conf[0x0e] = 0x00; /* header_type */
> +
> +    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
> +                                    ivshmem_mmio_write, s);
> +    /* region for registers*/
> +    pci_register_bar(&s->dev, 0, 0x400,
> +                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
> +
> +    /* allocate the MSI-X vectors */
> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +
> +        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
> +            pci_register_bar(&s->dev, 1,
> +                             msix_bar_size(&s->dev),
> +                             PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                             msix_mmio_map);
> +            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
> +        } else {
> +            IVSHMEM_DPRINTF("msix initialization failed\n");
> +        }
> +
> +        /* 'activate' the vectors */
> +        for (i = 0; i<  s->vectors; i++) {
> +            msix_vector_use(&s->dev, i);
> +        }
> +    }
> +
> +    if ((s->chr != NULL)&&  (strncmp(s->chr->filename, "unix:", 5) == 0)) {
> +        /* if we get a UNIX socket as the parameter we will talk
> +         * to the ivshmem server later once the MMIO BAR is actually
> +         * allocated (see ivshmem_mmio_map) */
> +
> +        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
> +                                                            s->chr->filename);
> +
> +        /* we allocate enough space for 16 guests and grow as needed */
> +        s->nr_alloc_guests = 16;
> +        s->vm_id = -1;
> +
> +        /* allocate/initialize space for interrupt handling */
> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
> +        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
> +
> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
> +
> +        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +            s->eventfd_chr = (CharDriverState **)qemu_malloc(s->nr_alloc_guests *
> +                                                            sizeof(void *));
> +        }
> +
> +    } else {
> +        /* just map the file immediately, we're not using a server */
> +        int fd;
> +
> +        if (s->shmobj == NULL) {
> +            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
> +        }
> +
> +        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
> +
> +        /* try opening with O_EXCL and if it succeeds zero the memory
> +         * by truncating to 0 */
> +        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
> +                        S_IRWXU|S_IRWXG|S_IRWXO))>  0) {
> +           /* truncate file to length PCI device's memory */
> +            if (ftruncate(fd, s->ivshmem_size) != 0) {
> +                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
> +            }
> +
> +        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
> +                        S_IRWXU|S_IRWXG|S_IRWXO))<  0) {
> +            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
> +            exit(-1);
> +        }
> +
> +        create_shared_memory_BAR(s, fd);
> +
> +    }
> +
> +
> +    return 0;
> +}
> +
> +static int pci_ivshmem_uninit(PCIDevice *dev)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
> +
> +    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
> +
> +    return 0;
> +}
> +
> +static PCIDeviceInfo ivshmem_info = {
> +    .qdev.name  = "ivshmem",
> +    .qdev.size  = sizeof(IVShmemState),
> +    .qdev.reset = ivshmem_reset,
> +    .init       = pci_ivshmem_init,
> +    .exit       = pci_ivshmem_uninit,
> +    .qdev.props = (Property[]) {
> +        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
> +        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
> +        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
> +        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
> +        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
> +        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
> +        DEFINE_PROP_END_OF_LIST(),
> +    }
> +};
> +
> +static void ivshmem_register_devices(void)
> +{
> +    pci_qdev_register(&ivshmem_info);
> +}
> +
> +device_init(ivshmem_register_devices)
> diff --git a/qemu-char.c b/qemu-char.c
> index 048da3f..41cb8c7 100644
> --- a/qemu-char.c
> +++ b/qemu-char.c
> @@ -2076,6 +2076,12 @@ static void tcp_chr_read(void *opaque)
>       }
>   }
>
> +CharDriverState *qemu_chr_open_eventfd(int eventfd){
> +
> +    return qemu_chr_open_fd(eventfd, eventfd);
> +
> +}
> +
>   static void tcp_chr_connect(void *opaque)
>   {
>       CharDriverState *chr = opaque;
> diff --git a/qemu-char.h b/qemu-char.h
> index 3a9427b..1571091 100644
> --- a/qemu-char.h
> +++ b/qemu-char.h
> @@ -93,6 +93,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
>   void qemu_chr_info(Monitor *mon, QObject **ret_data);
>   CharDriverState *qemu_chr_find(const char *name);
>
> +/* add an eventfd to the qemu devices that are polled */
> +CharDriverState *qemu_chr_open_eventfd(int eventfd);
> +
>   extern int term_escape_char;
>
>   /* async I/O support */
> diff --git a/qemu-doc.texi b/qemu-doc.texi
> index 6647b7b..2df4687 100644
> --- a/qemu-doc.texi
> +++ b/qemu-doc.texi
> @@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible to make VLANs
>   that span several QEMU instances. See @ref{sec_invocation} to have a
>   basic example.
>
> +@section Other Devices
> +
> +@subsection Inter-VM Shared Memory device
> +
> +With KVM enabled on a Linux host, a shared memory device is available.  Guests
> +map a POSIX shared memory region into the guest as a PCI device that enables
> +zero-copy communication to the application level of the guests.  The basic
> +syntax is:
> +
> +@example
> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
> +@end example
> +
> +If desired, interrupts can be sent between guest VMs accessing the same shared
> +memory region.  Interrupt support requires using a shared memory server and
> +using a chardev socket to connect to it.  The code for the shared memory server
> +is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
> +memory server is:
> +
> +@example
> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
> +                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
> +qemu -chardev socket,path=<path>,id=<id>
> +@end example
> +
>   @node direct_linux_boot
>   @section Direct Linux Boot
>
>    

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-06 17:32           ` [Qemu-devel] " Anthony Liguori
@ 2010-05-06 17:59             ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-06 17:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: kvm, qemu-devel

On Thu, May 6, 2010 at 11:32 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 04/21/2010 12:53 PM, Cam Macdonell wrote:
>>
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a
>> PCI device in the guest.  This patch also supports interrupts between
>> guest by
>> communicating over a unix domain socket.  This patch applies to the
>> qemu-kvm
>> repository.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory
>> server
>> by using a chardev socket.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>                     [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>     -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>     www.gitorious.org/nahanni
>> ---
>>  Makefile.target |    3 +
>>  hw/ivshmem.c    |  727
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  qemu-char.c     |    6 +
>>  qemu-char.h     |    3 +
>>  qemu-doc.texi   |   25 ++
>>  5 files changed, 764 insertions(+), 0 deletions(-)
>>  create mode 100644 hw/ivshmem.c
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index 1ffd802..bc9a681 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>>  obj-y += rtl8139.o
>>  obj-y += e1000.o
>>
>> +# Inter-VM PCI shared memory
>> +obj-y += ivshmem.o
>> +
>>  # Hardware support
>>  obj-i386-y = pckbd.o dma.o
>>  obj-i386-y += vga.o
>> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
>> new file mode 100644
>> index 0000000..f8d8fdb
>> --- /dev/null
>> +++ b/hw/ivshmem.c
>> @@ -0,0 +1,727 @@
>> +/*
>> + * Inter-VM Shared Memory PCI device.
>> + *
>> + * Author:
>> + *      Cam Macdonell<cam@cs.ualberta.ca>
>> + *
>> + * Based On: cirrus_vga.c and rtl8139.c
>> + *
>> + * This code is licensed under the GNU GPL v2.
>> + */
>> +#include<sys/mman.h>
>> +#include<sys/types.h>
>> +#include<sys/socket.h>
>> +#include<sys/io.h>
>> +#include<sys/ioctl.h>
>> +#include<sys/eventfd.h>
>>
>
> This will break the Windows along with any non-Linux unix or any Linux old
> enough to not have eventfd support.

I'll wrap it with

#ifdef CONFIG_EVENTFD

> If it's based on cirrus_vga.c and rtl8139.c, then it ought to carry the
> respective copyrights, no?

Sure, I can add those

Cam

>
> Regards,
>
> Anthony Liguori
>
>> +#include "hw.h"
>> +#include "console.h"
>> +#include "pc.h"
>> +#include "pci.h"
>> +#include "sysemu.h"
>> +
>> +#include "msix.h"
>> +#include "qemu-kvm.h"
>> +#include "libkvm.h"
>> +
>> +#include<sys/eventfd.h>
>> +#include<sys/mman.h>
>> +#include<sys/socket.h>
>> +#include<sys/ioctl.h>
>> +
>> +#define IVSHMEM_IRQFD   0
>> +#define IVSHMEM_MSI     1
>> +
>> +#define DEBUG_IVSHMEM
>> +#ifdef DEBUG_IVSHMEM
>> +#define IVSHMEM_DPRINTF(fmt, args...)        \
>> +    do {printf("IVSHMEM: " fmt, ##args); } while (0)
>> +#else
>> +#define IVSHMEM_DPRINTF(fmt, args...)
>> +#endif
>> +
>> +typedef struct EventfdEntry {
>> +    PCIDevice *pdev;
>> +    int vector;
>> +} EventfdEntry;
>> +
>> +typedef struct IVShmemState {
>> +    PCIDevice dev;
>> +    uint32_t intrmask;
>> +    uint32_t intrstatus;
>> +    uint32_t doorbell;
>> +
>> +    CharDriverState * chr;
>> +    CharDriverState ** eventfd_chr;
>> +    int ivshmem_mmio_io_addr;
>> +
>> +    pcibus_t mmio_addr;
>> +    unsigned long ivshmem_offset;
>> +    uint64_t ivshmem_size; /* size of shared memory region */
>> +    int shm_fd; /* shared memory file descriptor */
>> +
>> +    int nr_allocated_vms;
>> +    /* array of eventfds for each guest */
>> +    int ** eventfds;
>> +    /* keep track of # of eventfds for each guest*/
>> +    int * eventfds_posn_count;
>> +
>> +    int nr_alloc_guests;
>> +    int vm_id;
>> +    int num_eventfds;
>> +    uint32_t vectors;
>> +    uint32_t features;
>> +    EventfdEntry *eventfd_table;
>> +
>> +    char * shmobj;
>> +    char * sizearg;
>> +} IVShmemState;
>> +
>> +/* registers for the Inter-VM shared memory device */
>> +enum ivshmem_registers {
>> +    IntrMask = 0,
>> +    IntrStatus = 4,
>> +    IVPosition = 8,
>> +    Doorbell = 12,
>> +};
>> +
>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
>> feature) {
>> +    return (ivs->features&  (1<<  feature));
>> +}
>> +
>> +static inline int is_power_of_two(int x) {
>> +    return (x&  (x-1)) == 0;
>> +}
>> +
>> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
>> +                    pcibus_t addr, pcibus_t size, int type)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>> +
>> +    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr,
>> (uint32_t)size);
>> +    cpu_register_physical_memory(addr, s->ivshmem_size,
>> s->ivshmem_offset);
>> +
>> +}
>> +
>> +/* accessing registers - based on rtl8139 */
>> +static void ivshmem_update_irq(IVShmemState *s, int val)
>> +{
>> +    int isr;
>> +    isr = (s->intrstatus&  s->intrmask)&  0xffffffff;
>> +
>> +    /* don't print ISR resets */
>> +    if (isr) {
>> +        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
>> +           isr ? 1 : 0, s->intrstatus, s->intrmask);
>> +    }
>> +
>> +    qemu_set_irq(s->dev.irq[0], (isr != 0));
>> +}
>> +
>> +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
>> +
>> +    s->intrmask = val;
>> +
>> +    ivshmem_update_irq(s, val);
>> +}
>> +
>> +static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
>> +{
>> +    uint32_t ret = s->intrmask;
>> +
>> +    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
>> +
>> +    s->intrstatus = val;
>> +
>> +    ivshmem_update_irq(s, val);
>> +    return;
>> +}
>> +
>> +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
>> +{
>> +    uint32_t ret = s->intrstatus;
>> +
>> +    /* reading ISR clears all interrupts */
>> +    s->intrstatus = 0;
>> +
>> +    ivshmem_update_irq(s, 0);
>> +
>> +    return ret;
>> +}
>> +
>> +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +
>> +    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
>> +}
>> +
>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    u_int64_t write_one = 1;
>> +    u_int16_t dest = val>>  16;
>> +    u_int16_t vector = val&  0xff;
>> +
>> +    addr&= 0xfe;
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ivshmem_IntrMask_write(s, val);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ivshmem_IntrStatus_write(s, val);
>> +            break;
>> +
>> +        case Doorbell:
>> +            /* check doorbell range */
>> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest]))
>> {
>> +                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n",
>> write_one, dest, vector);
>> +                if (write(s->eventfds[dest][vector],&(write_one), 8) !=
>> 8) {
>> +                    IVSHMEM_DPRINTF("error writing to eventfd\n");
>> +                }
>> +            }
>> +            break;
>> +        default:
>> +            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
>> +    }
>> +}
>> +
>> +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
>> +}
>> +
>> +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
>> +{
>> +
>> +    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
>> +    return 0;
>> +}
>> +
>> +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
>> +{
>> +
>> +    IVShmemState *s = opaque;
>> +    uint32_t ret;
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ret = ivshmem_IntrMask_read(s);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ret = ivshmem_IntrStatus_read(s);
>> +            break;
>> +
>> +        case IVPosition:
>> +            /* return my id in the ivshmem list */
>> +            ret = s->vm_id;
>> +            break;
>> +
>> +        default:
>> +            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
>> +            ret = 0;
>> +    }
>> +
>> +    return ret;
>> +
>> +}
>> +
>> +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
>> +{
>> +    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
>> +
>> +    return 0;
>> +}
>> +
>> +static void ivshmem_mmio_writeb(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writeb(opaque, addr&  0xFF, val);
>> +}
>> +
>> +static void ivshmem_mmio_writew(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writew(opaque, addr&  0xFF, val);
>> +}
>> +
>> +static void ivshmem_mmio_writel(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writel(opaque, addr&  0xFF, val);
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
>> +{
>> +    return ivshmem_io_readb(opaque, addr&  0xFF);
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
>> +{
>> +    uint32_t val = ivshmem_io_readw(opaque, addr&  0xFF);
>> +    return val;
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
>> +{
>> +    uint32_t val = ivshmem_io_readl(opaque, addr&  0xFF);
>> +    return val;
>> +}
>> +
>> +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
>> +    ivshmem_mmio_readb,
>> +    ivshmem_mmio_readw,
>> +    ivshmem_mmio_readl,
>> +};
>> +
>> +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
>> +    ivshmem_mmio_writeb,
>> +    ivshmem_mmio_writew,
>> +    ivshmem_mmio_writel,
>> +};
>> +
>> +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    ivshmem_IntrStatus_write(s, *buf);
>> +
>> +    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
>> +}
>> +
>> +static int ivshmem_can_receive(void * opaque)
>> +{
>> +    return 8;
>> +}
>> +
>> +static void ivshmem_event(void *opaque, int event)
>> +{
>> +    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
>> +}
>> +
>> +static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
>> +
>> +    EventfdEntry *entry = opaque;
>> +    PCIDevice *pdev = entry->pdev;
>> +
>> +    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
>> +    msix_notify(pdev, entry->vector);
>> +}
>> +
>> +static CharDriverState* create_eventfd_chr_device(void * opaque, int
>> eventfd,
>> +                                                                    int
>> vector)
>> +{
>> +    /* create a event character device based on the passed eventfd */
>> +    IVShmemState *s = opaque;
>> +    CharDriverState * chr;
>> +
>> +    chr = qemu_chr_open_eventfd(eventfd);
>> +
>> +    if (chr == NULL) {
>> +        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n",
>> eventfd);
>> +        exit(-1);
>> +    }
>> +
>> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +        s->eventfd_table[vector].pdev =&s->dev;
>> +        s->eventfd_table[vector].vector = vector;
>> +
>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
>> +                      ivshmem_event,&s->eventfd_table[vector]);
>> +    } else {
>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
>> +                      ivshmem_event, s);
>> +    }
>> +
>> +    return chr;
>> +
>> +}
>> +
>> +static int check_shm_size(IVShmemState *s, int shmemfd) {
>> +    /* check that the guest isn't going to try and map more memory than
>> the
>> +     * card server allocated return -1 to indicate error */
>> +
>> +    struct stat buf;
>> +
>> +    fstat(shmemfd,&buf);
>> +
>> +    if (s->ivshmem_size>  buf.st_size) {
>> +        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
>> +        fprintf(stderr, " than shared object size (%ld>  %ld)\n",
>> +                                          s->ivshmem_size, buf.st_size);
>> +        return -1;
>> +    } else {
>> +        return 0;
>> +    }
>> +}
>> +
>> +static void create_shared_memory_BAR(IVShmemState *s, int fd) {
>> +
>> +    s->shm_fd = fd;
>> +
>> +    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
>> +             MAP_SHARED, 0);
>> +
>> +    /* region for shared memory */
>> +    pci_register_bar(&s->dev, 2, s->ivshmem_size,
>> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY,
>> ivshmem_map);
>> +}
>> +
>> +static void close_guest_eventfds(IVShmemState *s, int posn)
>> +{
>> +    int i, guest_curr_max;
>> +
>> +    guest_curr_max = s->eventfds_posn_count[posn];
>> +
>> +    for (i = 0; i<  guest_curr_max; i++)
>> +        close(s->eventfds[posn][i]);
>> +
>> +    free(s->eventfds[posn]);
>> +    s->eventfds_posn_count[posn] = 0;
>> +}
>> +
>> +/* this function increase the dynamic storage need to store data about
>> other
>> + * guests */
>> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
>> +
>> +    int j, old_nr_alloc;
>> +
>> +    old_nr_alloc = s->nr_alloc_guests;
>> +
>> +    while (s->nr_alloc_guests<  new_min_size)
>> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
>> +
>> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n",
>> s->nr_alloc_guests);
>> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
>> +                                                        sizeof(int *));
>> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
>> +                                                    s->nr_alloc_guests *
>> +                                                        sizeof(int));
>> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests
>> *
>> +
>>  sizeof(EventfdEntry));
>> +
>> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
>> +            (s->eventfd_table == NULL)) {
>> +        fprintf(stderr, "Allocation error - exiting\n");
>> +        exit(1);
>> +    }
>> +
>> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
>> +                                    s->nr_alloc_guests * sizeof(void *));
>> +        if (s->eventfd_chr == NULL) {
>> +            fprintf(stderr, "Allocation error - exiting\n");
>> +            exit(1);
>> +        }
>> +    }
>> +
>> +    /* zero out new pointers */
>> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
>> +        s->eventfds[j] = NULL;
>> +    }
>> +}
>> +
>> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
>> +{
>> +    IVShmemState *s = opaque;
>> +    int incoming_fd, tmp_fd;
>> +    int guest_curr_max;
>> +    long incoming_posn;
>> +
>> +    memcpy(&incoming_posn, buf, sizeof(long));
>> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
>> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
>> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
>> +
>> +    /* make sure we have enough space for this guest */
>> +    if (incoming_posn>= s->nr_alloc_guests) {
>> +        increase_dynamic_storage(s, incoming_posn);
>> +    }
>> +
>> +    if (tmp_fd == -1) {
>> +        /* if posn is positive and unseen before then this is our posn*/
>> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL))
>> {
>> +            /* receive our posn */
>> +            s->vm_id = incoming_posn;
>> +            return;
>> +        } else {
>> +            /* otherwise an fd == -1 means an existing guest has gone
>> away */
>> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
>> +            close_guest_eventfds(s, incoming_posn);
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* because of the implementation of get_msgfd, we need a dup */
>> +    incoming_fd = dup(tmp_fd);
>> +
>> +    /* if the position is -1, then it's shared memory region fd */
>> +    if (incoming_posn == -1) {
>> +
>> +        s->num_eventfds = 0;
>> +
>> +        if (check_shm_size(s, incoming_fd) == -1) {
>> +            exit(-1);
>> +        }
>> +
>> +        /* creating a BAR in qemu_chr callback may be crazy */
>> +        create_shared_memory_BAR(s, incoming_fd);
>> +
>> +       return;
>> +    }
>> +
>> +    /* each guest has an array of eventfds, and we keep track of how many
>> +     * guests for each VM */
>> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
>> +    if (guest_curr_max == 0) {
>> +        /* one eventfd per MSI vector */
>> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
>> +
>>  sizeof(int));
>> +    }
>> +
>> +    /* this is an eventfd for a particular guest VM */
>> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn,
>> guest_curr_max,
>> +
>>  incoming_fd);
>> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
>> +
>> +    /* increment count for particular guest */
>> +    s->eventfds_posn_count[incoming_posn]++;
>> +
>> +    /* ioeventfd and irqfd are enabled together,
>> +     * so the flag IRQFD refers to both */
>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&  guest_curr_max>= 0) {
>> +        /* allocate ioeventfd for the new fd
>> +         * received for guest @ incoming_posn */
>> +        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
>> +                                (incoming_posn<<  16) | guest_curr_max,
>> 1);
>> +    }
>> +
>> +    /* keep track of the maximum VM ID */
>> +    if (incoming_posn>  s->num_eventfds) {
>> +        s->num_eventfds = incoming_posn;
>> +    }
>> +
>> +    if (incoming_posn == s->vm_id) {
>> +        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +            /* setup irqfd for this VM's eventfd */
>> +            int vector = guest_curr_max;
>> +            kvm_set_irqfd(s->eventfds[s->vm_id][guest_curr_max], vector,
>> +
>>  s->dev.msix_irq_entries[vector].gsi);
>> +        } else {
>> +            /* initialize char device for callback
>> +             * if this is one of my eventfd */
>> +            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
>> +                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
>> +        }
>> +    }
>> +
>> +    return;
>> +}
>> +
>> +static void ivshmem_reset(DeviceState *d)
>> +{
>> +    return;
>> +}
>> +
>> +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
>> +                       pcibus_t addr, pcibus_t size, int type)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>> +
>> +    s->mmio_addr = addr;
>> +    cpu_register_physical_memory(addr + 0, 0x400,
>> s->ivshmem_mmio_io_addr);
>> +
>> +    /* now that our mmio region has been allocated, we can receive
>> +     * the file descriptors */
>> +    if (s->chr != NULL) {
>> +        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
>> +                     ivshmem_event, s);
>> +    }
>> +
>> +}
>> +
>> +static uint64_t ivshmem_get_size(IVShmemState * s) {
>> +
>> +    uint64_t value;
>> +    char *ptr;
>> +
>> +    value = strtoul(s->sizearg,&ptr, 10);
>> +    switch (*ptr) {
>> +        case 0: case 'M': case 'm':
>> +            value<<= 20;
>> +            break;
>> +        case 'G': case 'g':
>> +            value<<= 30;
>> +            break;
>> +        default:
>> +            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
>> +            exit(1);
>> +    }
>> +
>> +    /* BARs must be a power of 2 */
>> +    if (!is_power_of_two(value)) {
>> +        fprintf(stderr, "ivshmem: size must be power of 2\n");
>> +        exit(1);
>> +    }
>> +
>> +    return value;
>> +
>> +}
>> +
>> +static int pci_ivshmem_init(PCIDevice *dev)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>> +    uint8_t *pci_conf;
>> +    int i;
>> +
>> +    if (s->sizearg == NULL)
>> +        s->ivshmem_size = 4<<  20; /* 4 MB default */
>> +    else {
>> +        s->ivshmem_size = ivshmem_get_size(s);
>> +    }
>> +
>> +    /* IRQFD requires MSI */
>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&
>> +        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
>> +        exit(1);
>> +    }
>> +
>> +    pci_conf = s->dev.config;
>> +    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
>> +    pci_conf[0x01] = 0x1a;
>> +    pci_conf[0x02] = 0x10;
>> +    pci_conf[0x03] = 0x11;
>> +    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
>> +    pci_conf[0x0a] = 0x00; /* RAM controller */
>> +    pci_conf[0x0b] = 0x05;
>> +    pci_conf[0x0e] = 0x00; /* header_type */
>> +
>> +    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
>> +                                    ivshmem_mmio_write, s);
>> +    /* region for registers*/
>> +    pci_register_bar(&s->dev, 0, 0x400,
>> +                           PCI_BASE_ADDRESS_SPACE_MEMORY,
>> ivshmem_mmio_map);
>> +
>> +    /* allocate the MSI-X vectors */
>> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +
>> +        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
>> +            pci_register_bar(&s->dev, 1,
>> +                             msix_bar_size(&s->dev),
>> +                             PCI_BASE_ADDRESS_SPACE_MEMORY,
>> +                             msix_mmio_map);
>> +            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n",
>> s->vectors);
>> +        } else {
>> +            IVSHMEM_DPRINTF("msix initialization failed\n");
>> +        }
>> +
>> +        /* 'activate' the vectors */
>> +        for (i = 0; i<  s->vectors; i++) {
>> +            msix_vector_use(&s->dev, i);
>> +        }
>> +    }
>> +
>> +    if ((s->chr != NULL)&&  (strncmp(s->chr->filename, "unix:", 5) == 0))
>> {
>> +        /* if we get a UNIX socket as the parameter we will talk
>> +         * to the ivshmem server later once the MMIO BAR is actually
>> +         * allocated (see ivshmem_mmio_map) */
>> +
>> +        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
>> +
>>  s->chr->filename);
>> +
>> +        /* we allocate enough space for 16 guests and grow as needed */
>> +        s->nr_alloc_guests = 16;
>> +        s->vm_id = -1;
>> +
>> +        /* allocate/initialize space for interrupt handling */
>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>> sizeof(EventfdEntry));
>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>> sizeof(int));
>> +
>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>> interrupts */
>> +
>> +        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +            s->eventfd_chr = (CharDriverState
>> **)qemu_malloc(s->nr_alloc_guests *
>> +                                                            sizeof(void
>> *));
>> +        }
>> +
>> +    } else {
>> +        /* just map the file immediately, we're not using a server */
>> +        int fd;
>> +
>> +        if (s->shmobj == NULL) {
>> +            fprintf(stderr, "Must specify 'chardev' or 'shm' to
>> ivshmem\n");
>> +        }
>> +
>> +        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
>> +
>> +        /* try opening with O_EXCL and if it succeeds zero the memory
>> +         * by truncating to 0 */
>> +        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
>> +                        S_IRWXU|S_IRWXG|S_IRWXO))>  0) {
>> +           /* truncate file to length PCI device's memory */
>> +            if (ftruncate(fd, s->ivshmem_size) != 0) {
>> +                fprintf(stderr, "kvm_ivshmem: could not truncate shared
>> file\n");
>> +            }
>> +
>> +        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
>> +                        S_IRWXU|S_IRWXG|S_IRWXO))<  0) {
>> +            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
>> +            exit(-1);
>> +        }
>> +
>> +        create_shared_memory_BAR(s, fd);
>> +
>> +    }
>> +
>> +
>> +    return 0;
>> +}
>> +
>> +static int pci_ivshmem_uninit(PCIDevice *dev)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>> +
>> +    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
>> +
>> +    return 0;
>> +}
>> +
>> +static PCIDeviceInfo ivshmem_info = {
>> +    .qdev.name  = "ivshmem",
>> +    .qdev.size  = sizeof(IVShmemState),
>> +    .qdev.reset = ivshmem_reset,
>> +    .init       = pci_ivshmem_init,
>> +    .exit       = pci_ivshmem_uninit,
>> +    .qdev.props = (Property[]) {
>> +        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
>> +        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
>> +        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
>> +        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD,
>> false),
>> +        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI,
>> true),
>> +        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
>> +        DEFINE_PROP_END_OF_LIST(),
>> +    }
>> +};
>> +
>> +static void ivshmem_register_devices(void)
>> +{
>> +    pci_qdev_register(&ivshmem_info);
>> +}
>> +
>> +device_init(ivshmem_register_devices)
>> diff --git a/qemu-char.c b/qemu-char.c
>> index 048da3f..41cb8c7 100644
>> --- a/qemu-char.c
>> +++ b/qemu-char.c
>> @@ -2076,6 +2076,12 @@ static void tcp_chr_read(void *opaque)
>>      }
>>  }
>>
>> +CharDriverState *qemu_chr_open_eventfd(int eventfd){
>> +
>> +    return qemu_chr_open_fd(eventfd, eventfd);
>> +
>> +}
>> +
>>  static void tcp_chr_connect(void *opaque)
>>  {
>>      CharDriverState *chr = opaque;
>> diff --git a/qemu-char.h b/qemu-char.h
>> index 3a9427b..1571091 100644
>> --- a/qemu-char.h
>> +++ b/qemu-char.h
>> @@ -93,6 +93,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject
>> *ret_data);
>>  void qemu_chr_info(Monitor *mon, QObject **ret_data);
>>  CharDriverState *qemu_chr_find(const char *name);
>>
>> +/* add an eventfd to the qemu devices that are polled */
>> +CharDriverState *qemu_chr_open_eventfd(int eventfd);
>> +
>>  extern int term_escape_char;
>>
>>  /* async I/O support */
>> diff --git a/qemu-doc.texi b/qemu-doc.texi
>> index 6647b7b..2df4687 100644
>> --- a/qemu-doc.texi
>> +++ b/qemu-doc.texi
>> @@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible
>> to make VLANs
>>  that span several QEMU instances. See @ref{sec_invocation} to have a
>>  basic example.
>>
>> +@section Other Devices
>> +
>> +@subsection Inter-VM Shared Memory device
>> +
>> +With KVM enabled on a Linux host, a shared memory device is available.
>>  Guests
>> +map a POSIX shared memory region into the guest as a PCI device that
>> enables
>> +zero-copy communication to the application level of the guests.  The
>> basic
>> +syntax is:
>> +
>> +@example
>> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm
>> name>]
>> +@end example
>> +
>> +If desired, interrupts can be sent between guest VMs accessing the same
>> shared
>> +memory region.  Interrupt support requires using a shared memory server
>> and
>> +using a chardev socket to connect to it.  The code for the shared memory
>> server
>> +is qemu.git/contrib/ivshmem-server.  An example syntax when using the
>> shared
>> +memory server is:
>> +
>> +@example
>> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm
>> name>]
>> +                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>> +qemu -chardev socket,path=<path>,id=<id>
>> +@end example
>> +
>>  @node direct_linux_boot
>>  @section Direct Linux Boot
>>
>>
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-06 17:59             ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-06 17:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, kvm

On Thu, May 6, 2010 at 11:32 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 04/21/2010 12:53 PM, Cam Macdonell wrote:
>>
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a
>> PCI device in the guest.  This patch also supports interrupts between
>> guest by
>> communicating over a unix domain socket.  This patch applies to the
>> qemu-kvm
>> repository.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory
>> server
>> by using a chardev socket.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>                     [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>     -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>     www.gitorious.org/nahanni
>> ---
>>  Makefile.target |    3 +
>>  hw/ivshmem.c    |  727
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  qemu-char.c     |    6 +
>>  qemu-char.h     |    3 +
>>  qemu-doc.texi   |   25 ++
>>  5 files changed, 764 insertions(+), 0 deletions(-)
>>  create mode 100644 hw/ivshmem.c
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index 1ffd802..bc9a681 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>>  obj-y += rtl8139.o
>>  obj-y += e1000.o
>>
>> +# Inter-VM PCI shared memory
>> +obj-y += ivshmem.o
>> +
>>  # Hardware support
>>  obj-i386-y = pckbd.o dma.o
>>  obj-i386-y += vga.o
>> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
>> new file mode 100644
>> index 0000000..f8d8fdb
>> --- /dev/null
>> +++ b/hw/ivshmem.c
>> @@ -0,0 +1,727 @@
>> +/*
>> + * Inter-VM Shared Memory PCI device.
>> + *
>> + * Author:
>> + *      Cam Macdonell<cam@cs.ualberta.ca>
>> + *
>> + * Based On: cirrus_vga.c and rtl8139.c
>> + *
>> + * This code is licensed under the GNU GPL v2.
>> + */
>> +#include<sys/mman.h>
>> +#include<sys/types.h>
>> +#include<sys/socket.h>
>> +#include<sys/io.h>
>> +#include<sys/ioctl.h>
>> +#include<sys/eventfd.h>
>>
>
> This will break the Windows along with any non-Linux unix or any Linux old
> enough to not have eventfd support.

I'll wrap it with

#ifdef CONFIG_EVENTFD

> If it's based on cirrus_vga.c and rtl8139.c, then it ought to carry the
> respective copyrights, no?

Sure, I can add those

Cam

>
> Regards,
>
> Anthony Liguori
>
>> +#include "hw.h"
>> +#include "console.h"
>> +#include "pc.h"
>> +#include "pci.h"
>> +#include "sysemu.h"
>> +
>> +#include "msix.h"
>> +#include "qemu-kvm.h"
>> +#include "libkvm.h"
>> +
>> +#include<sys/eventfd.h>
>> +#include<sys/mman.h>
>> +#include<sys/socket.h>
>> +#include<sys/ioctl.h>
>> +
>> +#define IVSHMEM_IRQFD   0
>> +#define IVSHMEM_MSI     1
>> +
>> +#define DEBUG_IVSHMEM
>> +#ifdef DEBUG_IVSHMEM
>> +#define IVSHMEM_DPRINTF(fmt, args...)        \
>> +    do {printf("IVSHMEM: " fmt, ##args); } while (0)
>> +#else
>> +#define IVSHMEM_DPRINTF(fmt, args...)
>> +#endif
>> +
>> +typedef struct EventfdEntry {
>> +    PCIDevice *pdev;
>> +    int vector;
>> +} EventfdEntry;
>> +
>> +typedef struct IVShmemState {
>> +    PCIDevice dev;
>> +    uint32_t intrmask;
>> +    uint32_t intrstatus;
>> +    uint32_t doorbell;
>> +
>> +    CharDriverState * chr;
>> +    CharDriverState ** eventfd_chr;
>> +    int ivshmem_mmio_io_addr;
>> +
>> +    pcibus_t mmio_addr;
>> +    unsigned long ivshmem_offset;
>> +    uint64_t ivshmem_size; /* size of shared memory region */
>> +    int shm_fd; /* shared memory file descriptor */
>> +
>> +    int nr_allocated_vms;
>> +    /* array of eventfds for each guest */
>> +    int ** eventfds;
>> +    /* keep track of # of eventfds for each guest*/
>> +    int * eventfds_posn_count;
>> +
>> +    int nr_alloc_guests;
>> +    int vm_id;
>> +    int num_eventfds;
>> +    uint32_t vectors;
>> +    uint32_t features;
>> +    EventfdEntry *eventfd_table;
>> +
>> +    char * shmobj;
>> +    char * sizearg;
>> +} IVShmemState;
>> +
>> +/* registers for the Inter-VM shared memory device */
>> +enum ivshmem_registers {
>> +    IntrMask = 0,
>> +    IntrStatus = 4,
>> +    IVPosition = 8,
>> +    Doorbell = 12,
>> +};
>> +
>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
>> feature) {
>> +    return (ivs->features&  (1<<  feature));
>> +}
>> +
>> +static inline int is_power_of_two(int x) {
>> +    return (x&  (x-1)) == 0;
>> +}
>> +
>> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
>> +                    pcibus_t addr, pcibus_t size, int type)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>> +
>> +    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr,
>> (uint32_t)size);
>> +    cpu_register_physical_memory(addr, s->ivshmem_size,
>> s->ivshmem_offset);
>> +
>> +}
>> +
>> +/* accessing registers - based on rtl8139 */
>> +static void ivshmem_update_irq(IVShmemState *s, int val)
>> +{
>> +    int isr;
>> +    isr = (s->intrstatus&  s->intrmask)&  0xffffffff;
>> +
>> +    /* don't print ISR resets */
>> +    if (isr) {
>> +        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
>> +           isr ? 1 : 0, s->intrstatus, s->intrmask);
>> +    }
>> +
>> +    qemu_set_irq(s->dev.irq[0], (isr != 0));
>> +}
>> +
>> +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
>> +
>> +    s->intrmask = val;
>> +
>> +    ivshmem_update_irq(s, val);
>> +}
>> +
>> +static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
>> +{
>> +    uint32_t ret = s->intrmask;
>> +
>> +    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
>> +
>> +    s->intrstatus = val;
>> +
>> +    ivshmem_update_irq(s, val);
>> +    return;
>> +}
>> +
>> +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
>> +{
>> +    uint32_t ret = s->intrstatus;
>> +
>> +    /* reading ISR clears all interrupts */
>> +    s->intrstatus = 0;
>> +
>> +    ivshmem_update_irq(s, 0);
>> +
>> +    return ret;
>> +}
>> +
>> +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +
>> +    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
>> +}
>> +
>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    u_int64_t write_one = 1;
>> +    u_int16_t dest = val>>  16;
>> +    u_int16_t vector = val&  0xff;
>> +
>> +    addr&= 0xfe;
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ivshmem_IntrMask_write(s, val);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ivshmem_IntrStatus_write(s, val);
>> +            break;
>> +
>> +        case Doorbell:
>> +            /* check doorbell range */
>> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest]))
>> {
>> +                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n",
>> write_one, dest, vector);
>> +                if (write(s->eventfds[dest][vector],&(write_one), 8) !=
>> 8) {
>> +                    IVSHMEM_DPRINTF("error writing to eventfd\n");
>> +                }
>> +            }
>> +            break;
>> +        default:
>> +            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
>> +    }
>> +}
>> +
>> +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
>> +}
>> +
>> +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
>> +{
>> +
>> +    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
>> +    return 0;
>> +}
>> +
>> +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
>> +{
>> +
>> +    IVShmemState *s = opaque;
>> +    uint32_t ret;
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ret = ivshmem_IntrMask_read(s);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ret = ivshmem_IntrStatus_read(s);
>> +            break;
>> +
>> +        case IVPosition:
>> +            /* return my id in the ivshmem list */
>> +            ret = s->vm_id;
>> +            break;
>> +
>> +        default:
>> +            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
>> +            ret = 0;
>> +    }
>> +
>> +    return ret;
>> +
>> +}
>> +
>> +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
>> +{
>> +    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
>> +
>> +    return 0;
>> +}
>> +
>> +static void ivshmem_mmio_writeb(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writeb(opaque, addr&  0xFF, val);
>> +}
>> +
>> +static void ivshmem_mmio_writew(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writew(opaque, addr&  0xFF, val);
>> +}
>> +
>> +static void ivshmem_mmio_writel(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writel(opaque, addr&  0xFF, val);
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
>> +{
>> +    return ivshmem_io_readb(opaque, addr&  0xFF);
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
>> +{
>> +    uint32_t val = ivshmem_io_readw(opaque, addr&  0xFF);
>> +    return val;
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
>> +{
>> +    uint32_t val = ivshmem_io_readl(opaque, addr&  0xFF);
>> +    return val;
>> +}
>> +
>> +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
>> +    ivshmem_mmio_readb,
>> +    ivshmem_mmio_readw,
>> +    ivshmem_mmio_readl,
>> +};
>> +
>> +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
>> +    ivshmem_mmio_writeb,
>> +    ivshmem_mmio_writew,
>> +    ivshmem_mmio_writel,
>> +};
>> +
>> +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    ivshmem_IntrStatus_write(s, *buf);
>> +
>> +    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
>> +}
>> +
>> +static int ivshmem_can_receive(void * opaque)
>> +{
>> +    return 8;
>> +}
>> +
>> +static void ivshmem_event(void *opaque, int event)
>> +{
>> +    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
>> +}
>> +
>> +static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
>> +
>> +    EventfdEntry *entry = opaque;
>> +    PCIDevice *pdev = entry->pdev;
>> +
>> +    IVSHMEM_DPRINTF("fake irqfd on vector %d\n", entry->vector);
>> +    msix_notify(pdev, entry->vector);
>> +}
>> +
>> +static CharDriverState* create_eventfd_chr_device(void * opaque, int
>> eventfd,
>> +                                                                    int
>> vector)
>> +{
>> +    /* create a event character device based on the passed eventfd */
>> +    IVShmemState *s = opaque;
>> +    CharDriverState * chr;
>> +
>> +    chr = qemu_chr_open_eventfd(eventfd);
>> +
>> +    if (chr == NULL) {
>> +        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n",
>> eventfd);
>> +        exit(-1);
>> +    }
>> +
>> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +        s->eventfd_table[vector].pdev =&s->dev;
>> +        s->eventfd_table[vector].vector = vector;
>> +
>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
>> +                      ivshmem_event,&s->eventfd_table[vector]);
>> +    } else {
>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
>> +                      ivshmem_event, s);
>> +    }
>> +
>> +    return chr;
>> +
>> +}
>> +
>> +static int check_shm_size(IVShmemState *s, int shmemfd) {
>> +    /* check that the guest isn't going to try and map more memory than
>> the
>> +     * card server allocated return -1 to indicate error */
>> +
>> +    struct stat buf;
>> +
>> +    fstat(shmemfd,&buf);
>> +
>> +    if (s->ivshmem_size>  buf.st_size) {
>> +        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
>> +        fprintf(stderr, " than shared object size (%ld>  %ld)\n",
>> +                                          s->ivshmem_size, buf.st_size);
>> +        return -1;
>> +    } else {
>> +        return 0;
>> +    }
>> +}
>> +
>> +static void create_shared_memory_BAR(IVShmemState *s, int fd) {
>> +
>> +    s->shm_fd = fd;
>> +
>> +    s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size,
>> +             MAP_SHARED, 0);
>> +
>> +    /* region for shared memory */
>> +    pci_register_bar(&s->dev, 2, s->ivshmem_size,
>> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY,
>> ivshmem_map);
>> +}
>> +
>> +static void close_guest_eventfds(IVShmemState *s, int posn)
>> +{
>> +    int i, guest_curr_max;
>> +
>> +    guest_curr_max = s->eventfds_posn_count[posn];
>> +
>> +    for (i = 0; i<  guest_curr_max; i++)
>> +        close(s->eventfds[posn][i]);
>> +
>> +    free(s->eventfds[posn]);
>> +    s->eventfds_posn_count[posn] = 0;
>> +}
>> +
>> +/* this function increase the dynamic storage need to store data about
>> other
>> + * guests */
>> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
>> +
>> +    int j, old_nr_alloc;
>> +
>> +    old_nr_alloc = s->nr_alloc_guests;
>> +
>> +    while (s->nr_alloc_guests<  new_min_size)
>> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
>> +
>> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n",
>> s->nr_alloc_guests);
>> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
>> +                                                        sizeof(int *));
>> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
>> +                                                    s->nr_alloc_guests *
>> +                                                        sizeof(int));
>> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests
>> *
>> +
>>  sizeof(EventfdEntry));
>> +
>> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
>> +            (s->eventfd_table == NULL)) {
>> +        fprintf(stderr, "Allocation error - exiting\n");
>> +        exit(1);
>> +    }
>> +
>> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
>> +                                    s->nr_alloc_guests * sizeof(void *));
>> +        if (s->eventfd_chr == NULL) {
>> +            fprintf(stderr, "Allocation error - exiting\n");
>> +            exit(1);
>> +        }
>> +    }
>> +
>> +    /* zero out new pointers */
>> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
>> +        s->eventfds[j] = NULL;
>> +    }
>> +}
>> +
>> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
>> +{
>> +    IVShmemState *s = opaque;
>> +    int incoming_fd, tmp_fd;
>> +    int guest_curr_max;
>> +    long incoming_posn;
>> +
>> +    memcpy(&incoming_posn, buf, sizeof(long));
>> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
>> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
>> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
>> +
>> +    /* make sure we have enough space for this guest */
>> +    if (incoming_posn>= s->nr_alloc_guests) {
>> +        increase_dynamic_storage(s, incoming_posn);
>> +    }
>> +
>> +    if (tmp_fd == -1) {
>> +        /* if posn is positive and unseen before then this is our posn*/
>> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL))
>> {
>> +            /* receive our posn */
>> +            s->vm_id = incoming_posn;
>> +            return;
>> +        } else {
>> +            /* otherwise an fd == -1 means an existing guest has gone
>> away */
>> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
>> +            close_guest_eventfds(s, incoming_posn);
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* because of the implementation of get_msgfd, we need a dup */
>> +    incoming_fd = dup(tmp_fd);
>> +
>> +    /* if the position is -1, then it's shared memory region fd */
>> +    if (incoming_posn == -1) {
>> +
>> +        s->num_eventfds = 0;
>> +
>> +        if (check_shm_size(s, incoming_fd) == -1) {
>> +            exit(-1);
>> +        }
>> +
>> +        /* creating a BAR in qemu_chr callback may be crazy */
>> +        create_shared_memory_BAR(s, incoming_fd);
>> +
>> +       return;
>> +    }
>> +
>> +    /* each guest has an array of eventfds, and we keep track of how many
>> +     * guests for each VM */
>> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
>> +    if (guest_curr_max == 0) {
>> +        /* one eventfd per MSI vector */
>> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
>> +
>>  sizeof(int));
>> +    }
>> +
>> +    /* this is an eventfd for a particular guest VM */
>> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn,
>> guest_curr_max,
>> +
>>  incoming_fd);
>> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
>> +
>> +    /* increment count for particular guest */
>> +    s->eventfds_posn_count[incoming_posn]++;
>> +
>> +    /* ioeventfd and irqfd are enabled together,
>> +     * so the flag IRQFD refers to both */
>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&  guest_curr_max>= 0) {
>> +        /* allocate ioeventfd for the new fd
>> +         * received for guest @ incoming_posn */
>> +        kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + Doorbell,
>> +                                (incoming_posn<<  16) | guest_curr_max,
>> 1);
>> +    }
>> +
>> +    /* keep track of the maximum VM ID */
>> +    if (incoming_posn>  s->num_eventfds) {
>> +        s->num_eventfds = incoming_posn;
>> +    }
>> +
>> +    if (incoming_posn == s->vm_id) {
>> +        if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +            /* setup irqfd for this VM's eventfd */
>> +            int vector = guest_curr_max;
>> +            kvm_set_irqfd(s->eventfds[s->vm_id][guest_curr_max], vector,
>> +
>>  s->dev.msix_irq_entries[vector].gsi);
>> +        } else {
>> +            /* initialize char device for callback
>> +             * if this is one of my eventfd */
>> +            s->eventfd_chr[guest_curr_max] = create_eventfd_chr_device(s,
>> +                s->eventfds[s->vm_id][guest_curr_max], guest_curr_max);
>> +        }
>> +    }
>> +
>> +    return;
>> +}
>> +
>> +static void ivshmem_reset(DeviceState *d)
>> +{
>> +    return;
>> +}
>> +
>> +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
>> +                       pcibus_t addr, pcibus_t size, int type)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>> +
>> +    s->mmio_addr = addr;
>> +    cpu_register_physical_memory(addr + 0, 0x400,
>> s->ivshmem_mmio_io_addr);
>> +
>> +    /* now that our mmio region has been allocated, we can receive
>> +     * the file descriptors */
>> +    if (s->chr != NULL) {
>> +        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
>> +                     ivshmem_event, s);
>> +    }
>> +
>> +}
>> +
>> +static uint64_t ivshmem_get_size(IVShmemState * s) {
>> +
>> +    uint64_t value;
>> +    char *ptr;
>> +
>> +    value = strtoul(s->sizearg,&ptr, 10);
>> +    switch (*ptr) {
>> +        case 0: case 'M': case 'm':
>> +            value<<= 20;
>> +            break;
>> +        case 'G': case 'g':
>> +            value<<= 30;
>> +            break;
>> +        default:
>> +            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
>> +            exit(1);
>> +    }
>> +
>> +    /* BARs must be a power of 2 */
>> +    if (!is_power_of_two(value)) {
>> +        fprintf(stderr, "ivshmem: size must be power of 2\n");
>> +        exit(1);
>> +    }
>> +
>> +    return value;
>> +
>> +}
>> +
>> +static int pci_ivshmem_init(PCIDevice *dev)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>> +    uint8_t *pci_conf;
>> +    int i;
>> +
>> +    if (s->sizearg == NULL)
>> +        s->ivshmem_size = 4<<  20; /* 4 MB default */
>> +    else {
>> +        s->ivshmem_size = ivshmem_get_size(s);
>> +    }
>> +
>> +    /* IRQFD requires MSI */
>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)&&
>> +        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
>> +        exit(1);
>> +    }
>> +
>> +    pci_conf = s->dev.config;
>> +    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
>> +    pci_conf[0x01] = 0x1a;
>> +    pci_conf[0x02] = 0x10;
>> +    pci_conf[0x03] = 0x11;
>> +    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
>> +    pci_conf[0x0a] = 0x00; /* RAM controller */
>> +    pci_conf[0x0b] = 0x05;
>> +    pci_conf[0x0e] = 0x00; /* header_type */
>> +
>> +    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
>> +                                    ivshmem_mmio_write, s);
>> +    /* region for registers*/
>> +    pci_register_bar(&s->dev, 0, 0x400,
>> +                           PCI_BASE_ADDRESS_SPACE_MEMORY,
>> ivshmem_mmio_map);
>> +
>> +    /* allocate the MSI-X vectors */
>> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +
>> +        if (!msix_init(&s->dev, s->vectors, 1, 0)) {
>> +            pci_register_bar(&s->dev, 1,
>> +                             msix_bar_size(&s->dev),
>> +                             PCI_BASE_ADDRESS_SPACE_MEMORY,
>> +                             msix_mmio_map);
>> +            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n",
>> s->vectors);
>> +        } else {
>> +            IVSHMEM_DPRINTF("msix initialization failed\n");
>> +        }
>> +
>> +        /* 'activate' the vectors */
>> +        for (i = 0; i<  s->vectors; i++) {
>> +            msix_vector_use(&s->dev, i);
>> +        }
>> +    }
>> +
>> +    if ((s->chr != NULL)&&  (strncmp(s->chr->filename, "unix:", 5) == 0))
>> {
>> +        /* if we get a UNIX socket as the parameter we will talk
>> +         * to the ivshmem server later once the MMIO BAR is actually
>> +         * allocated (see ivshmem_mmio_map) */
>> +
>> +        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
>> +
>>  s->chr->filename);
>> +
>> +        /* we allocate enough space for 16 guests and grow as needed */
>> +        s->nr_alloc_guests = 16;
>> +        s->vm_id = -1;
>> +
>> +        /* allocate/initialize space for interrupt handling */
>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>> sizeof(EventfdEntry));
>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>> sizeof(int));
>> +
>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>> interrupts */
>> +
>> +        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +            s->eventfd_chr = (CharDriverState
>> **)qemu_malloc(s->nr_alloc_guests *
>> +                                                            sizeof(void
>> *));
>> +        }
>> +
>> +    } else {
>> +        /* just map the file immediately, we're not using a server */
>> +        int fd;
>> +
>> +        if (s->shmobj == NULL) {
>> +            fprintf(stderr, "Must specify 'chardev' or 'shm' to
>> ivshmem\n");
>> +        }
>> +
>> +        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
>> +
>> +        /* try opening with O_EXCL and if it succeeds zero the memory
>> +         * by truncating to 0 */
>> +        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
>> +                        S_IRWXU|S_IRWXG|S_IRWXO))>  0) {
>> +           /* truncate file to length PCI device's memory */
>> +            if (ftruncate(fd, s->ivshmem_size) != 0) {
>> +                fprintf(stderr, "kvm_ivshmem: could not truncate shared
>> file\n");
>> +            }
>> +
>> +        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
>> +                        S_IRWXU|S_IRWXG|S_IRWXO))<  0) {
>> +            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
>> +            exit(-1);
>> +        }
>> +
>> +        create_shared_memory_BAR(s, fd);
>> +
>> +    }
>> +
>> +
>> +    return 0;
>> +}
>> +
>> +static int pci_ivshmem_uninit(PCIDevice *dev)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>> +
>> +    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
>> +
>> +    return 0;
>> +}
>> +
>> +static PCIDeviceInfo ivshmem_info = {
>> +    .qdev.name  = "ivshmem",
>> +    .qdev.size  = sizeof(IVShmemState),
>> +    .qdev.reset = ivshmem_reset,
>> +    .init       = pci_ivshmem_init,
>> +    .exit       = pci_ivshmem_uninit,
>> +    .qdev.props = (Property[]) {
>> +        DEFINE_PROP_CHR("chardev", IVShmemState, chr),
>> +        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
>> +        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
>> +        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD,
>> false),
>> +        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI,
>> true),
>> +        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
>> +        DEFINE_PROP_END_OF_LIST(),
>> +    }
>> +};
>> +
>> +static void ivshmem_register_devices(void)
>> +{
>> +    pci_qdev_register(&ivshmem_info);
>> +}
>> +
>> +device_init(ivshmem_register_devices)
>> diff --git a/qemu-char.c b/qemu-char.c
>> index 048da3f..41cb8c7 100644
>> --- a/qemu-char.c
>> +++ b/qemu-char.c
>> @@ -2076,6 +2076,12 @@ static void tcp_chr_read(void *opaque)
>>      }
>>  }
>>
>> +CharDriverState *qemu_chr_open_eventfd(int eventfd){
>> +
>> +    return qemu_chr_open_fd(eventfd, eventfd);
>> +
>> +}
>> +
>>  static void tcp_chr_connect(void *opaque)
>>  {
>>      CharDriverState *chr = opaque;
>> diff --git a/qemu-char.h b/qemu-char.h
>> index 3a9427b..1571091 100644
>> --- a/qemu-char.h
>> +++ b/qemu-char.h
>> @@ -93,6 +93,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject
>> *ret_data);
>>  void qemu_chr_info(Monitor *mon, QObject **ret_data);
>>  CharDriverState *qemu_chr_find(const char *name);
>>
>> +/* add an eventfd to the qemu devices that are polled */
>> +CharDriverState *qemu_chr_open_eventfd(int eventfd);
>> +
>>  extern int term_escape_char;
>>
>>  /* async I/O support */
>> diff --git a/qemu-doc.texi b/qemu-doc.texi
>> index 6647b7b..2df4687 100644
>> --- a/qemu-doc.texi
>> +++ b/qemu-doc.texi
>> @@ -706,6 +706,31 @@ Using the @option{-net socket} option, it is possible
>> to make VLANs
>>  that span several QEMU instances. See @ref{sec_invocation} to have a
>>  basic example.
>>
>> +@section Other Devices
>> +
>> +@subsection Inter-VM Shared Memory device
>> +
>> +With KVM enabled on a Linux host, a shared memory device is available.
>>  Guests
>> +map a POSIX shared memory region into the guest as a PCI device that
>> enables
>> +zero-copy communication to the application level of the guests.  The
>> basic
>> +syntax is:
>> +
>> +@example
>> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm
>> name>]
>> +@end example
>> +
>> +If desired, interrupts can be sent between guest VMs accessing the same
>> shared
>> +memory region.  Interrupt support requires using a shared memory server
>> and
>> +using a chardev socket to connect to it.  The code for the shared memory
>> server
>> +is qemu.git/contrib/ivshmem-server.  An example syntax when using the
>> shared
>> +memory server is:
>> +
>> +@example
>> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm
>> name>]
>> +                        [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>> +qemu -chardev socket,path=<path>,id=<id>
>> +@end example
>> +
>>  @node direct_linux_boot
>>  @section Direct Linux Boot
>>
>>
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation
  2010-04-21 17:53     ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 10:39       ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 10:39 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 04/21/2010 08:53 PM, Cam Macdonell wrote:
> This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a
> host file into guest RAM.  This function mmaps the opened file anywhere and adds
> the memory to the ram blocks.
>
> Usage is
>
> qemu_ram_mmap(fd, size, MAP_SHARED, offset);
>    

Signoff?
>
> +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t offset)
> +{
> +    RAMBlock *new_block;
> +
> +    size = TARGET_PAGE_ALIGN(size);
> +    new_block = qemu_malloc(sizeof(*new_block));
> +
> +    /* map the file passed as a parameter to be this part of memory */
> +    new_block->host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset);
> +
> +    if (new_block->host == MAP_FAILED)
> +        exit(1);
>    

Braces after if ()

> +    if (kvm_enabled())
> +        kvm_setup_guest_memory(new_block->host, size);
> +
>    

More braces.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation
@ 2010-05-10 10:39       ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 10:39 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 04/21/2010 08:53 PM, Cam Macdonell wrote:
> This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a
> host file into guest RAM.  This function mmaps the opened file anywhere and adds
> the memory to the ram blocks.
>
> Usage is
>
> qemu_ram_mmap(fd, size, MAP_SHARED, offset);
>    

Signoff?
>
> +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t offset)
> +{
> +    RAMBlock *new_block;
> +
> +    size = TARGET_PAGE_ALIGN(size);
> +    new_block = qemu_malloc(sizeof(*new_block));
> +
> +    /* map the file passed as a parameter to be this part of memory */
> +    new_block->host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset);
> +
> +    if (new_block->host == MAP_FAILED)
> +        exit(1);
>    

Braces after if ()

> +    if (kvm_enabled())
> +        kvm_setup_guest_memory(new_block->host, size);
> +
>    

More braces.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
  2010-04-21 17:53       ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 10:43         ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 10:43 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 04/21/2010 08:53 PM, Cam Macdonell wrote:
> Generic functions to assign irqfds and ioeventfds.
>
>    

Signoff.

>   }
>
>   #ifdef KVM_IOEVENTFD
> +int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi)
> +{
> +    struct kvm_irqfd call = { };
> +    int r;
> +
> +    call.fd = fd;
> +    call.gsi = gsi;
>    

> +
> +    if (!kvm_enabled())
> +        return -ENOSYS;
>    

Braces, here and elsewhere.

> +    r = kvm_vm_ioctl(kvm_state, KVM_IRQFD,&call);
> +
> +    if (r<  0) {
> +        return r;
>    

-errno

> +    }
> +    return 0;
> +}
> +
> +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign)
> +{
> +
> +    int ret;
> +    struct kvm_ioeventfd iofd;
> +
> +    iofd.datamatch = val;
> +    iofd.addr = addr;
> +    iofd.len = 4;
> +    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
> +    iofd.fd = fd;
> +
> +    if (!kvm_enabled())
> +        return -ENOSYS;
> +    if (!assign)
> +        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
>    

May be more usable to have separate assign and deassign functions (that 
can call into a single internal implementation).

> +
> +    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD,&iofd);
> +
> +    if (ret<  0) {
> +        return ret;
>    

-errno

> +    }
> +
> +    return 0;
> +}
> +
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
@ 2010-05-10 10:43         ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 10:43 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 04/21/2010 08:53 PM, Cam Macdonell wrote:
> Generic functions to assign irqfds and ioeventfds.
>
>    

Signoff.

>   }
>
>   #ifdef KVM_IOEVENTFD
> +int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi)
> +{
> +    struct kvm_irqfd call = { };
> +    int r;
> +
> +    call.fd = fd;
> +    call.gsi = gsi;
>    

> +
> +    if (!kvm_enabled())
> +        return -ENOSYS;
>    

Braces, here and elsewhere.

> +    r = kvm_vm_ioctl(kvm_state, KVM_IRQFD,&call);
> +
> +    if (r<  0) {
> +        return r;
>    

-errno

> +    }
> +    return 0;
> +}
> +
> +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign)
> +{
> +
> +    int ret;
> +    struct kvm_ioeventfd iofd;
> +
> +    iofd.datamatch = val;
> +    iofd.addr = addr;
> +    iofd.len = 4;
> +    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
> +    iofd.fd = fd;
> +
> +    if (!kvm_enabled())
> +        return -ENOSYS;
> +    if (!assign)
> +        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
>    

May be more usable to have separate assign and deassign functions (that 
can call into a single internal implementation).

> +
> +    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD,&iofd);
> +
> +    if (ret<  0) {
> +        return ret;
>    

-errno

> +    }
> +
> +    return 0;
> +}
> +
>    

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-04-21 17:53         ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 11:59           ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 11:59 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 04/21/2010 08:53 PM, Cam Macdonell wrote:
> Support an inter-vm shared memory device that maps a shared-memory object as a
> PCI device in the guest.  This patch also supports interrupts between guest by
> communicating over a unix domain socket.  This patch applies to the qemu-kvm
> repository.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>
> Interrupts are supported between multiple VMs by using a shared memory server
> by using a chardev socket.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>                      [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>      -chardev socket,path=<path>,id=<id>
>
> (shared memory server is qemu.git/contrib/ivshmem-server)
>
> Sample programs and init scripts are in a git repo here:
>
>
> +typedef struct EventfdEntry {
> +    PCIDevice *pdev;
> +    int vector;
> +} EventfdEntry;
> +
> +typedef struct IVShmemState {
> +    PCIDevice dev;
> +    uint32_t intrmask;
> +    uint32_t intrstatus;
> +    uint32_t doorbell;
> +
> +    CharDriverState * chr;
> +    CharDriverState ** eventfd_chr;
> +    int ivshmem_mmio_io_addr;
> +
> +    pcibus_t mmio_addr;
> +    unsigned long ivshmem_offset;
> +    uint64_t ivshmem_size; /* size of shared memory region */
> +    int shm_fd; /* shared memory file descriptor */
> +
> +    int nr_allocated_vms;
> +    /* array of eventfds for each guest */
> +    int ** eventfds;
> +    /* keep track of # of eventfds for each guest*/
> +    int * eventfds_posn_count;
>    

More readable:

   typedef struct Peer {
       int nb_eventfds;
       int *eventfds;
   } Peer;
   int nb_peers;
   Peer *peers;

Does eventfd_chr need to be there as well?

> +
> +    int nr_alloc_guests;
> +    int vm_id;
> +    int num_eventfds;
> +    uint32_t vectors;
> +    uint32_t features;
> +    EventfdEntry *eventfd_table;
> +
> +    char * shmobj;
> +    char * sizearg;
>    

Does this need to be part of the state?

> +} IVShmemState;
> +
> +/* registers for the Inter-VM shared memory device */
> +enum ivshmem_registers {
> +    IntrMask = 0,
> +    IntrStatus = 4,
> +    IVPosition = 8,
> +    Doorbell = 12,
> +};
> +
> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
> +    return (ivs->features&  (1<<  feature));
> +}
> +
> +static inline int is_power_of_two(int x) {
> +    return (x&  (x-1)) == 0;
> +}
>    

argument needs to be uint64_t to avoid overflow with large BARs.  Return 
type can be bool.

> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVShmemState *s = opaque;
> +
> +    u_int64_t write_one = 1;
> +    u_int16_t dest = val>>  16;
> +    u_int16_t vector = val&  0xff;
> +
> +    addr&= 0xfe;
>    

Why 0xfe?  Can understand 0xfc or 0xff.

> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ivshmem_IntrMask_write(s, val);
> +            break;
> +
> +        case IntrStatus:
> +            ivshmem_IntrStatus_write(s, val);
> +            break;
> +
> +        case Doorbell:
> +            /* check doorbell range */
> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest])) {
>    

What if dest is too big?  We overflow s->eventfds_posn_count.
> +
> +static void close_guest_eventfds(IVShmemState *s, int posn)
> +{
> +    int i, guest_curr_max;
> +
> +    guest_curr_max = s->eventfds_posn_count[posn];
> +
> +    for (i = 0; i<  guest_curr_max; i++)
> +        close(s->eventfds[posn][i]);
> +
> +    free(s->eventfds[posn]);
>    

qemu_free().

> +/* this function increase the dynamic storage need to store data about other
> + * guests */
> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
> +
> +    int j, old_nr_alloc;
> +
> +    old_nr_alloc = s->nr_alloc_guests;
> +
> +    while (s->nr_alloc_guests<  new_min_size)
> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
> +
> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
> +                                                        sizeof(int *));
> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
> +                                                    s->nr_alloc_guests *
> +                                                        sizeof(int));
> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
> +                                                    sizeof(EventfdEntry));
> +
> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
> +            (s->eventfd_table == NULL)) {
> +        fprintf(stderr, "Allocation error - exiting\n");
> +        exit(1);
> +    }
> +
> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
> +                                    s->nr_alloc_guests * sizeof(void *));
> +        if (s->eventfd_chr == NULL) {
> +            fprintf(stderr, "Allocation error - exiting\n");
> +            exit(1);
> +        }
> +    }
> +
> +    /* zero out new pointers */
> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
> +        s->eventfds[j] = NULL;
>    

eventfds_posn_count and eventfd_table want zeroing as well.

> +    }
> +}
> +
> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
> +{
> +    IVShmemState *s = opaque;
> +    int incoming_fd, tmp_fd;
> +    int guest_curr_max;
> +    long incoming_posn;
> +
> +    memcpy(&incoming_posn, buf, sizeof(long));
> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
> +
> +    /* make sure we have enough space for this guest */
> +    if (incoming_posn>= s->nr_alloc_guests) {
> +        increase_dynamic_storage(s, incoming_posn);
> +    }
> +
> +    if (tmp_fd == -1) {
> +        /* if posn is positive and unseen before then this is our posn*/
> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL)) {
> +            /* receive our posn */
> +            s->vm_id = incoming_posn;
> +            return;
> +        } else {
> +            /* otherwise an fd == -1 means an existing guest has gone away */
> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
> +            close_guest_eventfds(s, incoming_posn);
> +            return;
> +        }
> +    }
> +
> +    /* because of the implementation of get_msgfd, we need a dup */
> +    incoming_fd = dup(tmp_fd);
>    

Error check.

> +
> +    /* if the position is -1, then it's shared memory region fd */
> +    if (incoming_posn == -1) {
> +
> +        s->num_eventfds = 0;
> +
> +        if (check_shm_size(s, incoming_fd) == -1) {
> +            exit(-1);
> +        }
> +
> +        /* creating a BAR in qemu_chr callback may be crazy */
> +        create_shared_memory_BAR(s, incoming_fd);
>    

It probably is... why can't you create it during initialization?


> +
> +       return;
> +    }
> +
> +    /* each guest has an array of eventfds, and we keep track of how many
> +     * guests for each VM */
> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
> +    if (guest_curr_max == 0) {
> +        /* one eventfd per MSI vector */
> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
> +                                                                sizeof(int));
> +    }
> +
> +    /* this is an eventfd for a particular guest VM */
> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
> +                                                                incoming_fd);
> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
> +
> +    /* increment count for particular guest */
> +    s->eventfds_posn_count[incoming_posn]++;
>    

Not sure I follow exactly, but perhaps this needs to be

     s->eventfds_posn_count[incoming_posn] = guest_curr_max + 1;

Oh, it is.

> +
> +        /* allocate/initialize space for interrupt handling */
> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
> +        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
> +
> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
>    

This is done by the guest BIOS.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 11:59           ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 11:59 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 04/21/2010 08:53 PM, Cam Macdonell wrote:
> Support an inter-vm shared memory device that maps a shared-memory object as a
> PCI device in the guest.  This patch also supports interrupts between guest by
> communicating over a unix domain socket.  This patch applies to the qemu-kvm
> repository.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>
> Interrupts are supported between multiple VMs by using a shared memory server
> by using a chardev socket.
>
>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>                      [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>      -chardev socket,path=<path>,id=<id>
>
> (shared memory server is qemu.git/contrib/ivshmem-server)
>
> Sample programs and init scripts are in a git repo here:
>
>
> +typedef struct EventfdEntry {
> +    PCIDevice *pdev;
> +    int vector;
> +} EventfdEntry;
> +
> +typedef struct IVShmemState {
> +    PCIDevice dev;
> +    uint32_t intrmask;
> +    uint32_t intrstatus;
> +    uint32_t doorbell;
> +
> +    CharDriverState * chr;
> +    CharDriverState ** eventfd_chr;
> +    int ivshmem_mmio_io_addr;
> +
> +    pcibus_t mmio_addr;
> +    unsigned long ivshmem_offset;
> +    uint64_t ivshmem_size; /* size of shared memory region */
> +    int shm_fd; /* shared memory file descriptor */
> +
> +    int nr_allocated_vms;
> +    /* array of eventfds for each guest */
> +    int ** eventfds;
> +    /* keep track of # of eventfds for each guest*/
> +    int * eventfds_posn_count;
>    

More readable:

   typedef struct Peer {
       int nb_eventfds;
       int *eventfds;
   } Peer;
   int nb_peers;
   Peer *peers;

Does eventfd_chr need to be there as well?

> +
> +    int nr_alloc_guests;
> +    int vm_id;
> +    int num_eventfds;
> +    uint32_t vectors;
> +    uint32_t features;
> +    EventfdEntry *eventfd_table;
> +
> +    char * shmobj;
> +    char * sizearg;
>    

Does this need to be part of the state?

> +} IVShmemState;
> +
> +/* registers for the Inter-VM shared memory device */
> +enum ivshmem_registers {
> +    IntrMask = 0,
> +    IntrStatus = 4,
> +    IVPosition = 8,
> +    Doorbell = 12,
> +};
> +
> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
> +    return (ivs->features&  (1<<  feature));
> +}
> +
> +static inline int is_power_of_two(int x) {
> +    return (x&  (x-1)) == 0;
> +}
>    

argument needs to be uint64_t to avoid overflow with large BARs.  Return 
type can be bool.

> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVShmemState *s = opaque;
> +
> +    u_int64_t write_one = 1;
> +    u_int16_t dest = val>>  16;
> +    u_int16_t vector = val&  0xff;
> +
> +    addr&= 0xfe;
>    

Why 0xfe?  Can understand 0xfc or 0xff.

> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ivshmem_IntrMask_write(s, val);
> +            break;
> +
> +        case IntrStatus:
> +            ivshmem_IntrStatus_write(s, val);
> +            break;
> +
> +        case Doorbell:
> +            /* check doorbell range */
> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest])) {
>    

What if dest is too big?  We overflow s->eventfds_posn_count.
> +
> +static void close_guest_eventfds(IVShmemState *s, int posn)
> +{
> +    int i, guest_curr_max;
> +
> +    guest_curr_max = s->eventfds_posn_count[posn];
> +
> +    for (i = 0; i<  guest_curr_max; i++)
> +        close(s->eventfds[posn][i]);
> +
> +    free(s->eventfds[posn]);
>    

qemu_free().

> +/* this function increase the dynamic storage need to store data about other
> + * guests */
> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
> +
> +    int j, old_nr_alloc;
> +
> +    old_nr_alloc = s->nr_alloc_guests;
> +
> +    while (s->nr_alloc_guests<  new_min_size)
> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
> +
> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nr_alloc_guests);
> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
> +                                                        sizeof(int *));
> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
> +                                                    s->nr_alloc_guests *
> +                                                        sizeof(int));
> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests *
> +                                                    sizeof(EventfdEntry));
> +
> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
> +            (s->eventfd_table == NULL)) {
> +        fprintf(stderr, "Allocation error - exiting\n");
> +        exit(1);
> +    }
> +
> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
> +                                    s->nr_alloc_guests * sizeof(void *));
> +        if (s->eventfd_chr == NULL) {
> +            fprintf(stderr, "Allocation error - exiting\n");
> +            exit(1);
> +        }
> +    }
> +
> +    /* zero out new pointers */
> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
> +        s->eventfds[j] = NULL;
>    

eventfds_posn_count and eventfd_table want zeroing as well.

> +    }
> +}
> +
> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
> +{
> +    IVShmemState *s = opaque;
> +    int incoming_fd, tmp_fd;
> +    int guest_curr_max;
> +    long incoming_posn;
> +
> +    memcpy(&incoming_posn, buf, sizeof(long));
> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
> +
> +    /* make sure we have enough space for this guest */
> +    if (incoming_posn>= s->nr_alloc_guests) {
> +        increase_dynamic_storage(s, incoming_posn);
> +    }
> +
> +    if (tmp_fd == -1) {
> +        /* if posn is positive and unseen before then this is our posn*/
> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL)) {
> +            /* receive our posn */
> +            s->vm_id = incoming_posn;
> +            return;
> +        } else {
> +            /* otherwise an fd == -1 means an existing guest has gone away */
> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
> +            close_guest_eventfds(s, incoming_posn);
> +            return;
> +        }
> +    }
> +
> +    /* because of the implementation of get_msgfd, we need a dup */
> +    incoming_fd = dup(tmp_fd);
>    

Error check.

> +
> +    /* if the position is -1, then it's shared memory region fd */
> +    if (incoming_posn == -1) {
> +
> +        s->num_eventfds = 0;
> +
> +        if (check_shm_size(s, incoming_fd) == -1) {
> +            exit(-1);
> +        }
> +
> +        /* creating a BAR in qemu_chr callback may be crazy */
> +        create_shared_memory_BAR(s, incoming_fd);
>    

It probably is... why can't you create it during initialization?


> +
> +       return;
> +    }
> +
> +    /* each guest has an array of eventfds, and we keep track of how many
> +     * guests for each VM */
> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
> +    if (guest_curr_max == 0) {
> +        /* one eventfd per MSI vector */
> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
> +                                                                sizeof(int));
> +    }
> +
> +    /* this is an eventfd for a particular guest VM */
> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
> +                                                                incoming_fd);
> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
> +
> +    /* increment count for particular guest */
> +    s->eventfds_posn_count[incoming_posn]++;
>    

Not sure I follow exactly, but perhaps this needs to be

     s->eventfds_posn_count[incoming_posn] = guest_curr_max + 1;

Oh, it is.

> +
> +        /* allocate/initialize space for interrupt handling */
> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
> +        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests * sizeof(int));
> +
> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support interrupts */
>    

This is done by the guest BIOS.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
  2010-05-10 10:43         ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 15:13           ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel

On Mon, May 10, 2010 at 4:43 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> Generic functions to assign irqfds and ioeventfds.
>>
>>
>
> Signoff.
>
>>  }
>>
>>  #ifdef KVM_IOEVENTFD
>> +int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi)
>> +{
>> +    struct kvm_irqfd call = { };
>> +    int r;
>> +
>> +    call.fd = fd;
>> +    call.gsi = gsi;
>>
>
>> +
>> +    if (!kvm_enabled())
>> +        return -ENOSYS;
>>
>
> Braces, here and elsewhere.

This function is unnecessary as Michael added one that does the same thing.

>
>> +    r = kvm_vm_ioctl(kvm_state, KVM_IRQFD,&call);
>> +
>> +    if (r<  0) {
>> +        return r;
>>
>
> -errno
>
>> +    }
>> +    return 0;
>> +}
>> +
>> +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool
>> assign)
>> +{
>> +
>> +    int ret;
>> +    struct kvm_ioeventfd iofd;
>> +
>> +    iofd.datamatch = val;
>> +    iofd.addr = addr;
>> +    iofd.len = 4;
>> +    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
>> +    iofd.fd = fd;
>> +
>> +    if (!kvm_enabled())
>> +        return -ENOSYS;
>> +    if (!assign)
>> +        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
>>
>
> May be more usable to have separate assign and deassign functions (that can
> call into a single internal implementation).

I believe the convention so far is to use the 'assign' flag as
Michael's patch and the PIO version kvm_set_ioeventfd_pio_word() do.

>
>> +
>> +    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD,&iofd);
>> +
>> +    if (ret<  0) {
>> +        return ret;
>>
>
> -errno
>
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
@ 2010-05-10 15:13           ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Mon, May 10, 2010 at 4:43 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> Generic functions to assign irqfds and ioeventfds.
>>
>>
>
> Signoff.
>
>>  }
>>
>>  #ifdef KVM_IOEVENTFD
>> +int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi)
>> +{
>> +    struct kvm_irqfd call = { };
>> +    int r;
>> +
>> +    call.fd = fd;
>> +    call.gsi = gsi;
>>
>
>> +
>> +    if (!kvm_enabled())
>> +        return -ENOSYS;
>>
>
> Braces, here and elsewhere.

This function is unnecessary as Michael added one that does the same thing.

>
>> +    r = kvm_vm_ioctl(kvm_state, KVM_IRQFD,&call);
>> +
>> +    if (r<  0) {
>> +        return r;
>>
>
> -errno
>
>> +    }
>> +    return 0;
>> +}
>> +
>> +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool
>> assign)
>> +{
>> +
>> +    int ret;
>> +    struct kvm_ioeventfd iofd;
>> +
>> +    iofd.datamatch = val;
>> +    iofd.addr = addr;
>> +    iofd.len = 4;
>> +    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
>> +    iofd.fd = fd;
>> +
>> +    if (!kvm_enabled())
>> +        return -ENOSYS;
>> +    if (!assign)
>> +        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
>>
>
> May be more usable to have separate assign and deassign functions (that can
> call into a single internal implementation).

I believe the convention so far is to use the 'assign' flag as
Michael's patch and the PIO version kvm_set_ioeventfd_pio_word() do.

>
>> +
>> +    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD,&iofd);
>> +
>> +    if (ret<  0) {
>> +        return ret;
>>
>
> -errno
>
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
  2010-05-10 15:13           ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 15:17             ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 15:17 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 05/10/2010 06:13 PM, Cam Macdonell wrote:
>
>>> +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool
>>> assign)
>>> +{
>>> +
>>> +    int ret;
>>> +    struct kvm_ioeventfd iofd;
>>> +
>>> +    iofd.datamatch = val;
>>> +    iofd.addr = addr;
>>> +    iofd.len = 4;
>>> +    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
>>> +    iofd.fd = fd;
>>> +
>>> +    if (!kvm_enabled())
>>> +        return -ENOSYS;
>>> +    if (!assign)
>>> +        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
>>>
>>>        
>> May be more usable to have separate assign and deassign functions (that can
>> call into a single internal implementation).
>>      
> I believe the convention so far is to use the 'assign' flag as
> Michael's patch and the PIO version kvm_set_ioeventfd_pio_word() do.
>    

I dislike bool arguments since they're hard to understand at the call 
site.  However if there's precedent we can stick to it and perhaps 
change it all later.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
@ 2010-05-10 15:17             ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 15:17 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 05/10/2010 06:13 PM, Cam Macdonell wrote:
>
>>> +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool
>>> assign)
>>> +{
>>> +
>>> +    int ret;
>>> +    struct kvm_ioeventfd iofd;
>>> +
>>> +    iofd.datamatch = val;
>>> +    iofd.addr = addr;
>>> +    iofd.len = 4;
>>> +    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
>>> +    iofd.fd = fd;
>>> +
>>> +    if (!kvm_enabled())
>>> +        return -ENOSYS;
>>> +    if (!assign)
>>> +        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
>>>
>>>        
>> May be more usable to have separate assign and deassign functions (that can
>> call into a single internal implementation).
>>      
> I believe the convention so far is to use the 'assign' flag as
> Michael's patch and the PIO version kvm_set_ioeventfd_pio_word() do.
>    

I dislike bool arguments since they're hard to understand at the call 
site.  However if there's precedent we can stick to it and perhaps 
change it all later.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 11:59           ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 15:22             ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a
>> PCI device in the guest.  This patch also supports interrupts between
>> guest by
>> communicating over a unix domain socket.  This patch applies to the
>> qemu-kvm
>> repository.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory
>> server
>> by using a chardev socket.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>                     [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>     -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>
>> +typedef struct EventfdEntry {
>> +    PCIDevice *pdev;
>> +    int vector;
>> +} EventfdEntry;
>> +
>> +typedef struct IVShmemState {
>> +    PCIDevice dev;
>> +    uint32_t intrmask;
>> +    uint32_t intrstatus;
>> +    uint32_t doorbell;
>> +
>> +    CharDriverState * chr;
>> +    CharDriverState ** eventfd_chr;
>> +    int ivshmem_mmio_io_addr;
>> +
>> +    pcibus_t mmio_addr;
>> +    unsigned long ivshmem_offset;
>> +    uint64_t ivshmem_size; /* size of shared memory region */
>> +    int shm_fd; /* shared memory file descriptor */
>> +
>> +    int nr_allocated_vms;
>> +    /* array of eventfds for each guest */
>> +    int ** eventfds;
>> +    /* keep track of # of eventfds for each guest*/
>> +    int * eventfds_posn_count;
>>
>
> More readable:
>
>  typedef struct Peer {
>      int nb_eventfds;
>      int *eventfds;
>  } Peer;
>  int nb_peers;
>  Peer *peers;
>
> Does eventfd_chr need to be there as well?
>
>> +
>> +    int nr_alloc_guests;
>> +    int vm_id;
>> +    int num_eventfds;
>> +    uint32_t vectors;
>> +    uint32_t features;
>> +    EventfdEntry *eventfd_table;
>> +
>> +    char * shmobj;
>> +    char * sizearg;
>>
>
> Does this need to be part of the state?
>
>> +} IVShmemState;
>> +
>> +/* registers for the Inter-VM shared memory device */
>> +enum ivshmem_registers {
>> +    IntrMask = 0,
>> +    IntrStatus = 4,
>> +    IVPosition = 8,
>> +    Doorbell = 12,
>> +};
>> +
>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
>> feature) {
>> +    return (ivs->features&  (1<<  feature));
>> +}
>> +
>> +static inline int is_power_of_two(int x) {
>> +    return (x&  (x-1)) == 0;
>> +}
>>
>
> argument needs to be uint64_t to avoid overflow with large BARs.  Return
> type can be bool.
>
>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    u_int64_t write_one = 1;
>> +    u_int16_t dest = val>>  16;
>> +    u_int16_t vector = val&  0xff;
>> +
>> +    addr&= 0xfe;
>>
>
> Why 0xfe?  Can understand 0xfc or 0xff.
>
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ivshmem_IntrMask_write(s, val);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ivshmem_IntrStatus_write(s, val);
>> +            break;
>> +
>> +        case Doorbell:
>> +            /* check doorbell range */
>> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest]))
>> {
>>
>
> What if dest is too big?  We overflow s->eventfds_posn_count.
>>
>> +
>> +static void close_guest_eventfds(IVShmemState *s, int posn)
>> +{
>> +    int i, guest_curr_max;
>> +
>> +    guest_curr_max = s->eventfds_posn_count[posn];
>> +
>> +    for (i = 0; i<  guest_curr_max; i++)
>> +        close(s->eventfds[posn][i]);
>> +
>> +    free(s->eventfds[posn]);
>>
>
> qemu_free().
>
>> +/* this function increase the dynamic storage need to store data about
>> other
>> + * guests */
>> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
>> +
>> +    int j, old_nr_alloc;
>> +
>> +    old_nr_alloc = s->nr_alloc_guests;
>> +
>> +    while (s->nr_alloc_guests<  new_min_size)
>> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
>> +
>> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n",
>> s->nr_alloc_guests);
>> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
>> +                                                        sizeof(int *));
>> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
>> +                                                    s->nr_alloc_guests *
>> +                                                        sizeof(int));
>> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests
>> *
>> +
>>  sizeof(EventfdEntry));
>> +
>> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
>> +            (s->eventfd_table == NULL)) {
>> +        fprintf(stderr, "Allocation error - exiting\n");
>> +        exit(1);
>> +    }
>> +
>> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
>> +                                    s->nr_alloc_guests * sizeof(void *));
>> +        if (s->eventfd_chr == NULL) {
>> +            fprintf(stderr, "Allocation error - exiting\n");
>> +            exit(1);
>> +        }
>> +    }
>> +
>> +    /* zero out new pointers */
>> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
>> +        s->eventfds[j] = NULL;
>>
>
> eventfds_posn_count and eventfd_table want zeroing as well.
>
>> +    }
>> +}
>> +
>> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
>> +{
>> +    IVShmemState *s = opaque;
>> +    int incoming_fd, tmp_fd;
>> +    int guest_curr_max;
>> +    long incoming_posn;
>> +
>> +    memcpy(&incoming_posn, buf, sizeof(long));
>> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
>> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
>> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
>> +
>> +    /* make sure we have enough space for this guest */
>> +    if (incoming_posn>= s->nr_alloc_guests) {
>> +        increase_dynamic_storage(s, incoming_posn);
>> +    }
>> +
>> +    if (tmp_fd == -1) {
>> +        /* if posn is positive and unseen before then this is our posn*/
>> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL))
>> {
>> +            /* receive our posn */
>> +            s->vm_id = incoming_posn;
>> +            return;
>> +        } else {
>> +            /* otherwise an fd == -1 means an existing guest has gone
>> away */
>> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
>> +            close_guest_eventfds(s, incoming_posn);
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* because of the implementation of get_msgfd, we need a dup */
>> +    incoming_fd = dup(tmp_fd);
>>
>
> Error check.
>
>> +
>> +    /* if the position is -1, then it's shared memory region fd */
>> +    if (incoming_posn == -1) {
>> +
>> +        s->num_eventfds = 0;
>> +
>> +        if (check_shm_size(s, incoming_fd) == -1) {
>> +            exit(-1);
>> +        }
>> +
>> +        /* creating a BAR in qemu_chr callback may be crazy */
>> +        create_shared_memory_BAR(s, incoming_fd);
>>
>
> It probably is... why can't you create it during initialization?

This is for the shared memory server implementation, so the fd for the
shared memory has to be received (over the qemu char device) from the
server before the BAR can be created via qemu_ram_mmap() which adds
the necessary memory

Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.

>
>
>> +
>> +       return;
>> +    }
>> +
>> +    /* each guest has an array of eventfds, and we keep track of how many
>> +     * guests for each VM */
>> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
>> +    if (guest_curr_max == 0) {
>> +        /* one eventfd per MSI vector */
>> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
>> +
>>  sizeof(int));
>> +    }
>> +
>> +    /* this is an eventfd for a particular guest VM */
>> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn,
>> guest_curr_max,
>> +
>>  incoming_fd);
>> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
>> +
>> +    /* increment count for particular guest */
>> +    s->eventfds_posn_count[incoming_posn]++;
>>
>
> Not sure I follow exactly, but perhaps this needs to be
>
>    s->eventfds_posn_count[incoming_posn] = guest_curr_max + 1;
>
> Oh, it is.
>
>> +
>> +        /* allocate/initialize space for interrupt handling */
>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>> sizeof(EventfdEntry));
>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>> sizeof(int));
>> +
>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>> interrupts */
>>
>
> This is done by the guest BIOS.
>
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 15:22             ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a
>> PCI device in the guest.  This patch also supports interrupts between
>> guest by
>> communicating over a unix domain socket.  This patch applies to the
>> qemu-kvm
>> repository.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory
>> server
>> by using a chardev socket.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>                     [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>     -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>
>> +typedef struct EventfdEntry {
>> +    PCIDevice *pdev;
>> +    int vector;
>> +} EventfdEntry;
>> +
>> +typedef struct IVShmemState {
>> +    PCIDevice dev;
>> +    uint32_t intrmask;
>> +    uint32_t intrstatus;
>> +    uint32_t doorbell;
>> +
>> +    CharDriverState * chr;
>> +    CharDriverState ** eventfd_chr;
>> +    int ivshmem_mmio_io_addr;
>> +
>> +    pcibus_t mmio_addr;
>> +    unsigned long ivshmem_offset;
>> +    uint64_t ivshmem_size; /* size of shared memory region */
>> +    int shm_fd; /* shared memory file descriptor */
>> +
>> +    int nr_allocated_vms;
>> +    /* array of eventfds for each guest */
>> +    int ** eventfds;
>> +    /* keep track of # of eventfds for each guest*/
>> +    int * eventfds_posn_count;
>>
>
> More readable:
>
>  typedef struct Peer {
>      int nb_eventfds;
>      int *eventfds;
>  } Peer;
>  int nb_peers;
>  Peer *peers;
>
> Does eventfd_chr need to be there as well?
>
>> +
>> +    int nr_alloc_guests;
>> +    int vm_id;
>> +    int num_eventfds;
>> +    uint32_t vectors;
>> +    uint32_t features;
>> +    EventfdEntry *eventfd_table;
>> +
>> +    char * shmobj;
>> +    char * sizearg;
>>
>
> Does this need to be part of the state?
>
>> +} IVShmemState;
>> +
>> +/* registers for the Inter-VM shared memory device */
>> +enum ivshmem_registers {
>> +    IntrMask = 0,
>> +    IntrStatus = 4,
>> +    IVPosition = 8,
>> +    Doorbell = 12,
>> +};
>> +
>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
>> feature) {
>> +    return (ivs->features&  (1<<  feature));
>> +}
>> +
>> +static inline int is_power_of_two(int x) {
>> +    return (x&  (x-1)) == 0;
>> +}
>>
>
> argument needs to be uint64_t to avoid overflow with large BARs.  Return
> type can be bool.
>
>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    u_int64_t write_one = 1;
>> +    u_int16_t dest = val>>  16;
>> +    u_int16_t vector = val&  0xff;
>> +
>> +    addr&= 0xfe;
>>
>
> Why 0xfe?  Can understand 0xfc or 0xff.
>
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ivshmem_IntrMask_write(s, val);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ivshmem_IntrStatus_write(s, val);
>> +            break;
>> +
>> +        case Doorbell:
>> +            /* check doorbell range */
>> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest]))
>> {
>>
>
> What if dest is too big?  We overflow s->eventfds_posn_count.
>>
>> +
>> +static void close_guest_eventfds(IVShmemState *s, int posn)
>> +{
>> +    int i, guest_curr_max;
>> +
>> +    guest_curr_max = s->eventfds_posn_count[posn];
>> +
>> +    for (i = 0; i<  guest_curr_max; i++)
>> +        close(s->eventfds[posn][i]);
>> +
>> +    free(s->eventfds[posn]);
>>
>
> qemu_free().
>
>> +/* this function increase the dynamic storage need to store data about
>> other
>> + * guests */
>> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
>> +
>> +    int j, old_nr_alloc;
>> +
>> +    old_nr_alloc = s->nr_alloc_guests;
>> +
>> +    while (s->nr_alloc_guests<  new_min_size)
>> +        s->nr_alloc_guests = s->nr_alloc_guests * 2;
>> +
>> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n",
>> s->nr_alloc_guests);
>> +    s->eventfds = qemu_realloc(s->eventfds, s->nr_alloc_guests *
>> +                                                        sizeof(int *));
>> +    s->eventfds_posn_count = qemu_realloc(s->eventfds_posn_count,
>> +                                                    s->nr_alloc_guests *
>> +                                                        sizeof(int));
>> +    s->eventfd_table = qemu_realloc(s->eventfd_table, s->nr_alloc_guests
>> *
>> +
>>  sizeof(EventfdEntry));
>> +
>> +    if ((s->eventfds == NULL) || (s->eventfds_posn_count == NULL) ||
>> +            (s->eventfd_table == NULL)) {
>> +        fprintf(stderr, "Allocation error - exiting\n");
>> +        exit(1);
>> +    }
>> +
>> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +        s->eventfd_chr = (CharDriverState **)qemu_realloc(s->eventfd_chr,
>> +                                    s->nr_alloc_guests * sizeof(void *));
>> +        if (s->eventfd_chr == NULL) {
>> +            fprintf(stderr, "Allocation error - exiting\n");
>> +            exit(1);
>> +        }
>> +    }
>> +
>> +    /* zero out new pointers */
>> +    for (j = old_nr_alloc; j<  s->nr_alloc_guests; j++) {
>> +        s->eventfds[j] = NULL;
>>
>
> eventfds_posn_count and eventfd_table want zeroing as well.
>
>> +    }
>> +}
>> +
>> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
>> +{
>> +    IVShmemState *s = opaque;
>> +    int incoming_fd, tmp_fd;
>> +    int guest_curr_max;
>> +    long incoming_posn;
>> +
>> +    memcpy(&incoming_posn, buf, sizeof(long));
>> +    /* pick off s->chr->msgfd and store it, posn should accompany msg */
>> +    tmp_fd = qemu_chr_get_msgfd(s->chr);
>> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
>> +
>> +    /* make sure we have enough space for this guest */
>> +    if (incoming_posn>= s->nr_alloc_guests) {
>> +        increase_dynamic_storage(s, incoming_posn);
>> +    }
>> +
>> +    if (tmp_fd == -1) {
>> +        /* if posn is positive and unseen before then this is our posn*/
>> +        if ((incoming_posn>= 0)&&  (s->eventfds[incoming_posn] == NULL))
>> {
>> +            /* receive our posn */
>> +            s->vm_id = incoming_posn;
>> +            return;
>> +        } else {
>> +            /* otherwise an fd == -1 means an existing guest has gone
>> away */
>> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
>> +            close_guest_eventfds(s, incoming_posn);
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* because of the implementation of get_msgfd, we need a dup */
>> +    incoming_fd = dup(tmp_fd);
>>
>
> Error check.
>
>> +
>> +    /* if the position is -1, then it's shared memory region fd */
>> +    if (incoming_posn == -1) {
>> +
>> +        s->num_eventfds = 0;
>> +
>> +        if (check_shm_size(s, incoming_fd) == -1) {
>> +            exit(-1);
>> +        }
>> +
>> +        /* creating a BAR in qemu_chr callback may be crazy */
>> +        create_shared_memory_BAR(s, incoming_fd);
>>
>
> It probably is... why can't you create it during initialization?

This is for the shared memory server implementation, so the fd for the
shared memory has to be received (over the qemu char device) from the
server before the BAR can be created via qemu_ram_mmap() which adds
the necessary memory

Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.

>
>
>> +
>> +       return;
>> +    }
>> +
>> +    /* each guest has an array of eventfds, and we keep track of how many
>> +     * guests for each VM */
>> +    guest_curr_max = s->eventfds_posn_count[incoming_posn];
>> +    if (guest_curr_max == 0) {
>> +        /* one eventfd per MSI vector */
>> +        s->eventfds[incoming_posn] = (int *) qemu_malloc(s->vectors *
>> +
>>  sizeof(int));
>> +    }
>> +
>> +    /* this is an eventfd for a particular guest VM */
>> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn,
>> guest_curr_max,
>> +
>>  incoming_fd);
>> +    s->eventfds[incoming_posn][guest_curr_max] = incoming_fd;
>> +
>> +    /* increment count for particular guest */
>> +    s->eventfds_posn_count[incoming_posn]++;
>>
>
> Not sure I follow exactly, but perhaps this needs to be
>
>    s->eventfds_posn_count[incoming_posn] = guest_curr_max + 1;
>
> Oh, it is.
>
>> +
>> +        /* allocate/initialize space for interrupt handling */
>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>> sizeof(EventfdEntry));
>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>> sizeof(int));
>> +
>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>> interrupts */
>>
>
> This is done by the guest BIOS.
>
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 15:22             ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 15:28               ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 15:28 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel, Anthony Liguori

On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>
>>
>>> +
>>> +    /* if the position is -1, then it's shared memory region fd */
>>> +    if (incoming_posn == -1) {
>>> +
>>> +        s->num_eventfds = 0;
>>> +
>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>> +            exit(-1);
>>> +        }
>>> +
>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>
>>>        
>> It probably is... why can't you create it during initialization?
>>      
> This is for the shared memory server implementation, so the fd for the
> shared memory has to be received (over the qemu char device) from the
> server before the BAR can be created via qemu_ram_mmap() which adds
> the necessary memory
>
>    


We could do the handshake during initialization.  I'm worried that the 
device will appear without the BAR, and strange things will happen.  But 
the chardev API is probably not geared for passing data during init.

Anthony, any ideas?

> Otherwise, if the BAR is allocated during initialization, I would have
> to use MAP_FIXED to mmap the memory.  This is what I did before the
> qemu_ram_mmap() function was added.
>    

What would happen to any data written to the BAR before the the 
handshake completed?  I think it would disappear.

So it's a good idea to make the initialization process atomic.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 15:28               ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 15:28 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>
>>
>>> +
>>> +    /* if the position is -1, then it's shared memory region fd */
>>> +    if (incoming_posn == -1) {
>>> +
>>> +        s->num_eventfds = 0;
>>> +
>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>> +            exit(-1);
>>> +        }
>>> +
>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>
>>>        
>> It probably is... why can't you create it during initialization?
>>      
> This is for the shared memory server implementation, so the fd for the
> shared memory has to be received (over the qemu char device) from the
> server before the BAR can be created via qemu_ram_mmap() which adds
> the necessary memory
>
>    


We could do the handshake during initialization.  I'm worried that the 
device will appear without the BAR, and strange things will happen.  But 
the chardev API is probably not geared for passing data during init.

Anthony, any ideas?

> Otherwise, if the BAR is allocated during initialization, I would have
> to use MAP_FIXED to mmap the memory.  This is what I did before the
> qemu_ram_mmap() function was added.
>    

What would happen to any data written to the BAR before the the 
handshake completed?  I think it would disappear.

So it's a good idea to make the initialization process atomic.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation
  2010-05-10 10:39       ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 15:32         ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel@nongnu.org Developers, mtosatti

On Mon, May 10, 2010 at 4:39 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to
>> map a
>> host file into guest RAM.  This function mmaps the opened file anywhere
>> and adds
>> the memory to the ram blocks.
>>
>> Usage is
>>
>> qemu_ram_mmap(fd, size, MAP_SHARED, offset);
>>
>
> Signoff?
>>
>> +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t
>> offset)
>> +{
>> +    RAMBlock *new_block;
>> +
>> +    size = TARGET_PAGE_ALIGN(size);
>> +    new_block = qemu_malloc(sizeof(*new_block));
>> +
>> +    /* map the file passed as a parameter to be this part of memory */
>> +    new_block->host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd,
>> offset);
>> +
>> +    if (new_block->host == MAP_FAILED)
>> +        exit(1);
>>
>
> Braces after if ()
>
>> +    if (kvm_enabled())
>> +        kvm_setup_guest_memory(new_block->host, size);
>> +
>>
>
> More braces.
>

This function is possibly made redundant by Marcelo's patch for qemu_ram_map

http://kerneltrap.org/mailarchive/linux-kvm/2010/4/26/6261299

qemu_ram_map isn't merged yet either, but I'm fine with either one.
Marcelo's requires the device to map the memory and then pass the
pointer to be added to the memory allocation, so it gives the device
full mapping control.  Alternatively, I could add the protection flag
to my function (I think that's all that is missing).

Let me know and I'll change my patch if necessary.

> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation
@ 2010-05-10 15:32         ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, qemu-devel@nongnu.org Developers, kvm

On Mon, May 10, 2010 at 4:39 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to
>> map a
>> host file into guest RAM.  This function mmaps the opened file anywhere
>> and adds
>> the memory to the ram blocks.
>>
>> Usage is
>>
>> qemu_ram_mmap(fd, size, MAP_SHARED, offset);
>>
>
> Signoff?
>>
>> +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t
>> offset)
>> +{
>> +    RAMBlock *new_block;
>> +
>> +    size = TARGET_PAGE_ALIGN(size);
>> +    new_block = qemu_malloc(sizeof(*new_block));
>> +
>> +    /* map the file passed as a parameter to be this part of memory */
>> +    new_block->host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd,
>> offset);
>> +
>> +    if (new_block->host == MAP_FAILED)
>> +        exit(1);
>>
>
> Braces after if ()
>
>> +    if (kvm_enabled())
>> +        kvm_setup_guest_memory(new_block->host, size);
>> +
>>
>
> More braces.
>

This function is possibly made redundant by Marcelo's patch for qemu_ram_map

http://kerneltrap.org/mailarchive/linux-kvm/2010/4/26/6261299

qemu_ram_map isn't merged yet either, but I'm fine with either one.
Marcelo's requires the device to map the memory and then pass the
pointer to be added to the memory allocation, so it gives the device
full mapping control.  Alternatively, I could add the protection flag
to my function (I think that's all that is missing).

Let me know and I'll change my patch if necessary.

> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 15:28               ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 15:38                 ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 15:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/10/2010 10:28 AM, Avi Kivity wrote:
> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>
>>>
>>>> +
>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>> +    if (incoming_posn == -1) {
>>>> +
>>>> +        s->num_eventfds = 0;
>>>> +
>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>> +            exit(-1);
>>>> +        }
>>>> +
>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>
>>> It probably is... why can't you create it during initialization?
>> This is for the shared memory server implementation, so the fd for the
>> shared memory has to be received (over the qemu char device) from the
>> server before the BAR can be created via qemu_ram_mmap() which adds
>> the necessary memory
>>
>
>
> We could do the handshake during initialization.  I'm worried that the 
> device will appear without the BAR, and strange things will happen.  
> But the chardev API is probably not geared for passing data during init.
>
> Anthony, any ideas?

Why can't we create the BAR with just normal RAM and then change it to a 
mmap()'d fd after initialization?  This will be behavior would be 
important for live migration as it would let you quickly migrate 
preserving the memory contents without waiting for an external program 
to reconnect.

Regards,

Anthony Lioguori

>> Otherwise, if the BAR is allocated during initialization, I would have
>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>> qemu_ram_mmap() function was added.
>
> What would happen to any data written to the BAR before the the 
> handshake completed?  I think it would disappear.

You don't have to do MAP_FIXED.  You can allocate a ram area and map 
that in when disconnected.  When you connect, you create another ram 
area and memcpy() the previous ram area to the new one.  You then map 
the second ram area in.

 From the guest's perspective, it's totally transparent.  For the 
backend, I'd suggest having an explicit "initialized" ack or something 
so that it knows that the data is now mapped to the guest.

If you're doing just a ring queue in shared memory, it should allow 
disconnect/reconnect during live migration asynchronously to the actual 
qemu live migration.

Regards,

Anthony Liguori

> So it's a good idea to make the initialization process atomic.
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 15:38                 ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 15:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/10/2010 10:28 AM, Avi Kivity wrote:
> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>
>>>
>>>> +
>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>> +    if (incoming_posn == -1) {
>>>> +
>>>> +        s->num_eventfds = 0;
>>>> +
>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>> +            exit(-1);
>>>> +        }
>>>> +
>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>
>>> It probably is... why can't you create it during initialization?
>> This is for the shared memory server implementation, so the fd for the
>> shared memory has to be received (over the qemu char device) from the
>> server before the BAR can be created via qemu_ram_mmap() which adds
>> the necessary memory
>>
>
>
> We could do the handshake during initialization.  I'm worried that the 
> device will appear without the BAR, and strange things will happen.  
> But the chardev API is probably not geared for passing data during init.
>
> Anthony, any ideas?

Why can't we create the BAR with just normal RAM and then change it to a 
mmap()'d fd after initialization?  This will be behavior would be 
important for live migration as it would let you quickly migrate 
preserving the memory contents without waiting for an external program 
to reconnect.

Regards,

Anthony Lioguori

>> Otherwise, if the BAR is allocated during initialization, I would have
>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>> qemu_ram_mmap() function was added.
>
> What would happen to any data written to the BAR before the the 
> handshake completed?  I think it would disappear.

You don't have to do MAP_FIXED.  You can allocate a ram area and map 
that in when disconnected.  When you connect, you create another ram 
area and memcpy() the previous ram area to the new one.  You then map 
the second ram area in.

 From the guest's perspective, it's totally transparent.  For the 
backend, I'd suggest having an explicit "initialized" ack or something 
so that it knows that the data is now mapped to the guest.

If you're doing just a ring queue in shared memory, it should allow 
disconnect/reconnect during live migration asynchronously to the actual 
qemu live migration.

Regards,

Anthony Liguori

> So it's a good idea to make the initialization process atomic.
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 15:28               ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 15:41                 ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel, Anthony Liguori

On Mon, May 10, 2010 at 9:28 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>
>>>
>>>> +
>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>> +    if (incoming_posn == -1) {
>>>> +
>>>> +        s->num_eventfds = 0;
>>>> +
>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>> +            exit(-1);
>>>> +        }
>>>> +
>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>
>>>>
>>>
>>> It probably is... why can't you create it during initialization?
>>>
>>
>> This is for the shared memory server implementation, so the fd for the
>> shared memory has to be received (over the qemu char device) from the
>> server before the BAR can be created via qemu_ram_mmap() which adds
>> the necessary memory
>>
>>
>
>
> We could do the handshake during initialization.  I'm worried that the
> device will appear without the BAR, and strange things will happen.  But the
> chardev API is probably not geared for passing data during init.

More specifically, the challenge I've found is that there is no
function to tell a chardev to block and wait for the initialization
data.

>
> Anthony, any ideas?
>
>> Otherwise, if the BAR is allocated during initialization, I would have
>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>> qemu_ram_mmap() function was added.
>>
>
> What would happen to any data written to the BAR before the the handshake
> completed?  I think it would disappear.

But, the BAR isn't there until the handshake is completed.  Only after
receiving the shared memory fd does my device call pci_register_bar()
in the callback function.  So there may be a case with BAR2 (the
shared memory BAR) missing during initialization.  FWIW, I haven't
encountered this.

>
> So it's a good idea to make the initialization process atomic.
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 15:41                 ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 15:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Mon, May 10, 2010 at 9:28 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>
>>>
>>>> +
>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>> +    if (incoming_posn == -1) {
>>>> +
>>>> +        s->num_eventfds = 0;
>>>> +
>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>> +            exit(-1);
>>>> +        }
>>>> +
>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>
>>>>
>>>
>>> It probably is... why can't you create it during initialization?
>>>
>>
>> This is for the shared memory server implementation, so the fd for the
>> shared memory has to be received (over the qemu char device) from the
>> server before the BAR can be created via qemu_ram_mmap() which adds
>> the necessary memory
>>
>>
>
>
> We could do the handshake during initialization.  I'm worried that the
> device will appear without the BAR, and strange things will happen.  But the
> chardev API is probably not geared for passing data during init.

More specifically, the challenge I've found is that there is no
function to tell a chardev to block and wait for the initialization
data.

>
> Anthony, any ideas?
>
>> Otherwise, if the BAR is allocated during initialization, I would have
>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>> qemu_ram_mmap() function was added.
>>
>
> What would happen to any data written to the BAR before the the handshake
> completed?  I think it would disappear.

But, the BAR isn't there until the handshake is completed.  Only after
receiving the shared memory fd does my device call pci_register_bar()
in the callback function.  So there may be a case with BAR2 (the
shared memory BAR) missing during initialization.  FWIW, I haven't
encountered this.

>
> So it's a good idea to make the initialization process atomic.
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 15:38                 ` [Qemu-devel] " Anthony Liguori
@ 2010-05-10 16:20                   ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 16:20 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 9:38 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/10/2010 10:28 AM, Avi Kivity wrote:
>>
>> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>>
>>>>
>>>>> +
>>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>>> +    if (incoming_posn == -1) {
>>>>> +
>>>>> +        s->num_eventfds = 0;
>>>>> +
>>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>>> +            exit(-1);
>>>>> +        }
>>>>> +
>>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>>
>>>> It probably is... why can't you create it during initialization?
>>>
>>> This is for the shared memory server implementation, so the fd for the
>>> shared memory has to be received (over the qemu char device) from the
>>> server before the BAR can be created via qemu_ram_mmap() which adds
>>> the necessary memory
>>>
>>
>>
>> We could do the handshake during initialization.  I'm worried that the
>> device will appear without the BAR, and strange things will happen.  But the
>> chardev API is probably not geared for passing data during init.
>>
>> Anthony, any ideas?
>
> Why can't we create the BAR with just normal RAM and then change it to a
> mmap()'d fd after initialization?  This will be behavior would be important
> for live migration as it would let you quickly migrate preserving the memory
> contents without waiting for an external program to reconnect.
>
> Regards,
>
> Anthony Lioguori
>
>>> Otherwise, if the BAR is allocated during initialization, I would have
>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>> qemu_ram_mmap() function was added.
>>
>> What would happen to any data written to the BAR before the the handshake
>> completed?  I think it would disappear.
>
> You don't have to do MAP_FIXED.  You can allocate a ram area and map that in
> when disconnected.  When you connect, you create another ram area and
> memcpy() the previous ram area to the new one.  You then map the second ram
> area in.

the memcpy() would overwrite the contents of the shared memory each
time a guest joins which would be dangerous.

>
> From the guest's perspective, it's totally transparent.  For the backend,
> I'd suggest having an explicit "initialized" ack or something so that it
> knows that the data is now mapped to the guest.

Yes, I think the ack is the way to go, so the guest has to be aware of
it.  Would setting a flag in the driver-specific config space be an
acceptable ack that the shared region is now mapped?

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 16:20                   ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 16:20 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 9:38 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/10/2010 10:28 AM, Avi Kivity wrote:
>>
>> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>>
>>>>
>>>>> +
>>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>>> +    if (incoming_posn == -1) {
>>>>> +
>>>>> +        s->num_eventfds = 0;
>>>>> +
>>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>>> +            exit(-1);
>>>>> +        }
>>>>> +
>>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>>
>>>> It probably is... why can't you create it during initialization?
>>>
>>> This is for the shared memory server implementation, so the fd for the
>>> shared memory has to be received (over the qemu char device) from the
>>> server before the BAR can be created via qemu_ram_mmap() which adds
>>> the necessary memory
>>>
>>
>>
>> We could do the handshake during initialization.  I'm worried that the
>> device will appear without the BAR, and strange things will happen.  But the
>> chardev API is probably not geared for passing data during init.
>>
>> Anthony, any ideas?
>
> Why can't we create the BAR with just normal RAM and then change it to a
> mmap()'d fd after initialization?  This will be behavior would be important
> for live migration as it would let you quickly migrate preserving the memory
> contents without waiting for an external program to reconnect.
>
> Regards,
>
> Anthony Lioguori
>
>>> Otherwise, if the BAR is allocated during initialization, I would have
>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>> qemu_ram_mmap() function was added.
>>
>> What would happen to any data written to the BAR before the the handshake
>> completed?  I think it would disappear.
>
> You don't have to do MAP_FIXED.  You can allocate a ram area and map that in
> when disconnected.  When you connect, you create another ram area and
> memcpy() the previous ram area to the new one.  You then map the second ram
> area in.

the memcpy() would overwrite the contents of the shared memory each
time a guest joins which would be dangerous.

>
> From the guest's perspective, it's totally transparent.  For the backend,
> I'd suggest having an explicit "initialized" ack or something so that it
> knows that the data is now mapped to the guest.

Yes, I think the ack is the way to go, so the guest has to be aware of
it.  Would setting a flag in the driver-specific config space be an
acceptable ack that the shared region is now mapped?

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 15:41                 ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 16:40                   ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 16:40 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel, Anthony Liguori

On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>
>> What would happen to any data written to the BAR before the the handshake
>> completed?  I think it would disappear.
>>      
> But, the BAR isn't there until the handshake is completed.  Only after
> receiving the shared memory fd does my device call pci_register_bar()
> in the callback function.  So there may be a case with BAR2 (the
> shared memory BAR) missing during initialization.  FWIW, I haven't
> encountered this.
>    

Well, that violates PCI.  You can't have a PCI device with no BAR, then 
have a BAR appear.  It may work since the BAR is registered a lot faster 
than the BIOS is able to peek at it, but it's a race nevertheless.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 16:40                   ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 16:40 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>
>> What would happen to any data written to the BAR before the the handshake
>> completed?  I think it would disappear.
>>      
> But, the BAR isn't there until the handshake is completed.  Only after
> receiving the shared memory fd does my device call pci_register_bar()
> in the callback function.  So there may be a case with BAR2 (the
> shared memory BAR) missing during initialization.  FWIW, I haven't
> encountered this.
>    

Well, that violates PCI.  You can't have a PCI device with no BAR, then 
have a BAR appear.  It may work since the BAR is registered a lot faster 
than the BIOS is able to peek at it, but it's a race nevertheless.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 16:40                   ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 16:48                     ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 16:48 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel, Anthony Liguori

On Mon, May 10, 2010 at 10:40 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>>
>>> What would happen to any data written to the BAR before the the handshake
>>> completed?  I think it would disappear.
>>>
>>
>> But, the BAR isn't there until the handshake is completed.  Only after
>> receiving the shared memory fd does my device call pci_register_bar()
>> in the callback function.  So there may be a case with BAR2 (the
>> shared memory BAR) missing during initialization.  FWIW, I haven't
>> encountered this.
>>
>
> Well, that violates PCI.  You can't have a PCI device with no BAR, then have
> a BAR appear.  It may work since the BAR is registered a lot faster than the
> BIOS is able to peek at it, but it's a race nevertheless.

Agreed.  I'll get Anthony's idea up and running.  It seems that is the
way forward.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 16:48                     ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 16:48 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Mon, May 10, 2010 at 10:40 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>>
>>> What would happen to any data written to the BAR before the the handshake
>>> completed?  I think it would disappear.
>>>
>>
>> But, the BAR isn't there until the handshake is completed.  Only after
>> receiving the shared memory fd does my device call pci_register_bar()
>> in the callback function.  So there may be a case with BAR2 (the
>> shared memory BAR) missing during initialization.  FWIW, I haven't
>> encountered this.
>>
>
> Well, that violates PCI.  You can't have a PCI device with no BAR, then have
> a BAR appear.  It may work since the BAR is registered a lot faster than the
> BIOS is able to peek at it, but it's a race nevertheless.

Agreed.  I'll get Anthony's idea up and running.  It seems that is the
way forward.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 16:20                   ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 16:52                     ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 16:52 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm, qemu-devel

On 05/10/2010 11:20 AM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 9:38 AM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 05/10/2010 10:28 AM, Avi Kivity wrote:
>>      
>>> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>>        
>>>>          
>>>>>            
>>>>>> +
>>>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>>>> +    if (incoming_posn == -1) {
>>>>>> +
>>>>>> +        s->num_eventfds = 0;
>>>>>> +
>>>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>>>> +            exit(-1);
>>>>>> +        }
>>>>>> +
>>>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>>>
>>>>>>              
>>>>> It probably is... why can't you create it during initialization?
>>>>>            
>>>> This is for the shared memory server implementation, so the fd for the
>>>> shared memory has to be received (over the qemu char device) from the
>>>> server before the BAR can be created via qemu_ram_mmap() which adds
>>>> the necessary memory
>>>>
>>>>          
>>>
>>> We could do the handshake during initialization.  I'm worried that the
>>> device will appear without the BAR, and strange things will happen.  But the
>>> chardev API is probably not geared for passing data during init.
>>>
>>> Anthony, any ideas?
>>>        
>> Why can't we create the BAR with just normal RAM and then change it to a
>> mmap()'d fd after initialization?  This will be behavior would be important
>> for live migration as it would let you quickly migrate preserving the memory
>> contents without waiting for an external program to reconnect.
>>
>> Regards,
>>
>> Anthony Lioguori
>>
>>      
>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>> qemu_ram_mmap() function was added.
>>>>          
>>> What would happen to any data written to the BAR before the the handshake
>>> completed?  I think it would disappear.
>>>        
>> You don't have to do MAP_FIXED.  You can allocate a ram area and map that in
>> when disconnected.  When you connect, you create another ram area and
>> memcpy() the previous ram area to the new one.  You then map the second ram
>> area in.
>>      
> the memcpy() would overwrite the contents of the shared memory each
> time a guest joins which would be dangerous.
>    

I think those are reasonable semantics and is really the only way to get 
guest-transparent reconnect.  The later is pretty critical for guest 
transparent live migration.

>>  From the guest's perspective, it's totally transparent.  For the backend,
>> I'd suggest having an explicit "initialized" ack or something so that it
>> knows that the data is now mapped to the guest.
>>      
> Yes, I think the ack is the way to go, so the guest has to be aware of
> it.  Would setting a flag in the driver-specific config space be an
> acceptable ack that the shared region is now mapped?
>    

You know it's mapped because it's mapped when the pci map function 
returns.  You don't need the guest to explicitly tell you.

Regards,

Anthony Liguori

> Cam
>    


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 16:52                     ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 16:52 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm, qemu-devel

On 05/10/2010 11:20 AM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 9:38 AM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 05/10/2010 10:28 AM, Avi Kivity wrote:
>>      
>>> On 05/10/2010 06:22 PM, Cam Macdonell wrote:
>>>        
>>>>          
>>>>>            
>>>>>> +
>>>>>> +    /* if the position is -1, then it's shared memory region fd */
>>>>>> +    if (incoming_posn == -1) {
>>>>>> +
>>>>>> +        s->num_eventfds = 0;
>>>>>> +
>>>>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>>>>> +            exit(-1);
>>>>>> +        }
>>>>>> +
>>>>>> +        /* creating a BAR in qemu_chr callback may be crazy */
>>>>>> +        create_shared_memory_BAR(s, incoming_fd);
>>>>>>
>>>>>>              
>>>>> It probably is... why can't you create it during initialization?
>>>>>            
>>>> This is for the shared memory server implementation, so the fd for the
>>>> shared memory has to be received (over the qemu char device) from the
>>>> server before the BAR can be created via qemu_ram_mmap() which adds
>>>> the necessary memory
>>>>
>>>>          
>>>
>>> We could do the handshake during initialization.  I'm worried that the
>>> device will appear without the BAR, and strange things will happen.  But the
>>> chardev API is probably not geared for passing data during init.
>>>
>>> Anthony, any ideas?
>>>        
>> Why can't we create the BAR with just normal RAM and then change it to a
>> mmap()'d fd after initialization?  This will be behavior would be important
>> for live migration as it would let you quickly migrate preserving the memory
>> contents without waiting for an external program to reconnect.
>>
>> Regards,
>>
>> Anthony Lioguori
>>
>>      
>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>> qemu_ram_mmap() function was added.
>>>>          
>>> What would happen to any data written to the BAR before the the handshake
>>> completed?  I think it would disappear.
>>>        
>> You don't have to do MAP_FIXED.  You can allocate a ram area and map that in
>> when disconnected.  When you connect, you create another ram area and
>> memcpy() the previous ram area to the new one.  You then map the second ram
>> area in.
>>      
> the memcpy() would overwrite the contents of the shared memory each
> time a guest joins which would be dangerous.
>    

I think those are reasonable semantics and is really the only way to get 
guest-transparent reconnect.  The later is pretty critical for guest 
transparent live migration.

>>  From the guest's perspective, it's totally transparent.  For the backend,
>> I'd suggest having an explicit "initialized" ack or something so that it
>> knows that the data is now mapped to the guest.
>>      
> Yes, I think the ack is the way to go, so the guest has to be aware of
> it.  Would setting a flag in the driver-specific config space be an
> acceptable ack that the shared region is now mapped?
>    

You know it's mapped because it's mapped when the pci map function 
returns.  You don't need the guest to explicitly tell you.

Regards,

Anthony Liguori

> Cam
>    

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 15:38                 ` [Qemu-devel] " Anthony Liguori
@ 2010-05-10 16:59                   ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 16:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>
>>> Otherwise, if the BAR is allocated during initialization, I would have
>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>> qemu_ram_mmap() function was added.
>>
>> What would happen to any data written to the BAR before the the 
>> handshake completed?  I think it would disappear.
>
> You don't have to do MAP_FIXED.  You can allocate a ram area and map 
> that in when disconnected.  When you connect, you create another ram 
> area and memcpy() the previous ram area to the new one.  You then map 
> the second ram area in.

But it's a shared memory area.  Other peers could have connected and 
written some data in.  The memcpy() would destroy their data.

>
> From the guest's perspective, it's totally transparent.  For the 
> backend, I'd suggest having an explicit "initialized" ack or something 
> so that it knows that the data is now mapped to the guest.

 From the peers' perspective, it's non-transparent :(

Also it doubles the transient memory requirement.

>
> If you're doing just a ring queue in shared memory, it should allow 
> disconnect/reconnect during live migration asynchronously to the 
> actual qemu live migration.
>

Live migration of guests using shared memory is interesting.  You'd need 
to freeze all peers on one node, disconnect, reconnect, and restart them 
on the other node.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 16:59                   ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-10 16:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>
>>> Otherwise, if the BAR is allocated during initialization, I would have
>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>> qemu_ram_mmap() function was added.
>>
>> What would happen to any data written to the BAR before the the 
>> handshake completed?  I think it would disappear.
>
> You don't have to do MAP_FIXED.  You can allocate a ram area and map 
> that in when disconnected.  When you connect, you create another ram 
> area and memcpy() the previous ram area to the new one.  You then map 
> the second ram area in.

But it's a shared memory area.  Other peers could have connected and 
written some data in.  The memcpy() would destroy their data.

>
> From the guest's perspective, it's totally transparent.  For the 
> backend, I'd suggest having an explicit "initialized" ack or something 
> so that it knows that the data is now mapped to the guest.

 From the peers' perspective, it's non-transparent :(

Also it doubles the transient memory requirement.

>
> If you're doing just a ring queue in shared memory, it should allow 
> disconnect/reconnect during live migration asynchronously to the 
> actual qemu live migration.
>

Live migration of guests using shared memory is interesting.  You'd need 
to freeze all peers on one node, disconnect, reconnect, and restart them 
on the other node.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 16:59                   ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 17:25                     ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 17:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/10/2010 11:59 AM, Avi Kivity wrote:
> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>
>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>> qemu_ram_mmap() function was added.
>>>
>>> What would happen to any data written to the BAR before the the 
>>> handshake completed?  I think it would disappear.
>>
>> You don't have to do MAP_FIXED.  You can allocate a ram area and map 
>> that in when disconnected.  When you connect, you create another ram 
>> area and memcpy() the previous ram area to the new one.  You then map 
>> the second ram area in.
>
> But it's a shared memory area.  Other peers could have connected and 
> written some data in.  The memcpy() would destroy their data.

Why try to attempt to support multi-master shared memory?  What's the 
use-case?

Regards,

Anthony Liguori

>>
>> From the guest's perspective, it's totally transparent.  For the 
>> backend, I'd suggest having an explicit "initialized" ack or 
>> something so that it knows that the data is now mapped to the guest.
>
> From the peers' perspective, it's non-transparent :(
>
> Also it doubles the transient memory requirement.
>
>>
>> If you're doing just a ring queue in shared memory, it should allow 
>> disconnect/reconnect during live migration asynchronously to the 
>> actual qemu live migration.
>>
>
> Live migration of guests using shared memory is interesting.  You'd 
> need to freeze all peers on one node, disconnect, reconnect, and 
> restart them on the other node.
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 17:25                     ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 17:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/10/2010 11:59 AM, Avi Kivity wrote:
> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>
>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>> qemu_ram_mmap() function was added.
>>>
>>> What would happen to any data written to the BAR before the the 
>>> handshake completed?  I think it would disappear.
>>
>> You don't have to do MAP_FIXED.  You can allocate a ram area and map 
>> that in when disconnected.  When you connect, you create another ram 
>> area and memcpy() the previous ram area to the new one.  You then map 
>> the second ram area in.
>
> But it's a shared memory area.  Other peers could have connected and 
> written some data in.  The memcpy() would destroy their data.

Why try to attempt to support multi-master shared memory?  What's the 
use-case?

Regards,

Anthony Liguori

>>
>> From the guest's perspective, it's totally transparent.  For the 
>> backend, I'd suggest having an explicit "initialized" ack or 
>> something so that it knows that the data is now mapped to the guest.
>
> From the peers' perspective, it's non-transparent :(
>
> Also it doubles the transient memory requirement.
>
>>
>> If you're doing just a ring queue in shared memory, it should allow 
>> disconnect/reconnect during live migration asynchronously to the 
>> actual qemu live migration.
>>
>
> Live migration of guests using shared memory is interesting.  You'd 
> need to freeze all peers on one node, disconnect, reconnect, and 
> restart them on the other node.
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 17:25                     ` [Qemu-devel] " Anthony Liguori
@ 2010-05-10 17:43                       ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 17:43 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>>
>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>
>>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>> qemu_ram_mmap() function was added.
>>>>
>>>> What would happen to any data written to the BAR before the the
>>>> handshake completed?  I think it would disappear.
>>>
>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map that
>>> in when disconnected.  When you connect, you create another ram area and
>>> memcpy() the previous ram area to the new one.  You then map the second ram
>>> area in.
>>
>> But it's a shared memory area.  Other peers could have connected and
>> written some data in.  The memcpy() would destroy their data.
>
> Why try to attempt to support multi-master shared memory?  What's the
> use-case?

I don't see it as multi-master, but that the latest guest to join
shouldn't have their contents take precedence.  In developing this
patch, my motivation has been to let the guests decide.  If the memcpy
is always done, even when no data is written, a guest cannot join
without overwriting everything.

One use case we're looking at is having VMs using a map reduce
framework like Hadoop or Phoenix running in VMs.  However, if a
workqueue is stored or data transfer passes through shared memory, a
system can't scale up the number of workers because each new guest
will erase the shared memory (and the workqueue or in progress data
transfer).

In cases where the latest guest to join wants to clear the memory, it
can do so without the automatic memcpy.  The guest can do a memset
once it knows the memory is attached.  My opinion is to leave it to
the guests and the application that is using the shared memory to
decide what to do on guest joins.

Cam

>
> Regards,
>
> Anthony Liguori
>
>>>
>>> From the guest's perspective, it's totally transparent.  For the backend,
>>> I'd suggest having an explicit "initialized" ack or something so that it
>>> knows that the data is now mapped to the guest.
>>
>> From the peers' perspective, it's non-transparent :(
>>
>> Also it doubles the transient memory requirement.
>>
>>>
>>> If you're doing just a ring queue in shared memory, it should allow
>>> disconnect/reconnect during live migration asynchronously to the actual qemu
>>> live migration.
>>>
>>
>> Live migration of guests using shared memory is interesting.  You'd need
>> to freeze all peers on one node, disconnect, reconnect, and restart them on
>> the other node.
>>
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 17:43                       ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 17:43 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>>
>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>
>>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>> qemu_ram_mmap() function was added.
>>>>
>>>> What would happen to any data written to the BAR before the the
>>>> handshake completed?  I think it would disappear.
>>>
>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map that
>>> in when disconnected.  When you connect, you create another ram area and
>>> memcpy() the previous ram area to the new one.  You then map the second ram
>>> area in.
>>
>> But it's a shared memory area.  Other peers could have connected and
>> written some data in.  The memcpy() would destroy their data.
>
> Why try to attempt to support multi-master shared memory?  What's the
> use-case?

I don't see it as multi-master, but that the latest guest to join
shouldn't have their contents take precedence.  In developing this
patch, my motivation has been to let the guests decide.  If the memcpy
is always done, even when no data is written, a guest cannot join
without overwriting everything.

One use case we're looking at is having VMs using a map reduce
framework like Hadoop or Phoenix running in VMs.  However, if a
workqueue is stored or data transfer passes through shared memory, a
system can't scale up the number of workers because each new guest
will erase the shared memory (and the workqueue or in progress data
transfer).

In cases where the latest guest to join wants to clear the memory, it
can do so without the automatic memcpy.  The guest can do a memset
once it knows the memory is attached.  My opinion is to leave it to
the guests and the application that is using the shared memory to
decide what to do on guest joins.

Cam

>
> Regards,
>
> Anthony Liguori
>
>>>
>>> From the guest's perspective, it's totally transparent.  For the backend,
>>> I'd suggest having an explicit "initialized" ack or something so that it
>>> knows that the data is now mapped to the guest.
>>
>> From the peers' perspective, it's non-transparent :(
>>
>> Also it doubles the transient memory requirement.
>>
>>>
>>> If you're doing just a ring queue in shared memory, it should allow
>>> disconnect/reconnect during live migration asynchronously to the actual qemu
>>> live migration.
>>>
>>
>> Live migration of guests using shared memory is interesting.  You'd need
>> to freeze all peers on one node, disconnect, reconnect, and restart them on
>> the other node.
>>
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 17:43                       ` [Qemu-devel] " Cam Macdonell
@ 2010-05-10 17:52                         ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 17:52 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm, qemu-devel

On 05/10/2010 12:43 PM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>>      
>>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>        
>>>>          
>>>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>>> qemu_ram_mmap() function was added.
>>>>>>              
>>>>> What would happen to any data written to the BAR before the the
>>>>> handshake completed?  I think it would disappear.
>>>>>            
>>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map that
>>>> in when disconnected.  When you connect, you create another ram area and
>>>> memcpy() the previous ram area to the new one.  You then map the second ram
>>>> area in.
>>>>          
>>> But it's a shared memory area.  Other peers could have connected and
>>> written some data in.  The memcpy() would destroy their data.
>>>        
>> Why try to attempt to support multi-master shared memory?  What's the
>> use-case?
>>      
> I don't see it as multi-master, but that the latest guest to join
> shouldn't have their contents take precedence.  In developing this
> patch, my motivation has been to let the guests decide.  If the memcpy
> is always done, even when no data is written, a guest cannot join
> without overwriting everything.
>
> One use case we're looking at is having VMs using a map reduce
> framework like Hadoop or Phoenix running in VMs.  However, if a
> workqueue is stored or data transfer passes through shared memory, a
> system can't scale up the number of workers because each new guest
> will erase the shared memory (and the workqueue or in progress data
> transfer).
>    

(Replying again to list)

What data structure would you use?  For a lockless ring queue, you can 
only support a single producer and consumer.  To achieve bidirectional 
communication in virtio, we always use two queues.

If you're adding additional queues to support other levels of 
communication, you can always use different areas of shared memory.

I guess this is the point behind the doorbell mechanism?

Regards,

Anthony Liguori

> In cases where the latest guest to join wants to clear the memory, it
> can do so without the automatic memcpy.  The guest can do a memset
> once it knows the memory is attached.  My opinion is to leave it to
> the guests and the application that is using the shared memory to
> decide what to do on guest joins.
>
> Cam
>
>    
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>>>  From the guest's perspective, it's totally transparent.  For the backend,
>>>> I'd suggest having an explicit "initialized" ack or something so that it
>>>> knows that the data is now mapped to the guest.
>>>>          
>>>  From the peers' perspective, it's non-transparent :(
>>>
>>> Also it doubles the transient memory requirement.
>>>
>>>        
>>>> If you're doing just a ring queue in shared memory, it should allow
>>>> disconnect/reconnect during live migration asynchronously to the actual qemu
>>>> live migration.
>>>>
>>>>          
>>> Live migration of guests using shared memory is interesting.  You'd need
>>> to freeze all peers on one node, disconnect, reconnect, and restart them on
>>> the other node.
>>>
>>>        
>>
>>      
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 17:52                         ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-10 17:52 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm, qemu-devel

On 05/10/2010 12:43 PM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>>      
>>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>        
>>>>          
>>>>>> Otherwise, if the BAR is allocated during initialization, I would have
>>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>>> qemu_ram_mmap() function was added.
>>>>>>              
>>>>> What would happen to any data written to the BAR before the the
>>>>> handshake completed?  I think it would disappear.
>>>>>            
>>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map that
>>>> in when disconnected.  When you connect, you create another ram area and
>>>> memcpy() the previous ram area to the new one.  You then map the second ram
>>>> area in.
>>>>          
>>> But it's a shared memory area.  Other peers could have connected and
>>> written some data in.  The memcpy() would destroy their data.
>>>        
>> Why try to attempt to support multi-master shared memory?  What's the
>> use-case?
>>      
> I don't see it as multi-master, but that the latest guest to join
> shouldn't have their contents take precedence.  In developing this
> patch, my motivation has been to let the guests decide.  If the memcpy
> is always done, even when no data is written, a guest cannot join
> without overwriting everything.
>
> One use case we're looking at is having VMs using a map reduce
> framework like Hadoop or Phoenix running in VMs.  However, if a
> workqueue is stored or data transfer passes through shared memory, a
> system can't scale up the number of workers because each new guest
> will erase the shared memory (and the workqueue or in progress data
> transfer).
>    

(Replying again to list)

What data structure would you use?  For a lockless ring queue, you can 
only support a single producer and consumer.  To achieve bidirectional 
communication in virtio, we always use two queues.

If you're adding additional queues to support other levels of 
communication, you can always use different areas of shared memory.

I guess this is the point behind the doorbell mechanism?

Regards,

Anthony Liguori

> In cases where the latest guest to join wants to clear the memory, it
> can do so without the automatic memcpy.  The guest can do a memset
> once it knows the memory is attached.  My opinion is to leave it to
> the guests and the application that is using the shared memory to
> decide what to do on guest joins.
>
> Cam
>
>    
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>>>  From the guest's perspective, it's totally transparent.  For the backend,
>>>> I'd suggest having an explicit "initialized" ack or something so that it
>>>> knows that the data is now mapped to the guest.
>>>>          
>>>  From the peers' perspective, it's non-transparent :(
>>>
>>> Also it doubles the transient memory requirement.
>>>
>>>        
>>>> If you're doing just a ring queue in shared memory, it should allow
>>>> disconnect/reconnect during live migration asynchronously to the actual qemu
>>>> live migration.
>>>>
>>>>          
>>> Live migration of guests using shared memory is interesting.  You'd need
>>> to freeze all peers on one node, disconnect, reconnect, and restart them on
>>> the other node.
>>>
>>>        
>>
>>      
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 17:52                         ` [Qemu-devel] " Anthony Liguori
@ 2010-05-10 18:01                           ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 18:01 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 11:52 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/10/2010 12:43 PM, Cam Macdonell wrote:
>>
>> On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori<anthony@codemonkey.ws>
>>  wrote:
>>
>>>
>>> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>>>
>>>>
>>>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> Otherwise, if the BAR is allocated during initialization, I would
>>>>>>> have
>>>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>>>> qemu_ram_mmap() function was added.
>>>>>>>
>>>>>>
>>>>>> What would happen to any data written to the BAR before the the
>>>>>> handshake completed?  I think it would disappear.
>>>>>>
>>>>>
>>>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map
>>>>> that
>>>>> in when disconnected.  When you connect, you create another ram area
>>>>> and
>>>>> memcpy() the previous ram area to the new one.  You then map the second
>>>>> ram
>>>>> area in.
>>>>>
>>>>
>>>> But it's a shared memory area.  Other peers could have connected and
>>>> written some data in.  The memcpy() would destroy their data.
>>>>
>>>
>>> Why try to attempt to support multi-master shared memory?  What's the
>>> use-case?
>>>
>>
>> I don't see it as multi-master, but that the latest guest to join
>> shouldn't have their contents take precedence.  In developing this
>> patch, my motivation has been to let the guests decide.  If the memcpy
>> is always done, even when no data is written, a guest cannot join
>> without overwriting everything.
>>
>> One use case we're looking at is having VMs using a map reduce
>> framework like Hadoop or Phoenix running in VMs.  However, if a
>> workqueue is stored or data transfer passes through shared memory, a
>> system can't scale up the number of workers because each new guest
>> will erase the shared memory (and the workqueue or in progress data
>> transfer).
>>
>
> (Replying again to list)

Sorry about that.

> What data structure would you use?  For a lockless ring queue, you can only
> support a single producer and consumer.  To achieve bidirectional
> communication in virtio, we always use two queues.

MCS locks can work with multiple producer/consumers, either with busy
waiting or using the doorbell mechanism.

>
> If you're adding additional queues to support other levels of communication,
> you can always use different areas of shared memory.

True, and my point is simply that the memcpy would wipe those all out.

>
> I guess this is the point behind the doorbell mechanism?

Yes.

>
> Regards,
>
> Anthony Liguori
>
>> In cases where the latest guest to join wants to clear the memory, it
>> can do so without the automatic memcpy.  The guest can do a memset
>> once it knows the memory is attached.  My opinion is to leave it to
>> the guests and the application that is using the shared memory to
>> decide what to do on guest joins.
>>
>> Cam
>>
>>
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>>
>>>>>
>>>>>  From the guest's perspective, it's totally transparent.  For the
>>>>> backend,
>>>>> I'd suggest having an explicit "initialized" ack or something so that
>>>>> it
>>>>> knows that the data is now mapped to the guest.
>>>>>
>>>>
>>>>  From the peers' perspective, it's non-transparent :(
>>>>
>>>> Also it doubles the transient memory requirement.
>>>>
>>>>
>>>>>
>>>>> If you're doing just a ring queue in shared memory, it should allow
>>>>> disconnect/reconnect during live migration asynchronously to the actual
>>>>> qemu
>>>>> live migration.
>>>>>
>>>>>
>>>>
>>>> Live migration of guests using shared memory is interesting.  You'd need
>>>> to freeze all peers on one node, disconnect, reconnect, and restart them
>>>> on
>>>> the other node.
>>>>
>>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 18:01                           ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 18:01 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 11:52 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/10/2010 12:43 PM, Cam Macdonell wrote:
>>
>> On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori<anthony@codemonkey.ws>
>>  wrote:
>>
>>>
>>> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>>>
>>>>
>>>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> Otherwise, if the BAR is allocated during initialization, I would
>>>>>>> have
>>>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>>>> qemu_ram_mmap() function was added.
>>>>>>>
>>>>>>
>>>>>> What would happen to any data written to the BAR before the the
>>>>>> handshake completed?  I think it would disappear.
>>>>>>
>>>>>
>>>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map
>>>>> that
>>>>> in when disconnected.  When you connect, you create another ram area
>>>>> and
>>>>> memcpy() the previous ram area to the new one.  You then map the second
>>>>> ram
>>>>> area in.
>>>>>
>>>>
>>>> But it's a shared memory area.  Other peers could have connected and
>>>> written some data in.  The memcpy() would destroy their data.
>>>>
>>>
>>> Why try to attempt to support multi-master shared memory?  What's the
>>> use-case?
>>>
>>
>> I don't see it as multi-master, but that the latest guest to join
>> shouldn't have their contents take precedence.  In developing this
>> patch, my motivation has been to let the guests decide.  If the memcpy
>> is always done, even when no data is written, a guest cannot join
>> without overwriting everything.
>>
>> One use case we're looking at is having VMs using a map reduce
>> framework like Hadoop or Phoenix running in VMs.  However, if a
>> workqueue is stored or data transfer passes through shared memory, a
>> system can't scale up the number of workers because each new guest
>> will erase the shared memory (and the workqueue or in progress data
>> transfer).
>>
>
> (Replying again to list)

Sorry about that.

> What data structure would you use?  For a lockless ring queue, you can only
> support a single producer and consumer.  To achieve bidirectional
> communication in virtio, we always use two queues.

MCS locks can work with multiple producer/consumers, either with busy
waiting or using the doorbell mechanism.

>
> If you're adding additional queues to support other levels of communication,
> you can always use different areas of shared memory.

True, and my point is simply that the memcpy would wipe those all out.

>
> I guess this is the point behind the doorbell mechanism?

Yes.

>
> Regards,
>
> Anthony Liguori
>
>> In cases where the latest guest to join wants to clear the memory, it
>> can do so without the automatic memcpy.  The guest can do a memset
>> once it knows the memory is attached.  My opinion is to leave it to
>> the guests and the application that is using the shared memory to
>> decide what to do on guest joins.
>>
>> Cam
>>
>>
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>>
>>>>>
>>>>>  From the guest's perspective, it's totally transparent.  For the
>>>>> backend,
>>>>> I'd suggest having an explicit "initialized" ack or something so that
>>>>> it
>>>>> knows that the data is now mapped to the guest.
>>>>>
>>>>
>>>>  From the peers' perspective, it's non-transparent :(
>>>>
>>>> Also it doubles the transient memory requirement.
>>>>
>>>>
>>>>>
>>>>> If you're doing just a ring queue in shared memory, it should allow
>>>>> disconnect/reconnect during live migration asynchronously to the actual
>>>>> qemu
>>>>> live migration.
>>>>>
>>>>>
>>>>
>>>> Live migration of guests using shared memory is interesting.  You'd need
>>>> to freeze all peers on one node, disconnect, reconnect, and restart them
>>>> on
>>>> the other node.
>>>>
>>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 11:59           ` [Qemu-devel] " Avi Kivity
@ 2010-05-10 23:17             ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 23:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a
>> PCI device in the guest.  This patch also supports interrupts between
>> guest by
>> communicating over a unix domain socket.  This patch applies to the
>> qemu-kvm
>> repository.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory
>> server
>> by using a chardev socket.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>                     [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>     -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>
>> +typedef struct EventfdEntry {
>> +    PCIDevice *pdev;
>> +    int vector;
>> +} EventfdEntry;
>> +
>> +typedef struct IVShmemState {
>> +    PCIDevice dev;
>> +    uint32_t intrmask;
>> +    uint32_t intrstatus;
>> +    uint32_t doorbell;
>> +
>> +    CharDriverState * chr;
>> +    CharDriverState ** eventfd_chr;
>> +    int ivshmem_mmio_io_addr;
>> +
>> +    pcibus_t mmio_addr;
>> +    unsigned long ivshmem_offset;
>> +    uint64_t ivshmem_size; /* size of shared memory region */
>> +    int shm_fd; /* shared memory file descriptor */
>> +
>> +    int nr_allocated_vms;
>> +    /* array of eventfds for each guest */
>> +    int ** eventfds;
>> +    /* keep track of # of eventfds for each guest*/
>> +    int * eventfds_posn_count;
>>
>
> More readable:
>
>  typedef struct Peer {
>      int nb_eventfds;
>      int *eventfds;
>  } Peer;
>  int nb_peers;
>  Peer *peers;
>
> Does eventfd_chr need to be there as well?

No it does not, eventfd_chr store character devices for receiving
interrupts when irqfd is not available, so we only them for this
guest, not for our peers.

I've switched over to this more readable naming you've suggested.

>
>> +
>> +    int nr_alloc_guests;
>> +    int vm_id;
>> +    int num_eventfds;
>> +    uint32_t vectors;
>> +    uint32_t features;
>> +    EventfdEntry *eventfd_table;
>> +
>> +    char * shmobj;
>> +    char * sizearg;
>>
>
> Does this need to be part of the state?

They are because they're passed in as qdev properties from the
command-line so I thought they needed to be in the state struct to be
assigned via DEFINE_PROP_...

>
>> +} IVShmemState;
>> +
>> +/* registers for the Inter-VM shared memory device */
>> +enum ivshmem_registers {
>> +    IntrMask = 0,
>> +    IntrStatus = 4,
>> +    IVPosition = 8,
>> +    Doorbell = 12,
>> +};
>> +
>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
>> feature) {
>> +    return (ivs->features&  (1<<  feature));
>> +}
>> +
>> +static inline int is_power_of_two(int x) {
>> +    return (x&  (x-1)) == 0;
>> +}
>>
>
> argument needs to be uint64_t to avoid overflow with large BARs.  Return
> type can be bool.
>
>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    u_int64_t write_one = 1;
>> +    u_int16_t dest = val>>  16;
>> +    u_int16_t vector = val&  0xff;
>> +
>> +    addr&= 0xfe;
>>
>
> Why 0xfe?  Can understand 0xfc or 0xff.

Forgot to change to 0xfc when registers went from 16 to 32-bits.

>
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ivshmem_IntrMask_write(s, val);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ivshmem_IntrStatus_write(s, val);
>> +            break;
>> +
>> +        case Doorbell:
>> +            /* check doorbell range */
>> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest]))
>> {
>>
>
> What if dest is too big?  We overflow s->eventfds_posn_count.

added a check for that.

Thanks,
Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-10 23:17             ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-10 23:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a
>> PCI device in the guest.  This patch also supports interrupts between
>> guest by
>> communicating over a unix domain socket.  This patch applies to the
>> qemu-kvm
>> repository.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory
>> server
>> by using a chardev socket.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>                     [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>     -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>
>> +typedef struct EventfdEntry {
>> +    PCIDevice *pdev;
>> +    int vector;
>> +} EventfdEntry;
>> +
>> +typedef struct IVShmemState {
>> +    PCIDevice dev;
>> +    uint32_t intrmask;
>> +    uint32_t intrstatus;
>> +    uint32_t doorbell;
>> +
>> +    CharDriverState * chr;
>> +    CharDriverState ** eventfd_chr;
>> +    int ivshmem_mmio_io_addr;
>> +
>> +    pcibus_t mmio_addr;
>> +    unsigned long ivshmem_offset;
>> +    uint64_t ivshmem_size; /* size of shared memory region */
>> +    int shm_fd; /* shared memory file descriptor */
>> +
>> +    int nr_allocated_vms;
>> +    /* array of eventfds for each guest */
>> +    int ** eventfds;
>> +    /* keep track of # of eventfds for each guest*/
>> +    int * eventfds_posn_count;
>>
>
> More readable:
>
>  typedef struct Peer {
>      int nb_eventfds;
>      int *eventfds;
>  } Peer;
>  int nb_peers;
>  Peer *peers;
>
> Does eventfd_chr need to be there as well?

No it does not, eventfd_chr store character devices for receiving
interrupts when irqfd is not available, so we only them for this
guest, not for our peers.

I've switched over to this more readable naming you've suggested.

>
>> +
>> +    int nr_alloc_guests;
>> +    int vm_id;
>> +    int num_eventfds;
>> +    uint32_t vectors;
>> +    uint32_t features;
>> +    EventfdEntry *eventfd_table;
>> +
>> +    char * shmobj;
>> +    char * sizearg;
>>
>
> Does this need to be part of the state?

They are because they're passed in as qdev properties from the
command-line so I thought they needed to be in the state struct to be
assigned via DEFINE_PROP_...

>
>> +} IVShmemState;
>> +
>> +/* registers for the Inter-VM shared memory device */
>> +enum ivshmem_registers {
>> +    IntrMask = 0,
>> +    IntrStatus = 4,
>> +    IVPosition = 8,
>> +    Doorbell = 12,
>> +};
>> +
>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
>> feature) {
>> +    return (ivs->features&  (1<<  feature));
>> +}
>> +
>> +static inline int is_power_of_two(int x) {
>> +    return (x&  (x-1)) == 0;
>> +}
>>
>
> argument needs to be uint64_t to avoid overflow with large BARs.  Return
> type can be bool.
>
>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    u_int64_t write_one = 1;
>> +    u_int16_t dest = val>>  16;
>> +    u_int16_t vector = val&  0xff;
>> +
>> +    addr&= 0xfe;
>>
>
> Why 0xfe?  Can understand 0xfc or 0xff.

Forgot to change to 0xfc when registers went from 16 to 32-bits.

>
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ivshmem_IntrMask_write(s, val);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ivshmem_IntrStatus_write(s, val);
>> +            break;
>> +
>> +        case Doorbell:
>> +            /* check doorbell range */
>> +            if ((vector>= 0)&&  (vector<  s->eventfds_posn_count[dest]))
>> {
>>
>
> What if dest is too big?  We overflow s->eventfds_posn_count.

added a check for that.

Thanks,
Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 17:25                     ` [Qemu-devel] " Anthony Liguori
@ 2010-05-11  7:55                       ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11  7:55 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/10/2010 08:25 PM, Anthony Liguori wrote:
> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>
>>>>> Otherwise, if the BAR is allocated during initialization, I would 
>>>>> have
>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>> qemu_ram_mmap() function was added.
>>>>
>>>> What would happen to any data written to the BAR before the the 
>>>> handshake completed?  I think it would disappear.
>>>
>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map 
>>> that in when disconnected.  When you connect, you create another ram 
>>> area and memcpy() the previous ram area to the new one.  You then 
>>> map the second ram area in.
>>
>> But it's a shared memory area.  Other peers could have connected and 
>> written some data in.  The memcpy() would destroy their data.
>
> Why try to attempt to support multi-master shared memory?  What's the 
> use-case?

(presuming you mean multiple writers?)

This is a surprising take.  What's the use of a single master shared 
memory area?

Most uses of shared memory among processes or threads are multi-master.  
One use case can be a shared cache among the various guests.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11  7:55                       ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11  7:55 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/10/2010 08:25 PM, Anthony Liguori wrote:
> On 05/10/2010 11:59 AM, Avi Kivity wrote:
>> On 05/10/2010 06:38 PM, Anthony Liguori wrote:
>>>
>>>>> Otherwise, if the BAR is allocated during initialization, I would 
>>>>> have
>>>>> to use MAP_FIXED to mmap the memory.  This is what I did before the
>>>>> qemu_ram_mmap() function was added.
>>>>
>>>> What would happen to any data written to the BAR before the the 
>>>> handshake completed?  I think it would disappear.
>>>
>>> You don't have to do MAP_FIXED.  You can allocate a ram area and map 
>>> that in when disconnected.  When you connect, you create another ram 
>>> area and memcpy() the previous ram area to the new one.  You then 
>>> map the second ram area in.
>>
>> But it's a shared memory area.  Other peers could have connected and 
>> written some data in.  The memcpy() would destroy their data.
>
> Why try to attempt to support multi-master shared memory?  What's the 
> use-case?

(presuming you mean multiple writers?)

This is a surprising take.  What's the use of a single master shared 
memory area?

Most uses of shared memory among processes or threads are multi-master.  
One use case can be a shared cache among the various guests.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 17:52                         ` [Qemu-devel] " Anthony Liguori
@ 2010-05-11  7:59                           ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11  7:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/10/2010 08:52 PM, Anthony Liguori wrote:
>>> Why try to attempt to support multi-master shared memory?  What's the
>>> use-case?
>> I don't see it as multi-master, but that the latest guest to join
>> shouldn't have their contents take precedence.  In developing this
>> patch, my motivation has been to let the guests decide.  If the memcpy
>> is always done, even when no data is written, a guest cannot join
>> without overwriting everything.
>>
>> One use case we're looking at is having VMs using a map reduce
>> framework like Hadoop or Phoenix running in VMs.  However, if a
>> workqueue is stored or data transfer passes through shared memory, a
>> system can't scale up the number of workers because each new guest
>> will erase the shared memory (and the workqueue or in progress data
>> transfer).
>
> (Replying again to list)
>
> What data structure would you use?  For a lockless ring queue, you can 
> only support a single producer and consumer.  To achieve bidirectional 
> communication in virtio, we always use two queues.

You don't have to use a lockless ring queue.  You can use locks 
(spinlocks without interrupt support, full mutexes with interrupts) and 
any data structure you like.  Say a hash table + LRU for a shared cache.

>
> If you're adding additional queues to support other levels of 
> communication, you can always use different areas of shared memory.

You'll need O(n^2) shared memory areas (n=peer count), and it is a lot 
less flexible that real shared memory.  Consider using threading where 
the only communication among threads is a pipe (erlang?)


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11  7:59                           ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11  7:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/10/2010 08:52 PM, Anthony Liguori wrote:
>>> Why try to attempt to support multi-master shared memory?  What's the
>>> use-case?
>> I don't see it as multi-master, but that the latest guest to join
>> shouldn't have their contents take precedence.  In developing this
>> patch, my motivation has been to let the guests decide.  If the memcpy
>> is always done, even when no data is written, a guest cannot join
>> without overwriting everything.
>>
>> One use case we're looking at is having VMs using a map reduce
>> framework like Hadoop or Phoenix running in VMs.  However, if a
>> workqueue is stored or data transfer passes through shared memory, a
>> system can't scale up the number of workers because each new guest
>> will erase the shared memory (and the workqueue or in progress data
>> transfer).
>
> (Replying again to list)
>
> What data structure would you use?  For a lockless ring queue, you can 
> only support a single producer and consumer.  To achieve bidirectional 
> communication in virtio, we always use two queues.

You don't have to use a lockless ring queue.  You can use locks 
(spinlocks without interrupt support, full mutexes with interrupts) and 
any data structure you like.  Say a hash table + LRU for a shared cache.

>
> If you're adding additional queues to support other levels of 
> communication, you can always use different areas of shared memory.

You'll need O(n^2) shared memory areas (n=peer count), and it is a lot 
less flexible that real shared memory.  Consider using threading where 
the only communication among threads is a pipe (erlang?)


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 23:17             ` [Qemu-devel] " Cam Macdonell
@ 2010-05-11  8:03               ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11  8:03 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 05/11/2010 02:17 AM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 5:59 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>      
>>> Support an inter-vm shared memory device that maps a shared-memory object
>>> as a
>>> PCI device in the guest.  This patch also supports interrupts between
>>> guest by
>>> communicating over a unix domain socket.  This patch applies to the
>>> qemu-kvm
>>> repository.
>>>
>>>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>>
>>> Interrupts are supported between multiple VMs by using a shared memory
>>> server
>>> by using a chardev socket.
>>>
>>>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>>                      [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>>      -chardev socket,path=<path>,id=<id>
>>>
>>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>>
>>> Sample programs and init scripts are in a git repo here:
>>>
>>>
>>> +typedef struct EventfdEntry {
>>> +    PCIDevice *pdev;
>>> +    int vector;
>>> +} EventfdEntry;
>>> +
>>> +typedef struct IVShmemState {
>>> +    PCIDevice dev;
>>> +    uint32_t intrmask;
>>> +    uint32_t intrstatus;
>>> +    uint32_t doorbell;
>>> +
>>> +    CharDriverState * chr;
>>> +    CharDriverState ** eventfd_chr;
>>> +    int ivshmem_mmio_io_addr;
>>> +
>>> +    pcibus_t mmio_addr;
>>> +    unsigned long ivshmem_offset;
>>> +    uint64_t ivshmem_size; /* size of shared memory region */
>>> +    int shm_fd; /* shared memory file descriptor */
>>> +
>>> +    int nr_allocated_vms;
>>> +    /* array of eventfds for each guest */
>>> +    int ** eventfds;
>>> +    /* keep track of # of eventfds for each guest*/
>>> +    int * eventfds_posn_count;
>>>
>>>        
>> More readable:
>>
>>   typedef struct Peer {
>>       int nb_eventfds;
>>       int *eventfds;
>>   } Peer;
>>   int nb_peers;
>>   Peer *peers;
>>
>> Does eventfd_chr need to be there as well?
>>      
> No it does not, eventfd_chr store character devices for receiving
> interrupts when irqfd is not available, so we only them for this
> guest, not for our peers.
>    

Ok.

>> Does this need to be part of the state?
>>      
> They are because they're passed in as qdev properties from the
> command-line so I thought they needed to be in the state struct to be
> assigned via DEFINE_PROP_...
>    

Well I'm not q-ualified to comment on qdev, so I'm fine either way with 
this.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11  8:03               ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11  8:03 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 05/11/2010 02:17 AM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 5:59 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>      
>>> Support an inter-vm shared memory device that maps a shared-memory object
>>> as a
>>> PCI device in the guest.  This patch also supports interrupts between
>>> guest by
>>> communicating over a unix domain socket.  This patch applies to the
>>> qemu-kvm
>>> repository.
>>>
>>>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>>
>>> Interrupts are supported between multiple VMs by using a shared memory
>>> server
>>> by using a chardev socket.
>>>
>>>      -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>>                      [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n]
>>>      -chardev socket,path=<path>,id=<id>
>>>
>>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>>
>>> Sample programs and init scripts are in a git repo here:
>>>
>>>
>>> +typedef struct EventfdEntry {
>>> +    PCIDevice *pdev;
>>> +    int vector;
>>> +} EventfdEntry;
>>> +
>>> +typedef struct IVShmemState {
>>> +    PCIDevice dev;
>>> +    uint32_t intrmask;
>>> +    uint32_t intrstatus;
>>> +    uint32_t doorbell;
>>> +
>>> +    CharDriverState * chr;
>>> +    CharDriverState ** eventfd_chr;
>>> +    int ivshmem_mmio_io_addr;
>>> +
>>> +    pcibus_t mmio_addr;
>>> +    unsigned long ivshmem_offset;
>>> +    uint64_t ivshmem_size; /* size of shared memory region */
>>> +    int shm_fd; /* shared memory file descriptor */
>>> +
>>> +    int nr_allocated_vms;
>>> +    /* array of eventfds for each guest */
>>> +    int ** eventfds;
>>> +    /* keep track of # of eventfds for each guest*/
>>> +    int * eventfds_posn_count;
>>>
>>>        
>> More readable:
>>
>>   typedef struct Peer {
>>       int nb_eventfds;
>>       int *eventfds;
>>   } Peer;
>>   int nb_peers;
>>   Peer *peers;
>>
>> Does eventfd_chr need to be there as well?
>>      
> No it does not, eventfd_chr store character devices for receiving
> interrupts when irqfd is not available, so we only them for this
> guest, not for our peers.
>    

Ok.

>> Does this need to be part of the state?
>>      
> They are because they're passed in as qdev properties from the
> command-line so I thought they needed to be in the state struct to be
> assigned via DEFINE_PROP_...
>    

Well I'm not q-ualified to comment on qdev, so I'm fine either way with 
this.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11  7:59                           ` [Qemu-devel] " Avi Kivity
@ 2010-05-11 13:10                             ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-11 13:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/11/2010 02:59 AM, Avi Kivity wrote:
>> (Replying again to list)
>>
>> What data structure would you use?  For a lockless ring queue, you 
>> can only support a single producer and consumer.  To achieve 
>> bidirectional communication in virtio, we always use two queues.
>
>
> You don't have to use a lockless ring queue.  You can use locks 
> (spinlocks without interrupt support, full mutexes with interrupts) 
> and any data structure you like.  Say a hash table + LRU for a shared 
> cache.

Yeah, the mailslot enables this.

I think the question boils down to whether we can support transparent 
peer connections and disconnections.  I think that's important in order 
to support transparent live migration.

If you have two peers that are disconnected and then connect to each 
other, there's simply no way to choose who's content gets preserved.  
It's necessary to designate one peer as a master in order to break the tie.

So this could simply involve an additional option to the shared memory 
driver: role=master|peer.  If role=master, when a new shared memory 
segment is mapped, the contents of the BAR ram is memcpy()'d to the 
shared memory segment.  In either case, the contents of the shared 
memory segment should be memcpy()'d to the BAR ram whenever the shared 
memory segment is disconnected.

I believe role=master should be default because I think a relationship 
of master/slave is going to be much more common than peering.

>>
>> If you're adding additional queues to support other levels of 
>> communication, you can always use different areas of shared memory.
>
> You'll need O(n^2) shared memory areas (n=peer count), and it is a lot 
> less flexible that real shared memory.  Consider using threading where 
> the only communication among threads is a pipe (erlang?)

I can't think of a use of multiple peers via shared memory today with 
virtualization.  I know lots of master/slave uses of shared memory 
though.  I agree that it's useful to support from an academic 
perspective but I don't believe it's going to be the common use.

Regards,

Anthony Liguori



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 13:10                             ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-11 13:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/11/2010 02:59 AM, Avi Kivity wrote:
>> (Replying again to list)
>>
>> What data structure would you use?  For a lockless ring queue, you 
>> can only support a single producer and consumer.  To achieve 
>> bidirectional communication in virtio, we always use two queues.
>
>
> You don't have to use a lockless ring queue.  You can use locks 
> (spinlocks without interrupt support, full mutexes with interrupts) 
> and any data structure you like.  Say a hash table + LRU for a shared 
> cache.

Yeah, the mailslot enables this.

I think the question boils down to whether we can support transparent 
peer connections and disconnections.  I think that's important in order 
to support transparent live migration.

If you have two peers that are disconnected and then connect to each 
other, there's simply no way to choose who's content gets preserved.  
It's necessary to designate one peer as a master in order to break the tie.

So this could simply involve an additional option to the shared memory 
driver: role=master|peer.  If role=master, when a new shared memory 
segment is mapped, the contents of the BAR ram is memcpy()'d to the 
shared memory segment.  In either case, the contents of the shared 
memory segment should be memcpy()'d to the BAR ram whenever the shared 
memory segment is disconnected.

I believe role=master should be default because I think a relationship 
of master/slave is going to be much more common than peering.

>>
>> If you're adding additional queues to support other levels of 
>> communication, you can always use different areas of shared memory.
>
> You'll need O(n^2) shared memory areas (n=peer count), and it is a lot 
> less flexible that real shared memory.  Consider using threading where 
> the only communication among threads is a pipe (erlang?)

I can't think of a use of multiple peers via shared memory today with 
virtualization.  I know lots of master/slave uses of shared memory 
though.  I agree that it's useful to support from an academic 
perspective but I don't believe it's going to be the common use.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 13:10                             ` [Qemu-devel] " Anthony Liguori
@ 2010-05-11 14:03                               ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 14:03 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/11/2010 04:10 PM, Anthony Liguori wrote:
> On 05/11/2010 02:59 AM, Avi Kivity wrote:
>>> (Replying again to list)
>>>
>>> What data structure would you use?  For a lockless ring queue, you 
>>> can only support a single producer and consumer.  To achieve 
>>> bidirectional communication in virtio, we always use two queues.
>>
>>
>> You don't have to use a lockless ring queue.  You can use locks 
>> (spinlocks without interrupt support, full mutexes with interrupts) 
>> and any data structure you like.  Say a hash table + LRU for a shared 
>> cache.
>
> Yeah, the mailslot enables this.
>
> I think the question boils down to whether we can support transparent 
> peer connections and disconnections.  I think that's important in 
> order to support transparent live migration.
>
> If you have two peers that are disconnected and then connect to each 
> other, there's simply no way to choose who's content gets preserved.  
> It's necessary to designate one peer as a master in order to break the 
> tie.

The master is the shared memory area.  It's a completely separate entity 
that is represented by the backing file (or shared memory server handing 
out the fd to mmap).  It can exists independently of any guest.

>
> So this could simply involve an additional option to the shared memory 
> driver: role=master|peer.  If role=master, when a new shared memory 
> segment is mapped, the contents of the BAR ram is memcpy()'d to the 
> shared memory segment.  In either case, the contents of the shared 
> memory segment should be memcpy()'d to the BAR ram whenever the shared 
> memory segment is disconnected.

I don't understand why we need separate BAR ram and shared memory.  Have 
just shared memory, exposed by the BAR when connected.  When the PCI 
card is disconnected from shared memory, the BAR should discard writes 
and return all 1s for reads.

Having a temporary RAM area while disconnected doesn't serve a purpose 
(since it exists only for a short while) and increases the RAM load.

> I believe role=master should be default because I think a relationship 
> of master/slave is going to be much more common than peering.

What if you have N guests?  What if the master disconnects?

>
>>>
>>> If you're adding additional queues to support other levels of 
>>> communication, you can always use different areas of shared memory.
>>
>> You'll need O(n^2) shared memory areas (n=peer count), and it is a 
>> lot less flexible that real shared memory.  Consider using threading 
>> where the only communication among threads is a pipe (erlang?)
>
> I can't think of a use of multiple peers via shared memory today with 
> virtualization.  I know lots of master/slave uses of shared memory 
> though.  I agree that it's useful to support from an academic 
> perspective but I don't believe it's going to be the common use.

Large shared cache.  That use case even survives live migration if you 
use lockless algorithms.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 14:03                               ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 14:03 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/11/2010 04:10 PM, Anthony Liguori wrote:
> On 05/11/2010 02:59 AM, Avi Kivity wrote:
>>> (Replying again to list)
>>>
>>> What data structure would you use?  For a lockless ring queue, you 
>>> can only support a single producer and consumer.  To achieve 
>>> bidirectional communication in virtio, we always use two queues.
>>
>>
>> You don't have to use a lockless ring queue.  You can use locks 
>> (spinlocks without interrupt support, full mutexes with interrupts) 
>> and any data structure you like.  Say a hash table + LRU for a shared 
>> cache.
>
> Yeah, the mailslot enables this.
>
> I think the question boils down to whether we can support transparent 
> peer connections and disconnections.  I think that's important in 
> order to support transparent live migration.
>
> If you have two peers that are disconnected and then connect to each 
> other, there's simply no way to choose who's content gets preserved.  
> It's necessary to designate one peer as a master in order to break the 
> tie.

The master is the shared memory area.  It's a completely separate entity 
that is represented by the backing file (or shared memory server handing 
out the fd to mmap).  It can exists independently of any guest.

>
> So this could simply involve an additional option to the shared memory 
> driver: role=master|peer.  If role=master, when a new shared memory 
> segment is mapped, the contents of the BAR ram is memcpy()'d to the 
> shared memory segment.  In either case, the contents of the shared 
> memory segment should be memcpy()'d to the BAR ram whenever the shared 
> memory segment is disconnected.

I don't understand why we need separate BAR ram and shared memory.  Have 
just shared memory, exposed by the BAR when connected.  When the PCI 
card is disconnected from shared memory, the BAR should discard writes 
and return all 1s for reads.

Having a temporary RAM area while disconnected doesn't serve a purpose 
(since it exists only for a short while) and increases the RAM load.

> I believe role=master should be default because I think a relationship 
> of master/slave is going to be much more common than peering.

What if you have N guests?  What if the master disconnects?

>
>>>
>>> If you're adding additional queues to support other levels of 
>>> communication, you can always use different areas of shared memory.
>>
>> You'll need O(n^2) shared memory areas (n=peer count), and it is a 
>> lot less flexible that real shared memory.  Consider using threading 
>> where the only communication among threads is a pipe (erlang?)
>
> I can't think of a use of multiple peers via shared memory today with 
> virtualization.  I know lots of master/slave uses of shared memory 
> though.  I agree that it's useful to support from an academic 
> perspective but I don't believe it's going to be the common use.

Large shared cache.  That use case even survives live migration if you 
use lockless algorithms.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 14:03                               ` [Qemu-devel] " Avi Kivity
@ 2010-05-11 14:17                                 ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-11 14:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, kvm, qemu-devel

On Tue, May 11, 2010 at 8:03 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/11/2010 04:10 PM, Anthony Liguori wrote:
>>
>> On 05/11/2010 02:59 AM, Avi Kivity wrote:
>>>>
>>>> (Replying again to list)
>>>>
>>>> What data structure would you use?  For a lockless ring queue, you can
>>>> only support a single producer and consumer.  To achieve bidirectional
>>>> communication in virtio, we always use two queues.
>>>
>>>
>>> You don't have to use a lockless ring queue.  You can use locks
>>> (spinlocks without interrupt support, full mutexes with interrupts) and any
>>> data structure you like.  Say a hash table + LRU for a shared cache.
>>
>> Yeah, the mailslot enables this.
>>
>> I think the question boils down to whether we can support transparent peer
>> connections and disconnections.  I think that's important in order to
>> support transparent live migration.
>>
>> If you have two peers that are disconnected and then connect to each
>> other, there's simply no way to choose who's content gets preserved.  It's
>> necessary to designate one peer as a master in order to break the tie.
>
> The master is the shared memory area.  It's a completely separate entity
> that is represented by the backing file (or shared memory server handing out
> the fd to mmap).  It can exists independently of any guest.

I think the master/peer idea would be necessary if we were sharing
guest memory (sharing guest A's memory with guest B).  Then if the
master (guest A) dies, perhaps something needs to happen to preserve
the memory contents.  But since we're sharing host memory, the
applications in the guests can race to determine the master by
grabbing a lock at offset 0 or by using lowest VM ID.

Looking at it another way, it is the applications using shared memory
that may or may not need a master, the Qemu processes don't need the
concept of a master since the memory belongs to the host.

>
>>
>> So this could simply involve an additional option to the shared memory
>> driver: role=master|peer.  If role=master, when a new shared memory segment
>> is mapped, the contents of the BAR ram is memcpy()'d to the shared memory
>> segment.  In either case, the contents of the shared memory segment should
>> be memcpy()'d to the BAR ram whenever the shared memory segment is
>> disconnected.
>
> I don't understand why we need separate BAR ram and shared memory.  Have
> just shared memory, exposed by the BAR when connected.  When the PCI card is
> disconnected from shared memory, the BAR should discard writes and return
> all 1s for reads.
>
> Having a temporary RAM area while disconnected doesn't serve a purpose
> (since it exists only for a short while) and increases the RAM load.

I agree with Avi here.  If a guest wants to view shared memory, then
it needs to stay connected.  I think being able to see the contents
(via the memcpy()) even though the guest is disconnected would be
confusing.

>
>> I believe role=master should be default because I think a relationship of
>> master/slave is going to be much more common than peering.
>
> What if you have N guests?  What if the master disconnects?
>
>>
>>>>
>>>> If you're adding additional queues to support other levels of
>>>> communication, you can always use different areas of shared memory.
>>>
>>> You'll need O(n^2) shared memory areas (n=peer count), and it is a lot
>>> less flexible that real shared memory.  Consider using threading where the
>>> only communication among threads is a pipe (erlang?)
>>
>> I can't think of a use of multiple peers via shared memory today with
>> virtualization.  I know lots of master/slave uses of shared memory though.
>>  I agree that it's useful to support from an academic perspective but I
>> don't believe it's going to be the common use.
>
> Large shared cache.  That use case even survives live migration if you use
> lockless algorithms.
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 14:17                                 ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-11 14:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel

On Tue, May 11, 2010 at 8:03 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/11/2010 04:10 PM, Anthony Liguori wrote:
>>
>> On 05/11/2010 02:59 AM, Avi Kivity wrote:
>>>>
>>>> (Replying again to list)
>>>>
>>>> What data structure would you use?  For a lockless ring queue, you can
>>>> only support a single producer and consumer.  To achieve bidirectional
>>>> communication in virtio, we always use two queues.
>>>
>>>
>>> You don't have to use a lockless ring queue.  You can use locks
>>> (spinlocks without interrupt support, full mutexes with interrupts) and any
>>> data structure you like.  Say a hash table + LRU for a shared cache.
>>
>> Yeah, the mailslot enables this.
>>
>> I think the question boils down to whether we can support transparent peer
>> connections and disconnections.  I think that's important in order to
>> support transparent live migration.
>>
>> If you have two peers that are disconnected and then connect to each
>> other, there's simply no way to choose who's content gets preserved.  It's
>> necessary to designate one peer as a master in order to break the tie.
>
> The master is the shared memory area.  It's a completely separate entity
> that is represented by the backing file (or shared memory server handing out
> the fd to mmap).  It can exists independently of any guest.

I think the master/peer idea would be necessary if we were sharing
guest memory (sharing guest A's memory with guest B).  Then if the
master (guest A) dies, perhaps something needs to happen to preserve
the memory contents.  But since we're sharing host memory, the
applications in the guests can race to determine the master by
grabbing a lock at offset 0 or by using lowest VM ID.

Looking at it another way, it is the applications using shared memory
that may or may not need a master, the Qemu processes don't need the
concept of a master since the memory belongs to the host.

>
>>
>> So this could simply involve an additional option to the shared memory
>> driver: role=master|peer.  If role=master, when a new shared memory segment
>> is mapped, the contents of the BAR ram is memcpy()'d to the shared memory
>> segment.  In either case, the contents of the shared memory segment should
>> be memcpy()'d to the BAR ram whenever the shared memory segment is
>> disconnected.
>
> I don't understand why we need separate BAR ram and shared memory.  Have
> just shared memory, exposed by the BAR when connected.  When the PCI card is
> disconnected from shared memory, the BAR should discard writes and return
> all 1s for reads.
>
> Having a temporary RAM area while disconnected doesn't serve a purpose
> (since it exists only for a short while) and increases the RAM load.

I agree with Avi here.  If a guest wants to view shared memory, then
it needs to stay connected.  I think being able to see the contents
(via the memcpy()) even though the guest is disconnected would be
confusing.

>
>> I believe role=master should be default because I think a relationship of
>> master/slave is going to be much more common than peering.
>
> What if you have N guests?  What if the master disconnects?
>
>>
>>>>
>>>> If you're adding additional queues to support other levels of
>>>> communication, you can always use different areas of shared memory.
>>>
>>> You'll need O(n^2) shared memory areas (n=peer count), and it is a lot
>>> less flexible that real shared memory.  Consider using threading where the
>>> only communication among threads is a pipe (erlang?)
>>
>> I can't think of a use of multiple peers via shared memory today with
>> virtualization.  I know lots of master/slave uses of shared memory though.
>>  I agree that it's useful to support from an academic perspective but I
>> don't believe it's going to be the common use.
>
> Large shared cache.  That use case even survives live migration if you use
> lockless algorithms.
>
> --
> error compiling committee.c: too many arguments to function
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 14:17                                 ` [Qemu-devel] " Cam Macdonell
@ 2010-05-11 14:53                                   ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 14:53 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Anthony Liguori, kvm, qemu-devel

On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>
>> The master is the shared memory area.  It's a completely separate entity
>> that is represented by the backing file (or shared memory server handing out
>> the fd to mmap).  It can exists independently of any guest.
>>      
> I think the master/peer idea would be necessary if we were sharing
> guest memory (sharing guest A's memory with guest B).  Then if the
> master (guest A) dies, perhaps something needs to happen to preserve
> the memory contents.

Definitely.  But we aren't...

>    But since we're sharing host memory, the
> applications in the guests can race to determine the master by
> grabbing a lock at offset 0 or by using lowest VM ID.
>
> Looking at it another way, it is the applications using shared memory
> that may or may not need a master, the Qemu processes don't need the
> concept of a master since the memory belongs to the host.
>    

Exactly.  Furthermore, even in a master/slave relationship, there will 
be different masters for different sub-areas, it would be a pity to 
expose all this in the hardware abstraction.  This way we have an 
external device, and PCI HBAs which connect to it - just like a 
multi-tailed SCSI disk.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 14:53                                   ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 14:53 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>
>> The master is the shared memory area.  It's a completely separate entity
>> that is represented by the backing file (or shared memory server handing out
>> the fd to mmap).  It can exists independently of any guest.
>>      
> I think the master/peer idea would be necessary if we were sharing
> guest memory (sharing guest A's memory with guest B).  Then if the
> master (guest A) dies, perhaps something needs to happen to preserve
> the memory contents.

Definitely.  But we aren't...

>    But since we're sharing host memory, the
> applications in the guests can race to determine the master by
> grabbing a lock at offset 0 or by using lowest VM ID.
>
> Looking at it another way, it is the applications using shared memory
> that may or may not need a master, the Qemu processes don't need the
> concept of a master since the memory belongs to the host.
>    

Exactly.  Furthermore, even in a master/slave relationship, there will 
be different masters for different sub-areas, it would be a pity to 
expose all this in the hardware abstraction.  This way we have an 
external device, and PCI HBAs which connect to it - just like a 
multi-tailed SCSI disk.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 14:53                                   ` [Qemu-devel] " Avi Kivity
@ 2010-05-11 15:51                                     ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-11 15:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/11/2010 09:53 AM, Avi Kivity wrote:
> On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>>
>>> The master is the shared memory area.  It's a completely separate 
>>> entity
>>> that is represented by the backing file (or shared memory server 
>>> handing out
>>> the fd to mmap).  It can exists independently of any guest.
>> I think the master/peer idea would be necessary if we were sharing
>> guest memory (sharing guest A's memory with guest B).  Then if the
>> master (guest A) dies, perhaps something needs to happen to preserve
>> the memory contents.
>
> Definitely.  But we aren't...

Then transparent live migration is impossible.  IMHO, that's a 
fundamental mistake that we will regret down the road.

>>    But since we're sharing host memory, the
>> applications in the guests can race to determine the master by
>> grabbing a lock at offset 0 or by using lowest VM ID.
>>
>> Looking at it another way, it is the applications using shared memory
>> that may or may not need a master, the Qemu processes don't need the
>> concept of a master since the memory belongs to the host.
>
> Exactly.  Furthermore, even in a master/slave relationship, there will 
> be different masters for different sub-areas, it would be a pity to 
> expose all this in the hardware abstraction.  This way we have an 
> external device, and PCI HBAs which connect to it - just like a 
> multi-tailed SCSI disk.

To support transparent live migration, it's necessary to do two things:

1) Preserve the memory contents of the PCI BAR after disconnected from a 
shared memory segment
2) Synchronize any changes made to the PCI BAR with the shared memory 
segment upon reconnect/initial connection.

N.B. savevm/loadvm both constitute disconnect and reconnect events 
respectively.

Supporting (1) is easy since we just need to memcpy() the contents of 
the shared memory segment to a temporary RAM area upon disconnect.

Supporting (2) is easy when the shared memory segment is viewed as owned 
by the guest since it has the definitive copy of the data.  IMHO, this 
is what role=master means.  However, if we want to support a model where 
the guest does not have a definitive copy of the data, upon reconnect, 
we need to throw away the guest's changes and make the shared memory 
segment appear to simultaneously update to the guest.  This is what 
role=peer means.

For role=peer, it's necessary to signal to the guest when it's not 
connected.  This means prior to savevm it's necessary to indicate to the 
guest that it's been disconnected.

I think it's important that we build this mechanism in from the start 
because as I've stated in the past, I don't think role=peer is going to 
be the dominant use-case.  I actually don't think that shared memory 
between guests is all that interesting compared to shared memory to an 
external process on the host.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 15:51                                     ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-11 15:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/11/2010 09:53 AM, Avi Kivity wrote:
> On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>>
>>> The master is the shared memory area.  It's a completely separate 
>>> entity
>>> that is represented by the backing file (or shared memory server 
>>> handing out
>>> the fd to mmap).  It can exists independently of any guest.
>> I think the master/peer idea would be necessary if we were sharing
>> guest memory (sharing guest A's memory with guest B).  Then if the
>> master (guest A) dies, perhaps something needs to happen to preserve
>> the memory contents.
>
> Definitely.  But we aren't...

Then transparent live migration is impossible.  IMHO, that's a 
fundamental mistake that we will regret down the road.

>>    But since we're sharing host memory, the
>> applications in the guests can race to determine the master by
>> grabbing a lock at offset 0 or by using lowest VM ID.
>>
>> Looking at it another way, it is the applications using shared memory
>> that may or may not need a master, the Qemu processes don't need the
>> concept of a master since the memory belongs to the host.
>
> Exactly.  Furthermore, even in a master/slave relationship, there will 
> be different masters for different sub-areas, it would be a pity to 
> expose all this in the hardware abstraction.  This way we have an 
> external device, and PCI HBAs which connect to it - just like a 
> multi-tailed SCSI disk.

To support transparent live migration, it's necessary to do two things:

1) Preserve the memory contents of the PCI BAR after disconnected from a 
shared memory segment
2) Synchronize any changes made to the PCI BAR with the shared memory 
segment upon reconnect/initial connection.

N.B. savevm/loadvm both constitute disconnect and reconnect events 
respectively.

Supporting (1) is easy since we just need to memcpy() the contents of 
the shared memory segment to a temporary RAM area upon disconnect.

Supporting (2) is easy when the shared memory segment is viewed as owned 
by the guest since it has the definitive copy of the data.  IMHO, this 
is what role=master means.  However, if we want to support a model where 
the guest does not have a definitive copy of the data, upon reconnect, 
we need to throw away the guest's changes and make the shared memory 
segment appear to simultaneously update to the guest.  This is what 
role=peer means.

For role=peer, it's necessary to signal to the guest when it's not 
connected.  This means prior to savevm it's necessary to indicate to the 
guest that it's been disconnected.

I think it's important that we build this mechanism in from the start 
because as I've stated in the past, I don't think role=peer is going to 
be the dominant use-case.  I actually don't think that shared memory 
between guests is all that interesting compared to shared memory to an 
external process on the host.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 15:51                                     ` [Qemu-devel] " Anthony Liguori
@ 2010-05-11 16:39                                       ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-11 16:39 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Tue, May 11, 2010 at 9:51 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/11/2010 09:53 AM, Avi Kivity wrote:
>>
>> On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>>>
>>>> The master is the shared memory area.  It's a completely separate entity
>>>> that is represented by the backing file (or shared memory server handing
>>>> out
>>>> the fd to mmap).  It can exists independently of any guest.
>>>
>>> I think the master/peer idea would be necessary if we were sharing
>>> guest memory (sharing guest A's memory with guest B).  Then if the
>>> master (guest A) dies, perhaps something needs to happen to preserve
>>> the memory contents.
>>
>> Definitely.  But we aren't...
>
> Then transparent live migration is impossible.  IMHO, that's a fundamental
> mistake that we will regret down the road.
>
>>>   But since we're sharing host memory, the
>>> applications in the guests can race to determine the master by
>>> grabbing a lock at offset 0 or by using lowest VM ID.
>>>
>>> Looking at it another way, it is the applications using shared memory
>>> that may or may not need a master, the Qemu processes don't need the
>>> concept of a master since the memory belongs to the host.
>>
>> Exactly.  Furthermore, even in a master/slave relationship, there will be
>> different masters for different sub-areas, it would be a pity to expose all
>> this in the hardware abstraction.  This way we have an external device, and
>> PCI HBAs which connect to it - just like a multi-tailed SCSI disk.
>
> To support transparent live migration, it's necessary to do two things:
>
> 1) Preserve the memory contents of the PCI BAR after disconnected from a
> shared memory segment
> 2) Synchronize any changes made to the PCI BAR with the shared memory
> segment upon reconnect/initial connection.
>
> N.B. savevm/loadvm both constitute disconnect and reconnect events
> respectively.
>
> Supporting (1) is easy since we just need to memcpy() the contents of the
> shared memory segment to a temporary RAM area upon disconnect.
>
> Supporting (2) is easy when the shared memory segment is viewed as owned by
> the guest since it has the definitive copy of the data.  IMHO, this is what
> role=master means.  However, if we want to support a model where the guest
> does not have a definitive copy of the data, upon reconnect, we need to
> throw away the guest's changes and make the shared memory segment appear to
> simultaneously update to the guest.  This is what role=peer means.
>
> For role=peer, it's necessary to signal to the guest when it's not
> connected.  This means prior to savevm it's necessary to indicate to the
> guest that it's been disconnected.
>
> I think it's important that we build this mechanism in from the start
> because as I've stated in the past, I don't think role=peer is going to be
> the dominant use-case.  I actually don't think that shared memory between
> guests is all that interesting compared to shared memory to an external
> process on the host.
>

Most of the people I hear from who are using my patch are using a peer
model to share data between applications (simulations, JVMs, etc).
But guest-to-host applications work as well of course.

I think "transparent migration" can be achieved by making the
connected/disconnected state transparent to the application.

When using the shared memory server, the server has to be setup anyway
on the new host and copying the memory region could be part of that as
well if the application needs the contents preserved.  I don't think
it has to be handled by the savevm/loadvm operations.  There's little
difference between naming one VM the master or letting the shared
memory server act like a master.

I think abstractions on top of shared memory could handle
disconnection issues (sort of how TCP handles them for networks) if
the application needs it.  Again, my opinion is to leave it to the
application to decide what it necessary.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 16:39                                       ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-11 16:39 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Tue, May 11, 2010 at 9:51 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/11/2010 09:53 AM, Avi Kivity wrote:
>>
>> On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>>>
>>>> The master is the shared memory area.  It's a completely separate entity
>>>> that is represented by the backing file (or shared memory server handing
>>>> out
>>>> the fd to mmap).  It can exists independently of any guest.
>>>
>>> I think the master/peer idea would be necessary if we were sharing
>>> guest memory (sharing guest A's memory with guest B).  Then if the
>>> master (guest A) dies, perhaps something needs to happen to preserve
>>> the memory contents.
>>
>> Definitely.  But we aren't...
>
> Then transparent live migration is impossible.  IMHO, that's a fundamental
> mistake that we will regret down the road.
>
>>>   But since we're sharing host memory, the
>>> applications in the guests can race to determine the master by
>>> grabbing a lock at offset 0 or by using lowest VM ID.
>>>
>>> Looking at it another way, it is the applications using shared memory
>>> that may or may not need a master, the Qemu processes don't need the
>>> concept of a master since the memory belongs to the host.
>>
>> Exactly.  Furthermore, even in a master/slave relationship, there will be
>> different masters for different sub-areas, it would be a pity to expose all
>> this in the hardware abstraction.  This way we have an external device, and
>> PCI HBAs which connect to it - just like a multi-tailed SCSI disk.
>
> To support transparent live migration, it's necessary to do two things:
>
> 1) Preserve the memory contents of the PCI BAR after disconnected from a
> shared memory segment
> 2) Synchronize any changes made to the PCI BAR with the shared memory
> segment upon reconnect/initial connection.
>
> N.B. savevm/loadvm both constitute disconnect and reconnect events
> respectively.
>
> Supporting (1) is easy since we just need to memcpy() the contents of the
> shared memory segment to a temporary RAM area upon disconnect.
>
> Supporting (2) is easy when the shared memory segment is viewed as owned by
> the guest since it has the definitive copy of the data.  IMHO, this is what
> role=master means.  However, if we want to support a model where the guest
> does not have a definitive copy of the data, upon reconnect, we need to
> throw away the guest's changes and make the shared memory segment appear to
> simultaneously update to the guest.  This is what role=peer means.
>
> For role=peer, it's necessary to signal to the guest when it's not
> connected.  This means prior to savevm it's necessary to indicate to the
> guest that it's been disconnected.
>
> I think it's important that we build this mechanism in from the start
> because as I've stated in the past, I don't think role=peer is going to be
> the dominant use-case.  I actually don't think that shared memory between
> guests is all that interesting compared to shared memory to an external
> process on the host.
>

Most of the people I hear from who are using my patch are using a peer
model to share data between applications (simulations, JVMs, etc).
But guest-to-host applications work as well of course.

I think "transparent migration" can be achieved by making the
connected/disconnected state transparent to the application.

When using the shared memory server, the server has to be setup anyway
on the new host and copying the memory region could be part of that as
well if the application needs the contents preserved.  I don't think
it has to be handled by the savevm/loadvm operations.  There's little
difference between naming one VM the master or letting the shared
memory server act like a master.

I think abstractions on top of shared memory could handle
disconnection issues (sort of how TCP handles them for networks) if
the application needs it.  Again, my opinion is to leave it to the
application to decide what it necessary.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 16:39                                       ` [Qemu-devel] " Cam Macdonell
@ 2010-05-11 17:05                                         ` Anthony Liguori
  -1 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-11 17:05 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm, qemu-devel

On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>
> Most of the people I hear from who are using my patch are using a peer
> model to share data between applications (simulations, JVMs, etc).
> But guest-to-host applications work as well of course.
>
> I think "transparent migration" can be achieved by making the
> connected/disconnected state transparent to the application.
>
> When using the shared memory server, the server has to be setup anyway
> on the new host and copying the memory region could be part of that as
> well if the application needs the contents preserved.  I don't think
> it has to be handled by the savevm/loadvm operations.  There's little
> difference between naming one VM the master or letting the shared
> memory server act like a master.
>    

Except that to make it work with the shared memory server, you need the 
server to participate in the live migration protocol which is something 
I'd prefer to avoid at it introduces additional down time.

Regards,

Anthony Liguori

> I think abstractions on top of shared memory could handle
> disconnection issues (sort of how TCP handles them for networks) if
> the application needs it.  Again, my opinion is to leave it to the
> application to decide what it necessary.
>
> Cam
>    


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 17:05                                         ` Anthony Liguori
  0 siblings, 0 replies; 102+ messages in thread
From: Anthony Liguori @ 2010-05-11 17:05 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm, qemu-devel

On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>
> Most of the people I hear from who are using my patch are using a peer
> model to share data between applications (simulations, JVMs, etc).
> But guest-to-host applications work as well of course.
>
> I think "transparent migration" can be achieved by making the
> connected/disconnected state transparent to the application.
>
> When using the shared memory server, the server has to be setup anyway
> on the new host and copying the memory region could be part of that as
> well if the application needs the contents preserved.  I don't think
> it has to be handled by the savevm/loadvm operations.  There's little
> difference between naming one VM the master or letting the shared
> memory server act like a master.
>    

Except that to make it work with the shared memory server, you need the 
server to participate in the live migration protocol which is something 
I'd prefer to avoid at it introduces additional down time.

Regards,

Anthony Liguori

> I think abstractions on top of shared memory could handle
> disconnection issues (sort of how TCP handles them for networks) if
> the application needs it.  Again, my opinion is to leave it to the
> application to decide what it necessary.
>
> Cam
>    

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 17:05                                         ` [Qemu-devel] " Anthony Liguori
@ 2010-05-11 17:50                                           ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-11 17:50 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Tue, May 11, 2010 at 11:05 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>>
>> Most of the people I hear from who are using my patch are using a peer
>> model to share data between applications (simulations, JVMs, etc).
>> But guest-to-host applications work as well of course.
>>
>> I think "transparent migration" can be achieved by making the
>> connected/disconnected state transparent to the application.
>>
>> When using the shared memory server, the server has to be setup anyway
>> on the new host and copying the memory region could be part of that as
>> well if the application needs the contents preserved.  I don't think
>> it has to be handled by the savevm/loadvm operations.  There's little
>> difference between naming one VM the master or letting the shared
>> memory server act like a master.
>>
>
> Except that to make it work with the shared memory server, you need the
> server to participate in the live migration protocol which is something I'd
> prefer to avoid at it introduces additional down time.

Fair enough, then to move to a resolution on this can we either

not support migration at this point, which leaves us free to add it
later as migration use cases become better understand. (my preference)

OR

1 - not support migration when the server is used
2 - if role=master is specified in the non-server case, then that
guest will copy the memory with it.  If role=peer is specified, the
guest will use the shared memory object on the destination host as is
(possibly creating it or output an error if memory object doesn't
exist).

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 17:50                                           ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-11 17:50 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Tue, May 11, 2010 at 11:05 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>>
>> Most of the people I hear from who are using my patch are using a peer
>> model to share data between applications (simulations, JVMs, etc).
>> But guest-to-host applications work as well of course.
>>
>> I think "transparent migration" can be achieved by making the
>> connected/disconnected state transparent to the application.
>>
>> When using the shared memory server, the server has to be setup anyway
>> on the new host and copying the memory region could be part of that as
>> well if the application needs the contents preserved.  I don't think
>> it has to be handled by the savevm/loadvm operations.  There's little
>> difference between naming one VM the master or letting the shared
>> memory server act like a master.
>>
>
> Except that to make it work with the shared memory server, you need the
> server to participate in the live migration protocol which is something I'd
> prefer to avoid at it introduces additional down time.

Fair enough, then to move to a resolution on this can we either

not support migration at this point, which leaves us free to add it
later as migration use cases become better understand. (my preference)

OR

1 - not support migration when the server is used
2 - if role=master is specified in the non-server case, then that
guest will copy the memory with it.  If role=peer is specified, the
guest will use the shared memory object on the destination host as is
(possibly creating it or output an error if memory object doesn't
exist).

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 15:51                                     ` [Qemu-devel] " Anthony Liguori
@ 2010-05-11 18:09                                       ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 18:09 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/11/2010 06:51 PM, Anthony Liguori wrote:
> On 05/11/2010 09:53 AM, Avi Kivity wrote:
>> On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>>>
>>>> The master is the shared memory area.  It's a completely separate 
>>>> entity
>>>> that is represented by the backing file (or shared memory server 
>>>> handing out
>>>> the fd to mmap).  It can exists independently of any guest.
>>> I think the master/peer idea would be necessary if we were sharing
>>> guest memory (sharing guest A's memory with guest B).  Then if the
>>> master (guest A) dies, perhaps something needs to happen to preserve
>>> the memory contents.
>>
>> Definitely.  But we aren't...
>
> Then transparent live migration is impossible.  IMHO, that's a 
> fundamental mistake that we will regret down the road.

I don't see why the two cases are any different.  In all cases, all 
guests have to be migrated simultaneously, or we have to support 
distributed shared memory (likely at the kernel level).  Who owns the 
memory makes no difference.

There is a two non-transparent variants:
- forcibly disconnect the migrating guest, and migrate it later
   - puts all the burden on the guest application
- ask the guest to detach from the memory device
   - host is at the mercy of the guest

Since the consumers of shared memory are academia, they'll probably 
implement DSM.

>
>>>    But since we're sharing host memory, the
>>> applications in the guests can race to determine the master by
>>> grabbing a lock at offset 0 or by using lowest VM ID.
>>>
>>> Looking at it another way, it is the applications using shared memory
>>> that may or may not need a master, the Qemu processes don't need the
>>> concept of a master since the memory belongs to the host.
>>
>> Exactly.  Furthermore, even in a master/slave relationship, there 
>> will be different masters for different sub-areas, it would be a pity 
>> to expose all this in the hardware abstraction.  This way we have an 
>> external device, and PCI HBAs which connect to it - just like a 
>> multi-tailed SCSI disk.
>
> To support transparent live migration, it's necessary to do two things:
>
> 1) Preserve the memory contents of the PCI BAR after disconnected from 
> a shared memory segment
> 2) Synchronize any changes made to the PCI BAR with the shared memory 
> segment upon reconnect/initial connection.

Disconnect/reconnect mean it's no longer transparent.

>
> N.B. savevm/loadvm both constitute disconnect and reconnect events 
> respectively.
>
> Supporting (1) is easy since we just need to memcpy() the contents of 
> the shared memory segment to a temporary RAM area upon disconnect.
>
> Supporting (2) is easy when the shared memory segment is viewed as 
> owned by the guest since it has the definitive copy of the data.  
> IMHO, this is what role=master means. 

There is no 'the guest', if the memory is to be shared there will be 
multiple guests (or multiple entities).

> However, if we want to support a model where the guest does not have a 
> definitive copy of the data, upon reconnect, we need to throw away the 
> guest's changes and make the shared memory segment appear to 
> simultaneously update to the guest.  This is what role=peer means.
>
> For role=peer, it's necessary to signal to the guest when it's not 
> connected.  This means prior to savevm it's necessary to indicate to 
> the guest that it's been disconnected.
>
> I think it's important that we build this mechanism in from the start 
> because as I've stated in the past, I don't think role=peer is going 
> to be the dominant use-case.  I actually don't think that shared 
> memory between guests is all that interesting compared to shared 
> memory to an external process on the host.

I'd like to avoid making the distinction.  Why limit at the outset?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 18:09                                       ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 18:09 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/11/2010 06:51 PM, Anthony Liguori wrote:
> On 05/11/2010 09:53 AM, Avi Kivity wrote:
>> On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>>>
>>>> The master is the shared memory area.  It's a completely separate 
>>>> entity
>>>> that is represented by the backing file (or shared memory server 
>>>> handing out
>>>> the fd to mmap).  It can exists independently of any guest.
>>> I think the master/peer idea would be necessary if we were sharing
>>> guest memory (sharing guest A's memory with guest B).  Then if the
>>> master (guest A) dies, perhaps something needs to happen to preserve
>>> the memory contents.
>>
>> Definitely.  But we aren't...
>
> Then transparent live migration is impossible.  IMHO, that's a 
> fundamental mistake that we will regret down the road.

I don't see why the two cases are any different.  In all cases, all 
guests have to be migrated simultaneously, or we have to support 
distributed shared memory (likely at the kernel level).  Who owns the 
memory makes no difference.

There is a two non-transparent variants:
- forcibly disconnect the migrating guest, and migrate it later
   - puts all the burden on the guest application
- ask the guest to detach from the memory device
   - host is at the mercy of the guest

Since the consumers of shared memory are academia, they'll probably 
implement DSM.

>
>>>    But since we're sharing host memory, the
>>> applications in the guests can race to determine the master by
>>> grabbing a lock at offset 0 or by using lowest VM ID.
>>>
>>> Looking at it another way, it is the applications using shared memory
>>> that may or may not need a master, the Qemu processes don't need the
>>> concept of a master since the memory belongs to the host.
>>
>> Exactly.  Furthermore, even in a master/slave relationship, there 
>> will be different masters for different sub-areas, it would be a pity 
>> to expose all this in the hardware abstraction.  This way we have an 
>> external device, and PCI HBAs which connect to it - just like a 
>> multi-tailed SCSI disk.
>
> To support transparent live migration, it's necessary to do two things:
>
> 1) Preserve the memory contents of the PCI BAR after disconnected from 
> a shared memory segment
> 2) Synchronize any changes made to the PCI BAR with the shared memory 
> segment upon reconnect/initial connection.

Disconnect/reconnect mean it's no longer transparent.

>
> N.B. savevm/loadvm both constitute disconnect and reconnect events 
> respectively.
>
> Supporting (1) is easy since we just need to memcpy() the contents of 
> the shared memory segment to a temporary RAM area upon disconnect.
>
> Supporting (2) is easy when the shared memory segment is viewed as 
> owned by the guest since it has the definitive copy of the data.  
> IMHO, this is what role=master means. 

There is no 'the guest', if the memory is to be shared there will be 
multiple guests (or multiple entities).

> However, if we want to support a model where the guest does not have a 
> definitive copy of the data, upon reconnect, we need to throw away the 
> guest's changes and make the shared memory segment appear to 
> simultaneously update to the guest.  This is what role=peer means.
>
> For role=peer, it's necessary to signal to the guest when it's not 
> connected.  This means prior to savevm it's necessary to indicate to 
> the guest that it's been disconnected.
>
> I think it's important that we build this mechanism in from the start 
> because as I've stated in the past, I don't think role=peer is going 
> to be the dominant use-case.  I actually don't think that shared 
> memory between guests is all that interesting compared to shared 
> memory to an external process on the host.

I'd like to avoid making the distinction.  Why limit at the outset?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 17:05                                         ` [Qemu-devel] " Anthony Liguori
@ 2010-05-11 18:13                                           ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 18:13 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, kvm, qemu-devel

On 05/11/2010 08:05 PM, Anthony Liguori wrote:
> On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>>
>> Most of the people I hear from who are using my patch are using a peer
>> model to share data between applications (simulations, JVMs, etc).
>> But guest-to-host applications work as well of course.
>>
>> I think "transparent migration" can be achieved by making the
>> connected/disconnected state transparent to the application.
>>
>> When using the shared memory server, the server has to be setup anyway
>> on the new host and copying the memory region could be part of that as
>> well if the application needs the contents preserved.  I don't think
>> it has to be handled by the savevm/loadvm operations.  There's little
>> difference between naming one VM the master or letting the shared
>> memory server act like a master.
>
> Except that to make it work with the shared memory server, you need 
> the server to participate in the live migration protocol which is 
> something I'd prefer to avoid at it introduces additional down time.

We can tunnel its migration data through qemu.  Of course, gathering its 
dirty bitmap will be interesting.  DSM may be the way to go here (we can 
even live migrate qemu through DSM: share the guest address space and 
immediately start running on the destination node; the guest will fault 
its memory to the destination.  An advantage is that that the cpu load 
is immediately transferred.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-11 18:13                                           ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-11 18:13 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

On 05/11/2010 08:05 PM, Anthony Liguori wrote:
> On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>>
>> Most of the people I hear from who are using my patch are using a peer
>> model to share data between applications (simulations, JVMs, etc).
>> But guest-to-host applications work as well of course.
>>
>> I think "transparent migration" can be achieved by making the
>> connected/disconnected state transparent to the application.
>>
>> When using the shared memory server, the server has to be setup anyway
>> on the new host and copying the memory region could be part of that as
>> well if the application needs the contents preserved.  I don't think
>> it has to be handled by the savevm/loadvm operations.  There's little
>> difference between naming one VM the master or letting the shared
>> memory server act like a master.
>
> Except that to make it work with the shared memory server, you need 
> the server to participate in the live migration protocol which is 
> something I'd prefer to avoid at it introduces additional down time.

We can tunnel its migration data through qemu.  Of course, gathering its 
dirty bitmap will be interesting.  DSM may be the way to go here (we can 
even live migrate qemu through DSM: share the guest address space and 
immediately start running on the destination node; the guest will fault 
its memory to the destination.  An advantage is that that the cpu load 
is immediately transferred.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-11 18:13                                           ` [Qemu-devel] " Avi Kivity
@ 2010-05-12 15:32                                             ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-12 15:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, kvm, qemu-devel

On Tue, May 11, 2010 at 12:13 PM, Avi Kivity <avi@redhat.com> wrote:
> On 05/11/2010 08:05 PM, Anthony Liguori wrote:
>>
>> On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>>>
>>> Most of the people I hear from who are using my patch are using a peer
>>> model to share data between applications (simulations, JVMs, etc).
>>> But guest-to-host applications work as well of course.
>>>
>>> I think "transparent migration" can be achieved by making the
>>> connected/disconnected state transparent to the application.
>>>
>>> When using the shared memory server, the server has to be setup anyway
>>> on the new host and copying the memory region could be part of that as
>>> well if the application needs the contents preserved.  I don't think
>>> it has to be handled by the savevm/loadvm operations.  There's little
>>> difference between naming one VM the master or letting the shared
>>> memory server act like a master.
>>
>> Except that to make it work with the shared memory server, you need the
>> server to participate in the live migration protocol which is something I'd
>> prefer to avoid at it introduces additional down time.
>
> We can tunnel its migration data through qemu.  Of course, gathering its
> dirty bitmap will be interesting.  DSM may be the way to go here (we can
> even live migrate qemu through DSM: share the guest address space and
> immediately start running on the destination node; the guest will fault its
> memory to the destination.  An advantage is that that the cpu load is
> immediately transferred.
>

Given the potential need to develop DSM and migrating multiple VMs
simultaneously as well as few details to decide on, can the patch
series (with other review tweaks fixed) be accepted without migration
support?  I'll continue to work on it of course, but I think the patch
is useful to users without migration at the moment.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-12 15:32                                             ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-12 15:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel

On Tue, May 11, 2010 at 12:13 PM, Avi Kivity <avi@redhat.com> wrote:
> On 05/11/2010 08:05 PM, Anthony Liguori wrote:
>>
>> On 05/11/2010 11:39 AM, Cam Macdonell wrote:
>>>
>>> Most of the people I hear from who are using my patch are using a peer
>>> model to share data between applications (simulations, JVMs, etc).
>>> But guest-to-host applications work as well of course.
>>>
>>> I think "transparent migration" can be achieved by making the
>>> connected/disconnected state transparent to the application.
>>>
>>> When using the shared memory server, the server has to be setup anyway
>>> on the new host and copying the memory region could be part of that as
>>> well if the application needs the contents preserved.  I don't think
>>> it has to be handled by the savevm/loadvm operations.  There's little
>>> difference between naming one VM the master or letting the shared
>>> memory server act like a master.
>>
>> Except that to make it work with the shared memory server, you need the
>> server to participate in the live migration protocol which is something I'd
>> prefer to avoid at it introduces additional down time.
>
> We can tunnel its migration data through qemu.  Of course, gathering its
> dirty bitmap will be interesting.  DSM may be the way to go here (we can
> even live migrate qemu through DSM: share the guest address space and
> immediately start running on the destination node; the guest will fault its
> memory to the destination.  An advantage is that that the cpu load is
> immediately transferred.
>

Given the potential need to develop DSM and migrating multiple VMs
simultaneously as well as few details to decide on, can the patch
series (with other review tweaks fixed) be accepted without migration
support?  I'll continue to work on it of course, but I think the patch
is useful to users without migration at the moment.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-12 15:32                                             ` [Qemu-devel] " Cam Macdonell
@ 2010-05-12 15:48                                               ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-12 15:48 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Anthony Liguori, kvm, qemu-devel

On 05/12/2010 06:32 PM, Cam Macdonell wrote:
>
>> We can tunnel its migration data through qemu.  Of course, gathering its
>> dirty bitmap will be interesting.  DSM may be the way to go here (we can
>> even live migrate qemu through DSM: share the guest address space and
>> immediately start running on the destination node; the guest will fault its
>> memory to the destination.  An advantage is that that the cpu load is
>> immediately transferred.
>>
>>      
> Given the potential need to develop DSM and migrating multiple VMs
> simultaneously as well as few details to decide on, can the patch
> series (with other review tweaks fixed) be accepted without migration
> support?

Definitely.  I don't expect DSM to materialize tomorrow (or ever).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-12 15:48                                               ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-12 15:48 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 05/12/2010 06:32 PM, Cam Macdonell wrote:
>
>> We can tunnel its migration data through qemu.  Of course, gathering its
>> dirty bitmap will be interesting.  DSM may be the way to go here (we can
>> even live migrate qemu through DSM: share the guest address space and
>> immediately start running on the destination node; the guest will fault its
>> memory to the destination.  An advantage is that that the cpu load is
>> immediately transferred.
>>
>>      
> Given the potential need to develop DSM and migrating multiple VMs
> simultaneously as well as few details to decide on, can the patch
> series (with other review tweaks fixed) be accepted without migration
> support?

Definitely.  I don't expect DSM to materialize tomorrow (or ever).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 16:48                     ` [Qemu-devel] " Cam Macdonell
@ 2010-05-12 15:49                       ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-12 15:49 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel, Anthony Liguori

On 05/10/2010 07:48 PM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 10:40 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>>      
>>>        
>>>> What would happen to any data written to the BAR before the the handshake
>>>> completed?  I think it would disappear.
>>>>
>>>>          
>>> But, the BAR isn't there until the handshake is completed.  Only after
>>> receiving the shared memory fd does my device call pci_register_bar()
>>> in the callback function.  So there may be a case with BAR2 (the
>>> shared memory BAR) missing during initialization.  FWIW, I haven't
>>> encountered this.
>>>
>>>        
>> Well, that violates PCI.  You can't have a PCI device with no BAR, then have
>> a BAR appear.  It may work since the BAR is registered a lot faster than the
>> BIOS is able to peek at it, but it's a race nevertheless.
>>      
> Agreed.  I'll get Anthony's idea up and running.  It seems that is the
> way forward.
>    

What, with the separate allocation and memcpy?  Or another one?

Why can't we complete initialization before exposing the card and BAR?  
Seems to be the simplest solution.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-12 15:49                       ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-12 15:49 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 05/10/2010 07:48 PM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 10:40 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>>      
>>>        
>>>> What would happen to any data written to the BAR before the the handshake
>>>> completed?  I think it would disappear.
>>>>
>>>>          
>>> But, the BAR isn't there until the handshake is completed.  Only after
>>> receiving the shared memory fd does my device call pci_register_bar()
>>> in the callback function.  So there may be a case with BAR2 (the
>>> shared memory BAR) missing during initialization.  FWIW, I haven't
>>> encountered this.
>>>
>>>        
>> Well, that violates PCI.  You can't have a PCI device with no BAR, then have
>> a BAR appear.  It may work since the BAR is registered a lot faster than the
>> BIOS is able to peek at it, but it's a race nevertheless.
>>      
> Agreed.  I'll get Anthony's idea up and running.  It seems that is the
> way forward.
>    

What, with the separate allocation and memcpy?  Or another one?

Why can't we complete initialization before exposing the card and BAR?  
Seems to be the simplest solution.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-12 15:49                       ` [Qemu-devel] " Avi Kivity
@ 2010-05-12 16:14                         ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-12 16:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel, Anthony Liguori

On Wed, May 12, 2010 at 9:49 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/10/2010 07:48 PM, Cam Macdonell wrote:
>>
>> On Mon, May 10, 2010 at 10:40 AM, Avi Kivity<avi@redhat.com>  wrote:
>>
>>>
>>> On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>>>
>>>>
>>>>
>>>>>
>>>>> What would happen to any data written to the BAR before the the
>>>>> handshake
>>>>> completed?  I think it would disappear.
>>>>>
>>>>>
>>>>
>>>> But, the BAR isn't there until the handshake is completed.  Only after
>>>> receiving the shared memory fd does my device call pci_register_bar()
>>>> in the callback function.  So there may be a case with BAR2 (the
>>>> shared memory BAR) missing during initialization.  FWIW, I haven't
>>>> encountered this.
>>>>
>>>>
>>>
>>> Well, that violates PCI.  You can't have a PCI device with no BAR, then
>>> have
>>> a BAR appear.  It may work since the BAR is registered a lot faster than
>>> the
>>> BIOS is able to peek at it, but it's a race nevertheless.
>>>
>>
>> Agreed.  I'll get Anthony's idea up and running.  It seems that is the
>> way forward.
>>
>
> What, with the separate allocation and memcpy?  Or another one?

Mapping in the memory when it is received from the server.

>
> Why can't we complete initialization before exposing the card and BAR?
>  Seems to be the simplest solution.

Looking at it more closely, you're right, the fds for shared
memory/eventfds are received in a fraction of a second, so that's why
I haven't seen any problems since the memory is mapped before the BIOS
detects and configures the device.

We can't block on a qemu char device (in anyway I can see) so we have
to handle mapping the memory BAR in the callback function.  But, we
can make the semantics that the VM ID is not set until the memory is
mapped.  So if the VM ID is -1, then the memory has not been mapped
yet, reads/writes work but don't do anything useful.  So the user can
detect the mapping of the memory
and it does not invalidate PCI since the BAR is always present, but
just not mapped to the shared memory.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-12 16:14                         ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-12 16:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Wed, May 12, 2010 at 9:49 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/10/2010 07:48 PM, Cam Macdonell wrote:
>>
>> On Mon, May 10, 2010 at 10:40 AM, Avi Kivity<avi@redhat.com>  wrote:
>>
>>>
>>> On 05/10/2010 06:41 PM, Cam Macdonell wrote:
>>>
>>>>
>>>>
>>>>>
>>>>> What would happen to any data written to the BAR before the the
>>>>> handshake
>>>>> completed?  I think it would disappear.
>>>>>
>>>>>
>>>>
>>>> But, the BAR isn't there until the handshake is completed.  Only after
>>>> receiving the shared memory fd does my device call pci_register_bar()
>>>> in the callback function.  So there may be a case with BAR2 (the
>>>> shared memory BAR) missing during initialization.  FWIW, I haven't
>>>> encountered this.
>>>>
>>>>
>>>
>>> Well, that violates PCI.  You can't have a PCI device with no BAR, then
>>> have
>>> a BAR appear.  It may work since the BAR is registered a lot faster than
>>> the
>>> BIOS is able to peek at it, but it's a race nevertheless.
>>>
>>
>> Agreed.  I'll get Anthony's idea up and running.  It seems that is the
>> way forward.
>>
>
> What, with the separate allocation and memcpy?  Or another one?

Mapping in the memory when it is received from the server.

>
> Why can't we complete initialization before exposing the card and BAR?
>  Seems to be the simplest solution.

Looking at it more closely, you're right, the fds for shared
memory/eventfds are received in a fraction of a second, so that's why
I haven't seen any problems since the memory is mapped before the BIOS
detects and configures the device.

We can't block on a qemu char device (in anyway I can see) so we have
to handle mapping the memory BAR in the callback function.  But, we
can make the semantics that the VM ID is not set until the memory is
mapped.  So if the VM ID is -1, then the memory has not been mapped
yet, reads/writes work but don't do anything useful.  So the user can
detect the mapping of the memory
and it does not invalidate PCI since the BAR is always present, but
just not mapped to the shared memory.

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-12 16:14                         ` [Qemu-devel] " Cam Macdonell
@ 2010-05-12 16:45                           ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-12 16:45 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel, Anthony Liguori

On 05/12/2010 07:14 PM, Cam Macdonell wrote:
>
>> Why can't we complete initialization before exposing the card and BAR?
>>   Seems to be the simplest solution.
>>      
> Looking at it more closely, you're right, the fds for shared
> memory/eventfds are received in a fraction of a second, so that's why
> I haven't seen any problems since the memory is mapped before the BIOS
> detects and configures the device.
>
> We can't block on a qemu char device (in anyway I can see) so we have
> to handle mapping the memory BAR in the callback function.  But, we
> can make the semantics that the VM ID is not set until the memory is
> mapped.  So if the VM ID is -1, then the memory has not been mapped
> yet, reads/writes work but don't do anything useful.  So the user can
> detect the mapping of the memory
> and it does not invalidate PCI since the BAR is always present, but
> just not mapped to the shared memory.
>    

I don't like this very much.  We expose an internal qemu implementation 
detail, the lack of ability to complete negotiation during init, and 
make the device more error prone to use.

However, it does make some sense if we regard the device as an HBA 
accessing an external memory, so it's not unreasonable.  But please be 
sure to document this.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-12 16:45                           ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-12 16:45 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 05/12/2010 07:14 PM, Cam Macdonell wrote:
>
>> Why can't we complete initialization before exposing the card and BAR?
>>   Seems to be the simplest solution.
>>      
> Looking at it more closely, you're right, the fds for shared
> memory/eventfds are received in a fraction of a second, so that's why
> I haven't seen any problems since the memory is mapped before the BIOS
> detects and configures the device.
>
> We can't block on a qemu char device (in anyway I can see) so we have
> to handle mapping the memory BAR in the callback function.  But, we
> can make the semantics that the VM ID is not set until the memory is
> mapped.  So if the VM ID is -1, then the memory has not been mapped
> yet, reads/writes work but don't do anything useful.  So the user can
> detect the mapping of the memory
> and it does not invalidate PCI since the BAR is always present, but
> just not mapped to the shared memory.
>    

I don't like this very much.  We expose an internal qemu implementation 
detail, the lack of ability to complete negotiation during init, and 
make the device more error prone to use.

However, it does make some sense if we regard the device as an HBA 
accessing an external memory, so it's not unreasonable.  But please be 
sure to document this.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 11:59           ` [Qemu-devel] " Avi Kivity
@ 2010-05-13 21:10             ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-13 21:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, qemu-devel

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:

>> +
>> +        /* allocate/initialize space for interrupt handling */
>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>> sizeof(EventfdEntry));
>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>> sizeof(int));
>> +
>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>> interrupts */
>>
>
> This is done by the guest BIOS.
>
>

If I remove that line, my driver crashes when it falls back to
pin-based interrupts (when MSI is turned off).  Is there something in
the device driver that I need to set in place of this?  A number of
other devices (mostly network cards) set the interrupt pin this way,
so I'm a little confused.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-13 21:10             ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-13 21:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, kvm

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/21/2010 08:53 PM, Cam Macdonell wrote:

>> +
>> +        /* allocate/initialize space for interrupt handling */
>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>> sizeof(EventfdEntry));
>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>> sizeof(int));
>> +
>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>> interrupts */
>>
>
> This is done by the guest BIOS.
>
>

If I remove that line, my driver crashes when it falls back to
pin-based interrupts (when MSI is turned off).  Is there something in
the device driver that I need to set in place of this?  A number of
other devices (mostly network cards) set the interrupt pin this way,
so I'm a little confused.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-13 21:10             ` [Qemu-devel] " Cam Macdonell
@ 2010-05-15  6:05               ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-15  6:05 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 05/14/2010 12:10 AM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 5:59 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>      
>    
>>> +
>>> +        /* allocate/initialize space for interrupt handling */
>>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>>> sizeof(EventfdEntry));
>>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>>> sizeof(int));
>>> +
>>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>>> interrupts */
>>>
>>>        
>> This is done by the guest BIOS.
>>
>>
>>      
> If I remove that line, my driver crashes when it falls back to
> pin-based interrupts (when MSI is turned off).  Is there something in
> the device driver that I need to set in place of this?  A number of
> other devices (mostly network cards) set the interrupt pin this way,
> so I'm a little confused.
>    

Sorry, I confused this with PCI_INTERRUPT_LINE.

Note there is a helper to set it, pci_config_set_interrupt_pin().

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-15  6:05               ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-15  6:05 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 05/14/2010 12:10 AM, Cam Macdonell wrote:
> On Mon, May 10, 2010 at 5:59 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 04/21/2010 08:53 PM, Cam Macdonell wrote:
>>      
>    
>>> +
>>> +        /* allocate/initialize space for interrupt handling */
>>> +        s->eventfds = qemu_mallocz(s->nr_alloc_guests * sizeof(int *));
>>> +        s->eventfd_table = qemu_mallocz(s->vectors *
>>> sizeof(EventfdEntry));
>>> +        s->eventfds_posn_count = qemu_mallocz(s->nr_alloc_guests *
>>> sizeof(int));
>>> +
>>> +        pci_conf[PCI_INTERRUPT_PIN] = 1; /* we are going to support
>>> interrupts */
>>>
>>>        
>> This is done by the guest BIOS.
>>
>>
>>      
> If I remove that line, my driver crashes when it falls back to
> pin-based interrupts (when MSI is turned off).  Is there something in
> the device driver that I need to set in place of this?  A number of
> other devices (mostly network cards) set the interrupt pin this way,
> so I'm a little confused.
>    

Sorry, I confused this with PCI_INTERRUPT_LINE.

Note there is a helper to set it, pci_config_set_interrupt_pin().

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-10 16:52                     ` [Qemu-devel] " Anthony Liguori
@ 2010-05-18 16:58                       ` Cam Macdonell
  -1 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-18 16:58 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 10:52 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>> Yes, I think the ack is the way to go, so the guest has to be aware of
>> it.  Would setting a flag in the driver-specific config space be an
>> acceptable ack that the shared region is now mapped?
>>
>
> You know it's mapped because it's mapped when the pci map function returns.
>  You don't need the guest to explicitly tell you.
>

I've been playing with migration.  It appears that the memory is
preserved on migration in the default case which makes sense as it is
part of the qemu memory allocation.  In my current implementation, I
"map" the shared memory in by calling cpu_register_physical_memory()
with the offset returned from qemu_ram_map().

My question is how to I unregister the physical memory so it is not
copied on migration (for the role=peer case).  There isn't a
cpu_unregister_physical_memory().

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-18 16:58                       ` Cam Macdonell
  0 siblings, 0 replies; 102+ messages in thread
From: Cam Macdonell @ 2010-05-18 16:58 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, qemu-devel

On Mon, May 10, 2010 at 10:52 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>> Yes, I think the ack is the way to go, so the guest has to be aware of
>> it.  Would setting a flag in the driver-specific config space be an
>> acceptable ack that the shared region is now mapped?
>>
>
> You know it's mapped because it's mapped when the pci map function returns.
>  You don't need the guest to explicitly tell you.
>

I've been playing with migration.  It appears that the memory is
preserved on migration in the default case which makes sense as it is
part of the qemu memory allocation.  In my current implementation, I
"map" the shared memory in by calling cpu_register_physical_memory()
with the offset returned from qemu_ram_map().

My question is how to I unregister the physical memory so it is not
copied on migration (for the role=peer case).  There isn't a
cpu_unregister_physical_memory().

Cam

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
  2010-05-18 16:58                       ` [Qemu-devel] " Cam Macdonell
@ 2010-05-18 17:27                         ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-18 17:27 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Anthony Liguori, kvm, qemu-devel

On 05/18/2010 07:58 PM, Cam Macdonell wrote:
>
> My question is how to I unregister the physical memory so it is not
> copied on migration (for the role=peer case).  There isn't a
> cpu_unregister_physical_memory().
>    

It doesn't need to be unregistered, simply marked not migratable.  
Perhaps a flags argument to c_r_p_m().

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
@ 2010-05-18 17:27                         ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2010-05-18 17:27 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm, qemu-devel

On 05/18/2010 07:58 PM, Cam Macdonell wrote:
>
> My question is how to I unregister the physical memory so it is not
> copied on migration (for the role=peer case).  There isn't a
> cpu_unregister_physical_memory().
>    

It doesn't need to be unregistered, simply marked not migratable.  
Perhaps a flags argument to c_r_p_m().

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2010-05-18 17:27 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-21 17:53 [PATCH v5 0/5] PCI Shared Memory device Cam Macdonell
2010-04-21 17:53 ` [Qemu-devel] " Cam Macdonell
2010-04-21 17:53 ` [PATCH v5 1/5] Device specification for shared memory PCI device Cam Macdonell
2010-04-21 17:53   ` [Qemu-devel] " Cam Macdonell
2010-04-21 17:53   ` [PATCH v5 2/5] Support adding a file to qemu's ram allocation Cam Macdonell
2010-04-21 17:53     ` [Qemu-devel] " Cam Macdonell
2010-04-21 17:53     ` [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds Cam Macdonell
2010-04-21 17:53       ` [Qemu-devel] " Cam Macdonell
2010-04-21 17:53       ` [PATCH v5 4/5] Inter-VM shared memory PCI device Cam Macdonell
2010-04-21 17:53         ` [Qemu-devel] " Cam Macdonell
2010-04-21 18:00         ` [PATCH v5 5/5] shared memory server for inter-VM shared memory Cam Macdonell
2010-04-21 18:00           ` [Qemu-devel] " Cam Macdonell
2010-05-05 16:57         ` [PATCH v5 4/5] RESEND: Inter-VM shared memory PCI device Cam Macdonell
2010-05-05 16:57           ` [Qemu-devel] " Cam Macdonell
2010-05-06 17:32         ` [PATCH v5 4/5] " Anthony Liguori
2010-05-06 17:32           ` [Qemu-devel] " Anthony Liguori
2010-05-06 17:59           ` Cam Macdonell
2010-05-06 17:59             ` [Qemu-devel] " Cam Macdonell
2010-05-10 11:59         ` Avi Kivity
2010-05-10 11:59           ` [Qemu-devel] " Avi Kivity
2010-05-10 15:22           ` Cam Macdonell
2010-05-10 15:22             ` [Qemu-devel] " Cam Macdonell
2010-05-10 15:28             ` Avi Kivity
2010-05-10 15:28               ` [Qemu-devel] " Avi Kivity
2010-05-10 15:38               ` Anthony Liguori
2010-05-10 15:38                 ` [Qemu-devel] " Anthony Liguori
2010-05-10 16:20                 ` Cam Macdonell
2010-05-10 16:20                   ` [Qemu-devel] " Cam Macdonell
2010-05-10 16:52                   ` Anthony Liguori
2010-05-10 16:52                     ` [Qemu-devel] " Anthony Liguori
2010-05-18 16:58                     ` Cam Macdonell
2010-05-18 16:58                       ` [Qemu-devel] " Cam Macdonell
2010-05-18 17:27                       ` Avi Kivity
2010-05-18 17:27                         ` [Qemu-devel] " Avi Kivity
2010-05-10 16:59                 ` Avi Kivity
2010-05-10 16:59                   ` [Qemu-devel] " Avi Kivity
2010-05-10 17:25                   ` Anthony Liguori
2010-05-10 17:25                     ` [Qemu-devel] " Anthony Liguori
2010-05-10 17:43                     ` Cam Macdonell
2010-05-10 17:43                       ` [Qemu-devel] " Cam Macdonell
2010-05-10 17:52                       ` Anthony Liguori
2010-05-10 17:52                         ` [Qemu-devel] " Anthony Liguori
2010-05-10 18:01                         ` Cam Macdonell
2010-05-10 18:01                           ` [Qemu-devel] " Cam Macdonell
2010-05-11  7:59                         ` Avi Kivity
2010-05-11  7:59                           ` [Qemu-devel] " Avi Kivity
2010-05-11 13:10                           ` Anthony Liguori
2010-05-11 13:10                             ` [Qemu-devel] " Anthony Liguori
2010-05-11 14:03                             ` Avi Kivity
2010-05-11 14:03                               ` [Qemu-devel] " Avi Kivity
2010-05-11 14:17                               ` Cam Macdonell
2010-05-11 14:17                                 ` [Qemu-devel] " Cam Macdonell
2010-05-11 14:53                                 ` Avi Kivity
2010-05-11 14:53                                   ` [Qemu-devel] " Avi Kivity
2010-05-11 15:51                                   ` Anthony Liguori
2010-05-11 15:51                                     ` [Qemu-devel] " Anthony Liguori
2010-05-11 16:39                                     ` Cam Macdonell
2010-05-11 16:39                                       ` [Qemu-devel] " Cam Macdonell
2010-05-11 17:05                                       ` Anthony Liguori
2010-05-11 17:05                                         ` [Qemu-devel] " Anthony Liguori
2010-05-11 17:50                                         ` Cam Macdonell
2010-05-11 17:50                                           ` [Qemu-devel] " Cam Macdonell
2010-05-11 18:13                                         ` Avi Kivity
2010-05-11 18:13                                           ` [Qemu-devel] " Avi Kivity
2010-05-12 15:32                                           ` Cam Macdonell
2010-05-12 15:32                                             ` [Qemu-devel] " Cam Macdonell
2010-05-12 15:48                                             ` Avi Kivity
2010-05-12 15:48                                               ` [Qemu-devel] " Avi Kivity
2010-05-11 18:09                                     ` Avi Kivity
2010-05-11 18:09                                       ` [Qemu-devel] " Avi Kivity
2010-05-11  7:55                     ` Avi Kivity
2010-05-11  7:55                       ` [Qemu-devel] " Avi Kivity
2010-05-10 15:41               ` Cam Macdonell
2010-05-10 15:41                 ` [Qemu-devel] " Cam Macdonell
2010-05-10 16:40                 ` Avi Kivity
2010-05-10 16:40                   ` [Qemu-devel] " Avi Kivity
2010-05-10 16:48                   ` Cam Macdonell
2010-05-10 16:48                     ` [Qemu-devel] " Cam Macdonell
2010-05-12 15:49                     ` Avi Kivity
2010-05-12 15:49                       ` [Qemu-devel] " Avi Kivity
2010-05-12 16:14                       ` Cam Macdonell
2010-05-12 16:14                         ` [Qemu-devel] " Cam Macdonell
2010-05-12 16:45                         ` Avi Kivity
2010-05-12 16:45                           ` [Qemu-devel] " Avi Kivity
2010-05-10 23:17           ` Cam Macdonell
2010-05-10 23:17             ` [Qemu-devel] " Cam Macdonell
2010-05-11  8:03             ` Avi Kivity
2010-05-11  8:03               ` [Qemu-devel] " Avi Kivity
2010-05-13 21:10           ` Cam Macdonell
2010-05-13 21:10             ` [Qemu-devel] " Cam Macdonell
2010-05-15  6:05             ` Avi Kivity
2010-05-15  6:05               ` [Qemu-devel] " Avi Kivity
2010-05-10 10:43       ` [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds Avi Kivity
2010-05-10 10:43         ` [Qemu-devel] " Avi Kivity
2010-05-10 15:13         ` Cam Macdonell
2010-05-10 15:13           ` [Qemu-devel] " Cam Macdonell
2010-05-10 15:17           ` Avi Kivity
2010-05-10 15:17             ` [Qemu-devel] " Avi Kivity
2010-05-10 10:39     ` [PATCH v5 2/5] Support adding a file to qemu's ram allocation Avi Kivity
2010-05-10 10:39       ` [Qemu-devel] " Avi Kivity
2010-05-10 15:32       ` Cam Macdonell
2010-05-10 15:32         ` [Qemu-devel] " Cam Macdonell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.