[PATCH v6 0/6] Inter-VM Shared Memory Device with migration support

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support
@ 2010-06-04 21:45 ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, Cam Macdonell

Latest patch for PCI shared memory device that maps a host shared memory object
to be shared between guests

new in this series
    - migration support with 'master' and 'peer' roles for guest to determine
      who "owns" memory.  With 'master', the guest has the canonical copy of
      the shared memory and will copy it with it on migration.  With 'role=peer',
      the guest will not copy the shared memory, but attach to what is on the
      destination machine.
    - modified phys_ram_dirty array for marking memory as not to be migrated
    - add support for non-migrated memory regions

    v5:
    - fixed segfault for non-server case
    - code style fixes
    - removed limit on the number of guests
    - shared memory server is now in qemu.git/contrib
    - made ioeventfd setup function generic
    - removed interrupts when guest joined (let application handle it)

    v4:
    - moved to single Doorbell register and use datamatch to trigger different
      VMs rather than one register per eventfd
    - remove writing arbitrary values to eventfds.  Only values of 1 are now
      written to ensure correct usage

Cam Macdonell (6):
  Device specification for shared memory PCI device
  Adds two new functions for assigning ioeventfd and irqfds.
  Change phys_ram_dirty to phys_ram_status
  Add support for marking memory to not be migrated.  On migration,
    memory is checked for the NO_MIGRATION_FLAG.
  Inter-VM shared memory PCI device
  the stand-alone shared memory server for inter-VM shared memory

 Makefile.target                         |    3 +
 arch_init.c                             |   28 +-
 contrib/ivshmem-server/Makefile         |   16 +
 contrib/ivshmem-server/README           |   30 ++
 contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 ++++++++
 contrib/ivshmem-server/send_scm.h       |   19 +
 cpu-all.h                               |   18 +-
 cpu-common.h                            |    2 +
 docs/specs/ivshmem_device_spec.txt      |   96 ++++
 exec.c                                  |   48 ++-
 hw/ivshmem.c                            |  852 +++++++++++++++++++++++++++++++
 kvm-all.c                               |   32 ++
 kvm.h                                   |    1 +
 qemu-char.c                             |    6 +
 qemu-char.h                             |    3 +
 qemu-doc.texi                           |   32 ++
 17 files changed, 1710 insertions(+), 37 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h
 create mode 100644 docs/specs/ivshmem_device_spec.txt
 create mode 100644 hw/ivshmem.c


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support
@ 2010-06-04 21:45 ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cam Macdonell, kvm

Latest patch for PCI shared memory device that maps a host shared memory object
to be shared between guests

new in this series
    - migration support with 'master' and 'peer' roles for guest to determine
      who "owns" memory.  With 'master', the guest has the canonical copy of
      the shared memory and will copy it with it on migration.  With 'role=peer',
      the guest will not copy the shared memory, but attach to what is on the
      destination machine.
    - modified phys_ram_dirty array for marking memory as not to be migrated
    - add support for non-migrated memory regions

    v5:
    - fixed segfault for non-server case
    - code style fixes
    - removed limit on the number of guests
    - shared memory server is now in qemu.git/contrib
    - made ioeventfd setup function generic
    - removed interrupts when guest joined (let application handle it)

    v4:
    - moved to single Doorbell register and use datamatch to trigger different
      VMs rather than one register per eventfd
    - remove writing arbitrary values to eventfds.  Only values of 1 are now
      written to ensure correct usage

Cam Macdonell (6):
  Device specification for shared memory PCI device
  Adds two new functions for assigning ioeventfd and irqfds.
  Change phys_ram_dirty to phys_ram_status
  Add support for marking memory to not be migrated.  On migration,
    memory is checked for the NO_MIGRATION_FLAG.
  Inter-VM shared memory PCI device
  the stand-alone shared memory server for inter-VM shared memory

 Makefile.target                         |    3 +
 arch_init.c                             |   28 +-
 contrib/ivshmem-server/Makefile         |   16 +
 contrib/ivshmem-server/README           |   30 ++
 contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 ++++++++
 contrib/ivshmem-server/send_scm.h       |   19 +
 cpu-all.h                               |   18 +-
 cpu-common.h                            |    2 +
 docs/specs/ivshmem_device_spec.txt      |   96 ++++
 exec.c                                  |   48 ++-
 hw/ivshmem.c                            |  852 +++++++++++++++++++++++++++++++
 kvm-all.c                               |   32 ++
 kvm.h                                   |    1 +
 qemu-char.c                             |    6 +
 qemu-char.h                             |    3 +
 qemu-doc.texi                           |   32 ++
 17 files changed, 1710 insertions(+), 37 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h
 create mode 100644 docs/specs/ivshmem_device_spec.txt
 create mode 100644 hw/ivshmem.c

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v6 1/6] Device specification for shared memory PCI device
  2010-06-04 21:45 ` [Qemu-devel] " Cam Macdonell
@ 2010-06-04 21:45   ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, Cam Macdonell

---
 docs/specs/ivshmem_device_spec.txt |   96 ++++++++++++++++++++++++++++++++++++
 1 files changed, 96 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/ivshmem_device_spec.txt

diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
new file mode 100644
index 0000000..23dd2ba
--- /dev/null
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -0,0 +1,96 @@
+
+Device Specification for Inter-VM shared memory device
+------------------------------------------------------
+
+The Inter-VM shared memory device is designed to share a region of memory to
+userspace in multiple virtual guests.  The memory region does not belong to any
+guest, but is a POSIX memory object on the host.  Optionally, the device may
+support sending interrupts to other guests sharing the same memory region.
+
+
+The Inter-VM PCI device
+-----------------------
+
+*BARs*
+
+The device supports three BARs.  BAR0 is a 1 Kbyte MMIO region to support
+registers.  BAR1 is used for MSI-X when it is enabled in the device.  BAR2 is
+used to map the shared memory object from the host.  The size of BAR2 is
+specified when the guest is started and must be a power of 2 in size.
+
+*Registers*
+
+The device currently supports 4 registers of 32-bits each.  Registers
+are used for synchronization between guests sharing the same memory object when
+interrupts are supported (this requires using the shared memory server).
+
+The server assigns each VM an ID number and sends this ID number to the Qemu
+process when the guest starts.
+
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12
+};
+
+The first two registers are the interrupt mask and status registers.  Mask and
+status are only used with pin-based interrupts.  They are unused with MSI
+interrupts.
+
+Status Register: The status register is set to 1 when an interrupt occurs.
+
+Mask Register: The mask register is bitwise ANDed with the interrupt status
+and the result will raise an interrupt if it is non-zero.  However, since 1 is
+the only value the status will be set to, it is only the first bit of the mask
+that has any effect.  Therefore interrupts can be masked by setting the first
+bit to 0 and unmasked by setting the first bit to 1.
+
+IVPosition Register: The IVPosition register is read-only and reports the
+guest's ID number.  The guest IDs are non-negative integers.  When using the
+server, since the server is a separate process, the VM ID will only be set when
+the device is ready (shared memory is received from the server and accessible via
+the device).  If the device is not ready, the IVPosition will return -1.
+Applications should ensure that they have a valid VM ID before accessing the
+shared memory.
+
+Doorbell Register:  To interrupt another guest, a guest must write to the
+Doorbell register.  The doorbell register is 32-bits, logically divided into
+two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
+16-bits are the interrupt vector to trigger.  The semantics of the value
+written to the doorbell depends on whether the device is using MSI or a regular
+pin-based interrupt.  In short, MSI uses vectors while regular interrupts set the
+status register.
+
+Regular Interrupts
+
+If regular interrupts are used (due to either a guest not supporting MSI or the
+user specifying not to use them on startup) then the value written to the lower
+16-bits of the Doorbell register results is arbitrary and will trigger an
+interrupt in the destination guest.
+
+Message Signalled Interrupts
+
+A ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
+written to the Doorbell register must be between 0 and the maximum number of
+vectors the guest supports.  The lower 16 bits written to the doorbell is the
+MSI vector that will be raised in the destination guest.  The number of MSI
+vectors is configurable but it is set when the VM is started.
+
+The important thing to remember with MSI is that it is only a signal, no status
+is set (since MSI interrupts are not shared).  All information other than the
+interrupt itself should be communicated via the shared memory region.  Devices
+supporting multiple MSI vectors can use different vectors to indicate different
+events have occurred.  The semantics of interrupt vectors are left to the
+user's discretion.
+
+
+Usage in the Guest
+------------------
+
+The shared memory device is intended to be used with the provided UIO driver.
+Very little configuration is needed.  The guest should map BAR0 to access the
+registers (an array of 32-bit ints allows simple writing) and map BAR2 to
+access the shared memory region itself.  The size of the shared memory region
+is specified when the guest (or shared memory server) is started.  A guest may
+map the whole shared memory region or only part of it.
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6 1/6] Device specification for shared memory PCI device
@ 2010-06-04 21:45   ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cam Macdonell, kvm

---
 docs/specs/ivshmem_device_spec.txt |   96 ++++++++++++++++++++++++++++++++++++
 1 files changed, 96 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/ivshmem_device_spec.txt

diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
new file mode 100644
index 0000000..23dd2ba
--- /dev/null
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -0,0 +1,96 @@
+
+Device Specification for Inter-VM shared memory device
+------------------------------------------------------
+
+The Inter-VM shared memory device is designed to share a region of memory to
+userspace in multiple virtual guests.  The memory region does not belong to any
+guest, but is a POSIX memory object on the host.  Optionally, the device may
+support sending interrupts to other guests sharing the same memory region.
+
+
+The Inter-VM PCI device
+-----------------------
+
+*BARs*
+
+The device supports three BARs.  BAR0 is a 1 Kbyte MMIO region to support
+registers.  BAR1 is used for MSI-X when it is enabled in the device.  BAR2 is
+used to map the shared memory object from the host.  The size of BAR2 is
+specified when the guest is started and must be a power of 2 in size.
+
+*Registers*
+
+The device currently supports 4 registers of 32-bits each.  Registers
+are used for synchronization between guests sharing the same memory object when
+interrupts are supported (this requires using the shared memory server).
+
+The server assigns each VM an ID number and sends this ID number to the Qemu
+process when the guest starts.
+
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12
+};
+
+The first two registers are the interrupt mask and status registers.  Mask and
+status are only used with pin-based interrupts.  They are unused with MSI
+interrupts.
+
+Status Register: The status register is set to 1 when an interrupt occurs.
+
+Mask Register: The mask register is bitwise ANDed with the interrupt status
+and the result will raise an interrupt if it is non-zero.  However, since 1 is
+the only value the status will be set to, it is only the first bit of the mask
+that has any effect.  Therefore interrupts can be masked by setting the first
+bit to 0 and unmasked by setting the first bit to 1.
+
+IVPosition Register: The IVPosition register is read-only and reports the
+guest's ID number.  The guest IDs are non-negative integers.  When using the
+server, since the server is a separate process, the VM ID will only be set when
+the device is ready (shared memory is received from the server and accessible via
+the device).  If the device is not ready, the IVPosition will return -1.
+Applications should ensure that they have a valid VM ID before accessing the
+shared memory.
+
+Doorbell Register:  To interrupt another guest, a guest must write to the
+Doorbell register.  The doorbell register is 32-bits, logically divided into
+two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
+16-bits are the interrupt vector to trigger.  The semantics of the value
+written to the doorbell depends on whether the device is using MSI or a regular
+pin-based interrupt.  In short, MSI uses vectors while regular interrupts set the
+status register.
+
+Regular Interrupts
+
+If regular interrupts are used (due to either a guest not supporting MSI or the
+user specifying not to use them on startup) then the value written to the lower
+16-bits of the Doorbell register results is arbitrary and will trigger an
+interrupt in the destination guest.
+
+Message Signalled Interrupts
+
+A ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
+written to the Doorbell register must be between 0 and the maximum number of
+vectors the guest supports.  The lower 16 bits written to the doorbell is the
+MSI vector that will be raised in the destination guest.  The number of MSI
+vectors is configurable but it is set when the VM is started.
+
+The important thing to remember with MSI is that it is only a signal, no status
+is set (since MSI interrupts are not shared).  All information other than the
+interrupt itself should be communicated via the shared memory region.  Devices
+supporting multiple MSI vectors can use different vectors to indicate different
+events have occurred.  The semantics of interrupt vectors are left to the
+user's discretion.
+
+
+Usage in the Guest
+------------------
+
+The shared memory device is intended to be used with the provided UIO driver.
+Very little configuration is needed.  The guest should map BAR0 to access the
+registers (an array of 32-bit ints allows simple writing) and map BAR2 to
+access the shared memory region itself.  The size of the shared memory region
+is specified when the guest (or shared memory server) is started.  A guest may
+map the whole shared memory region or only part of it.
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 2/6] Add function to assign ioeventfd to MMIO.
  2010-06-04 21:45   ` [Qemu-devel] " Cam Macdonell
@ 2010-06-04 21:45     ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, Cam Macdonell

---
 kvm-all.c |   32 ++++++++++++++++++++++++++++++++
 kvm.h     |    1 +
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 47f58a6..2982631 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1257,6 +1257,38 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset)
     return r;
 }
 
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign)
+{
+#ifdef KVM_IOEVENTFD
+    int ret;
+    struct kvm_ioeventfd iofd;
+
+    iofd.datamatch = val;
+    iofd.addr = addr;
+    iofd.len = 4;
+    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+    iofd.fd = fd;
+
+    if (!kvm_enabled()) {
+        return -ENOSYS;
+    }
+
+    if (!assign) {
+        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+    }
+
+    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
+
+    if (ret < 0) {
+        return -errno;
+    }
+
+    return 0;
+#else
+    return -ENOSYS;
+#endif
+}
+
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool assign)
 {
 #ifdef KVM_IOEVENTFD
diff --git a/kvm.h b/kvm.h
index aab5118..52e3a7f 100644
--- a/kvm.h
+++ b/kvm.h
@@ -181,6 +181,7 @@ static inline void cpu_synchronize_post_init(CPUState *env)
 }
 
 #endif
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool assign);
 
 #if defined(KVM_IRQFD) && defined(CONFIG_KVM)
 int kvm_set_irqfd(int gsi, int fd, bool assigned);
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6 2/6] Add function to assign ioeventfd to MMIO.
@ 2010-06-04 21:45     ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cam Macdonell, kvm

---
 kvm-all.c |   32 ++++++++++++++++++++++++++++++++
 kvm.h     |    1 +
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 47f58a6..2982631 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1257,6 +1257,38 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset)
     return r;
 }
 
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign)
+{
+#ifdef KVM_IOEVENTFD
+    int ret;
+    struct kvm_ioeventfd iofd;
+
+    iofd.datamatch = val;
+    iofd.addr = addr;
+    iofd.len = 4;
+    iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+    iofd.fd = fd;
+
+    if (!kvm_enabled()) {
+        return -ENOSYS;
+    }
+
+    if (!assign) {
+        iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+    }
+
+    ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
+
+    if (ret < 0) {
+        return -errno;
+    }
+
+    return 0;
+#else
+    return -ENOSYS;
+#endif
+}
+
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool assign)
 {
 #ifdef KVM_IOEVENTFD
diff --git a/kvm.h b/kvm.h
index aab5118..52e3a7f 100644
--- a/kvm.h
+++ b/kvm.h
@@ -181,6 +181,7 @@ static inline void cpu_synchronize_post_init(CPUState *env)
 }
 
 #endif
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool assign);
 
 #if defined(KVM_IRQFD) && defined(CONFIG_KVM)
 int kvm_set_irqfd(int gsi, int fd, bool assigned);
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 3/6] Change phys_ram_dirty to phys_ram_status
  2010-06-04 21:45     ` [Qemu-devel] " Cam Macdonell
@ 2010-06-04 21:45       ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, Cam Macdonell

phys_ram_dirty are 8-bit values storing 3 dirty bits.  Change to more generic
phys_ram_flags and use lower 4-bits for dirty status and leave upper 4 for
other uses.

The names of functions may need to be changed as well, such as c_p_m_get_dirty().

---
 cpu-all.h |   16 +++++++++-------
 exec.c    |   36 ++++++++++++++++++------------------
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 47a5722..9080cc7 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -858,7 +858,7 @@ target_phys_addr_t cpu_get_phys_page_debug(CPUState *env, target_ulong addr);
 /* memory API */
 
 extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
+extern uint8_t *phys_ram_flags;
 extern ram_addr_t ram_size;
 extern ram_addr_t last_ram_offset;
 
@@ -887,32 +887,34 @@ extern int mem_prealloc;
 #define CODE_DIRTY_FLAG      0x02
 #define MIGRATION_DIRTY_FLAG 0x08
 
+#define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG)
+
 /* read dirty bit (return 0 or 1) */
 static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] == 0xff;
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS] == DIRTY_ALL_FLAG;
 }
 
 static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS];
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS];
 }
 
 static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
                                                 int dirty_flags)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] & dirty_flags;
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS] & dirty_flags;
 }
 
 static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
 {
-    phys_ram_dirty[addr >> TARGET_PAGE_BITS] = 0xff;
+    phys_ram_flags[addr >> TARGET_PAGE_BITS] = DIRTY_ALL_FLAG;
 }
 
 static inline int cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
                                                       int dirty_flags)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] |= dirty_flags;
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS] |= dirty_flags;
 }
 
 static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
@@ -924,7 +926,7 @@ static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
 
     len = length >> TARGET_PAGE_BITS;
     mask = ~dirty_flags;
-    p = phys_ram_dirty + (start >> TARGET_PAGE_BITS);
+    p = phys_ram_flags + (start >> TARGET_PAGE_BITS);
     for (i = 0; i < len; i++) {
         p[i] &= mask;
     }
diff --git a/exec.c b/exec.c
index 7b0e1c5..39c18a7 100644
--- a/exec.c
+++ b/exec.c
@@ -116,7 +116,7 @@ uint8_t *code_gen_ptr;
 
 #if !defined(CONFIG_USER_ONLY)
 int phys_ram_fd;
-uint8_t *phys_ram_dirty;
+uint8_t *phys_ram_flags;
 static int in_migration;
 
 typedef struct RAMBlock {
@@ -2801,10 +2801,10 @@ ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
     new_block->next = ram_blocks;
     ram_blocks = new_block;
 
-    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+    phys_ram_flags = qemu_realloc(phys_ram_flags,
         (last_ram_offset + size) >> TARGET_PAGE_BITS);
-    memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
-           0xff, size >> TARGET_PAGE_BITS);
+    memset(phys_ram_flags + (last_ram_offset >> TARGET_PAGE_BITS),
+           DIRTY_ALL_FLAG, size >> TARGET_PAGE_BITS);
 
     last_ram_offset += size;
 
@@ -2853,10 +2853,10 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
     new_block->next = ram_blocks;
     ram_blocks = new_block;
 
-    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+    phys_ram_flags = qemu_realloc(phys_ram_flags,
         (last_ram_offset + size) >> TARGET_PAGE_BITS);
-    memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
-           0xff, size >> TARGET_PAGE_BITS);
+    memset(phys_ram_flags + (last_ram_offset >> TARGET_PAGE_BITS),
+           DIRTY_ALL_FLAG, size >> TARGET_PAGE_BITS);
 
     last_ram_offset += size;
 
@@ -3024,11 +3024,11 @@ static void notdirty_mem_writeb(void *opaque, target_phys_addr_t ram_addr,
 #endif
     }
     stb_p(qemu_get_ram_ptr(ram_addr), val);
-    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
+    dirty_flags |= (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG);
     cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
-    if (dirty_flags == 0xff)
+    if (dirty_flags == DIRTY_ALL_FLAG)
         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
 }
 
@@ -3044,11 +3044,11 @@ static void notdirty_mem_writew(void *opaque, target_phys_addr_t ram_addr,
 #endif
     }
     stw_p(qemu_get_ram_ptr(ram_addr), val);
-    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
+    dirty_flags |= (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG);
     cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
-    if (dirty_flags == 0xff)
+    if (dirty_flags == DIRTY_ALL_FLAG)
         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
 }
 
@@ -3064,11 +3064,11 @@ static void notdirty_mem_writel(void *opaque, target_phys_addr_t ram_addr,
 #endif
     }
     stl_p(qemu_get_ram_ptr(ram_addr), val);
-    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
+    dirty_flags |= (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG);
     cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
-    if (dirty_flags == 0xff)
+    if (dirty_flags == DIRTY_ALL_FLAG)
         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
 }
 
@@ -3485,7 +3485,7 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
                     /* set dirty bit */
                     cpu_physical_memory_set_dirty_flags(
-                        addr1, (0xff & ~CODE_DIRTY_FLAG));
+                        addr1, (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
                 }
 		/* qemu doesn't execute guest code directly, but kvm does
 		   therefore flush instruction caches */
@@ -3699,7 +3699,7 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
                     /* set dirty bit */
                     cpu_physical_memory_set_dirty_flags(
-                        addr1, (0xff & ~CODE_DIRTY_FLAG));
+                        addr1, (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
                 }
                 addr1 += l;
                 access_len -= l;
@@ -3860,7 +3860,7 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val)
                 tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
                 /* set dirty bit */
                 cpu_physical_memory_set_dirty_flags(
-                    addr1, (0xff & ~CODE_DIRTY_FLAG));
+                    addr1, (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
             }
         }
     }
@@ -3929,7 +3929,7 @@ void stl_phys(target_phys_addr_t addr, uint32_t val)
             tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
             /* set dirty bit */
             cpu_physical_memory_set_dirty_flags(addr1,
-                (0xff & ~CODE_DIRTY_FLAG));
+                (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
         }
     }
 }
@@ -3972,7 +3972,7 @@ void stw_phys(target_phys_addr_t addr, uint32_t val)
             tb_invalidate_phys_page_range(addr1, addr1 + 2, 0);
             /* set dirty bit */
             cpu_physical_memory_set_dirty_flags(addr1,
-                (0xff & ~CODE_DIRTY_FLAG));
+                (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
         }
     }
 }
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6 3/6] Change phys_ram_dirty to phys_ram_status
@ 2010-06-04 21:45       ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cam Macdonell, kvm

phys_ram_dirty are 8-bit values storing 3 dirty bits.  Change to more generic
phys_ram_flags and use lower 4-bits for dirty status and leave upper 4 for
other uses.

The names of functions may need to be changed as well, such as c_p_m_get_dirty().

---
 cpu-all.h |   16 +++++++++-------
 exec.c    |   36 ++++++++++++++++++------------------
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 47a5722..9080cc7 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -858,7 +858,7 @@ target_phys_addr_t cpu_get_phys_page_debug(CPUState *env, target_ulong addr);
 /* memory API */
 
 extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
+extern uint8_t *phys_ram_flags;
 extern ram_addr_t ram_size;
 extern ram_addr_t last_ram_offset;
 
@@ -887,32 +887,34 @@ extern int mem_prealloc;
 #define CODE_DIRTY_FLAG      0x02
 #define MIGRATION_DIRTY_FLAG 0x08
 
+#define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG)
+
 /* read dirty bit (return 0 or 1) */
 static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] == 0xff;
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS] == DIRTY_ALL_FLAG;
 }
 
 static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS];
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS];
 }
 
 static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
                                                 int dirty_flags)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] & dirty_flags;
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS] & dirty_flags;
 }
 
 static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
 {
-    phys_ram_dirty[addr >> TARGET_PAGE_BITS] = 0xff;
+    phys_ram_flags[addr >> TARGET_PAGE_BITS] = DIRTY_ALL_FLAG;
 }
 
 static inline int cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
                                                       int dirty_flags)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] |= dirty_flags;
+    return phys_ram_flags[addr >> TARGET_PAGE_BITS] |= dirty_flags;
 }
 
 static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
@@ -924,7 +926,7 @@ static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
 
     len = length >> TARGET_PAGE_BITS;
     mask = ~dirty_flags;
-    p = phys_ram_dirty + (start >> TARGET_PAGE_BITS);
+    p = phys_ram_flags + (start >> TARGET_PAGE_BITS);
     for (i = 0; i < len; i++) {
         p[i] &= mask;
     }
diff --git a/exec.c b/exec.c
index 7b0e1c5..39c18a7 100644
--- a/exec.c
+++ b/exec.c
@@ -116,7 +116,7 @@ uint8_t *code_gen_ptr;
 
 #if !defined(CONFIG_USER_ONLY)
 int phys_ram_fd;
-uint8_t *phys_ram_dirty;
+uint8_t *phys_ram_flags;
 static int in_migration;
 
 typedef struct RAMBlock {
@@ -2801,10 +2801,10 @@ ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
     new_block->next = ram_blocks;
     ram_blocks = new_block;
 
-    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+    phys_ram_flags = qemu_realloc(phys_ram_flags,
         (last_ram_offset + size) >> TARGET_PAGE_BITS);
-    memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
-           0xff, size >> TARGET_PAGE_BITS);
+    memset(phys_ram_flags + (last_ram_offset >> TARGET_PAGE_BITS),
+           DIRTY_ALL_FLAG, size >> TARGET_PAGE_BITS);
 
     last_ram_offset += size;
 
@@ -2853,10 +2853,10 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
     new_block->next = ram_blocks;
     ram_blocks = new_block;
 
-    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+    phys_ram_flags = qemu_realloc(phys_ram_flags,
         (last_ram_offset + size) >> TARGET_PAGE_BITS);
-    memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
-           0xff, size >> TARGET_PAGE_BITS);
+    memset(phys_ram_flags + (last_ram_offset >> TARGET_PAGE_BITS),
+           DIRTY_ALL_FLAG, size >> TARGET_PAGE_BITS);
 
     last_ram_offset += size;
 
@@ -3024,11 +3024,11 @@ static void notdirty_mem_writeb(void *opaque, target_phys_addr_t ram_addr,
 #endif
     }
     stb_p(qemu_get_ram_ptr(ram_addr), val);
-    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
+    dirty_flags |= (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG);
     cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
-    if (dirty_flags == 0xff)
+    if (dirty_flags == DIRTY_ALL_FLAG)
         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
 }
 
@@ -3044,11 +3044,11 @@ static void notdirty_mem_writew(void *opaque, target_phys_addr_t ram_addr,
 #endif
     }
     stw_p(qemu_get_ram_ptr(ram_addr), val);
-    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
+    dirty_flags |= (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG);
     cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
-    if (dirty_flags == 0xff)
+    if (dirty_flags == DIRTY_ALL_FLAG)
         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
 }
 
@@ -3064,11 +3064,11 @@ static void notdirty_mem_writel(void *opaque, target_phys_addr_t ram_addr,
 #endif
     }
     stl_p(qemu_get_ram_ptr(ram_addr), val);
-    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
+    dirty_flags |= (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG);
     cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
-    if (dirty_flags == 0xff)
+    if (dirty_flags == DIRTY_ALL_FLAG)
         tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
 }
 
@@ -3485,7 +3485,7 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
                     /* set dirty bit */
                     cpu_physical_memory_set_dirty_flags(
-                        addr1, (0xff & ~CODE_DIRTY_FLAG));
+                        addr1, (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
                 }
 		/* qemu doesn't execute guest code directly, but kvm does
 		   therefore flush instruction caches */
@@ -3699,7 +3699,7 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
                     /* set dirty bit */
                     cpu_physical_memory_set_dirty_flags(
-                        addr1, (0xff & ~CODE_DIRTY_FLAG));
+                        addr1, (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
                 }
                 addr1 += l;
                 access_len -= l;
@@ -3860,7 +3860,7 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val)
                 tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
                 /* set dirty bit */
                 cpu_physical_memory_set_dirty_flags(
-                    addr1, (0xff & ~CODE_DIRTY_FLAG));
+                    addr1, (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
             }
         }
     }
@@ -3929,7 +3929,7 @@ void stl_phys(target_phys_addr_t addr, uint32_t val)
             tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
             /* set dirty bit */
             cpu_physical_memory_set_dirty_flags(addr1,
-                (0xff & ~CODE_DIRTY_FLAG));
+                (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
         }
     }
 }
@@ -3972,7 +3972,7 @@ void stw_phys(target_phys_addr_t addr, uint32_t val)
             tb_invalidate_phys_page_range(addr1, addr1 + 2, 0);
             /* set dirty bit */
             cpu_physical_memory_set_dirty_flags(addr1,
-                (0xff & ~CODE_DIRTY_FLAG));
+                (DIRTY_ALL_FLAG & ~CODE_DIRTY_FLAG));
         }
     }
 }
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 4/6] Add support for marking memory to not be migrated.  On migration, memory is checked for the NO_MIGRATION_FLAG.
  2010-06-04 21:45       ` [Qemu-devel] " Cam Macdonell
@ 2010-06-04 21:45         ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, Cam Macdonell

This is useful for devices that do not want to take memory regions data with them on migration.
---
 arch_init.c  |   28 ++++++++++++++++------------
 cpu-all.h    |    2 ++
 cpu-common.h |    2 ++
 exec.c       |   12 ++++++++++++
 4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index cfc03ea..7a234fa 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
                                             current_addr + TARGET_PAGE_SIZE,
                                             MIGRATION_DIRTY_FLAG);
 
-            p = qemu_get_ram_ptr(current_addr);
-
-            if (is_dup_page(p, *p)) {
-                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, *p);
-            } else {
-                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
-                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
-            }
+            if (!cpu_physical_memory_get_dirty(current_addr,
+                                                    NO_MIGRATION_FLAG)) {
+                p = qemu_get_ram_ptr(current_addr);
+
+                if (is_dup_page(p, *p)) {
+                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
+                    qemu_put_byte(f, *p);
+                } else {
+                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
+                    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+                }
 
-            found = 1;
-            break;
+                found = 1;
+                break;
+            }
         }
         addr += TARGET_PAGE_SIZE;
         current_addr = (saved_addr + addr) % last_ram_offset;
@@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
     ram_addr_t count = 0;
 
     for (addr = 0; addr < last_ram_offset; addr += TARGET_PAGE_SIZE) {
-        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+        if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG) &&
+                cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
             count++;
         }
     }
diff --git a/cpu-all.h b/cpu-all.h
index 9080cc7..4df00ab 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -887,6 +887,8 @@ extern int mem_prealloc;
 #define CODE_DIRTY_FLAG      0x02
 #define MIGRATION_DIRTY_FLAG 0x08
 
+#define NO_MIGRATION_FLAG 0x10
+
 #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG)
 
 /* read dirty bit (return 0 or 1) */
diff --git a/cpu-common.h b/cpu-common.h
index 4b0ba60..a1ebbbe 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -39,6 +39,8 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr,
     cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0);
 }
 
+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
+
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
 ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
diff --git a/exec.c b/exec.c
index 39c18a7..c11d22f 100644
--- a/exec.c
+++ b/exec.c
@@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path)
 }
 #endif
 
+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
+{
+    int i, len;
+    uint8_t *p;
+
+    len = length >> TARGET_PAGE_BITS;
+    p = phys_ram_flags + (start >> TARGET_PAGE_BITS);
+    for (i = 0; i < len; i++) {
+        p[i] |= NO_MIGRATION_FLAG;
+    }
+}
+
 ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
 {
     RAMBlock *new_block;
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.
@ 2010-06-04 21:45         ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cam Macdonell, kvm

This is useful for devices that do not want to take memory regions data with them on migration.
---
 arch_init.c  |   28 ++++++++++++++++------------
 cpu-all.h    |    2 ++
 cpu-common.h |    2 ++
 exec.c       |   12 ++++++++++++
 4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index cfc03ea..7a234fa 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
                                             current_addr + TARGET_PAGE_SIZE,
                                             MIGRATION_DIRTY_FLAG);
 
-            p = qemu_get_ram_ptr(current_addr);
-
-            if (is_dup_page(p, *p)) {
-                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, *p);
-            } else {
-                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
-                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
-            }
+            if (!cpu_physical_memory_get_dirty(current_addr,
+                                                    NO_MIGRATION_FLAG)) {
+                p = qemu_get_ram_ptr(current_addr);
+
+                if (is_dup_page(p, *p)) {
+                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
+                    qemu_put_byte(f, *p);
+                } else {
+                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
+                    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+                }
 
-            found = 1;
-            break;
+                found = 1;
+                break;
+            }
         }
         addr += TARGET_PAGE_SIZE;
         current_addr = (saved_addr + addr) % last_ram_offset;
@@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
     ram_addr_t count = 0;
 
     for (addr = 0; addr < last_ram_offset; addr += TARGET_PAGE_SIZE) {
-        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+        if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG) &&
+                cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
             count++;
         }
     }
diff --git a/cpu-all.h b/cpu-all.h
index 9080cc7..4df00ab 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -887,6 +887,8 @@ extern int mem_prealloc;
 #define CODE_DIRTY_FLAG      0x02
 #define MIGRATION_DIRTY_FLAG 0x08
 
+#define NO_MIGRATION_FLAG 0x10
+
 #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG)
 
 /* read dirty bit (return 0 or 1) */
diff --git a/cpu-common.h b/cpu-common.h
index 4b0ba60..a1ebbbe 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -39,6 +39,8 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr,
     cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0);
 }
 
+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
+
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
 ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
diff --git a/exec.c b/exec.c
index 39c18a7..c11d22f 100644
--- a/exec.c
+++ b/exec.c
@@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path)
 }
 #endif
 
+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
+{
+    int i, len;
+    uint8_t *p;
+
+    len = length >> TARGET_PAGE_BITS;
+    p = phys_ram_flags + (start >> TARGET_PAGE_BITS);
+    for (i = 0; i < len; i++) {
+        p[i] |= NO_MIGRATION_FLAG;
+    }
+}
+
 ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
 {
     RAMBlock *new_block;
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 5/6] Inter-VM shared memory PCI device
  2010-06-04 21:45         ` [Qemu-devel] " Cam Macdonell
@ 2010-06-04 21:45           ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, Cam Macdonell

Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
           [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
    -chardev socket,path=<path>,id=<id>

(shared memory server is qemu.git/contrib/ivshmem-server)

Sample programs and init scripts are in a git repo here:

    www.gitorious.org/nahanni
---
 Makefile.target |    3 +
 hw/ivshmem.c    |  852 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-char.c     |    6 +
 qemu-char.h     |    3 +
 qemu-doc.texi   |   43 +++
 5 files changed, 907 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index c4ba592..4888308 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..9057612
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,852 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c
+ *          Copyright (c) 2004 Fabrice Bellard
+ *          Copyright (c) 2004 Makoto Suzuki (suzu)
+ *
+ *      and rtl8139.c
+ *          Copyright (c) 2006 Igor Kovalenko
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "msix.h"
+#include "qemu-kvm.h"
+#include "libkvm.h"
+
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI     1
+
+//#define DEBUG_IVSHMEM
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct Peer {
+    int nb_eventfds;
+    int *eventfds;
+} Peer;
+
+typedef struct EventfdEntry {
+    PCIDevice *pdev;
+    int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+    PCIDevice dev;
+    uint32_t intrmask;
+    uint32_t intrstatus;
+    uint32_t doorbell;
+
+    CharDriverState ** eventfd_chr;
+    CharDriverState * server_chr;
+    int ivshmem_mmio_io_addr;
+
+    pcibus_t mmio_addr;
+    pcibus_t shm_pci_addr;
+    uint64_t ivshmem_offset;
+    uint64_t ivshmem_size; /* size of shared memory region */
+    int shm_fd; /* shared memory file descriptor */
+
+    Peer *peers;
+    int nb_peers; /* how many guests we have space for */
+    int max_peer; /* maximum numbered peer */
+
+    int vm_id;
+    uint32_t vectors;
+    uint32_t features;
+    EventfdEntry *eventfd_table;
+
+    char * shmobj;
+    char * sizearg;
+    char * role;
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+    return (ivs->features & (1 << feature));
+}
+
+static inline bool is_power_of_two(uint64_t x) {
+    return (x & (x - 1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->shm_pci_addr = addr;
+
+    if (s->ivshmem_offset > 0) {
+        cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
+                                                            s->ivshmem_offset);
+        if (s->role && strncmp(s->role, "peer", 4) == 0) {
+            IVSHMEM_DPRINTF("marking pages no migrate\n");
+            cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
+        }
+    }
+
+    IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
+                (uint32_t)addr, (uint32_t)s->ivshmem_offset, (uint32_t)size);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s, val);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s, val);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s, 0);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    u_int64_t write_one = 1;
+    u_int16_t dest = val >> 16;
+    u_int16_t vector = val & 0xff;
+
+    addr &= 0xfc;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        case Doorbell:
+            /* check that dest VM ID is reasonable */
+            if ((dest < 0) || (dest > s->max_peer)) {
+                IVSHMEM_DPRINTF("Invalid destination VM ID (%d)\n", dest);
+                break;
+            }
+
+            /* check doorbell range */
+            if ((vector >= 0) && (vector < s->peers[dest].nb_eventfds)) {
+                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n",
+                                                    write_one, dest, vector);
+                if (write(s->peers[dest].eventfds[vector],
+                                                    &(write_one), 8) != 8) {
+                    IVSHMEM_DPRINTF("error writing to eventfd\n");
+                }
+            }
+            break;
+        default:
+            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
+    }
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+
+        case IVPosition:
+            /* return my VM ID if the memory is mapped */
+            if (s->shm_fd > 0) {
+                ret = s->vm_id;
+            } else {
+                ret = -1;
+            }
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 8;
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
+
+    EventfdEntry *entry = opaque;
+    PCIDevice *pdev = entry->pdev;
+
+    IVSHMEM_DPRINTF("fake irqfd on vector %p %d\n", pdev, entry->vector);
+    msix_notify(pdev, entry->vector);
+}
+
+static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
+                                                                    int vector)
+{
+    /* create a event character device based on the passed eventfd */
+    IVShmemState *s = opaque;
+    CharDriverState * chr;
+
+    chr = qemu_chr_open_eventfd(eventfd);
+
+    if (chr == NULL) {
+        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
+        exit(-1);
+    }
+
+    /* if MSI is supported we need multiple interrupts */
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        s->eventfd_table[vector].pdev = &s->dev;
+        s->eventfd_table[vector].vector = vector;
+
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
+                      ivshmem_event, &s->eventfd_table[vector]);
+    } else {
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
+                      ivshmem_event, s);
+    }
+
+    return chr;
+
+}
+
+static int check_shm_size(IVShmemState *s, int fd) {
+    /* check that the guest isn't going to try and map more memory than the
+     * the object has allocated return -1 to indicate error */
+
+    struct stat buf;
+
+    fstat(fd, &buf);
+
+    if (s->ivshmem_size > buf.st_size) {
+        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
+        fprintf(stderr, " than shared object size (%ld > %ld)\n",
+                                          s->ivshmem_size, buf.st_size);
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+/* create the shared memory BAR when we are not using the server, so we can
+ * create the BAR and map the memory immediately */
+static void create_shared_memory_BAR(IVShmemState *s, int fd) {
+
+    void * ptr;
+
+    s->shm_fd = fd;
+
+    ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+
+    s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, ptr);
+
+    /* region for shared memory */
+    pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+}
+
+static void close_guest_eventfds(IVShmemState *s, int posn)
+{
+    int i, guest_curr_max;
+
+    guest_curr_max = s->peers[posn].nb_eventfds;
+
+    for (i = 0; i < guest_curr_max; i++)
+        close(s->peers[posn].eventfds[i]);
+
+    qemu_free(s->peers[posn].eventfds);
+    s->peers[posn].nb_eventfds = 0;
+}
+
+static void setup_ioeventfds(IVShmemState *s) {
+
+    int i, j;
+
+    for (i = 0; i <= s->max_peer; i++) {
+        for (j = 0; j < s->peers[i].nb_eventfds; j++) {
+            kvm_set_ioeventfd_mmio_long(s->peers[i].eventfds[j],
+                    s->mmio_addr + Doorbell, (i << 16) | j, 1);
+        }
+    }
+
+    /* setup irqfd for this VM's eventfds */
+    for (i = 0; i < s->vectors; i++) {
+        kvm_set_irqfd(s->dev.msix_irq_entries[i].gsi,
+                        s->peers[s->vm_id].eventfds[i], 1);
+    }
+}
+
+
+/* this function increase the dynamic storage need to store data about other
+ * guests */
+static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
+
+    int j, old_nb_alloc;
+
+    old_nb_alloc = s->nb_peers;
+
+    while (new_min_size >= s->nb_peers)
+        s->nb_peers = s->nb_peers * 2;
+
+    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nb_peers);
+    s->peers = qemu_realloc(s->peers, s->nb_peers * sizeof(Peer));
+
+    if (s->peers == NULL) {
+        fprintf(stderr, "Allocation error - exiting\n");
+        exit(1);
+    }
+
+    /* zero out new pointers */
+    for (j = old_nb_alloc; j < s->nb_peers; j++) {
+        s->peers[j].eventfds = NULL;
+        s->peers[j].nb_eventfds = 0;
+    }
+}
+
+static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
+{
+    IVShmemState *s = opaque;
+    int incoming_fd, tmp_fd;
+    int guest_curr_max;
+    long incoming_posn;
+
+    memcpy(&incoming_posn, buf, sizeof(long));
+    /* pick off s->server_chr->msgfd and store it, posn should accompany msg */
+    tmp_fd = qemu_chr_get_msgfd(s->server_chr);
+    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
+
+    /* make sure we have enough space for this guest */
+    if (incoming_posn >= s->nb_peers) {
+        increase_dynamic_storage(s, incoming_posn);
+    }
+
+    if (tmp_fd == -1) {
+        /* if posn is positive and unseen before then this is our posn*/
+        if ((incoming_posn >= 0) && (s->peers[incoming_posn].eventfds == NULL)) {
+            /* receive our posn */
+            s->vm_id = incoming_posn;
+            return;
+        } else {
+            /* otherwise an fd == -1 means an existing guest has gone away */
+            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
+            close_guest_eventfds(s, incoming_posn);
+            return;
+        }
+    }
+
+    /* because of the implementation of get_msgfd, we need a dup */
+    incoming_fd = dup(tmp_fd);
+
+    if (incoming_fd == -1) {
+        fprintf(stderr, "could not allocate file descriptor %s\n",
+                                                            strerror(errno));
+        return;
+    }
+
+    /* if the position is -1, then it's shared memory region fd */
+    if (incoming_posn == -1) {
+
+        void * map_ptr;
+
+        s->max_peer = 0;
+
+        if (check_shm_size(s, incoming_fd) == -1) {
+            exit(-1);
+        }
+
+        /* mmap the region and map into the BAR2 */
+        map_ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED,
+                                                                incoming_fd, 0);
+        s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, map_ptr);
+
+        IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
+                        (uint32_t)s->shm_pci_addr, (uint32_t)s->ivshmem_offset,
+                        (uint32_t)s->ivshmem_size);
+
+        if (s->shm_pci_addr > 0) {
+            /* map memory into BAR2 */
+            cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
+                                                            s->ivshmem_offset);
+            if (s->role && strncmp(s->role, "peer", 4) == 0) {
+                IVSHMEM_DPRINTF("marking pages no migrate\n");
+                cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
+            }
+
+        }
+
+        /* only store the fd if it is successfully mapped */
+        s->shm_fd = incoming_fd;
+
+        return;
+    }
+
+    /* each guest has an array of eventfds, and we keep track of how many
+     * guests for each VM */
+    guest_curr_max = s->peers[incoming_posn].nb_eventfds;
+    if (guest_curr_max == 0) {
+        /* one eventfd per MSI vector */
+        s->peers[incoming_posn].eventfds = (int *) qemu_malloc(s->vectors *
+                                                                sizeof(int));
+    }
+
+    /* this is an eventfd for a particular guest VM */
+    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
+                                                                incoming_fd);
+    s->peers[incoming_posn].eventfds[guest_curr_max] = incoming_fd;
+
+    /* increment count for particular guest */
+    s->peers[incoming_posn].nb_eventfds++;
+
+    /* keep track of the maximum VM ID */
+    if (incoming_posn > s->max_peer) {
+        s->max_peer = incoming_posn;
+    }
+
+    if (incoming_posn == s->vm_id) {
+        int vector = guest_curr_max;
+        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            /* initialize char device for callback
+             * if this is one of my eventfds */
+            s->eventfd_chr[vector] = create_eventfd_chr_device(s,
+                       s->peers[s->vm_id].eventfds[vector], vector);
+        }
+    }
+
+    return;
+}
+
+static void ivshmem_reset(DeviceState *d)
+{
+    return;
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->mmio_addr = addr;
+    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
+
+    /* ioeventfd and irqfd are enabled together,
+     * so the flag IRQFD refers to both */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        setup_ioeventfds(s);
+    }
+}
+
+static uint64_t ivshmem_get_size(IVShmemState * s) {
+
+    uint64_t value;
+    char *ptr;
+
+    value = strtoul(s->sizearg, &ptr, 10);
+    switch (*ptr) {
+        case 0: case 'M': case 'm':
+            value <<= 20;
+            break;
+        case 'G': case 'g':
+            value <<= 30;
+            break;
+        default:
+            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
+            exit(1);
+    }
+
+    /* BARs must be a power of 2 */
+    if (!is_power_of_two(value)) {
+        fprintf(stderr, "ivshmem: size must be power of 2\n");
+        exit(1);
+    }
+
+    return value;
+
+}
+
+static void ivshmem_setup_msi(IVShmemState * s) {
+
+    int i;
+
+    /* allocate the MSI-X vectors */
+
+    if (!msix_init(&s->dev, s->vectors, 1, 0)) {
+        pci_register_bar(&s->dev, 1,
+                         msix_bar_size(&s->dev),
+                         PCI_BASE_ADDRESS_SPACE_MEMORY,
+                         msix_mmio_map);
+        IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
+    } else {
+        IVSHMEM_DPRINTF("msix initialization failed\n");
+    }
+
+    /* 'activate' the vectors */
+    for (i = 0; i < s->vectors; i++) {
+        msix_vector_use(&s->dev, i);
+    }
+
+    /* if IRQFDs are not supported, we'll have to trigger the interrupts
+     * via Qemu char devices */
+    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        /* for handling interrupts when IRQFD is not available */
+        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
+    }
+}
+
+static void ivshmem_save(QEMUFile* f, void *opaque)
+{
+    IVShmemState *proxy = opaque;
+
+    IVSHMEM_DPRINTF("ivshmem_save\n");
+    pci_device_save(&proxy->dev, f);
+
+    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
+        msix_save(&proxy->dev, f);
+    } else {
+        qemu_put_be32(f, proxy->intrstatus);
+        qemu_put_be32(f, proxy->intrmask);
+    }
+
+}
+
+static int ivshmem_load(QEMUFile* f, void *opaque, int version_id)
+{
+    IVSHMEM_DPRINTF("ivshmem_load\n");
+
+    IVShmemState *proxy = opaque;
+    int ret, i;
+
+    ret = pci_device_load(&proxy->dev, f);
+    if (ret) {
+        return ret;
+    }
+
+    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
+        msix_load(&proxy->dev, f);
+        for (i = 0; i < proxy->vectors; i++) {
+            msix_vector_use(&proxy->dev, i);
+        }
+    } else {
+        proxy->intrstatus = qemu_get_be32(f);
+        proxy->intrmask = qemu_get_be32(f);
+    }
+
+    return 0;
+}
+
+static int pci_ivshmem_init(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+    uint8_t *pci_conf;
+
+    if (s->sizearg == NULL)
+        s->ivshmem_size = 4 << 20; /* 4 MB default */
+    else {
+        s->ivshmem_size = ivshmem_get_size(s);
+    }
+
+    register_savevm("ivshmem", 0, 0, ivshmem_save, ivshmem_load, dev);
+
+    /* IRQFD requires MSI */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
+        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
+        exit(1);
+    }
+
+    /* check that role is reasonable */
+    if (s->role && !((strncmp(s->role, "peer", 5) == 0) ||
+                        (strncmp(s->role, "master", 7) == 0))) {
+        fprintf(stderr, "ivshmem: 'role' must be 'peer' or 'master'\n");
+        exit(1);
+    }
+
+    pci_conf = s->dev.config;
+    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+    pci_conf[0x0a] = 0x00; /* RAM controller */
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; /* header_type */
+
+    pci_conf[PCI_INTERRUPT_PIN] = 1;
+
+    s->shm_pci_addr = 0;
+    s->ivshmem_offset = 0;
+    s->shm_fd = 0;
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+    /* region for registers*/
+    pci_register_bar(&s->dev, 0, 0x400,
+                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
+
+    if ((s->server_chr != NULL) &&
+                        (strncmp(s->server_chr->filename, "unix:", 5) == 0)) {
+        /* if we get a UNIX socket as the parameter we will talk
+         * to the ivshmem server to receive the memory region */
+
+        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
+                                                    s->server_chr->filename);
+
+        if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+            ivshmem_setup_msi(s);
+        }
+
+        /* we allocate enough space for 16 guests and grow as needed */
+        s->nb_peers = 16;
+        s->vm_id = -1;
+
+        /* allocate/initialize space for interrupt handling */
+        s->peers = qemu_mallocz(s->nb_peers * sizeof(Peer));
+
+        pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+
+        s->eventfd_chr = (CharDriverState **) qemu_mallocz(s->vectors *
+                                                sizeof(CharDriverState *));
+
+        qemu_chr_add_handlers(s->server_chr, ivshmem_can_receive, ivshmem_read,
+                     ivshmem_event, s);
+    } else {
+        /* just map the file immediately, we're not using a server */
+        int fd;
+
+        if (s->shmobj == NULL) {
+            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
+        }
+
+        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
+
+        /* try opening with O_EXCL and if it succeeds zero the memory
+         * by truncating to 0 */
+        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
+           /* truncate file to length PCI device's memory */
+            if (ftruncate(fd, s->ivshmem_size) != 0) {
+                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
+            }
+
+        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
+            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+            exit(-1);
+
+        }
+
+        if (check_shm_size(s, fd) == -1) {
+            exit(-1);
+        }
+
+        create_shared_memory_BAR(s, fd);
+
+    }
+
+    return 0;
+}
+
+static int pci_ivshmem_uninit(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+
+    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
+
+    return 0;
+}
+
+static PCIDeviceInfo ivshmem_info = {
+    .qdev.name  = "ivshmem",
+    .qdev.size  = sizeof(IVShmemState),
+    .qdev.reset = ivshmem_reset,
+    .init       = pci_ivshmem_init,
+    .exit       = pci_ivshmem_uninit,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_CHR("chardev", IVShmemState, server_chr),
+        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
+        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
+        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
+        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
+        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
+        DEFINE_PROP_STRING("role", IVShmemState, role),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void ivshmem_register_devices(void)
+{
+    pci_qdev_register(&ivshmem_info);
+}
+
+device_init(ivshmem_register_devices)
diff --git a/qemu-char.c b/qemu-char.c
index ac65a1c..b2e50d0 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2093,6 +2093,12 @@ static void tcp_chr_read(void *opaque)
     }
 }
 
+CharDriverState *qemu_chr_open_eventfd(int eventfd){
+
+    return qemu_chr_open_fd(eventfd, eventfd);
+
+}
+
 static void tcp_chr_connect(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu-char.h b/qemu-char.h
index e3a0783..6ea01ba 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -94,6 +94,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
 void qemu_chr_info(Monitor *mon, QObject **ret_data);
 CharDriverState *qemu_chr_find(const char *name);
 
+/* add an eventfd to the qemu devices that are polled */
+CharDriverState *qemu_chr_open_eventfd(int eventfd);
+
 extern int term_escape_char;
 
 /* async I/O support */
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 6647b7b..24f8748 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -706,6 +706,49 @@ Using the @option{-net socket} option, it is possible to make VLANs
 that span several QEMU instances. See @ref{sec_invocation} to have a
 basic example.
 
+@section Other Devices
+
+@subsection Inter-VM Shared Memory device
+
+With KVM enabled on a Linux host, a shared memory device is available.  Guests
+map a POSIX shared memory region into the guest as a PCI device that enables
+zero-copy communication to the application level of the guests.  The basic
+syntax is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+@end example
+
+If desired, interrupts can be sent between guest VMs accessing the same shared
+memory region.  Interrupt support requires using a shared memory server and
+using a chardev socket to connect to it.  The code for the shared memory server
+is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
+memory server is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,chardev=<id>]
+                        [,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
+qemu -chardev socket,path=<path>,id=<id>
+@end example
+
+When using the server, the guest will be assigned a VM ID (>=0) that allows guests
+using the same server to communicate via interrupts.  Guests can read their
+VM ID from a device register (see example code).  Since receiving the shared
+memory region from the server is asynchronous, there is a (small) chance the
+guest may boot before the shared memory is attached.  To allow an application
+to ensure shared memory is attached, the VM ID register will return -1 (an
+invalid VM ID) until the memory is attached.  Once the shared memory is
+attached, the VM ID will return the guest's valid VM ID.  With these semantics,
+the guest application can check to ensure the shared memory is attached to the
+guest before proceeding.
+
+The @option{role} argument can be set to either master or peer and will affect
+how the shared memory is migrated.  With @option{role=master}, the guest will
+copy the shared memory on migration to the destination host.  With
+@option{role=peer}, the shared memory will not be copied on migration.  Only
+one guest should be specified as
+the master.
+
 @node direct_linux_boot
 @section Direct Linux Boot
 
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device
@ 2010-06-04 21:45           ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cam Macdonell, kvm

Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
           [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
    -chardev socket,path=<path>,id=<id>

(shared memory server is qemu.git/contrib/ivshmem-server)

Sample programs and init scripts are in a git repo here:

    www.gitorious.org/nahanni
---
 Makefile.target |    3 +
 hw/ivshmem.c    |  852 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-char.c     |    6 +
 qemu-char.h     |    3 +
 qemu-doc.texi   |   43 +++
 5 files changed, 907 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index c4ba592..4888308 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..9057612
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,852 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c
+ *          Copyright (c) 2004 Fabrice Bellard
+ *          Copyright (c) 2004 Makoto Suzuki (suzu)
+ *
+ *      and rtl8139.c
+ *          Copyright (c) 2006 Igor Kovalenko
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "msix.h"
+#include "qemu-kvm.h"
+#include "libkvm.h"
+
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI     1
+
+//#define DEBUG_IVSHMEM
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct Peer {
+    int nb_eventfds;
+    int *eventfds;
+} Peer;
+
+typedef struct EventfdEntry {
+    PCIDevice *pdev;
+    int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+    PCIDevice dev;
+    uint32_t intrmask;
+    uint32_t intrstatus;
+    uint32_t doorbell;
+
+    CharDriverState ** eventfd_chr;
+    CharDriverState * server_chr;
+    int ivshmem_mmio_io_addr;
+
+    pcibus_t mmio_addr;
+    pcibus_t shm_pci_addr;
+    uint64_t ivshmem_offset;
+    uint64_t ivshmem_size; /* size of shared memory region */
+    int shm_fd; /* shared memory file descriptor */
+
+    Peer *peers;
+    int nb_peers; /* how many guests we have space for */
+    int max_peer; /* maximum numbered peer */
+
+    int vm_id;
+    uint32_t vectors;
+    uint32_t features;
+    EventfdEntry *eventfd_table;
+
+    char * shmobj;
+    char * sizearg;
+    char * role;
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 4,
+    IVPosition = 8,
+    Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+    return (ivs->features & (1 << feature));
+}
+
+static inline bool is_power_of_two(uint64_t x) {
+    return (x & (x - 1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->shm_pci_addr = addr;
+
+    if (s->ivshmem_offset > 0) {
+        cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
+                                                            s->ivshmem_offset);
+        if (s->role && strncmp(s->role, "peer", 4) == 0) {
+            IVSHMEM_DPRINTF("marking pages no migrate\n");
+            cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
+        }
+    }
+
+    IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
+                (uint32_t)addr, (uint32_t)s->ivshmem_offset, (uint32_t)size);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s, val);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s, val);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s, 0);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    u_int64_t write_one = 1;
+    u_int16_t dest = val >> 16;
+    u_int16_t vector = val & 0xff;
+
+    addr &= 0xfc;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        case Doorbell:
+            /* check that dest VM ID is reasonable */
+            if ((dest < 0) || (dest > s->max_peer)) {
+                IVSHMEM_DPRINTF("Invalid destination VM ID (%d)\n", dest);
+                break;
+            }
+
+            /* check doorbell range */
+            if ((vector >= 0) && (vector < s->peers[dest].nb_eventfds)) {
+                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n",
+                                                    write_one, dest, vector);
+                if (write(s->peers[dest].eventfds[vector],
+                                                    &(write_one), 8) != 8) {
+                    IVSHMEM_DPRINTF("error writing to eventfd\n");
+                }
+            }
+            break;
+        default:
+            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
+    }
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+
+        case IVPosition:
+            /* return my VM ID if the memory is mapped */
+            if (s->shm_fd > 0) {
+                ret = s->vm_id;
+            } else {
+                ret = -1;
+            }
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 8;
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
+
+    EventfdEntry *entry = opaque;
+    PCIDevice *pdev = entry->pdev;
+
+    IVSHMEM_DPRINTF("fake irqfd on vector %p %d\n", pdev, entry->vector);
+    msix_notify(pdev, entry->vector);
+}
+
+static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
+                                                                    int vector)
+{
+    /* create a event character device based on the passed eventfd */
+    IVShmemState *s = opaque;
+    CharDriverState * chr;
+
+    chr = qemu_chr_open_eventfd(eventfd);
+
+    if (chr == NULL) {
+        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
+        exit(-1);
+    }
+
+    /* if MSI is supported we need multiple interrupts */
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        s->eventfd_table[vector].pdev = &s->dev;
+        s->eventfd_table[vector].vector = vector;
+
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
+                      ivshmem_event, &s->eventfd_table[vector]);
+    } else {
+        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
+                      ivshmem_event, s);
+    }
+
+    return chr;
+
+}
+
+static int check_shm_size(IVShmemState *s, int fd) {
+    /* check that the guest isn't going to try and map more memory than the
+     * the object has allocated return -1 to indicate error */
+
+    struct stat buf;
+
+    fstat(fd, &buf);
+
+    if (s->ivshmem_size > buf.st_size) {
+        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
+        fprintf(stderr, " than shared object size (%ld > %ld)\n",
+                                          s->ivshmem_size, buf.st_size);
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+/* create the shared memory BAR when we are not using the server, so we can
+ * create the BAR and map the memory immediately */
+static void create_shared_memory_BAR(IVShmemState *s, int fd) {
+
+    void * ptr;
+
+    s->shm_fd = fd;
+
+    ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+
+    s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, ptr);
+
+    /* region for shared memory */
+    pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+}
+
+static void close_guest_eventfds(IVShmemState *s, int posn)
+{
+    int i, guest_curr_max;
+
+    guest_curr_max = s->peers[posn].nb_eventfds;
+
+    for (i = 0; i < guest_curr_max; i++)
+        close(s->peers[posn].eventfds[i]);
+
+    qemu_free(s->peers[posn].eventfds);
+    s->peers[posn].nb_eventfds = 0;
+}
+
+static void setup_ioeventfds(IVShmemState *s) {
+
+    int i, j;
+
+    for (i = 0; i <= s->max_peer; i++) {
+        for (j = 0; j < s->peers[i].nb_eventfds; j++) {
+            kvm_set_ioeventfd_mmio_long(s->peers[i].eventfds[j],
+                    s->mmio_addr + Doorbell, (i << 16) | j, 1);
+        }
+    }
+
+    /* setup irqfd for this VM's eventfds */
+    for (i = 0; i < s->vectors; i++) {
+        kvm_set_irqfd(s->dev.msix_irq_entries[i].gsi,
+                        s->peers[s->vm_id].eventfds[i], 1);
+    }
+}
+
+
+/* this function increase the dynamic storage need to store data about other
+ * guests */
+static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
+
+    int j, old_nb_alloc;
+
+    old_nb_alloc = s->nb_peers;
+
+    while (new_min_size >= s->nb_peers)
+        s->nb_peers = s->nb_peers * 2;
+
+    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nb_peers);
+    s->peers = qemu_realloc(s->peers, s->nb_peers * sizeof(Peer));
+
+    if (s->peers == NULL) {
+        fprintf(stderr, "Allocation error - exiting\n");
+        exit(1);
+    }
+
+    /* zero out new pointers */
+    for (j = old_nb_alloc; j < s->nb_peers; j++) {
+        s->peers[j].eventfds = NULL;
+        s->peers[j].nb_eventfds = 0;
+    }
+}
+
+static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
+{
+    IVShmemState *s = opaque;
+    int incoming_fd, tmp_fd;
+    int guest_curr_max;
+    long incoming_posn;
+
+    memcpy(&incoming_posn, buf, sizeof(long));
+    /* pick off s->server_chr->msgfd and store it, posn should accompany msg */
+    tmp_fd = qemu_chr_get_msgfd(s->server_chr);
+    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
+
+    /* make sure we have enough space for this guest */
+    if (incoming_posn >= s->nb_peers) {
+        increase_dynamic_storage(s, incoming_posn);
+    }
+
+    if (tmp_fd == -1) {
+        /* if posn is positive and unseen before then this is our posn*/
+        if ((incoming_posn >= 0) && (s->peers[incoming_posn].eventfds == NULL)) {
+            /* receive our posn */
+            s->vm_id = incoming_posn;
+            return;
+        } else {
+            /* otherwise an fd == -1 means an existing guest has gone away */
+            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
+            close_guest_eventfds(s, incoming_posn);
+            return;
+        }
+    }
+
+    /* because of the implementation of get_msgfd, we need a dup */
+    incoming_fd = dup(tmp_fd);
+
+    if (incoming_fd == -1) {
+        fprintf(stderr, "could not allocate file descriptor %s\n",
+                                                            strerror(errno));
+        return;
+    }
+
+    /* if the position is -1, then it's shared memory region fd */
+    if (incoming_posn == -1) {
+
+        void * map_ptr;
+
+        s->max_peer = 0;
+
+        if (check_shm_size(s, incoming_fd) == -1) {
+            exit(-1);
+        }
+
+        /* mmap the region and map into the BAR2 */
+        map_ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED,
+                                                                incoming_fd, 0);
+        s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, map_ptr);
+
+        IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
+                        (uint32_t)s->shm_pci_addr, (uint32_t)s->ivshmem_offset,
+                        (uint32_t)s->ivshmem_size);
+
+        if (s->shm_pci_addr > 0) {
+            /* map memory into BAR2 */
+            cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
+                                                            s->ivshmem_offset);
+            if (s->role && strncmp(s->role, "peer", 4) == 0) {
+                IVSHMEM_DPRINTF("marking pages no migrate\n");
+                cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
+            }
+
+        }
+
+        /* only store the fd if it is successfully mapped */
+        s->shm_fd = incoming_fd;
+
+        return;
+    }
+
+    /* each guest has an array of eventfds, and we keep track of how many
+     * guests for each VM */
+    guest_curr_max = s->peers[incoming_posn].nb_eventfds;
+    if (guest_curr_max == 0) {
+        /* one eventfd per MSI vector */
+        s->peers[incoming_posn].eventfds = (int *) qemu_malloc(s->vectors *
+                                                                sizeof(int));
+    }
+
+    /* this is an eventfd for a particular guest VM */
+    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
+                                                                incoming_fd);
+    s->peers[incoming_posn].eventfds[guest_curr_max] = incoming_fd;
+
+    /* increment count for particular guest */
+    s->peers[incoming_posn].nb_eventfds++;
+
+    /* keep track of the maximum VM ID */
+    if (incoming_posn > s->max_peer) {
+        s->max_peer = incoming_posn;
+    }
+
+    if (incoming_posn == s->vm_id) {
+        int vector = guest_curr_max;
+        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+            /* initialize char device for callback
+             * if this is one of my eventfds */
+            s->eventfd_chr[vector] = create_eventfd_chr_device(s,
+                       s->peers[s->vm_id].eventfds[vector], vector);
+        }
+    }
+
+    return;
+}
+
+static void ivshmem_reset(DeviceState *d)
+{
+    return;
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       pcibus_t addr, pcibus_t size, int type)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+    s->mmio_addr = addr;
+    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
+
+    /* ioeventfd and irqfd are enabled together,
+     * so the flag IRQFD refers to both */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        setup_ioeventfds(s);
+    }
+}
+
+static uint64_t ivshmem_get_size(IVShmemState * s) {
+
+    uint64_t value;
+    char *ptr;
+
+    value = strtoul(s->sizearg, &ptr, 10);
+    switch (*ptr) {
+        case 0: case 'M': case 'm':
+            value <<= 20;
+            break;
+        case 'G': case 'g':
+            value <<= 30;
+            break;
+        default:
+            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
+            exit(1);
+    }
+
+    /* BARs must be a power of 2 */
+    if (!is_power_of_two(value)) {
+        fprintf(stderr, "ivshmem: size must be power of 2\n");
+        exit(1);
+    }
+
+    return value;
+
+}
+
+static void ivshmem_setup_msi(IVShmemState * s) {
+
+    int i;
+
+    /* allocate the MSI-X vectors */
+
+    if (!msix_init(&s->dev, s->vectors, 1, 0)) {
+        pci_register_bar(&s->dev, 1,
+                         msix_bar_size(&s->dev),
+                         PCI_BASE_ADDRESS_SPACE_MEMORY,
+                         msix_mmio_map);
+        IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
+    } else {
+        IVSHMEM_DPRINTF("msix initialization failed\n");
+    }
+
+    /* 'activate' the vectors */
+    for (i = 0; i < s->vectors; i++) {
+        msix_vector_use(&s->dev, i);
+    }
+
+    /* if IRQFDs are not supported, we'll have to trigger the interrupts
+     * via Qemu char devices */
+    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+        /* for handling interrupts when IRQFD is not available */
+        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
+    }
+}
+
+static void ivshmem_save(QEMUFile* f, void *opaque)
+{
+    IVShmemState *proxy = opaque;
+
+    IVSHMEM_DPRINTF("ivshmem_save\n");
+    pci_device_save(&proxy->dev, f);
+
+    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
+        msix_save(&proxy->dev, f);
+    } else {
+        qemu_put_be32(f, proxy->intrstatus);
+        qemu_put_be32(f, proxy->intrmask);
+    }
+
+}
+
+static int ivshmem_load(QEMUFile* f, void *opaque, int version_id)
+{
+    IVSHMEM_DPRINTF("ivshmem_load\n");
+
+    IVShmemState *proxy = opaque;
+    int ret, i;
+
+    ret = pci_device_load(&proxy->dev, f);
+    if (ret) {
+        return ret;
+    }
+
+    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
+        msix_load(&proxy->dev, f);
+        for (i = 0; i < proxy->vectors; i++) {
+            msix_vector_use(&proxy->dev, i);
+        }
+    } else {
+        proxy->intrstatus = qemu_get_be32(f);
+        proxy->intrmask = qemu_get_be32(f);
+    }
+
+    return 0;
+}
+
+static int pci_ivshmem_init(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+    uint8_t *pci_conf;
+
+    if (s->sizearg == NULL)
+        s->ivshmem_size = 4 << 20; /* 4 MB default */
+    else {
+        s->ivshmem_size = ivshmem_get_size(s);
+    }
+
+    register_savevm("ivshmem", 0, 0, ivshmem_save, ivshmem_load, dev);
+
+    /* IRQFD requires MSI */
+    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
+        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
+        exit(1);
+    }
+
+    /* check that role is reasonable */
+    if (s->role && !((strncmp(s->role, "peer", 5) == 0) ||
+                        (strncmp(s->role, "master", 7) == 0))) {
+        fprintf(stderr, "ivshmem: 'role' must be 'peer' or 'master'\n");
+        exit(1);
+    }
+
+    pci_conf = s->dev.config;
+    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+    pci_conf[0x0a] = 0x00; /* RAM controller */
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; /* header_type */
+
+    pci_conf[PCI_INTERRUPT_PIN] = 1;
+
+    s->shm_pci_addr = 0;
+    s->ivshmem_offset = 0;
+    s->shm_fd = 0;
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+    /* region for registers*/
+    pci_register_bar(&s->dev, 0, 0x400,
+                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
+
+    if ((s->server_chr != NULL) &&
+                        (strncmp(s->server_chr->filename, "unix:", 5) == 0)) {
+        /* if we get a UNIX socket as the parameter we will talk
+         * to the ivshmem server to receive the memory region */
+
+        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
+                                                    s->server_chr->filename);
+
+        if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+            ivshmem_setup_msi(s);
+        }
+
+        /* we allocate enough space for 16 guests and grow as needed */
+        s->nb_peers = 16;
+        s->vm_id = -1;
+
+        /* allocate/initialize space for interrupt handling */
+        s->peers = qemu_mallocz(s->nb_peers * sizeof(Peer));
+
+        pci_register_bar(&s->dev, 2, s->ivshmem_size,
+                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+
+        s->eventfd_chr = (CharDriverState **) qemu_mallocz(s->vectors *
+                                                sizeof(CharDriverState *));
+
+        qemu_chr_add_handlers(s->server_chr, ivshmem_can_receive, ivshmem_read,
+                     ivshmem_event, s);
+    } else {
+        /* just map the file immediately, we're not using a server */
+        int fd;
+
+        if (s->shmobj == NULL) {
+            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
+        }
+
+        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
+
+        /* try opening with O_EXCL and if it succeeds zero the memory
+         * by truncating to 0 */
+        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
+           /* truncate file to length PCI device's memory */
+            if (ftruncate(fd, s->ivshmem_size) != 0) {
+                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
+            }
+
+        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
+                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
+            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+            exit(-1);
+
+        }
+
+        if (check_shm_size(s, fd) == -1) {
+            exit(-1);
+        }
+
+        create_shared_memory_BAR(s, fd);
+
+    }
+
+    return 0;
+}
+
+static int pci_ivshmem_uninit(PCIDevice *dev)
+{
+    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
+
+    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
+
+    return 0;
+}
+
+static PCIDeviceInfo ivshmem_info = {
+    .qdev.name  = "ivshmem",
+    .qdev.size  = sizeof(IVShmemState),
+    .qdev.reset = ivshmem_reset,
+    .init       = pci_ivshmem_init,
+    .exit       = pci_ivshmem_uninit,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_CHR("chardev", IVShmemState, server_chr),
+        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
+        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
+        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
+        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
+        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
+        DEFINE_PROP_STRING("role", IVShmemState, role),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void ivshmem_register_devices(void)
+{
+    pci_qdev_register(&ivshmem_info);
+}
+
+device_init(ivshmem_register_devices)
diff --git a/qemu-char.c b/qemu-char.c
index ac65a1c..b2e50d0 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2093,6 +2093,12 @@ static void tcp_chr_read(void *opaque)
     }
 }
 
+CharDriverState *qemu_chr_open_eventfd(int eventfd){
+
+    return qemu_chr_open_fd(eventfd, eventfd);
+
+}
+
 static void tcp_chr_connect(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu-char.h b/qemu-char.h
index e3a0783..6ea01ba 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -94,6 +94,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
 void qemu_chr_info(Monitor *mon, QObject **ret_data);
 CharDriverState *qemu_chr_find(const char *name);
 
+/* add an eventfd to the qemu devices that are polled */
+CharDriverState *qemu_chr_open_eventfd(int eventfd);
+
 extern int term_escape_char;
 
 /* async I/O support */
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 6647b7b..24f8748 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -706,6 +706,49 @@ Using the @option{-net socket} option, it is possible to make VLANs
 that span several QEMU instances. See @ref{sec_invocation} to have a
 basic example.
 
+@section Other Devices
+
+@subsection Inter-VM Shared Memory device
+
+With KVM enabled on a Linux host, a shared memory device is available.  Guests
+map a POSIX shared memory region into the guest as a PCI device that enables
+zero-copy communication to the application level of the guests.  The basic
+syntax is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
+@end example
+
+If desired, interrupts can be sent between guest VMs accessing the same shared
+memory region.  Interrupt support requires using a shared memory server and
+using a chardev socket to connect to it.  The code for the shared memory server
+is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
+memory server is:
+
+@example
+qemu -device ivshmem,size=<size in format accepted by -m>[,chardev=<id>]
+                        [,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
+qemu -chardev socket,path=<path>,id=<id>
+@end example
+
+When using the server, the guest will be assigned a VM ID (>=0) that allows guests
+using the same server to communicate via interrupts.  Guests can read their
+VM ID from a device register (see example code).  Since receiving the shared
+memory region from the server is asynchronous, there is a (small) chance the
+guest may boot before the shared memory is attached.  To allow an application
+to ensure shared memory is attached, the VM ID register will return -1 (an
+invalid VM ID) until the memory is attached.  Once the shared memory is
+attached, the VM ID will return the guest's valid VM ID.  With these semantics,
+the guest application can check to ensure the shared memory is attached to the
+guest before proceeding.
+
+The @option{role} argument can be set to either master or peer and will affect
+how the shared memory is migrated.  With @option{role=master}, the guest will
+copy the shared memory on migration to the destination host.  With
+@option{role=peer}, the shared memory will not be copied on migration.  Only
+one guest should be specified as
+the master.
+
 @node direct_linux_boot
 @section Direct Linux Boot
 
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
  2010-06-04 21:45           ` [Qemu-devel] " Cam Macdonell
@ 2010-06-04 21:45             ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, Cam Macdonell

this code is a standalone server which will pass file descriptors for the shared
memory region and eventfds to support interrupts between guests using inter-VM
shared memory.
---
 contrib/ivshmem-server/Makefile         |   16 ++
 contrib/ivshmem-server/README           |   30 +++
 contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++++++++++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 ++++++++++++++++++
 contrib/ivshmem-server/send_scm.h       |   19 ++
 5 files changed, 626 insertions(+), 0 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h

diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile
new file mode 100644
index 0000000..da40ffa
--- /dev/null
+++ b/contrib/ivshmem-server/Makefile
@@ -0,0 +1,16 @@
+CC = gcc
+CFLAGS = -O3 -Wall -Werror
+LIBS = -lrt
+
+# a very simple makefile to build the inter-VM shared memory server
+
+all: ivshmem_server
+
+.c.o:
+	$(CC) $(CFLAGS) -c $^ -o $@
+
+ivshmem_server: ivshmem_server.o send_scm.o
+	$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+clean:
+	rm -f *.o ivshmem_server
diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
new file mode 100644
index 0000000..b1fc2a2
--- /dev/null
+++ b/contrib/ivshmem-server/README
@@ -0,0 +1,30 @@
+Using the ivshmem shared memory server
+--------------------------------------
+
+This server is only supported on Linux.
+
+To use the shared memory server, first compile it.  Running 'make' should
+accomplish this.  An executable named 'ivshmem_server' will be built.
+
+to display the options run:
+
+./ivshmem_server -h
+
+Options
+-------
+
+    -h  print help message
+
+    -p <path on host>
+        unix socket to listen on.  The qemu-kvm chardev needs to connect on
+        this socket. (default: '/tmp/ivshmem_socket')
+
+    -s <string>
+        POSIX shared object to create that is the shared memory (default: 'ivshmem')
+
+    -m <#>
+        size of the POSIX object in MBs (default: 1)
+
+    -n <#>
+        number of eventfds for each guest.  This number must match the
+        'vectors' argument passed the ivshmem device. (default: 1)
diff --git a/contrib/ivshmem-server/ivshmem_server.c b/contrib/ivshmem-server/ivshmem_server.c
new file mode 100644
index 0000000..e0a7b98
--- /dev/null
+++ b/contrib/ivshmem-server/ivshmem_server.c
@@ -0,0 +1,353 @@
+/*
+ * A stand-alone shared memory server for inter-VM shared memory for KVM
+*/
+
+#include <errno.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/select.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "send_scm.h"
+
+#define DEFAULT_SOCK_PATH "/tmp/ivshmem_socket"
+#define DEFAULT_SHM_OBJ "ivshmem"
+
+#define DEBUG 1
+
+typedef struct server_state {
+    vmguest_t *live_vms;
+    int nr_allocated_vms;
+    int shm_size;
+    long live_count;
+    long total_count;
+    int shm_fd;
+    char * path;
+    char * shmobj;
+    int maxfd, conn_socket;
+    long msi_vectors;
+} server_state_t;
+
+void usage(char const *prg);
+int find_set(fd_set * readset, int max);
+void print_vec(server_state_t * s, const char * c);
+
+void add_new_guest(server_state_t * s);
+void parse_args(int argc, char **argv, server_state_t * s);
+int create_listening_socket(char * path);
+
+int main(int argc, char ** argv)
+{
+    fd_set readset;
+    server_state_t * s;
+
+    s = (server_state_t *)calloc(1, sizeof(server_state_t));
+
+    s->live_count = 0;
+    s->total_count = 0;
+    parse_args(argc, argv, s);
+
+    /* open shared memory file  */
+    if ((s->shm_fd = shm_open(s->shmobj, O_CREAT|O_RDWR, S_IRWXU)) < 0)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+        exit(-1);
+    }
+
+    ftruncate(s->shm_fd, s->shm_size);
+
+    s->conn_socket = create_listening_socket(s->path);
+
+    s->maxfd = s->conn_socket;
+
+    for(;;) {
+        int ret, handle, i;
+        char buf[1024];
+
+        print_vec(s, "vm_sockets");
+
+        FD_ZERO(&readset);
+        /* conn socket is in Live_vms at posn 0 */
+        FD_SET(s->conn_socket, &readset);
+        for (i = 0; i < s->total_count; i++) {
+            if (s->live_vms[i].alive != 0) {
+                FD_SET(s->live_vms[i].sockfd, &readset);
+            }
+        }
+
+        printf("\nWaiting (maxfd = %d)\n", s->maxfd);
+
+        ret = select(s->maxfd + 1, &readset, NULL, NULL, NULL);
+
+        if (ret == -1) {
+            perror("select()");
+        }
+
+        handle = find_set(&readset, s->maxfd + 1);
+        if (handle == -1) continue;
+
+        if (handle == s->conn_socket) {
+
+            printf("[NC] new connection\n");
+            FD_CLR(s->conn_socket, &readset);
+
+            /* The Total_count is equal to the new guests VM ID */
+            add_new_guest(s);
+
+            /* update our the maximum file descriptor number */
+            s->maxfd = s->live_vms[s->total_count - 1].sockfd > s->maxfd ?
+                            s->live_vms[s->total_count - 1].sockfd : s->maxfd;
+
+            s->live_count++;
+            printf("Live_count is %ld\n", s->live_count);
+
+        } else {
+            /* then we have received a disconnection */
+            int recv_ret;
+            long i, j;
+            long deadposn = -1;
+
+            recv_ret = recv(handle, buf, 1, 0);
+
+            printf("[DC] recv returned %d\n", recv_ret);
+
+            /* find the dead VM in our list and move it do the dead list. */
+            for (i = 0; i < s->total_count; i++) {
+                if (s->live_vms[i].sockfd == handle) {
+                    deadposn = i;
+                    s->live_vms[i].alive = 0;
+                    close(s->live_vms[i].sockfd);
+
+                    for (j = 0; j < s->msi_vectors; j++) {
+                        close(s->live_vms[i].efd[j]);
+                    }
+
+                    free(s->live_vms[i].efd);
+                    s->live_vms[i].sockfd = -1;
+                    break;
+                }
+            }
+
+            for (j = 0; j < s->total_count; j++) {
+                /* update remaining clients that one client has left/died */
+                if (s->live_vms[j].alive) {
+                    printf("[UD] sending kill of fd[%ld] to %ld\n",
+                                                                deadposn, j);
+                    sendKill(s->live_vms[j].sockfd, deadposn, sizeof(deadposn));
+                }
+            }
+
+            s->live_count--;
+
+            /* close the socket for the departed VM */
+            close(handle);
+        }
+
+    }
+
+    return 0;
+}
+
+void add_new_guest(server_state_t * s) {
+
+    struct sockaddr_un remote;
+    socklen_t t = sizeof(remote);
+    long i, j;
+    int vm_sock;
+    long new_posn;
+    long neg1 = -1;
+
+    vm_sock = accept(s->conn_socket, (struct sockaddr *)&remote, &t);
+
+    if ( vm_sock == -1 ) {
+        perror("accept");
+        exit(1);
+    }
+
+    new_posn = s->total_count;
+
+    if (new_posn == s->nr_allocated_vms) {
+        printf("increasing vm slots\n");
+        s->nr_allocated_vms = s->nr_allocated_vms * 2;
+        if (s->nr_allocated_vms < 16)
+            s->nr_allocated_vms = 16;
+        s->live_vms = realloc(s->live_vms,
+                    s->nr_allocated_vms * sizeof(vmguest_t));
+
+        if (s->live_vms == NULL) {
+            fprintf(stderr, "realloc failed - quitting\n");
+            exit(-1);
+        }
+    }
+
+    s->live_vms[new_posn].posn = new_posn;
+    printf("[NC] Live_vms[%ld]\n", new_posn);
+    s->live_vms[new_posn].efd = (int *) malloc(sizeof(int));
+    for (i = 0; i < s->msi_vectors; i++) {
+        s->live_vms[new_posn].efd[i] = eventfd(0, 0);
+        printf("\tefd[%ld] = %d\n", i, s->live_vms[new_posn].efd[i]);
+    }
+    s->live_vms[new_posn].sockfd = vm_sock;
+    s->live_vms[new_posn].alive = 1;
+
+
+    sendPosition(vm_sock, new_posn);
+    sendUpdate(vm_sock, neg1, sizeof(long), s->shm_fd);
+    printf("[NC] trying to send fds to new connection\n");
+    sendRights(vm_sock, new_posn, sizeof(new_posn), s->live_vms, s->msi_vectors);
+
+    printf("[NC] Connected (count = %ld).\n", new_posn);
+    for (i = 0; i < new_posn; i++) {
+        if (s->live_vms[i].alive) {
+            // ping all clients that a new client has joined
+            printf("[UD] sending fd[%ld] to %ld\n", new_posn, i);
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("\tefd[%ld] = [%d]", j, s->live_vms[new_posn].efd[j]);
+                sendUpdate(s->live_vms[i].sockfd, new_posn,
+                        sizeof(new_posn), s->live_vms[new_posn].efd[j]);
+            }
+            printf("\n");
+        }
+    }
+
+    s->total_count++;
+}
+
+int create_listening_socket(char * path) {
+
+    struct sockaddr_un local;
+    int len, conn_socket;
+
+    if ((conn_socket = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
+        perror("socket");
+        exit(1);
+    }
+
+    local.sun_family = AF_UNIX;
+    strcpy(local.sun_path, path);
+    unlink(local.sun_path);
+    len = strlen(local.sun_path) + sizeof(local.sun_family);
+    if (bind(conn_socket, (struct sockaddr *)&local, len) == -1) {
+        perror("bind");
+        exit(1);
+    }
+
+    if (listen(conn_socket, 5) == -1) {
+        perror("listen");
+        exit(1);
+    }
+
+    return conn_socket;
+
+}
+
+void parse_args(int argc, char **argv, server_state_t * s) {
+
+    int c;
+
+    s->shm_size = 1024 * 1024; // default shm_size
+    s->path = NULL;
+    s->shmobj = NULL;
+    s->msi_vectors = 1;
+
+	while ((c = getopt(argc, argv, "hp:s:m:n:")) != -1) {
+
+        switch (c) {
+            // path to listening socket
+            case 'p':
+                s->path = optarg;
+                break;
+            // name of shared memory object
+            case 's':
+                s->shmobj = optarg;
+                break;
+            // size of shared memory object
+            case 'm': {
+                    uint64_t value;
+                    char *ptr;
+
+                    value = strtoul(optarg, &ptr, 10);
+                    switch (*ptr) {
+                    case 0: case 'M': case 'm':
+                        value <<= 20;
+                        break;
+                    case 'G': case 'g':
+                        value <<= 30;
+                        break;
+                    default:
+                        fprintf(stderr, "qemu: invalid ram size: %s\n", optarg);
+                        exit(1);
+                    }
+                    s->shm_size = value;
+                    break;
+                }
+            case 'n':
+                s->msi_vectors = atol(optarg);
+                break;
+            case 'h':
+            default:
+	            usage(argv[0]);
+		        exit(1);
+		}
+	}
+
+    if (s->path == NULL) {
+        s->path = strdup(DEFAULT_SOCK_PATH);
+    }
+
+    printf("listening socket: %s\n", s->path);
+
+    if (s->shmobj == NULL) {
+        s->shmobj = strdup(DEFAULT_SHM_OBJ);
+    }
+
+    printf("shared object: %s\n", s->shmobj);
+    printf("shared object size: %d (bytes)\n", s->shm_size);
+
+}
+
+void print_vec(server_state_t * s, const char * c) {
+
+    int i, j;
+
+#if DEBUG
+    printf("%s (%ld) = ", c, s->total_count);
+    for (i = 0; i < s->total_count; i++) {
+        if (s->live_vms[i].alive) {
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("[%d|%d] ", s->live_vms[i].sockfd, s->live_vms[i].efd[j]);
+            }
+        }
+    }
+    printf("\n");
+#endif
+
+}
+
+int find_set(fd_set * readset, int max) {
+
+    int i;
+
+    for (i = 1; i < max; i++) {
+        if (FD_ISSET(i, readset)) {
+            return i;
+        }
+    }
+
+    printf("nothing set\n");
+    return -1;
+
+}
+
+void usage(char const *prg) {
+	fprintf(stderr, "use: %s [-h]  [-p <unix socket>] [-s <shm obj>] "
+            "[-m <size in MB>] [-n <# of MSI vectors>]\n", prg);
+}
diff --git a/contrib/ivshmem-server/send_scm.c b/contrib/ivshmem-server/send_scm.c
new file mode 100644
index 0000000..b1bb4a3
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.c
@@ -0,0 +1,208 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/syscall.h>
+#include <sys/un.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <poll.h>
+#include "send_scm.h"
+
+#ifndef POLLRDHUP
+#define POLLRDHUP 0x2000
+#endif
+
+int readUpdate(int fd, long * posn, int * newfd)
+{
+    struct msghdr msg;
+    struct iovec iov[1];
+    struct cmsghdr *cmptr;
+    size_t len;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+
+    msg.msg_name = 0;
+    msg.msg_namelen = 0;
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    msg.msg_flags = 0;
+    msg.msg_iov = iov;
+    msg.msg_iovlen = 1;
+
+    iov[0].iov_base = &posn;
+    iov[0].iov_len = sizeof(posn);
+
+    do {
+        len = recvmsg(fd, &msg, 0);
+    } while (len == (size_t) (-1) && (errno == EINTR || errno == EAGAIN));
+
+    printf("iov[0].buf is %ld\n", *((long *)iov[0].iov_base));
+    printf("len is %ld\n", len);
+    // TODO: Logging
+    if (len == (size_t) (-1)) {
+        perror("recvmsg()");
+        return -1;
+    }
+
+    if (msg.msg_controllen < sizeof(struct cmsghdr))
+        return *posn;
+
+    for (cmptr = CMSG_FIRSTHDR(&msg); cmptr != NULL;
+        cmptr = CMSG_NXTHDR(&msg, cmptr)) {
+        if (cmptr->cmsg_level != SOL_SOCKET ||
+            cmptr->cmsg_type != SCM_RIGHTS){
+                printf("continuing %ld\n", sizeof(size_t));
+                printf("read msg_size = %ld\n", msg_size);
+                if (cmptr->cmsg_len != sizeof(control))
+                    printf("not equal (%ld != %ld)\n",cmptr->cmsg_len,sizeof(control));
+                continue;
+        }
+
+        memcpy(newfd, CMSG_DATA(cmptr), sizeof(int));
+        printf("posn is %ld (fd = %d)\n", *posn, *newfd);
+        return 0;
+    }
+
+    fprintf(stderr, "bad data in packet\n");
+    return -1;
+}
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors)
+{
+    int j, newfd;
+
+    for (; ;){
+        long posn = 0;
+
+        readUpdate(fd, &posn, &newfd);
+        printf("reading posn %ld ", posn);
+        fds[posn] = (int *)malloc (msi_vectors * sizeof(int));
+        fds[posn][0] = newfd;
+        for (j = 1; j < msi_vectors; j++) {
+            readUpdate(fd, &posn, &newfd);
+            fds[posn][j] = newfd;
+            printf("%d.", fds[posn][j]);
+        }
+        printf("\n");
+
+        /* stop reading once i've read my own eventfds */
+        if (posn == count)
+            break;
+    }
+
+    return 0;
+}
+
+int sendKill(int fd, long const posn, size_t posn_len) {
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    struct pollfd mypollfd;
+    int rv;
+
+    iov[0].iov_base = (void *) &posn;
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_len = 0;
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    printf("Killing posn %ld\n", posn);
+
+    // check if the fd is dead or not
+    mypollfd.fd = fd;
+    mypollfd.events = POLLRDHUP;
+    mypollfd.revents = 0;
+
+    rv = poll(&mypollfd, 1, 0);
+
+    printf("rv is %d\n", rv);
+
+    if (rv == 0) {
+        len = sendmsg(fd, &msg, 0);
+        if (len == (size_t) (-1)) {
+            perror("sendmsg()");
+            return -1;
+        }
+        return (len == posn_len);
+    } else {
+        printf("already dead\n");
+        return 0;
+    }
+}
+
+int sendUpdate(int fd, long posn, size_t posn_len, int sendfd)
+{
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    iov[0].iov_base = (void *) (&posn);
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_type = SCM_RIGHTS;
+    cmsg->cmsg_len = CMSG_LEN(msg_size);
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    memcpy((CMSG_DATA(cmsg)), &sendfd, msg_size);
+
+    len = sendmsg(fd, &msg, 0);
+    if (len == (size_t) (-1)) {
+        perror("sendmsg()");
+        return -1;
+    }
+
+    return (len == posn_len);
+
+}
+
+int sendPosition(int fd, long const posn)
+{
+    int rv;
+
+    rv = send(fd, &posn, sizeof(long), 0);
+    if (rv != sizeof(long)) {
+        fprintf(stderr, "error sending posn\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+int sendRights(int fd, long const count, size_t count_len, vmguest_t * Live_vms,
+                                                            long msi_vectors)
+{
+    /* updates about new guests are sent one at a time */
+
+    long i, j;
+
+    for (i = 0; i <= count; i++) {
+        if (Live_vms[i].alive) {
+            for (j = 0; j < msi_vectors; j++) {
+                sendUpdate(Live_vms[count].sockfd, i, sizeof(long),
+                                                        Live_vms[i].efd[j]);
+            }
+        }
+    }
+
+    return 0;
+
+}
diff --git a/contrib/ivshmem-server/send_scm.h b/contrib/ivshmem-server/send_scm.h
new file mode 100644
index 0000000..48c9a8d
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.h
@@ -0,0 +1,19 @@
+#ifndef SEND_SCM
+#define SEND_SCM
+
+struct vm_guest_conn {
+    int posn;
+    int sockfd;
+    int * efd;
+    int alive;
+};
+
+typedef struct vm_guest_conn vmguest_t;
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors);
+int sendRights(int fd, long const count, size_t count_len, vmguest_t *Live_vms, long msi_vectors);
+int readUpdate(int fd, long * posn, int * newfd);
+int sendUpdate(int fd, long const posn, size_t posn_len, int sendfd);
+int sendPosition(int fd, long const posn);
+int sendKill(int fd, long const posn, size_t posn_len);
+#endif
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
@ 2010-06-04 21:45             ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cam Macdonell, kvm

this code is a standalone server which will pass file descriptors for the shared
memory region and eventfds to support interrupts between guests using inter-VM
shared memory.
---
 contrib/ivshmem-server/Makefile         |   16 ++
 contrib/ivshmem-server/README           |   30 +++
 contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++++++++++++++++++++
 contrib/ivshmem-server/send_scm.c       |  208 ++++++++++++++++++
 contrib/ivshmem-server/send_scm.h       |   19 ++
 5 files changed, 626 insertions(+), 0 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h

diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile
new file mode 100644
index 0000000..da40ffa
--- /dev/null
+++ b/contrib/ivshmem-server/Makefile
@@ -0,0 +1,16 @@
+CC = gcc
+CFLAGS = -O3 -Wall -Werror
+LIBS = -lrt
+
+# a very simple makefile to build the inter-VM shared memory server
+
+all: ivshmem_server
+
+.c.o:
+	$(CC) $(CFLAGS) -c $^ -o $@
+
+ivshmem_server: ivshmem_server.o send_scm.o
+	$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+clean:
+	rm -f *.o ivshmem_server
diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
new file mode 100644
index 0000000..b1fc2a2
--- /dev/null
+++ b/contrib/ivshmem-server/README
@@ -0,0 +1,30 @@
+Using the ivshmem shared memory server
+--------------------------------------
+
+This server is only supported on Linux.
+
+To use the shared memory server, first compile it.  Running 'make' should
+accomplish this.  An executable named 'ivshmem_server' will be built.
+
+to display the options run:
+
+./ivshmem_server -h
+
+Options
+-------
+
+    -h  print help message
+
+    -p <path on host>
+        unix socket to listen on.  The qemu-kvm chardev needs to connect on
+        this socket. (default: '/tmp/ivshmem_socket')
+
+    -s <string>
+        POSIX shared object to create that is the shared memory (default: 'ivshmem')
+
+    -m <#>
+        size of the POSIX object in MBs (default: 1)
+
+    -n <#>
+        number of eventfds for each guest.  This number must match the
+        'vectors' argument passed the ivshmem device. (default: 1)
diff --git a/contrib/ivshmem-server/ivshmem_server.c b/contrib/ivshmem-server/ivshmem_server.c
new file mode 100644
index 0000000..e0a7b98
--- /dev/null
+++ b/contrib/ivshmem-server/ivshmem_server.c
@@ -0,0 +1,353 @@
+/*
+ * A stand-alone shared memory server for inter-VM shared memory for KVM
+*/
+
+#include <errno.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <sys/select.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "send_scm.h"
+
+#define DEFAULT_SOCK_PATH "/tmp/ivshmem_socket"
+#define DEFAULT_SHM_OBJ "ivshmem"
+
+#define DEBUG 1
+
+typedef struct server_state {
+    vmguest_t *live_vms;
+    int nr_allocated_vms;
+    int shm_size;
+    long live_count;
+    long total_count;
+    int shm_fd;
+    char * path;
+    char * shmobj;
+    int maxfd, conn_socket;
+    long msi_vectors;
+} server_state_t;
+
+void usage(char const *prg);
+int find_set(fd_set * readset, int max);
+void print_vec(server_state_t * s, const char * c);
+
+void add_new_guest(server_state_t * s);
+void parse_args(int argc, char **argv, server_state_t * s);
+int create_listening_socket(char * path);
+
+int main(int argc, char ** argv)
+{
+    fd_set readset;
+    server_state_t * s;
+
+    s = (server_state_t *)calloc(1, sizeof(server_state_t));
+
+    s->live_count = 0;
+    s->total_count = 0;
+    parse_args(argc, argv, s);
+
+    /* open shared memory file  */
+    if ((s->shm_fd = shm_open(s->shmobj, O_CREAT|O_RDWR, S_IRWXU)) < 0)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+        exit(-1);
+    }
+
+    ftruncate(s->shm_fd, s->shm_size);
+
+    s->conn_socket = create_listening_socket(s->path);
+
+    s->maxfd = s->conn_socket;
+
+    for(;;) {
+        int ret, handle, i;
+        char buf[1024];
+
+        print_vec(s, "vm_sockets");
+
+        FD_ZERO(&readset);
+        /* conn socket is in Live_vms at posn 0 */
+        FD_SET(s->conn_socket, &readset);
+        for (i = 0; i < s->total_count; i++) {
+            if (s->live_vms[i].alive != 0) {
+                FD_SET(s->live_vms[i].sockfd, &readset);
+            }
+        }
+
+        printf("\nWaiting (maxfd = %d)\n", s->maxfd);
+
+        ret = select(s->maxfd + 1, &readset, NULL, NULL, NULL);
+
+        if (ret == -1) {
+            perror("select()");
+        }
+
+        handle = find_set(&readset, s->maxfd + 1);
+        if (handle == -1) continue;
+
+        if (handle == s->conn_socket) {
+
+            printf("[NC] new connection\n");
+            FD_CLR(s->conn_socket, &readset);
+
+            /* The Total_count is equal to the new guests VM ID */
+            add_new_guest(s);
+
+            /* update our the maximum file descriptor number */
+            s->maxfd = s->live_vms[s->total_count - 1].sockfd > s->maxfd ?
+                            s->live_vms[s->total_count - 1].sockfd : s->maxfd;
+
+            s->live_count++;
+            printf("Live_count is %ld\n", s->live_count);
+
+        } else {
+            /* then we have received a disconnection */
+            int recv_ret;
+            long i, j;
+            long deadposn = -1;
+
+            recv_ret = recv(handle, buf, 1, 0);
+
+            printf("[DC] recv returned %d\n", recv_ret);
+
+            /* find the dead VM in our list and move it do the dead list. */
+            for (i = 0; i < s->total_count; i++) {
+                if (s->live_vms[i].sockfd == handle) {
+                    deadposn = i;
+                    s->live_vms[i].alive = 0;
+                    close(s->live_vms[i].sockfd);
+
+                    for (j = 0; j < s->msi_vectors; j++) {
+                        close(s->live_vms[i].efd[j]);
+                    }
+
+                    free(s->live_vms[i].efd);
+                    s->live_vms[i].sockfd = -1;
+                    break;
+                }
+            }
+
+            for (j = 0; j < s->total_count; j++) {
+                /* update remaining clients that one client has left/died */
+                if (s->live_vms[j].alive) {
+                    printf("[UD] sending kill of fd[%ld] to %ld\n",
+                                                                deadposn, j);
+                    sendKill(s->live_vms[j].sockfd, deadposn, sizeof(deadposn));
+                }
+            }
+
+            s->live_count--;
+
+            /* close the socket for the departed VM */
+            close(handle);
+        }
+
+    }
+
+    return 0;
+}
+
+void add_new_guest(server_state_t * s) {
+
+    struct sockaddr_un remote;
+    socklen_t t = sizeof(remote);
+    long i, j;
+    int vm_sock;
+    long new_posn;
+    long neg1 = -1;
+
+    vm_sock = accept(s->conn_socket, (struct sockaddr *)&remote, &t);
+
+    if ( vm_sock == -1 ) {
+        perror("accept");
+        exit(1);
+    }
+
+    new_posn = s->total_count;
+
+    if (new_posn == s->nr_allocated_vms) {
+        printf("increasing vm slots\n");
+        s->nr_allocated_vms = s->nr_allocated_vms * 2;
+        if (s->nr_allocated_vms < 16)
+            s->nr_allocated_vms = 16;
+        s->live_vms = realloc(s->live_vms,
+                    s->nr_allocated_vms * sizeof(vmguest_t));
+
+        if (s->live_vms == NULL) {
+            fprintf(stderr, "realloc failed - quitting\n");
+            exit(-1);
+        }
+    }
+
+    s->live_vms[new_posn].posn = new_posn;
+    printf("[NC] Live_vms[%ld]\n", new_posn);
+    s->live_vms[new_posn].efd = (int *) malloc(sizeof(int));
+    for (i = 0; i < s->msi_vectors; i++) {
+        s->live_vms[new_posn].efd[i] = eventfd(0, 0);
+        printf("\tefd[%ld] = %d\n", i, s->live_vms[new_posn].efd[i]);
+    }
+    s->live_vms[new_posn].sockfd = vm_sock;
+    s->live_vms[new_posn].alive = 1;
+
+
+    sendPosition(vm_sock, new_posn);
+    sendUpdate(vm_sock, neg1, sizeof(long), s->shm_fd);
+    printf("[NC] trying to send fds to new connection\n");
+    sendRights(vm_sock, new_posn, sizeof(new_posn), s->live_vms, s->msi_vectors);
+
+    printf("[NC] Connected (count = %ld).\n", new_posn);
+    for (i = 0; i < new_posn; i++) {
+        if (s->live_vms[i].alive) {
+            // ping all clients that a new client has joined
+            printf("[UD] sending fd[%ld] to %ld\n", new_posn, i);
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("\tefd[%ld] = [%d]", j, s->live_vms[new_posn].efd[j]);
+                sendUpdate(s->live_vms[i].sockfd, new_posn,
+                        sizeof(new_posn), s->live_vms[new_posn].efd[j]);
+            }
+            printf("\n");
+        }
+    }
+
+    s->total_count++;
+}
+
+int create_listening_socket(char * path) {
+
+    struct sockaddr_un local;
+    int len, conn_socket;
+
+    if ((conn_socket = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
+        perror("socket");
+        exit(1);
+    }
+
+    local.sun_family = AF_UNIX;
+    strcpy(local.sun_path, path);
+    unlink(local.sun_path);
+    len = strlen(local.sun_path) + sizeof(local.sun_family);
+    if (bind(conn_socket, (struct sockaddr *)&local, len) == -1) {
+        perror("bind");
+        exit(1);
+    }
+
+    if (listen(conn_socket, 5) == -1) {
+        perror("listen");
+        exit(1);
+    }
+
+    return conn_socket;
+
+}
+
+void parse_args(int argc, char **argv, server_state_t * s) {
+
+    int c;
+
+    s->shm_size = 1024 * 1024; // default shm_size
+    s->path = NULL;
+    s->shmobj = NULL;
+    s->msi_vectors = 1;
+
+	while ((c = getopt(argc, argv, "hp:s:m:n:")) != -1) {
+
+        switch (c) {
+            // path to listening socket
+            case 'p':
+                s->path = optarg;
+                break;
+            // name of shared memory object
+            case 's':
+                s->shmobj = optarg;
+                break;
+            // size of shared memory object
+            case 'm': {
+                    uint64_t value;
+                    char *ptr;
+
+                    value = strtoul(optarg, &ptr, 10);
+                    switch (*ptr) {
+                    case 0: case 'M': case 'm':
+                        value <<= 20;
+                        break;
+                    case 'G': case 'g':
+                        value <<= 30;
+                        break;
+                    default:
+                        fprintf(stderr, "qemu: invalid ram size: %s\n", optarg);
+                        exit(1);
+                    }
+                    s->shm_size = value;
+                    break;
+                }
+            case 'n':
+                s->msi_vectors = atol(optarg);
+                break;
+            case 'h':
+            default:
+	            usage(argv[0]);
+		        exit(1);
+		}
+	}
+
+    if (s->path == NULL) {
+        s->path = strdup(DEFAULT_SOCK_PATH);
+    }
+
+    printf("listening socket: %s\n", s->path);
+
+    if (s->shmobj == NULL) {
+        s->shmobj = strdup(DEFAULT_SHM_OBJ);
+    }
+
+    printf("shared object: %s\n", s->shmobj);
+    printf("shared object size: %d (bytes)\n", s->shm_size);
+
+}
+
+void print_vec(server_state_t * s, const char * c) {
+
+    int i, j;
+
+#if DEBUG
+    printf("%s (%ld) = ", c, s->total_count);
+    for (i = 0; i < s->total_count; i++) {
+        if (s->live_vms[i].alive) {
+            for (j = 0; j < s->msi_vectors; j++) {
+                printf("[%d|%d] ", s->live_vms[i].sockfd, s->live_vms[i].efd[j]);
+            }
+        }
+    }
+    printf("\n");
+#endif
+
+}
+
+int find_set(fd_set * readset, int max) {
+
+    int i;
+
+    for (i = 1; i < max; i++) {
+        if (FD_ISSET(i, readset)) {
+            return i;
+        }
+    }
+
+    printf("nothing set\n");
+    return -1;
+
+}
+
+void usage(char const *prg) {
+	fprintf(stderr, "use: %s [-h]  [-p <unix socket>] [-s <shm obj>] "
+            "[-m <size in MB>] [-n <# of MSI vectors>]\n", prg);
+}
diff --git a/contrib/ivshmem-server/send_scm.c b/contrib/ivshmem-server/send_scm.c
new file mode 100644
index 0000000..b1bb4a3
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.c
@@ -0,0 +1,208 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/syscall.h>
+#include <sys/un.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <poll.h>
+#include "send_scm.h"
+
+#ifndef POLLRDHUP
+#define POLLRDHUP 0x2000
+#endif
+
+int readUpdate(int fd, long * posn, int * newfd)
+{
+    struct msghdr msg;
+    struct iovec iov[1];
+    struct cmsghdr *cmptr;
+    size_t len;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+
+    msg.msg_name = 0;
+    msg.msg_namelen = 0;
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    msg.msg_flags = 0;
+    msg.msg_iov = iov;
+    msg.msg_iovlen = 1;
+
+    iov[0].iov_base = &posn;
+    iov[0].iov_len = sizeof(posn);
+
+    do {
+        len = recvmsg(fd, &msg, 0);
+    } while (len == (size_t) (-1) && (errno == EINTR || errno == EAGAIN));
+
+    printf("iov[0].buf is %ld\n", *((long *)iov[0].iov_base));
+    printf("len is %ld\n", len);
+    // TODO: Logging
+    if (len == (size_t) (-1)) {
+        perror("recvmsg()");
+        return -1;
+    }
+
+    if (msg.msg_controllen < sizeof(struct cmsghdr))
+        return *posn;
+
+    for (cmptr = CMSG_FIRSTHDR(&msg); cmptr != NULL;
+        cmptr = CMSG_NXTHDR(&msg, cmptr)) {
+        if (cmptr->cmsg_level != SOL_SOCKET ||
+            cmptr->cmsg_type != SCM_RIGHTS){
+                printf("continuing %ld\n", sizeof(size_t));
+                printf("read msg_size = %ld\n", msg_size);
+                if (cmptr->cmsg_len != sizeof(control))
+                    printf("not equal (%ld != %ld)\n",cmptr->cmsg_len,sizeof(control));
+                continue;
+        }
+
+        memcpy(newfd, CMSG_DATA(cmptr), sizeof(int));
+        printf("posn is %ld (fd = %d)\n", *posn, *newfd);
+        return 0;
+    }
+
+    fprintf(stderr, "bad data in packet\n");
+    return -1;
+}
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors)
+{
+    int j, newfd;
+
+    for (; ;){
+        long posn = 0;
+
+        readUpdate(fd, &posn, &newfd);
+        printf("reading posn %ld ", posn);
+        fds[posn] = (int *)malloc (msi_vectors * sizeof(int));
+        fds[posn][0] = newfd;
+        for (j = 1; j < msi_vectors; j++) {
+            readUpdate(fd, &posn, &newfd);
+            fds[posn][j] = newfd;
+            printf("%d.", fds[posn][j]);
+        }
+        printf("\n");
+
+        /* stop reading once i've read my own eventfds */
+        if (posn == count)
+            break;
+    }
+
+    return 0;
+}
+
+int sendKill(int fd, long const posn, size_t posn_len) {
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    struct pollfd mypollfd;
+    int rv;
+
+    iov[0].iov_base = (void *) &posn;
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_len = 0;
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    printf("Killing posn %ld\n", posn);
+
+    // check if the fd is dead or not
+    mypollfd.fd = fd;
+    mypollfd.events = POLLRDHUP;
+    mypollfd.revents = 0;
+
+    rv = poll(&mypollfd, 1, 0);
+
+    printf("rv is %d\n", rv);
+
+    if (rv == 0) {
+        len = sendmsg(fd, &msg, 0);
+        if (len == (size_t) (-1)) {
+            perror("sendmsg()");
+            return -1;
+        }
+        return (len == posn_len);
+    } else {
+        printf("already dead\n");
+        return 0;
+    }
+}
+
+int sendUpdate(int fd, long posn, size_t posn_len, int sendfd)
+{
+
+    struct cmsghdr *cmsg;
+    size_t msg_size = sizeof(int);
+    char control[CMSG_SPACE(msg_size)];
+    struct iovec iov[1];
+    size_t len;
+    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
+
+    iov[0].iov_base = (void *) (&posn);
+    iov[0].iov_len = posn_len;
+
+    // from cmsg(3)
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_type = SCM_RIGHTS;
+    cmsg->cmsg_len = CMSG_LEN(msg_size);
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    memcpy((CMSG_DATA(cmsg)), &sendfd, msg_size);
+
+    len = sendmsg(fd, &msg, 0);
+    if (len == (size_t) (-1)) {
+        perror("sendmsg()");
+        return -1;
+    }
+
+    return (len == posn_len);
+
+}
+
+int sendPosition(int fd, long const posn)
+{
+    int rv;
+
+    rv = send(fd, &posn, sizeof(long), 0);
+    if (rv != sizeof(long)) {
+        fprintf(stderr, "error sending posn\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+int sendRights(int fd, long const count, size_t count_len, vmguest_t * Live_vms,
+                                                            long msi_vectors)
+{
+    /* updates about new guests are sent one at a time */
+
+    long i, j;
+
+    for (i = 0; i <= count; i++) {
+        if (Live_vms[i].alive) {
+            for (j = 0; j < msi_vectors; j++) {
+                sendUpdate(Live_vms[count].sockfd, i, sizeof(long),
+                                                        Live_vms[i].efd[j]);
+            }
+        }
+    }
+
+    return 0;
+
+}
diff --git a/contrib/ivshmem-server/send_scm.h b/contrib/ivshmem-server/send_scm.h
new file mode 100644
index 0000000..48c9a8d
--- /dev/null
+++ b/contrib/ivshmem-server/send_scm.h
@@ -0,0 +1,19 @@
+#ifndef SEND_SCM
+#define SEND_SCM
+
+struct vm_guest_conn {
+    int posn;
+    int sockfd;
+    int * efd;
+    int alive;
+};
+
+typedef struct vm_guest_conn vmguest_t;
+
+int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors);
+int sendRights(int fd, long const count, size_t count_len, vmguest_t *Live_vms, long msi_vectors);
+int readUpdate(int fd, long * posn, int * newfd);
+int sendUpdate(int fd, long const posn, size_t posn_len, int sendfd);
+int sendPosition(int fd, long const posn);
+int sendKill(int fd, long const posn, size_t posn_len);
+#endif
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6] Shared memory uio_pci driver
  2010-06-04 21:45             ` [Qemu-devel] " Cam Macdonell
@ 2010-06-04 21:47               ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:47 UTC (permalink / raw)
  To: kvm; +Cc: qemu-devel, Cam Macdonell

This patch adds a driver for my shared memory PCI device using the uio_pci
interface.  The driver has three memory regions.  The first memory region is for
device registers for sending interrupts. The second BAR is for receiving MSI-X
interrupts and the third memory region maps the shared memory.  The device only
exports the first and third memory regions to userspace.

This driver supports MSI-X and regular pin interrupts.  Currently, the number
of MSI vectors is set to 1 but it could easily be increased.  If MSI is not
available, then regular interrupts will be used.
---
 drivers/uio/Kconfig       |    8 ++
 drivers/uio/Makefile      |    1 +
 drivers/uio/uio_ivshmem.c |  252 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 261 insertions(+), 0 deletions(-)
 create mode 100644 drivers/uio/uio_ivshmem.c

diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index 1da73ec..b92cded 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -74,6 +74,14 @@ config UIO_SERCOS3
 
 	  If you compile this as a module, it will be called uio_sercos3.
 
+config UIO_IVSHMEM
+	tristate "KVM shared memory PCI driver"
+	default n
+	help
+	  Userspace I/O interface for the KVM shared memory device.  This
+	  driver will make available two memory regions, the first is
+	  registers and the second is a region for sharing between VMs.
+
 config UIO_PCI_GENERIC
 	tristate "Generic driver for PCI 2.3 and PCI Express cards"
 	depends on PCI
diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
index 18fd818..25c1ca5 100644
--- a/drivers/uio/Makefile
+++ b/drivers/uio/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UIO_AEC)	+= uio_aec.o
 obj-$(CONFIG_UIO_SERCOS3)	+= uio_sercos3.o
 obj-$(CONFIG_UIO_PCI_GENERIC)	+= uio_pci_generic.o
 obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
+obj-$(CONFIG_UIO_IVSHMEM) += uio_ivshmem.o
diff --git a/drivers/uio/uio_ivshmem.c b/drivers/uio/uio_ivshmem.c
new file mode 100644
index 0000000..95be1e0
--- /dev/null
+++ b/drivers/uio/uio_ivshmem.c
@@ -0,0 +1,252 @@
+/*
+ * UIO IVShmem Driver
+ *
+ * (C) 2009 Cam Macdonell
+ * based on Hilscher CIF card driver (C) 2007 Hans J. Koch <hjk@linutronix.de>
+ *
+ * Licensed under GPL version 2 only.
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/uio_driver.h>
+
+#include <asm/io.h>
+
+#define IntrStatus 0x04
+#define IntrMask 0x00
+
+struct ivshmem_info {
+	struct uio_info *uio;
+	struct pci_dev *dev;
+	char (*msix_names)[256];
+	struct msix_entry *msix_entries;
+	int nvectors;
+};
+
+static irqreturn_t ivshmem_handler(int irq, struct uio_info *dev_info)
+{
+
+	void __iomem *plx_intscr = dev_info->mem[0].internal_addr
+					+ IntrStatus;
+	u32 val;
+
+	val = readl(plx_intscr);
+	if (val == 0)
+		return IRQ_NONE;
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t ivshmem_msix_handler(int irq, void *opaque)
+{
+
+	struct uio_info * dev_info = (struct uio_info *) opaque;
+
+	/* we have to do this explicitly when using MSI-X */
+	uio_event_notify(dev_info);
+	return IRQ_HANDLED;
+}
+
+static void free_msix_vectors(struct ivshmem_info *ivs_info,
+							const int max_vector)
+{
+	int i;
+
+	for (i = 0; i < max_vector; i++)
+		free_irq(ivs_info->msix_entries[i].vector, ivs_info->uio);
+}
+
+static int request_msix_vectors(struct ivshmem_info *ivs_info, int nvectors)
+{
+	int i, err;
+	const char *name = "ivshmem";
+
+	ivs_info->nvectors = nvectors;
+
+	ivs_info->msix_entries = kmalloc(nvectors * sizeof *
+						ivs_info->msix_entries,
+						GFP_KERNEL);
+	if (ivs_info->msix_entries == NULL)
+		return -ENOSPC;
+
+	ivs_info->msix_names = kmalloc(nvectors * sizeof *ivs_info->msix_names,
+			GFP_KERNEL);
+	if (ivs_info->msix_names == NULL) {
+		kfree(ivs_info->msix_entries);
+		return -ENOSPC;
+	}
+
+	for (i = 0; i < nvectors; ++i)
+		ivs_info->msix_entries[i].entry = i;
+
+	err = pci_enable_msix(ivs_info->dev, ivs_info->msix_entries,
+					ivs_info->nvectors);
+	if (err > 0) {
+		ivs_info->nvectors = err; /* msi-x positive error code
+					 returns the number available*/
+		err = pci_enable_msix(ivs_info->dev, ivs_info->msix_entries,
+					ivs_info->nvectors);
+		if (err) {
+			printk(KERN_INFO "no MSI (%d). Back to INTx.\n", err);
+			goto error;
+		}
+	}
+
+	if (err)
+	    goto error;
+
+	for (i = 0; i < ivs_info->nvectors; i++) {
+
+		snprintf(ivs_info->msix_names[i], sizeof *ivs_info->msix_names,
+			"%s-config", name);
+
+		err = request_irq(ivs_info->msix_entries[i].vector,
+			ivshmem_msix_handler, 0,
+			ivs_info->msix_names[i], ivs_info->uio);
+
+		if (err) {
+			free_msix_vectors(ivs_info, i - 1);
+			goto error;
+		}
+
+	}
+
+	return 0;
+error:
+	kfree(ivs_info->msix_entries);
+	kfree(ivs_info->msix_names);
+	return err;
+
+}
+
+static int __devinit ivshmem_pci_probe(struct pci_dev *dev,
+					const struct pci_device_id *id)
+{
+	struct uio_info *info;
+	struct ivshmem_info * ivshmem_info;
+	int nvectors = 1;
+
+	info = kzalloc(sizeof(struct uio_info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	ivshmem_info = kzalloc(sizeof(struct ivshmem_info), GFP_KERNEL);
+	if (!ivshmem_info) {
+		kfree(info);
+		return -ENOMEM;
+	}
+
+	if (pci_enable_device(dev))
+		goto out_free;
+
+	if (pci_request_regions(dev, "ivshmem"))
+		goto out_disable;
+
+	info->mem[0].addr = pci_resource_start(dev, 0);
+	if (!info->mem[0].addr)
+		goto out_release;
+
+	info->mem[0].size = pci_resource_len(dev, 0);
+	info->mem[0].internal_addr = pci_ioremap_bar(dev, 0);
+	if (!info->mem[0].internal_addr) {
+		goto out_release;
+	}
+
+	info->mem[0].memtype = UIO_MEM_PHYS;
+
+	info->mem[1].addr = pci_resource_start(dev, 2);
+	if (!info->mem[1].addr)
+		goto out_unmap;
+	info->mem[1].internal_addr = pci_ioremap_bar(dev, 2);
+	if (!info->mem[1].internal_addr)
+		goto out_unmap;
+
+	info->mem[1].size = pci_resource_len(dev, 2);
+	info->mem[1].memtype = UIO_MEM_PHYS;
+
+	ivshmem_info->uio = info;
+	ivshmem_info->dev = dev;
+
+	if (request_msix_vectors(ivshmem_info, nvectors) != 0) {
+		printk(KERN_INFO "regular IRQs\n");
+		info->irq = dev->irq;
+		info->irq_flags = IRQF_SHARED;
+		info->handler = ivshmem_handler;
+		writel(0xffffffff, info->mem[0].internal_addr + IntrMask);
+	} else {
+		printk(KERN_INFO "MSI-X enabled\n");
+		info->irq = -1;
+	}
+
+	info->name = "ivshmem";
+	info->version = "0.0.1";
+
+	if (uio_register_device(&dev->dev, info))
+		goto out_unmap2;
+
+	pci_set_drvdata(dev, info);
+
+
+	return 0;
+out_unmap2:
+	iounmap(info->mem[2].internal_addr);
+out_unmap:
+	iounmap(info->mem[0].internal_addr);
+out_release:
+	pci_release_regions(dev);
+out_disable:
+	pci_disable_device(dev);
+out_free:
+	kfree (info);
+	return -ENODEV;
+}
+
+static void ivshmem_pci_remove(struct pci_dev *dev)
+{
+	struct uio_info *info = pci_get_drvdata(dev);
+
+	uio_unregister_device(info);
+	pci_release_regions(dev);
+	pci_disable_device(dev);
+	pci_set_drvdata(dev, NULL);
+	iounmap(info->mem[0].internal_addr);
+
+	kfree (info);
+}
+
+static struct pci_device_id ivshmem_pci_ids[] __devinitdata = {
+	{
+		.vendor =	0x1af4,
+		.device =	0x1110,
+		.subvendor =	PCI_ANY_ID,
+		.subdevice =	PCI_ANY_ID,
+	},
+	{ 0, }
+};
+
+static struct pci_driver ivshmem_pci_driver = {
+	.name = "uio_ivshmem",
+	.id_table = ivshmem_pci_ids,
+	.probe = ivshmem_pci_probe,
+	.remove = ivshmem_pci_remove,
+};
+
+static int __init ivshmem_init_module(void)
+{
+	return pci_register_driver(&ivshmem_pci_driver);
+}
+
+static void __exit ivshmem_exit_module(void)
+{
+	pci_unregister_driver(&ivshmem_pci_driver);
+}
+
+module_init(ivshmem_init_module);
+module_exit(ivshmem_exit_module);
+
+MODULE_DEVICE_TABLE(pci, ivshmem_pci_ids);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Cam Macdonell");
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v6] Shared memory uio_pci driver
@ 2010-06-04 21:47               ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-04 21:47 UTC (permalink / raw)
  To: kvm; +Cc: Cam Macdonell, qemu-devel

This patch adds a driver for my shared memory PCI device using the uio_pci
interface.  The driver has three memory regions.  The first memory region is for
device registers for sending interrupts. The second BAR is for receiving MSI-X
interrupts and the third memory region maps the shared memory.  The device only
exports the first and third memory regions to userspace.

This driver supports MSI-X and regular pin interrupts.  Currently, the number
of MSI vectors is set to 1 but it could easily be increased.  If MSI is not
available, then regular interrupts will be used.
---
 drivers/uio/Kconfig       |    8 ++
 drivers/uio/Makefile      |    1 +
 drivers/uio/uio_ivshmem.c |  252 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 261 insertions(+), 0 deletions(-)
 create mode 100644 drivers/uio/uio_ivshmem.c

diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index 1da73ec..b92cded 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -74,6 +74,14 @@ config UIO_SERCOS3
 
 	  If you compile this as a module, it will be called uio_sercos3.
 
+config UIO_IVSHMEM
+	tristate "KVM shared memory PCI driver"
+	default n
+	help
+	  Userspace I/O interface for the KVM shared memory device.  This
+	  driver will make available two memory regions, the first is
+	  registers and the second is a region for sharing between VMs.
+
 config UIO_PCI_GENERIC
 	tristate "Generic driver for PCI 2.3 and PCI Express cards"
 	depends on PCI
diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
index 18fd818..25c1ca5 100644
--- a/drivers/uio/Makefile
+++ b/drivers/uio/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UIO_AEC)	+= uio_aec.o
 obj-$(CONFIG_UIO_SERCOS3)	+= uio_sercos3.o
 obj-$(CONFIG_UIO_PCI_GENERIC)	+= uio_pci_generic.o
 obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
+obj-$(CONFIG_UIO_IVSHMEM) += uio_ivshmem.o
diff --git a/drivers/uio/uio_ivshmem.c b/drivers/uio/uio_ivshmem.c
new file mode 100644
index 0000000..95be1e0
--- /dev/null
+++ b/drivers/uio/uio_ivshmem.c
@@ -0,0 +1,252 @@
+/*
+ * UIO IVShmem Driver
+ *
+ * (C) 2009 Cam Macdonell
+ * based on Hilscher CIF card driver (C) 2007 Hans J. Koch <hjk@linutronix.de>
+ *
+ * Licensed under GPL version 2 only.
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/uio_driver.h>
+
+#include <asm/io.h>
+
+#define IntrStatus 0x04
+#define IntrMask 0x00
+
+struct ivshmem_info {
+	struct uio_info *uio;
+	struct pci_dev *dev;
+	char (*msix_names)[256];
+	struct msix_entry *msix_entries;
+	int nvectors;
+};
+
+static irqreturn_t ivshmem_handler(int irq, struct uio_info *dev_info)
+{
+
+	void __iomem *plx_intscr = dev_info->mem[0].internal_addr
+					+ IntrStatus;
+	u32 val;
+
+	val = readl(plx_intscr);
+	if (val == 0)
+		return IRQ_NONE;
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t ivshmem_msix_handler(int irq, void *opaque)
+{
+
+	struct uio_info * dev_info = (struct uio_info *) opaque;
+
+	/* we have to do this explicitly when using MSI-X */
+	uio_event_notify(dev_info);
+	return IRQ_HANDLED;
+}
+
+static void free_msix_vectors(struct ivshmem_info *ivs_info,
+							const int max_vector)
+{
+	int i;
+
+	for (i = 0; i < max_vector; i++)
+		free_irq(ivs_info->msix_entries[i].vector, ivs_info->uio);
+}
+
+static int request_msix_vectors(struct ivshmem_info *ivs_info, int nvectors)
+{
+	int i, err;
+	const char *name = "ivshmem";
+
+	ivs_info->nvectors = nvectors;
+
+	ivs_info->msix_entries = kmalloc(nvectors * sizeof *
+						ivs_info->msix_entries,
+						GFP_KERNEL);
+	if (ivs_info->msix_entries == NULL)
+		return -ENOSPC;
+
+	ivs_info->msix_names = kmalloc(nvectors * sizeof *ivs_info->msix_names,
+			GFP_KERNEL);
+	if (ivs_info->msix_names == NULL) {
+		kfree(ivs_info->msix_entries);
+		return -ENOSPC;
+	}
+
+	for (i = 0; i < nvectors; ++i)
+		ivs_info->msix_entries[i].entry = i;
+
+	err = pci_enable_msix(ivs_info->dev, ivs_info->msix_entries,
+					ivs_info->nvectors);
+	if (err > 0) {
+		ivs_info->nvectors = err; /* msi-x positive error code
+					 returns the number available*/
+		err = pci_enable_msix(ivs_info->dev, ivs_info->msix_entries,
+					ivs_info->nvectors);
+		if (err) {
+			printk(KERN_INFO "no MSI (%d). Back to INTx.\n", err);
+			goto error;
+		}
+	}
+
+	if (err)
+	    goto error;
+
+	for (i = 0; i < ivs_info->nvectors; i++) {
+
+		snprintf(ivs_info->msix_names[i], sizeof *ivs_info->msix_names,
+			"%s-config", name);
+
+		err = request_irq(ivs_info->msix_entries[i].vector,
+			ivshmem_msix_handler, 0,
+			ivs_info->msix_names[i], ivs_info->uio);
+
+		if (err) {
+			free_msix_vectors(ivs_info, i - 1);
+			goto error;
+		}
+
+	}
+
+	return 0;
+error:
+	kfree(ivs_info->msix_entries);
+	kfree(ivs_info->msix_names);
+	return err;
+
+}
+
+static int __devinit ivshmem_pci_probe(struct pci_dev *dev,
+					const struct pci_device_id *id)
+{
+	struct uio_info *info;
+	struct ivshmem_info * ivshmem_info;
+	int nvectors = 1;
+
+	info = kzalloc(sizeof(struct uio_info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	ivshmem_info = kzalloc(sizeof(struct ivshmem_info), GFP_KERNEL);
+	if (!ivshmem_info) {
+		kfree(info);
+		return -ENOMEM;
+	}
+
+	if (pci_enable_device(dev))
+		goto out_free;
+
+	if (pci_request_regions(dev, "ivshmem"))
+		goto out_disable;
+
+	info->mem[0].addr = pci_resource_start(dev, 0);
+	if (!info->mem[0].addr)
+		goto out_release;
+
+	info->mem[0].size = pci_resource_len(dev, 0);
+	info->mem[0].internal_addr = pci_ioremap_bar(dev, 0);
+	if (!info->mem[0].internal_addr) {
+		goto out_release;
+	}
+
+	info->mem[0].memtype = UIO_MEM_PHYS;
+
+	info->mem[1].addr = pci_resource_start(dev, 2);
+	if (!info->mem[1].addr)
+		goto out_unmap;
+	info->mem[1].internal_addr = pci_ioremap_bar(dev, 2);
+	if (!info->mem[1].internal_addr)
+		goto out_unmap;
+
+	info->mem[1].size = pci_resource_len(dev, 2);
+	info->mem[1].memtype = UIO_MEM_PHYS;
+
+	ivshmem_info->uio = info;
+	ivshmem_info->dev = dev;
+
+	if (request_msix_vectors(ivshmem_info, nvectors) != 0) {
+		printk(KERN_INFO "regular IRQs\n");
+		info->irq = dev->irq;
+		info->irq_flags = IRQF_SHARED;
+		info->handler = ivshmem_handler;
+		writel(0xffffffff, info->mem[0].internal_addr + IntrMask);
+	} else {
+		printk(KERN_INFO "MSI-X enabled\n");
+		info->irq = -1;
+	}
+
+	info->name = "ivshmem";
+	info->version = "0.0.1";
+
+	if (uio_register_device(&dev->dev, info))
+		goto out_unmap2;
+
+	pci_set_drvdata(dev, info);
+
+
+	return 0;
+out_unmap2:
+	iounmap(info->mem[2].internal_addr);
+out_unmap:
+	iounmap(info->mem[0].internal_addr);
+out_release:
+	pci_release_regions(dev);
+out_disable:
+	pci_disable_device(dev);
+out_free:
+	kfree (info);
+	return -ENODEV;
+}
+
+static void ivshmem_pci_remove(struct pci_dev *dev)
+{
+	struct uio_info *info = pci_get_drvdata(dev);
+
+	uio_unregister_device(info);
+	pci_release_regions(dev);
+	pci_disable_device(dev);
+	pci_set_drvdata(dev, NULL);
+	iounmap(info->mem[0].internal_addr);
+
+	kfree (info);
+}
+
+static struct pci_device_id ivshmem_pci_ids[] __devinitdata = {
+	{
+		.vendor =	0x1af4,
+		.device =	0x1110,
+		.subvendor =	PCI_ANY_ID,
+		.subdevice =	PCI_ANY_ID,
+	},
+	{ 0, }
+};
+
+static struct pci_driver ivshmem_pci_driver = {
+	.name = "uio_ivshmem",
+	.id_table = ivshmem_pci_ids,
+	.probe = ivshmem_pci_probe,
+	.remove = ivshmem_pci_remove,
+};
+
+static int __init ivshmem_init_module(void)
+{
+	return pci_register_driver(&ivshmem_pci_driver);
+}
+
+static void __exit ivshmem_exit_module(void)
+{
+	pci_unregister_driver(&ivshmem_pci_driver);
+}
+
+module_init(ivshmem_init_module);
+module_exit(ivshmem_exit_module);
+
+MODULE_DEVICE_TABLE(pci, ivshmem_pci_ids);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Cam Macdonell");
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device
  2010-06-04 21:45           ` [Qemu-devel] " Cam Macdonell
  (?)
  (?)
@ 2010-06-05  9:44           ` Blue Swirl
  2010-06-06 15:02             ` Avi Kivity
  2010-06-07 16:41             ` Cam Macdonell
  -1 siblings, 2 replies; 42+ messages in thread
From: Blue Swirl @ 2010-06-05  9:44 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On Fri, Jun 4, 2010 at 9:45 PM, Cam Macdonell <cam@cs.ualberta.ca> wrote:
> Support an inter-vm shared memory device that maps a shared-memory object as a
> PCI device in the guest.  This patch also supports interrupts between guest by
> communicating over a unix domain socket.  This patch applies to the qemu-kvm
> repository.
>
>    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>
> Interrupts are supported between multiple VMs by using a shared memory server
> by using a chardev socket.
>
>    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>           [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
>    -chardev socket,path=<path>,id=<id>
>
> (shared memory server is qemu.git/contrib/ivshmem-server)
>
> Sample programs and init scripts are in a git repo here:
>
>    www.gitorious.org/nahanni
> ---
>  Makefile.target |    3 +
>  hw/ivshmem.c    |  852 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  qemu-char.c     |    6 +
>  qemu-char.h     |    3 +
>  qemu-doc.texi   |   43 +++
>  5 files changed, 907 insertions(+), 0 deletions(-)
>  create mode 100644 hw/ivshmem.c
>
> diff --git a/Makefile.target b/Makefile.target
> index c4ba592..4888308 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>  obj-y += rtl8139.o
>  obj-y += e1000.o
>
> +# Inter-VM PCI shared memory
> +obj-y += ivshmem.o
> +

Can this be compiled once, simply by moving this to Makefile.objs
instead of Makefile.target? Also, because the code seems to be KVM
specific, it can't be compiled unconditionally but depending on at
least CONFIG_KVM and maybe CONFIG_EVENTFD.

Why is this KVM specific BTW, Posix SHM is available on many
platforms? What would happen if kvm_set_foobar functions were not
called when KVM is not being used? Is host eventfd support essential?

>  # Hardware support
>  obj-i386-y += vga.o
>  obj-i386-y += mc146818rtc.o i8259.o pc.o
> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
> new file mode 100644
> index 0000000..9057612
> --- /dev/null
> +++ b/hw/ivshmem.c
> @@ -0,0 +1,852 @@
> +/*
> + * Inter-VM Shared Memory PCI device.
> + *
> + * Author:
> + *      Cam Macdonell <cam@cs.ualberta.ca>
> + *
> + * Based On: cirrus_vga.c
> + *          Copyright (c) 2004 Fabrice Bellard
> + *          Copyright (c) 2004 Makoto Suzuki (suzu)
> + *
> + *      and rtl8139.c
> + *          Copyright (c) 2006 Igor Kovalenko
> + *
> + * This code is licensed under the GNU GPL v2.
> + */
> +#include <sys/mman.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/io.h>
> +#include <sys/ioctl.h>
> +#include "hw.h"
> +#include "console.h"
> +#include "pc.h"
> +#include "pci.h"
> +#include "sysemu.h"
> +
> +#include "msix.h"
> +#include "qemu-kvm.h"
> +#include "libkvm.h"
> +
> +#include <sys/eventfd.h>
> +#include <sys/mman.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
> +
> +#define IVSHMEM_IRQFD   0
> +#define IVSHMEM_MSI     1
> +
> +//#define DEBUG_IVSHMEM
> +#ifdef DEBUG_IVSHMEM
> +#define IVSHMEM_DPRINTF(fmt, args...)        \
> +    do {printf("IVSHMEM: " fmt, ##args); } while (0)

Please use __VA_ARGS__.

> +#else
> +#define IVSHMEM_DPRINTF(fmt, args...)
> +#endif
> +
> +typedef struct Peer {
> +    int nb_eventfds;
> +    int *eventfds;
> +} Peer;
> +
> +typedef struct EventfdEntry {
> +    PCIDevice *pdev;
> +    int vector;
> +} EventfdEntry;
> +
> +typedef struct IVShmemState {
> +    PCIDevice dev;
> +    uint32_t intrmask;
> +    uint32_t intrstatus;
> +    uint32_t doorbell;
> +
> +    CharDriverState ** eventfd_chr;

I'd remove the space between '**' and 'eventfd_chr', it's used inconsistently.

> +    CharDriverState * server_chr;
> +    int ivshmem_mmio_io_addr;
> +
> +    pcibus_t mmio_addr;
> +    pcibus_t shm_pci_addr;
> +    uint64_t ivshmem_offset;
> +    uint64_t ivshmem_size; /* size of shared memory region */
> +    int shm_fd; /* shared memory file descriptor */
> +
> +    Peer *peers;
> +    int nb_peers; /* how many guests we have space for */
> +    int max_peer; /* maximum numbered peer */
> +
> +    int vm_id;
> +    uint32_t vectors;
> +    uint32_t features;
> +    EventfdEntry *eventfd_table;
> +
> +    char * shmobj;
> +    char * sizearg;
> +    char * role;
> +} IVShmemState;
> +
> +/* registers for the Inter-VM shared memory device */
> +enum ivshmem_registers {
> +    IntrMask = 0,
> +    IntrStatus = 4,
> +    IVPosition = 8,
> +    Doorbell = 12,
> +};

IIRC these should be uppercase.

> +
> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
> +    return (ivs->features & (1 << feature));
> +}

Since this is the first version, do we need any features at this
point, can't we expect that all features are available now? Why does
the user need to specify the features?

To avoid a negative shift, I'd make 'feature' unsigned.

> +
> +static inline bool is_power_of_two(uint64_t x) {
> +    return (x & (x - 1)) == 0;
> +}
> +
> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
> +                    pcibus_t addr, pcibus_t size, int type)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
> +
> +    s->shm_pci_addr = addr;
> +
> +    if (s->ivshmem_offset > 0) {
> +        cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
> +                                                            s->ivshmem_offset);
> +        if (s->role && strncmp(s->role, "peer", 4) == 0) {
> +            IVSHMEM_DPRINTF("marking pages no migrate\n");
> +            cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
> +        }
> +    }
> +
> +    IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
> +                (uint32_t)addr, (uint32_t)s->ivshmem_offset, (uint32_t)size);

Please use FMT_PCIBUS for addr and size and PRIu64 for s->ivshmem_offset.

> +
> +}
> +
> +/* accessing registers - based on rtl8139 */
> +static void ivshmem_update_irq(IVShmemState *s, int val)
> +{
> +    int isr;
> +    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
> +
> +    /* don't print ISR resets */
> +    if (isr) {
> +        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
> +           isr ? 1 : 0, s->intrstatus, s->intrmask);
> +    }
> +
> +    qemu_set_irq(s->dev.irq[0], (isr != 0));
> +}
> +
> +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
> +
> +    s->intrmask = val;
> +
> +    ivshmem_update_irq(s, val);
> +}
> +
> +static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
> +{
> +    uint32_t ret = s->intrmask;
> +
> +    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
> +
> +    return ret;
> +}
> +
> +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
> +
> +    s->intrstatus = val;
> +
> +    ivshmem_update_irq(s, val);
> +    return;
> +}
> +
> +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
> +{
> +    uint32_t ret = s->intrstatus;
> +
> +    /* reading ISR clears all interrupts */
> +    s->intrstatus = 0;
> +
> +    ivshmem_update_irq(s, 0);
> +
> +    return ret;
> +}
> +
> +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
> +{
> +
> +    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
> +}
> +
> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVShmemState *s = opaque;
> +
> +    u_int64_t write_one = 1;

Please use uintNN_t instead of u_intNN_t.

> +    u_int16_t dest = val >> 16;
> +    u_int16_t vector = val & 0xff;
> +
> +    addr &= 0xfc;

I'd add a debug printf here, likewise for exit of ivshmem_io_readl().
When you do the merge (see below), the correct printf format for the
addresses will be TARGET_FMT_plx.

> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ivshmem_IntrMask_write(s, val);
> +            break;
> +
> +        case IntrStatus:
> +            ivshmem_IntrStatus_write(s, val);
> +            break;
> +
> +        case Doorbell:
> +            /* check that dest VM ID is reasonable */
> +            if ((dest < 0) || (dest > s->max_peer)) {
> +                IVSHMEM_DPRINTF("Invalid destination VM ID (%d)\n", dest);
> +                break;
> +            }
> +
> +            /* check doorbell range */
> +            if ((vector >= 0) && (vector < s->peers[dest].nb_eventfds)) {
> +                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n",
> +                                                    write_one, dest, vector);

PRId64 for write_one, %ld is not enough on ILP32.

> +                if (write(s->peers[dest].eventfds[vector],
> +                                                    &(write_one), 8) != 8) {
> +                    IVSHMEM_DPRINTF("error writing to eventfd\n");
> +                }
> +            }
> +            break;
> +        default:
> +            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
> +    }
> +}
> +
> +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
> +{
> +    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
> +}
> +
> +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
> +{
> +
> +    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
> +    return 0;
> +}
> +
> +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
> +{
> +
> +    IVShmemState *s = opaque;
> +    uint32_t ret;
> +
> +    switch (addr)
> +    {
> +        case IntrMask:
> +            ret = ivshmem_IntrMask_read(s);
> +            break;
> +
> +        case IntrStatus:
> +            ret = ivshmem_IntrStatus_read(s);
> +            break;
> +
> +        case IVPosition:
> +            /* return my VM ID if the memory is mapped */
> +            if (s->shm_fd > 0) {
> +                ret = s->vm_id;
> +            } else {
> +                ret = -1;
> +            }
> +            break;
> +
> +        default:
> +            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
> +            ret = 0;
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
> +{
> +    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
> +
> +    return 0;
> +}
> +
> +static void ivshmem_mmio_writeb(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writeb(opaque, addr & 0xFF, val);
> +}

This function and others below only performs a cast and useless
masking (the address passed is these days an offset from start of the
area). Please merge these to ivshmem_io_readl() etc.

> +
> +static void ivshmem_mmio_writew(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writew(opaque, addr & 0xFF, val);
> +}
> +
> +static void ivshmem_mmio_writel(void *opaque,
> +                                target_phys_addr_t addr, uint32_t val)
> +{
> +    ivshmem_io_writel(opaque, addr & 0xFF, val);
> +}
> +
> +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
> +{
> +    return ivshmem_io_readb(opaque, addr & 0xFF);
> +}
> +
> +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
> +{
> +    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
> +    return val;
> +}
> +
> +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
> +{
> +    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
> +    return val;
> +}
> +
> +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {

Please add 'const'.

> +    ivshmem_mmio_readb,
> +    ivshmem_mmio_readw,
> +    ivshmem_mmio_readl,
> +};
> +
> +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
> +    ivshmem_mmio_writeb,
> +    ivshmem_mmio_writew,
> +    ivshmem_mmio_writel,
> +};
> +
> +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
> +{
> +    IVShmemState *s = opaque;
> +
> +    ivshmem_IntrStatus_write(s, *buf);
> +
> +    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
> +}
> +
> +static int ivshmem_can_receive(void * opaque)
> +{
> +    return 8;
> +}
> +
> +static void ivshmem_event(void *opaque, int event)
> +{
> +    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
> +}
> +
> +static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
> +
> +    EventfdEntry *entry = opaque;
> +    PCIDevice *pdev = entry->pdev;
> +
> +    IVSHMEM_DPRINTF("fake irqfd on vector %p %d\n", pdev, entry->vector);
> +    msix_notify(pdev, entry->vector);
> +}
> +
> +static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
> +                                                                    int vector)
> +{
> +    /* create a event character device based on the passed eventfd */
> +    IVShmemState *s = opaque;
> +    CharDriverState * chr;
> +
> +    chr = qemu_chr_open_eventfd(eventfd);
> +
> +    if (chr == NULL) {
> +        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);

This should not be a DPRINTF.

> +        exit(-1);
> +    }
> +
> +    /* if MSI is supported we need multiple interrupts */
> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +        s->eventfd_table[vector].pdev = &s->dev;
> +        s->eventfd_table[vector].vector = vector;
> +
> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
> +                      ivshmem_event, &s->eventfd_table[vector]);
> +    } else {
> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
> +                      ivshmem_event, s);
> +    }
> +
> +    return chr;
> +
> +}
> +
> +static int check_shm_size(IVShmemState *s, int fd) {
> +    /* check that the guest isn't going to try and map more memory than the
> +     * the object has allocated return -1 to indicate error */
> +
> +    struct stat buf;
> +
> +    fstat(fd, &buf);
> +
> +    if (s->ivshmem_size > buf.st_size) {
> +        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
> +        fprintf(stderr, " than shared object size (%ld > %ld)\n",
> +                                          s->ivshmem_size, buf.st_size);

Please use PRIx64 for s->ivshmem_size, this will cause a warning on ILP32.

> +        return -1;
> +    } else {
> +        return 0;
> +    }
> +}
> +
> +/* create the shared memory BAR when we are not using the server, so we can
> + * create the BAR and map the memory immediately */
> +static void create_shared_memory_BAR(IVShmemState *s, int fd) {
> +
> +    void * ptr;
> +
> +    s->shm_fd = fd;
> +
> +    ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> +
> +    s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, ptr);

qemu_ram_map() does not exist in HEAD.

> +
> +    /* region for shared memory */
> +    pci_register_bar(&s->dev, 2, s->ivshmem_size,
> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
> +}
> +
> +static void close_guest_eventfds(IVShmemState *s, int posn)
> +{
> +    int i, guest_curr_max;
> +
> +    guest_curr_max = s->peers[posn].nb_eventfds;
> +
> +    for (i = 0; i < guest_curr_max; i++)
> +        close(s->peers[posn].eventfds[i]);

CODING_STYLE.

> +
> +    qemu_free(s->peers[posn].eventfds);
> +    s->peers[posn].nb_eventfds = 0;
> +}
> +
> +static void setup_ioeventfds(IVShmemState *s) {
> +
> +    int i, j;
> +
> +    for (i = 0; i <= s->max_peer; i++) {
> +        for (j = 0; j < s->peers[i].nb_eventfds; j++) {
> +            kvm_set_ioeventfd_mmio_long(s->peers[i].eventfds[j],
> +                    s->mmio_addr + Doorbell, (i << 16) | j, 1);
> +        }
> +    }
> +
> +    /* setup irqfd for this VM's eventfds */
> +    for (i = 0; i < s->vectors; i++) {
> +        kvm_set_irqfd(s->dev.msix_irq_entries[i].gsi,
> +                        s->peers[s->vm_id].eventfds[i], 1);

kvm_set_irqfd() does not exist in HEAD.

> +    }
> +}
> +
> +
> +/* this function increase the dynamic storage need to store data about other
> + * guests */
> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
> +
> +    int j, old_nb_alloc;
> +
> +    old_nb_alloc = s->nb_peers;
> +
> +    while (new_min_size >= s->nb_peers)
> +        s->nb_peers = s->nb_peers * 2;
> +
> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nb_peers);
> +    s->peers = qemu_realloc(s->peers, s->nb_peers * sizeof(Peer));
> +
> +    if (s->peers == NULL) {
> +        fprintf(stderr, "Allocation error - exiting\n");
> +        exit(1);
> +    }

qemu_realloc will never return zero.

> +
> +    /* zero out new pointers */
> +    for (j = old_nb_alloc; j < s->nb_peers; j++) {
> +        s->peers[j].eventfds = NULL;
> +        s->peers[j].nb_eventfds = 0;
> +    }
> +}
> +
> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
> +{
> +    IVShmemState *s = opaque;
> +    int incoming_fd, tmp_fd;
> +    int guest_curr_max;
> +    long incoming_posn;
> +
> +    memcpy(&incoming_posn, buf, sizeof(long));
> +    /* pick off s->server_chr->msgfd and store it, posn should accompany msg */
> +    tmp_fd = qemu_chr_get_msgfd(s->server_chr);
> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
> +
> +    /* make sure we have enough space for this guest */
> +    if (incoming_posn >= s->nb_peers) {
> +        increase_dynamic_storage(s, incoming_posn);
> +    }
> +
> +    if (tmp_fd == -1) {
> +        /* if posn is positive and unseen before then this is our posn*/
> +        if ((incoming_posn >= 0) && (s->peers[incoming_posn].eventfds == NULL)) {
> +            /* receive our posn */
> +            s->vm_id = incoming_posn;
> +            return;
> +        } else {
> +            /* otherwise an fd == -1 means an existing guest has gone away */
> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
> +            close_guest_eventfds(s, incoming_posn);
> +            return;
> +        }
> +    }
> +
> +    /* because of the implementation of get_msgfd, we need a dup */
> +    incoming_fd = dup(tmp_fd);
> +
> +    if (incoming_fd == -1) {
> +        fprintf(stderr, "could not allocate file descriptor %s\n",
> +                                                            strerror(errno));
> +        return;
> +    }
> +
> +    /* if the position is -1, then it's shared memory region fd */
> +    if (incoming_posn == -1) {
> +
> +        void * map_ptr;
> +
> +        s->max_peer = 0;
> +
> +        if (check_shm_size(s, incoming_fd) == -1) {
> +            exit(-1);
> +        }
> +
> +        /* mmap the region and map into the BAR2 */
> +        map_ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED,
> +                                                                incoming_fd, 0);
> +        s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, map_ptr);
> +
> +        IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
> +                        (uint32_t)s->shm_pci_addr, (uint32_t)s->ivshmem_offset,
> +                        (uint32_t)s->ivshmem_size);
> +
> +        if (s->shm_pci_addr > 0) {
> +            /* map memory into BAR2 */
> +            cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
> +                                                            s->ivshmem_offset);
> +            if (s->role && strncmp(s->role, "peer", 4) == 0) {
> +                IVSHMEM_DPRINTF("marking pages no migrate\n");
> +                cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
> +            }
> +
> +        }
> +
> +        /* only store the fd if it is successfully mapped */
> +        s->shm_fd = incoming_fd;
> +
> +        return;
> +    }
> +
> +    /* each guest has an array of eventfds, and we keep track of how many
> +     * guests for each VM */
> +    guest_curr_max = s->peers[incoming_posn].nb_eventfds;
> +    if (guest_curr_max == 0) {
> +        /* one eventfd per MSI vector */
> +        s->peers[incoming_posn].eventfds = (int *) qemu_malloc(s->vectors *
> +                                                                sizeof(int));

Useless cast in C.

> +    }
> +
> +    /* this is an eventfd for a particular guest VM */
> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
> +                                                                incoming_fd);
> +    s->peers[incoming_posn].eventfds[guest_curr_max] = incoming_fd;
> +
> +    /* increment count for particular guest */
> +    s->peers[incoming_posn].nb_eventfds++;
> +
> +    /* keep track of the maximum VM ID */
> +    if (incoming_posn > s->max_peer) {
> +        s->max_peer = incoming_posn;
> +    }
> +
> +    if (incoming_posn == s->vm_id) {
> +        int vector = guest_curr_max;

Why add a new variable?

> +        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +            /* initialize char device for callback
> +             * if this is one of my eventfds */
> +            s->eventfd_chr[vector] = create_eventfd_chr_device(s,
> +                       s->peers[s->vm_id].eventfds[vector], vector);
> +        }
> +    }
> +
> +    return;
> +}
> +
> +static void ivshmem_reset(DeviceState *d)
> +{
> +    return;

Nothing to do?

> +}
> +
> +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
> +                       pcibus_t addr, pcibus_t size, int type)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
> +
> +    s->mmio_addr = addr;
> +    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);

The 0x400 should be #defined earlier. Why 0x400 since you are only
interested in the 0x100 first bytes?

> +
> +    /* ioeventfd and irqfd are enabled together,
> +     * so the flag IRQFD refers to both */
> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +        setup_ioeventfds(s);
> +    }
> +}
> +
> +static uint64_t ivshmem_get_size(IVShmemState * s) {
> +
> +    uint64_t value;
> +    char *ptr;
> +
> +    value = strtoul(s->sizearg, &ptr, 10);

I'd use strtoull() but the whole function should be suppressed, see below.

> +    switch (*ptr) {
> +        case 0: case 'M': case 'm':
> +            value <<= 20;
> +            break;
> +        case 'G': case 'g':
> +            value <<= 30;
> +            break;
> +        default:
> +            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
> +            exit(1);
> +    }
> +
> +    /* BARs must be a power of 2 */
> +    if (!is_power_of_two(value)) {
> +        fprintf(stderr, "ivshmem: size must be power of 2\n");
> +        exit(1);
> +    }

Isn't the BAR check in pci.c enough?

> +
> +    return value;
> +
> +}
> +
> +static void ivshmem_setup_msi(IVShmemState * s) {
> +
> +    int i;
> +
> +    /* allocate the MSI-X vectors */
> +
> +    if (!msix_init(&s->dev, s->vectors, 1, 0)) {
> +        pci_register_bar(&s->dev, 1,
> +                         msix_bar_size(&s->dev),
> +                         PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                         msix_mmio_map);
> +        IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
> +    } else {
> +        IVSHMEM_DPRINTF("msix initialization failed\n");

Is this fatal considering the msix_vector_use() below?

> +    }
> +
> +    /* 'activate' the vectors */
> +    for (i = 0; i < s->vectors; i++) {
> +        msix_vector_use(&s->dev, i);
> +    }
> +
> +    /* if IRQFDs are not supported, we'll have to trigger the interrupts
> +     * via Qemu char devices */
> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
> +        /* for handling interrupts when IRQFD is not available */
> +        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
> +    }
> +}
> +
> +static void ivshmem_save(QEMUFile* f, void *opaque)
> +{
> +    IVShmemState *proxy = opaque;
> +
> +    IVSHMEM_DPRINTF("ivshmem_save\n");
> +    pci_device_save(&proxy->dev, f);
> +
> +    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
> +        msix_save(&proxy->dev, f);
> +    } else {
> +        qemu_put_be32(f, proxy->intrstatus);
> +        qemu_put_be32(f, proxy->intrmask);
> +    }
> +
> +}

There may be VMState magic to handle conditional structures (or just
make the structures unconditional), so VMState should be used instead.

> +
> +static int ivshmem_load(QEMUFile* f, void *opaque, int version_id)
> +{
> +    IVSHMEM_DPRINTF("ivshmem_load\n");
> +
> +    IVShmemState *proxy = opaque;
> +    int ret, i;
> +

Missing version check.

> +    ret = pci_device_load(&proxy->dev, f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
> +        msix_load(&proxy->dev, f);
> +        for (i = 0; i < proxy->vectors; i++) {
> +            msix_vector_use(&proxy->dev, i);
> +        }
> +    } else {
> +        proxy->intrstatus = qemu_get_be32(f);
> +        proxy->intrmask = qemu_get_be32(f);
> +    }
> +
> +    return 0;
> +}
> +
> +static int pci_ivshmem_init(PCIDevice *dev)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
> +    uint8_t *pci_conf;
> +
> +    if (s->sizearg == NULL)
> +        s->ivshmem_size = 4 << 20; /* 4 MB default */
> +    else {
> +        s->ivshmem_size = ivshmem_get_size(s);
> +    }
> +
> +    register_savevm("ivshmem", 0, 0, ivshmem_save, ivshmem_load, dev);
> +
> +    /* IRQFD requires MSI */
> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
> +        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
> +        exit(1);
> +    }
> +
> +    /* check that role is reasonable */
> +    if (s->role && !((strncmp(s->role, "peer", 5) == 0) ||
> +                        (strncmp(s->role, "master", 7) == 0))) {
> +        fprintf(stderr, "ivshmem: 'role' must be 'peer' or 'master'\n");
> +        exit(1);
> +    }

I'd add a scalar flag in IVShmemState for role so that further strcmps
are avoided.

> +
> +    pci_conf = s->dev.config;
> +    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
> +    pci_conf[0x01] = 0x1a;
> +    pci_conf[0x02] = 0x10;
> +    pci_conf[0x03] = 0x11;

Please add the DID to hw/pci_ids.h and use pci_config_set_xyz() here.

> +    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
> +    pci_conf[0x0a] = 0x00; /* RAM controller */
> +    pci_conf[0x0b] = 0x05;
> +    pci_conf[0x0e] = 0x00; /* header_type */
> +
> +    pci_conf[PCI_INTERRUPT_PIN] = 1;
> +
> +    s->shm_pci_addr = 0;
> +    s->ivshmem_offset = 0;
> +    s->shm_fd = 0;
> +
> +    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
> +                                    ivshmem_mmio_write, s);
> +    /* region for registers*/
> +    pci_register_bar(&s->dev, 0, 0x400,
> +                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
> +
> +    if ((s->server_chr != NULL) &&
> +                        (strncmp(s->server_chr->filename, "unix:", 5) == 0)) {
> +        /* if we get a UNIX socket as the parameter we will talk
> +         * to the ivshmem server to receive the memory region */
> +
> +        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
> +                                                    s->server_chr->filename);
> +
> +        if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
> +            ivshmem_setup_msi(s);
> +        }
> +
> +        /* we allocate enough space for 16 guests and grow as needed */
> +        s->nb_peers = 16;
> +        s->vm_id = -1;
> +
> +        /* allocate/initialize space for interrupt handling */
> +        s->peers = qemu_mallocz(s->nb_peers * sizeof(Peer));
> +
> +        pci_register_bar(&s->dev, 2, s->ivshmem_size,
> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
> +
> +        s->eventfd_chr = (CharDriverState **) qemu_mallocz(s->vectors *
> +                                                sizeof(CharDriverState *));

Useless cast in C.

> +
> +        qemu_chr_add_handlers(s->server_chr, ivshmem_can_receive, ivshmem_read,
> +                     ivshmem_event, s);
> +    } else {
> +        /* just map the file immediately, we're not using a server */
> +        int fd;
> +
> +        if (s->shmobj == NULL) {
> +            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
> +        }

I'd rather have separate 'chardev' and 'shm_file' parameters. Then
'info qtree' could return more useful information about the chardev.

> +
> +        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
> +
> +        /* try opening with O_EXCL and if it succeeds zero the memory
> +         * by truncating to 0 */
> +        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
> +                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
> +           /* truncate file to length PCI device's memory */
> +            if (ftruncate(fd, s->ivshmem_size) != 0) {
> +                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");

Why 'kvm_ivshmem'?

> +            }
> +
> +        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
> +                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
> +            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
> +            exit(-1);
> +
> +        }
> +
> +        if (check_shm_size(s, fd) == -1) {
> +            exit(-1);
> +        }
> +
> +        create_shared_memory_BAR(s, fd);
> +
> +    }
> +
> +    return 0;
> +}
> +
> +static int pci_ivshmem_uninit(PCIDevice *dev)
> +{
> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
> +
> +    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
> +
> +    return 0;
> +}
> +
> +static PCIDeviceInfo ivshmem_info = {
> +    .qdev.name  = "ivshmem",
> +    .qdev.size  = sizeof(IVShmemState),
> +    .qdev.reset = ivshmem_reset,
> +    .init       = pci_ivshmem_init,
> +    .exit       = pci_ivshmem_uninit,
> +    .qdev.props = (Property[]) {
> +        DEFINE_PROP_CHR("chardev", IVShmemState, server_chr),
> +        DEFINE_PROP_STRING("size", IVShmemState, sizearg),

This should be scalar type, not string.

> +        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
> +        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
> +        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
> +        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
> +        DEFINE_PROP_STRING("role", IVShmemState, role),
> +        DEFINE_PROP_END_OF_LIST(),
> +    }
> +};
> +
> +static void ivshmem_register_devices(void)
> +{
> +    pci_qdev_register(&ivshmem_info);
> +}
> +
> +device_init(ivshmem_register_devices)
> diff --git a/qemu-char.c b/qemu-char.c
> index ac65a1c..b2e50d0 100644
> --- a/qemu-char.c
> +++ b/qemu-char.c
> @@ -2093,6 +2093,12 @@ static void tcp_chr_read(void *opaque)
>     }
>  }
>
> +CharDriverState *qemu_chr_open_eventfd(int eventfd){
> +
> +    return qemu_chr_open_fd(eventfd, eventfd);
> +
> +}
> +
>  static void tcp_chr_connect(void *opaque)
>  {
>     CharDriverState *chr = opaque;
> diff --git a/qemu-char.h b/qemu-char.h
> index e3a0783..6ea01ba 100644
> --- a/qemu-char.h
> +++ b/qemu-char.h
> @@ -94,6 +94,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
>  void qemu_chr_info(Monitor *mon, QObject **ret_data);
>  CharDriverState *qemu_chr_find(const char *name);
>
> +/* add an eventfd to the qemu devices that are polled */
> +CharDriverState *qemu_chr_open_eventfd(int eventfd);

Maybe this should be removed and just open coded with qemu_chr_open_fd.

> +
>  extern int term_escape_char;
>
>  /* async I/O support */
> diff --git a/qemu-doc.texi b/qemu-doc.texi
> index 6647b7b..24f8748 100644
> --- a/qemu-doc.texi
> +++ b/qemu-doc.texi
> @@ -706,6 +706,49 @@ Using the @option{-net socket} option, it is possible to make VLANs
>  that span several QEMU instances. See @ref{sec_invocation} to have a
>  basic example.
>
> +@section Other Devices
> +
> +@subsection Inter-VM Shared Memory device
> +
> +With KVM enabled on a Linux host, a shared memory device is available.  Guests
> +map a POSIX shared memory region into the guest as a PCI device that enables
> +zero-copy communication to the application level of the guests.  The basic
> +syntax is:
> +
> +@example
> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
> +@end example
> +
> +If desired, interrupts can be sent between guest VMs accessing the same shared
> +memory region.  Interrupt support requires using a shared memory server and
> +using a chardev socket to connect to it.  The code for the shared memory server
> +is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
> +memory server is:
> +
> +@example
> +qemu -device ivshmem,size=<size in format accepted by -m>[,chardev=<id>]
> +                        [,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
> +qemu -chardev socket,path=<path>,id=<id>
> +@end example
> +
> +When using the server, the guest will be assigned a VM ID (>=0) that allows guests
> +using the same server to communicate via interrupts.  Guests can read their
> +VM ID from a device register (see example code).  Since receiving the shared
> +memory region from the server is asynchronous, there is a (small) chance the
> +guest may boot before the shared memory is attached.  To allow an application
> +to ensure shared memory is attached, the VM ID register will return -1 (an
> +invalid VM ID) until the memory is attached.  Once the shared memory is
> +attached, the VM ID will return the guest's valid VM ID.  With these semantics,
> +the guest application can check to ensure the shared memory is attached to the
> +guest before proceeding.
> +
> +The @option{role} argument can be set to either master or peer and will affect
> +how the shared memory is migrated.  With @option{role=master}, the guest will
> +copy the shared memory on migration to the destination host.  With
> +@option{role=peer}, the shared memory will not be copied on migration.  Only
> +one guest should be specified as
> +the master.
> +
>  @node direct_linux_boot
>  @section Direct Linux Boot
>
> --
> 1.6.3.2.198.g6096d
>
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device
  2010-06-05  9:44           ` [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device Blue Swirl
@ 2010-06-06 15:02             ` Avi Kivity
  2010-06-07 16:41             ` Cam Macdonell
  1 sibling, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2010-06-06 15:02 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Cam Macdonell, qemu-devel, kvm

On 06/05/2010 12:44 PM, Blue Swirl wrote:
> On Fri, Jun 4, 2010 at 9:45 PM, Cam Macdonell<cam@cs.ualberta.ca>  wrote:
>    
>> Support an inter-vm shared memory device that maps a shared-memory object as a
>> PCI device in the guest.  This patch also supports interrupts between guest by
>> communicating over a unix domain socket.  This patch applies to the qemu-kvm
>> repository.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory server
>> by using a chardev socket.
>>
>>     -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>            [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
>>     -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>      
> Why is this KVM specific BTW, Posix SHM is available on many
> platforms? What would happen if kvm_set_foobar functions were not
> called when KVM is not being used? Is host eventfd support essential?
>    

It's not kvm specific, it's 
atomic-ops-on-shared-memory-are-visible-as-atomic-ops specific, which is 
currently only available with kvm.  When tcg gains true smp support (and 
not just against other tcg threads) this can work with tcg as well.

I guess that needs a host with at least 32/64 bit CAS for 32/64 bit 
targets respectively, and double that if the target has DCAS.  Not sure 
how targets with ll/sc can be implemented, especially if there are 
limits as to what can go in between.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device
  2010-06-05  9:44           ` [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device Blue Swirl
  2010-06-06 15:02             ` Avi Kivity
@ 2010-06-07 16:41             ` Cam Macdonell
  2010-06-09 20:12               ` Blue Swirl
  1 sibling, 1 reply; 42+ messages in thread
From: Cam Macdonell @ 2010-06-07 16:41 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, kvm

On Sat, Jun 5, 2010 at 3:44 AM, Blue Swirl <blauwirbel@gmail.com> wrote:
> On Fri, Jun 4, 2010 at 9:45 PM, Cam Macdonell <cam@cs.ualberta.ca> wrote:
>> Support an inter-vm shared memory device that maps a shared-memory object as a
>> PCI device in the guest.  This patch also supports interrupts between guest by
>> communicating over a unix domain socket.  This patch applies to the qemu-kvm
>> repository.
>>
>>    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>
>> Interrupts are supported between multiple VMs by using a shared memory server
>> by using a chardev socket.
>>
>>    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>           [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
>>    -chardev socket,path=<path>,id=<id>
>>
>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>
>> Sample programs and init scripts are in a git repo here:
>>
>>    www.gitorious.org/nahanni
>> ---
>>  Makefile.target |    3 +
>>  hw/ivshmem.c    |  852 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  qemu-char.c     |    6 +
>>  qemu-char.h     |    3 +
>>  qemu-doc.texi   |   43 +++
>>  5 files changed, 907 insertions(+), 0 deletions(-)
>>  create mode 100644 hw/ivshmem.c
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index c4ba592..4888308 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>>  obj-y += rtl8139.o
>>  obj-y += e1000.o
>>
>> +# Inter-VM PCI shared memory
>> +obj-y += ivshmem.o
>> +
>
> Can this be compiled once, simply by moving this to Makefile.objs
> instead of Makefile.target? Also, because the code seems to be KVM
> specific, it can't be compiled unconditionally but depending on at
> least CONFIG_KVM and maybe CONFIG_EVENTFD.

The device uses eventfds for signalling, but never creates the
eventfds itself as they are passed from a server using SCM_RIGHTS.
So, it does not depend on the eventfd API.  Its dependence on
irqfd/ioeventfd (the only KVM specific bits) are optional via the
command-line.

A runtime check for KVM is done for the reasons Avi mentioned.

>
> Why is this KVM specific BTW, Posix SHM is available on many
> platforms? What would happen if kvm_set_foobar functions were not
> called when KVM is not being used? Is host eventfd support essential?
>
>>  # Hardware support
>>  obj-i386-y += vga.o
>>  obj-i386-y += mc146818rtc.o i8259.o pc.o
>> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
>> new file mode 100644
>> index 0000000..9057612
>> --- /dev/null
>> +++ b/hw/ivshmem.c
>> @@ -0,0 +1,852 @@
>> +/*
>> + * Inter-VM Shared Memory PCI device.
>> + *
>> + * Author:
>> + *      Cam Macdonell <cam@cs.ualberta.ca>
>> + *
>> + * Based On: cirrus_vga.c
>> + *          Copyright (c) 2004 Fabrice Bellard
>> + *          Copyright (c) 2004 Makoto Suzuki (suzu)
>> + *
>> + *      and rtl8139.c
>> + *          Copyright (c) 2006 Igor Kovalenko
>> + *
>> + * This code is licensed under the GNU GPL v2.
>> + */
>> +#include <sys/mman.h>
>> +#include <sys/types.h>
>> +#include <sys/socket.h>
>> +#include <sys/io.h>
>> +#include <sys/ioctl.h>
>> +#include "hw.h"
>> +#include "console.h"
>> +#include "pc.h"
>> +#include "pci.h"
>> +#include "sysemu.h"
>> +
>> +#include "msix.h"
>> +#include "qemu-kvm.h"
>> +#include "libkvm.h"
>> +
>> +#include <sys/eventfd.h>
>> +#include <sys/mman.h>
>> +#include <sys/socket.h>
>> +#include <sys/ioctl.h>
>> +
>> +#define IVSHMEM_IRQFD   0
>> +#define IVSHMEM_MSI     1
>> +
>> +//#define DEBUG_IVSHMEM
>> +#ifdef DEBUG_IVSHMEM
>> +#define IVSHMEM_DPRINTF(fmt, args...)        \
>> +    do {printf("IVSHMEM: " fmt, ##args); } while (0)
>
> Please use __VA_ARGS__.
>
>> +#else
>> +#define IVSHMEM_DPRINTF(fmt, args...)
>> +#endif
>> +
>> +typedef struct Peer {
>> +    int nb_eventfds;
>> +    int *eventfds;
>> +} Peer;
>> +
>> +typedef struct EventfdEntry {
>> +    PCIDevice *pdev;
>> +    int vector;
>> +} EventfdEntry;
>> +
>> +typedef struct IVShmemState {
>> +    PCIDevice dev;
>> +    uint32_t intrmask;
>> +    uint32_t intrstatus;
>> +    uint32_t doorbell;
>> +
>> +    CharDriverState ** eventfd_chr;
>
> I'd remove the space between '**' and 'eventfd_chr', it's used inconsistently.
>
>> +    CharDriverState * server_chr;
>> +    int ivshmem_mmio_io_addr;
>> +
>> +    pcibus_t mmio_addr;
>> +    pcibus_t shm_pci_addr;
>> +    uint64_t ivshmem_offset;
>> +    uint64_t ivshmem_size; /* size of shared memory region */
>> +    int shm_fd; /* shared memory file descriptor */
>> +
>> +    Peer *peers;
>> +    int nb_peers; /* how many guests we have space for */
>> +    int max_peer; /* maximum numbered peer */
>> +
>> +    int vm_id;
>> +    uint32_t vectors;
>> +    uint32_t features;
>> +    EventfdEntry *eventfd_table;
>> +
>> +    char * shmobj;
>> +    char * sizearg;
>> +    char * role;
>> +} IVShmemState;
>> +
>> +/* registers for the Inter-VM shared memory device */
>> +enum ivshmem_registers {
>> +    IntrMask = 0,
>> +    IntrStatus = 4,
>> +    IVPosition = 8,
>> +    Doorbell = 12,
>> +};
>
> IIRC these should be uppercase.

I worked from rtl8139 which doesn't have them uppercase.  But doing a
quick search, I can see most devices do, I'll fix that.

>
>> +
>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
>> +    return (ivs->features & (1 << feature));
>> +}
>
> Since this is the first version, do we need any features at this
> point, can't we expect that all features are available now? Why does
> the user need to specify the features?

Some features require host support such as irqfd/ioeventfds.  So
having features allows them to be disabled on the command-line (e.g.,
irqfd=off).

>
> To avoid a negative shift, I'd make 'feature' unsigned.
>
>> +
>> +static inline bool is_power_of_two(uint64_t x) {
>> +    return (x & (x - 1)) == 0;
>> +}
>> +
>> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
>> +                    pcibus_t addr, pcibus_t size, int type)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>> +
>> +    s->shm_pci_addr = addr;
>> +
>> +    if (s->ivshmem_offset > 0) {
>> +        cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
>> +                                                            s->ivshmem_offset);
>> +        if (s->role && strncmp(s->role, "peer", 4) == 0) {
>> +            IVSHMEM_DPRINTF("marking pages no migrate\n");
>> +            cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
>> +        }
>> +    }
>> +
>> +    IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
>> +                (uint32_t)addr, (uint32_t)s->ivshmem_offset, (uint32_t)size);
>
> Please use FMT_PCIBUS for addr and size and PRIu64 for s->ivshmem_offset.
>
>> +
>> +}
>> +
>> +/* accessing registers - based on rtl8139 */
>> +static void ivshmem_update_irq(IVShmemState *s, int val)
>> +{
>> +    int isr;
>> +    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
>> +
>> +    /* don't print ISR resets */
>> +    if (isr) {
>> +        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
>> +           isr ? 1 : 0, s->intrstatus, s->intrmask);
>> +    }
>> +
>> +    qemu_set_irq(s->dev.irq[0], (isr != 0));
>> +}
>> +
>> +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
>> +
>> +    s->intrmask = val;
>> +
>> +    ivshmem_update_irq(s, val);
>> +}
>> +
>> +static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
>> +{
>> +    uint32_t ret = s->intrmask;
>> +
>> +    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
>> +
>> +    s->intrstatus = val;
>> +
>> +    ivshmem_update_irq(s, val);
>> +    return;
>> +}
>> +
>> +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
>> +{
>> +    uint32_t ret = s->intrstatus;
>> +
>> +    /* reading ISR clears all interrupts */
>> +    s->intrstatus = 0;
>> +
>> +    ivshmem_update_irq(s, 0);
>> +
>> +    return ret;
>> +}
>> +
>> +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +
>> +    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
>> +}
>> +
>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    u_int64_t write_one = 1;
>
> Please use uintNN_t instead of u_intNN_t.
>
>> +    u_int16_t dest = val >> 16;
>> +    u_int16_t vector = val & 0xff;
>> +
>> +    addr &= 0xfc;
>
> I'd add a debug printf here, likewise for exit of ivshmem_io_readl().
> When you do the merge (see below), the correct printf format for the
> addresses will be TARGET_FMT_plx.
>
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ivshmem_IntrMask_write(s, val);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ivshmem_IntrStatus_write(s, val);
>> +            break;
>> +
>> +        case Doorbell:
>> +            /* check that dest VM ID is reasonable */
>> +            if ((dest < 0) || (dest > s->max_peer)) {
>> +                IVSHMEM_DPRINTF("Invalid destination VM ID (%d)\n", dest);
>> +                break;
>> +            }
>> +
>> +            /* check doorbell range */
>> +            if ((vector >= 0) && (vector < s->peers[dest].nb_eventfds)) {
>> +                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n",
>> +                                                    write_one, dest, vector);
>
> PRId64 for write_one, %ld is not enough on ILP32.
>
>> +                if (write(s->peers[dest].eventfds[vector],
>> +                                                    &(write_one), 8) != 8) {
>> +                    IVSHMEM_DPRINTF("error writing to eventfd\n");
>> +                }
>> +            }
>> +            break;
>> +        default:
>> +            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
>> +    }
>> +}
>> +
>> +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
>> +{
>> +    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
>> +}
>> +
>> +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
>> +{
>> +
>> +    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
>> +    return 0;
>> +}
>> +
>> +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
>> +{
>> +
>> +    IVShmemState *s = opaque;
>> +    uint32_t ret;
>> +
>> +    switch (addr)
>> +    {
>> +        case IntrMask:
>> +            ret = ivshmem_IntrMask_read(s);
>> +            break;
>> +
>> +        case IntrStatus:
>> +            ret = ivshmem_IntrStatus_read(s);
>> +            break;
>> +
>> +        case IVPosition:
>> +            /* return my VM ID if the memory is mapped */
>> +            if (s->shm_fd > 0) {
>> +                ret = s->vm_id;
>> +            } else {
>> +                ret = -1;
>> +            }
>> +            break;
>> +
>> +        default:
>> +            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
>> +            ret = 0;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
>> +{
>> +    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
>> +
>> +    return 0;
>> +}
>> +
>> +static void ivshmem_mmio_writeb(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writeb(opaque, addr & 0xFF, val);
>> +}
>
> This function and others below only performs a cast and useless
> masking (the address passed is these days an offset from start of the
> area). Please merge these to ivshmem_io_readl() etc.

Ok, these are artifacts from basing my patch on rtl8139.c.  I'll remove them.

>
>> +
>> +static void ivshmem_mmio_writew(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writew(opaque, addr & 0xFF, val);
>> +}
>> +
>> +static void ivshmem_mmio_writel(void *opaque,
>> +                                target_phys_addr_t addr, uint32_t val)
>> +{
>> +    ivshmem_io_writel(opaque, addr & 0xFF, val);
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
>> +{
>> +    return ivshmem_io_readb(opaque, addr & 0xFF);
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
>> +{
>> +    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
>> +    return val;
>> +}
>> +
>> +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
>> +{
>> +    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
>> +    return val;
>> +}
>> +
>> +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
>
> Please add 'const'.
>
>> +    ivshmem_mmio_readb,
>> +    ivshmem_mmio_readw,
>> +    ivshmem_mmio_readl,
>> +};
>> +
>> +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
>> +    ivshmem_mmio_writeb,
>> +    ivshmem_mmio_writew,
>> +    ivshmem_mmio_writel,
>> +};
>> +
>> +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
>> +{
>> +    IVShmemState *s = opaque;
>> +
>> +    ivshmem_IntrStatus_write(s, *buf);
>> +
>> +    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
>> +}
>> +
>> +static int ivshmem_can_receive(void * opaque)
>> +{
>> +    return 8;
>> +}
>> +
>> +static void ivshmem_event(void *opaque, int event)
>> +{
>> +    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
>> +}
>> +
>> +static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
>> +
>> +    EventfdEntry *entry = opaque;
>> +    PCIDevice *pdev = entry->pdev;
>> +
>> +    IVSHMEM_DPRINTF("fake irqfd on vector %p %d\n", pdev, entry->vector);
>> +    msix_notify(pdev, entry->vector);
>> +}
>> +
>> +static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
>> +                                                                    int vector)
>> +{
>> +    /* create a event character device based on the passed eventfd */
>> +    IVShmemState *s = opaque;
>> +    CharDriverState * chr;
>> +
>> +    chr = qemu_chr_open_eventfd(eventfd);
>> +
>> +    if (chr == NULL) {
>> +        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
>
> This should not be a DPRINTF.
>
>> +        exit(-1);
>> +    }
>> +
>> +    /* if MSI is supported we need multiple interrupts */
>> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +        s->eventfd_table[vector].pdev = &s->dev;
>> +        s->eventfd_table[vector].vector = vector;
>> +
>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
>> +                      ivshmem_event, &s->eventfd_table[vector]);
>> +    } else {
>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
>> +                      ivshmem_event, s);
>> +    }
>> +
>> +    return chr;
>> +
>> +}
>> +
>> +static int check_shm_size(IVShmemState *s, int fd) {
>> +    /* check that the guest isn't going to try and map more memory than the
>> +     * the object has allocated return -1 to indicate error */
>> +
>> +    struct stat buf;
>> +
>> +    fstat(fd, &buf);
>> +
>> +    if (s->ivshmem_size > buf.st_size) {
>> +        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
>> +        fprintf(stderr, " than shared object size (%ld > %ld)\n",
>> +                                          s->ivshmem_size, buf.st_size);
>
> Please use PRIx64 for s->ivshmem_size, this will cause a warning on ILP32.
>
>> +        return -1;
>> +    } else {
>> +        return 0;
>> +    }
>> +}
>> +
>> +/* create the shared memory BAR when we are not using the server, so we can
>> + * create the BAR and map the memory immediately */
>> +static void create_shared_memory_BAR(IVShmemState *s, int fd) {
>> +
>> +    void * ptr;
>> +
>> +    s->shm_fd = fd;
>> +
>> +    ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> +    s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, ptr);
>
> qemu_ram_map() does not exist in HEAD.
>

Ok, so qemu_ram_map() and kvm_set_irq() are in the KVM HEAD.  I had my
own version of both functions, but removed them when Marcelo's were
merged into KVM.

>> +
>> +    /* region for shared memory */
>> +    pci_register_bar(&s->dev, 2, s->ivshmem_size,
>> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
>> +}
>> +
>> +static void close_guest_eventfds(IVShmemState *s, int posn)
>> +{
>> +    int i, guest_curr_max;
>> +
>> +    guest_curr_max = s->peers[posn].nb_eventfds;
>> +
>> +    for (i = 0; i < guest_curr_max; i++)
>> +        close(s->peers[posn].eventfds[i]);
>
> CODING_STYLE.
>
>> +
>> +    qemu_free(s->peers[posn].eventfds);
>> +    s->peers[posn].nb_eventfds = 0;
>> +}
>> +
>> +static void setup_ioeventfds(IVShmemState *s) {
>> +
>> +    int i, j;
>> +
>> +    for (i = 0; i <= s->max_peer; i++) {
>> +        for (j = 0; j < s->peers[i].nb_eventfds; j++) {
>> +            kvm_set_ioeventfd_mmio_long(s->peers[i].eventfds[j],
>> +                    s->mmio_addr + Doorbell, (i << 16) | j, 1);
>> +        }
>> +    }
>> +
>> +    /* setup irqfd for this VM's eventfds */
>> +    for (i = 0; i < s->vectors; i++) {
>> +        kvm_set_irqfd(s->dev.msix_irq_entries[i].gsi,
>> +                        s->peers[s->vm_id].eventfds[i], 1);
>
> kvm_set_irqfd() does not exist in HEAD.
>
>> +    }
>> +}
>> +
>> +
>> +/* this function increase the dynamic storage need to store data about other
>> + * guests */
>> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
>> +
>> +    int j, old_nb_alloc;
>> +
>> +    old_nb_alloc = s->nb_peers;
>> +
>> +    while (new_min_size >= s->nb_peers)
>> +        s->nb_peers = s->nb_peers * 2;
>> +
>> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nb_peers);
>> +    s->peers = qemu_realloc(s->peers, s->nb_peers * sizeof(Peer));
>> +
>> +    if (s->peers == NULL) {
>> +        fprintf(stderr, "Allocation error - exiting\n");
>> +        exit(1);
>> +    }
>
> qemu_realloc will never return zero.
>
>> +
>> +    /* zero out new pointers */
>> +    for (j = old_nb_alloc; j < s->nb_peers; j++) {
>> +        s->peers[j].eventfds = NULL;
>> +        s->peers[j].nb_eventfds = 0;
>> +    }
>> +}
>> +
>> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
>> +{
>> +    IVShmemState *s = opaque;
>> +    int incoming_fd, tmp_fd;
>> +    int guest_curr_max;
>> +    long incoming_posn;
>> +
>> +    memcpy(&incoming_posn, buf, sizeof(long));
>> +    /* pick off s->server_chr->msgfd and store it, posn should accompany msg */
>> +    tmp_fd = qemu_chr_get_msgfd(s->server_chr);
>> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
>> +
>> +    /* make sure we have enough space for this guest */
>> +    if (incoming_posn >= s->nb_peers) {
>> +        increase_dynamic_storage(s, incoming_posn);
>> +    }
>> +
>> +    if (tmp_fd == -1) {
>> +        /* if posn is positive and unseen before then this is our posn*/
>> +        if ((incoming_posn >= 0) && (s->peers[incoming_posn].eventfds == NULL)) {
>> +            /* receive our posn */
>> +            s->vm_id = incoming_posn;
>> +            return;
>> +        } else {
>> +            /* otherwise an fd == -1 means an existing guest has gone away */
>> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
>> +            close_guest_eventfds(s, incoming_posn);
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* because of the implementation of get_msgfd, we need a dup */
>> +    incoming_fd = dup(tmp_fd);
>> +
>> +    if (incoming_fd == -1) {
>> +        fprintf(stderr, "could not allocate file descriptor %s\n",
>> +                                                            strerror(errno));
>> +        return;
>> +    }
>> +
>> +    /* if the position is -1, then it's shared memory region fd */
>> +    if (incoming_posn == -1) {
>> +
>> +        void * map_ptr;
>> +
>> +        s->max_peer = 0;
>> +
>> +        if (check_shm_size(s, incoming_fd) == -1) {
>> +            exit(-1);
>> +        }
>> +
>> +        /* mmap the region and map into the BAR2 */
>> +        map_ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED,
>> +                                                                incoming_fd, 0);
>> +        s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, map_ptr);
>> +
>> +        IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
>> +                        (uint32_t)s->shm_pci_addr, (uint32_t)s->ivshmem_offset,
>> +                        (uint32_t)s->ivshmem_size);
>> +
>> +        if (s->shm_pci_addr > 0) {
>> +            /* map memory into BAR2 */
>> +            cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
>> +                                                            s->ivshmem_offset);
>> +            if (s->role && strncmp(s->role, "peer", 4) == 0) {
>> +                IVSHMEM_DPRINTF("marking pages no migrate\n");
>> +                cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
>> +            }
>> +
>> +        }
>> +
>> +        /* only store the fd if it is successfully mapped */
>> +        s->shm_fd = incoming_fd;
>> +
>> +        return;
>> +    }
>> +
>> +    /* each guest has an array of eventfds, and we keep track of how many
>> +     * guests for each VM */
>> +    guest_curr_max = s->peers[incoming_posn].nb_eventfds;
>> +    if (guest_curr_max == 0) {
>> +        /* one eventfd per MSI vector */
>> +        s->peers[incoming_posn].eventfds = (int *) qemu_malloc(s->vectors *
>> +                                                                sizeof(int));
>
> Useless cast in C.
>
>> +    }
>> +
>> +    /* this is an eventfd for a particular guest VM */
>> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
>> +                                                                incoming_fd);
>> +    s->peers[incoming_posn].eventfds[guest_curr_max] = incoming_fd;
>> +
>> +    /* increment count for particular guest */
>> +    s->peers[incoming_posn].nb_eventfds++;
>> +
>> +    /* keep track of the maximum VM ID */
>> +    if (incoming_posn > s->max_peer) {
>> +        s->max_peer = incoming_posn;
>> +    }
>> +
>> +    if (incoming_posn == s->vm_id) {
>> +        int vector = guest_curr_max;
>
> Why add a new variable?

for clarity, so when looking at the code it's clear that the current
maxium is a MSI vector.  Perhaps I'll rename guest_curr_max to achieve
this.

>
>> +        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +            /* initialize char device for callback
>> +             * if this is one of my eventfds */
>> +            s->eventfd_chr[vector] = create_eventfd_chr_device(s,
>> +                       s->peers[s->vm_id].eventfds[vector], vector);
>> +        }
>> +    }
>> +
>> +    return;
>> +}
>> +
>> +static void ivshmem_reset(DeviceState *d)
>> +{
>> +    return;
>
> Nothing to do?

Oversight, will fix.

>
>> +}
>> +
>> +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
>> +                       pcibus_t addr, pcibus_t size, int type)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>> +
>> +    s->mmio_addr = addr;
>> +    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
>
> The 0x400 should be #defined earlier. Why 0x400 since you are only
> interested in the 0x100 first bytes?

We bumped to 1MB for a previous version that had multiple doorbells.
ioeventfd masks eliminated made multiple doorbells unnecessary.  I'll
put it back to 0x100.

>
>> +
>> +    /* ioeventfd and irqfd are enabled together,
>> +     * so the flag IRQFD refers to both */
>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +        setup_ioeventfds(s);
>> +    }
>> +}
>> +
>> +static uint64_t ivshmem_get_size(IVShmemState * s) {
>> +
>> +    uint64_t value;
>> +    char *ptr;
>> +
>> +    value = strtoul(s->sizearg, &ptr, 10);
>
> I'd use strtoull() but the whole function should be suppressed, see below.
>
>> +    switch (*ptr) {
>> +        case 0: case 'M': case 'm':
>> +            value <<= 20;
>> +            break;
>> +        case 'G': case 'g':
>> +            value <<= 30;
>> +            break;
>> +        default:
>> +            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
>> +            exit(1);
>> +    }
>> +
>> +    /* BARs must be a power of 2 */
>> +    if (!is_power_of_two(value)) {
>> +        fprintf(stderr, "ivshmem: size must be power of 2\n");
>> +        exit(1);
>> +    }
>
> Isn't the BAR check in pci.c enough?

This would produce a clearer error that it is the ivshmem device that
is the problem.

>
>> +
>> +    return value;
>> +
>> +}
>> +
>> +static void ivshmem_setup_msi(IVShmemState * s) {
>> +
>> +    int i;
>> +
>> +    /* allocate the MSI-X vectors */
>> +
>> +    if (!msix_init(&s->dev, s->vectors, 1, 0)) {
>> +        pci_register_bar(&s->dev, 1,
>> +                         msix_bar_size(&s->dev),
>> +                         PCI_BASE_ADDRESS_SPACE_MEMORY,
>> +                         msix_mmio_map);
>> +        IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
>> +    } else {
>> +        IVSHMEM_DPRINTF("msix initialization failed\n");
>
> Is this fatal considering the msix_vector_use() below?

Perhaps not, we could fall back to regular interrupts.  Could move the
msix_vector_use() into the "then" clause.

>
>> +    }
>> +
>> +    /* 'activate' the vectors */
>> +    for (i = 0; i < s->vectors; i++) {
>> +        msix_vector_use(&s->dev, i);
>> +    }
>> +
>> +    /* if IRQFDs are not supported, we'll have to trigger the interrupts
>> +     * via Qemu char devices */
>> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>> +        /* for handling interrupts when IRQFD is not available */
>> +        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
>> +    }
>> +}
>> +
>> +static void ivshmem_save(QEMUFile* f, void *opaque)
>> +{
>> +    IVShmemState *proxy = opaque;
>> +
>> +    IVSHMEM_DPRINTF("ivshmem_save\n");
>> +    pci_device_save(&proxy->dev, f);
>> +
>> +    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
>> +        msix_save(&proxy->dev, f);
>> +    } else {
>> +        qemu_put_be32(f, proxy->intrstatus);
>> +        qemu_put_be32(f, proxy->intrmask);
>> +    }
>> +
>> +}
>
> There may be VMState magic to handle conditional structures (or just
> make the structures unconditional), so VMState should be used instead.

MSI-X requires the use of the old-style savevm/loadvm functions.
Should VMState and the load/save be used together?

>
>> +
>> +static int ivshmem_load(QEMUFile* f, void *opaque, int version_id)
>> +{
>> +    IVSHMEM_DPRINTF("ivshmem_load\n");
>> +
>> +    IVShmemState *proxy = opaque;
>> +    int ret, i;
>> +
>
> Missing version check.
>
>> +    ret = pci_device_load(&proxy->dev, f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
>> +        msix_load(&proxy->dev, f);
>> +        for (i = 0; i < proxy->vectors; i++) {
>> +            msix_vector_use(&proxy->dev, i);
>> +        }
>> +    } else {
>> +        proxy->intrstatus = qemu_get_be32(f);
>> +        proxy->intrmask = qemu_get_be32(f);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int pci_ivshmem_init(PCIDevice *dev)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>> +    uint8_t *pci_conf;
>> +
>> +    if (s->sizearg == NULL)
>> +        s->ivshmem_size = 4 << 20; /* 4 MB default */
>> +    else {
>> +        s->ivshmem_size = ivshmem_get_size(s);
>> +    }
>> +
>> +    register_savevm("ivshmem", 0, 0, ivshmem_save, ivshmem_load, dev);
>> +
>> +    /* IRQFD requires MSI */
>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
>> +        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
>> +        exit(1);
>> +    }
>> +
>> +    /* check that role is reasonable */
>> +    if (s->role && !((strncmp(s->role, "peer", 5) == 0) ||
>> +                        (strncmp(s->role, "master", 7) == 0))) {
>> +        fprintf(stderr, "ivshmem: 'role' must be 'peer' or 'master'\n");
>> +        exit(1);
>> +    }
>
> I'd add a scalar flag in IVShmemState for role so that further strcmps
> are avoided.
>
>> +
>> +    pci_conf = s->dev.config;
>> +    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
>> +    pci_conf[0x01] = 0x1a;
>> +    pci_conf[0x02] = 0x10;
>> +    pci_conf[0x03] = 0x11;
>
> Please add the DID to hw/pci_ids.h and use pci_config_set_xyz() here.
>
>> +    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
>> +    pci_conf[0x0a] = 0x00; /* RAM controller */
>> +    pci_conf[0x0b] = 0x05;
>> +    pci_conf[0x0e] = 0x00; /* header_type */
>> +
>> +    pci_conf[PCI_INTERRUPT_PIN] = 1;
>> +
>> +    s->shm_pci_addr = 0;
>> +    s->ivshmem_offset = 0;
>> +    s->shm_fd = 0;
>> +
>> +    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
>> +                                    ivshmem_mmio_write, s);
>> +    /* region for registers*/
>> +    pci_register_bar(&s->dev, 0, 0x400,
>> +                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
>> +
>> +    if ((s->server_chr != NULL) &&
>> +                        (strncmp(s->server_chr->filename, "unix:", 5) == 0)) {
>> +        /* if we get a UNIX socket as the parameter we will talk
>> +         * to the ivshmem server to receive the memory region */
>> +
>> +        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
>> +                                                    s->server_chr->filename);
>> +
>> +        if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>> +            ivshmem_setup_msi(s);
>> +        }
>> +
>> +        /* we allocate enough space for 16 guests and grow as needed */
>> +        s->nb_peers = 16;
>> +        s->vm_id = -1;
>> +
>> +        /* allocate/initialize space for interrupt handling */
>> +        s->peers = qemu_mallocz(s->nb_peers * sizeof(Peer));
>> +
>> +        pci_register_bar(&s->dev, 2, s->ivshmem_size,
>> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
>> +
>> +        s->eventfd_chr = (CharDriverState **) qemu_mallocz(s->vectors *
>> +                                                sizeof(CharDriverState *));
>
> Useless cast in C.
>
>> +
>> +        qemu_chr_add_handlers(s->server_chr, ivshmem_can_receive, ivshmem_read,
>> +                     ivshmem_event, s);
>> +    } else {
>> +        /* just map the file immediately, we're not using a server */
>> +        int fd;
>> +
>> +        if (s->shmobj == NULL) {
>> +            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
>> +        }
>
> I'd rather have separate 'chardev' and 'shm_file' parameters. Then
> 'info qtree' could return more useful information about the chardev.

can you elaborate?  the command-line currently must be one of the following

server case:
-ivshmem chardev=x,...
-chardev id=x,...

or

non-server case:
-ivshmem shm=<name>,...

>
>> +
>> +        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
>> +
>> +        /* try opening with O_EXCL and if it succeeds zero the memory
>> +         * by truncating to 0 */
>> +        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
>> +                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
>> +           /* truncate file to length PCI device's memory */
>> +            if (ftruncate(fd, s->ivshmem_size) != 0) {
>> +                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
>
> Why 'kvm_ivshmem'?

old name, will remove.

>
>> +            }
>> +
>> +        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
>> +                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
>> +            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
>> +            exit(-1);
>> +
>> +        }
>> +
>> +        if (check_shm_size(s, fd) == -1) {
>> +            exit(-1);
>> +        }
>> +
>> +        create_shared_memory_BAR(s, fd);
>> +
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int pci_ivshmem_uninit(PCIDevice *dev)
>> +{
>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>> +
>> +    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
>> +
>> +    return 0;
>> +}
>> +
>> +static PCIDeviceInfo ivshmem_info = {
>> +    .qdev.name  = "ivshmem",
>> +    .qdev.size  = sizeof(IVShmemState),
>> +    .qdev.reset = ivshmem_reset,
>> +    .init       = pci_ivshmem_init,
>> +    .exit       = pci_ivshmem_uninit,
>> +    .qdev.props = (Property[]) {
>> +        DEFINE_PROP_CHR("chardev", IVShmemState, server_chr),
>> +        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
>
> This should be scalar type, not string.

but it needs to handle the 'm' and 'g' suffixes for memory sizes.

>
>> +        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
>> +        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
>> +        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
>> +        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
>> +        DEFINE_PROP_STRING("role", IVShmemState, role),
>> +        DEFINE_PROP_END_OF_LIST(),
>> +    }
>> +};
>> +
>> +static void ivshmem_register_devices(void)
>> +{
>> +    pci_qdev_register(&ivshmem_info);
>> +}
>> +
>> +device_init(ivshmem_register_devices)
>> diff --git a/qemu-char.c b/qemu-char.c
>> index ac65a1c..b2e50d0 100644
>> --- a/qemu-char.c
>> +++ b/qemu-char.c
>> @@ -2093,6 +2093,12 @@ static void tcp_chr_read(void *opaque)
>>     }
>>  }
>>
>> +CharDriverState *qemu_chr_open_eventfd(int eventfd){
>> +
>> +    return qemu_chr_open_fd(eventfd, eventfd);
>> +
>> +}
>> +
>>  static void tcp_chr_connect(void *opaque)
>>  {
>>     CharDriverState *chr = opaque;
>> diff --git a/qemu-char.h b/qemu-char.h
>> index e3a0783..6ea01ba 100644
>> --- a/qemu-char.h
>> +++ b/qemu-char.h
>> @@ -94,6 +94,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
>>  void qemu_chr_info(Monitor *mon, QObject **ret_data);
>>  CharDriverState *qemu_chr_find(const char *name);
>>
>> +/* add an eventfd to the qemu devices that are polled */
>> +CharDriverState *qemu_chr_open_eventfd(int eventfd);
>
> Maybe this should be removed and just open coded with qemu_chr_open_fd.

I'm indifferent, it just looked a little funny passing the parameter twice.

>
>> +
>>  extern int term_escape_char;
>>
>>  /* async I/O support */
>> diff --git a/qemu-doc.texi b/qemu-doc.texi
>> index 6647b7b..24f8748 100644
>> --- a/qemu-doc.texi
>> +++ b/qemu-doc.texi
>> @@ -706,6 +706,49 @@ Using the @option{-net socket} option, it is possible to make VLANs
>>  that span several QEMU instances. See @ref{sec_invocation} to have a
>>  basic example.
>>
>> +@section Other Devices
>> +
>> +@subsection Inter-VM Shared Memory device
>> +
>> +With KVM enabled on a Linux host, a shared memory device is available.  Guests
>> +map a POSIX shared memory region into the guest as a PCI device that enables
>> +zero-copy communication to the application level of the guests.  The basic
>> +syntax is:
>> +
>> +@example
>> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>> +@end example
>> +
>> +If desired, interrupts can be sent between guest VMs accessing the same shared
>> +memory region.  Interrupt support requires using a shared memory server and
>> +using a chardev socket to connect to it.  The code for the shared memory server
>> +is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
>> +memory server is:
>> +
>> +@example
>> +qemu -device ivshmem,size=<size in format accepted by -m>[,chardev=<id>]
>> +                        [,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
>> +qemu -chardev socket,path=<path>,id=<id>
>> +@end example
>> +
>> +When using the server, the guest will be assigned a VM ID (>=0) that allows guests
>> +using the same server to communicate via interrupts.  Guests can read their
>> +VM ID from a device register (see example code).  Since receiving the shared
>> +memory region from the server is asynchronous, there is a (small) chance the
>> +guest may boot before the shared memory is attached.  To allow an application
>> +to ensure shared memory is attached, the VM ID register will return -1 (an
>> +invalid VM ID) until the memory is attached.  Once the shared memory is
>> +attached, the VM ID will return the guest's valid VM ID.  With these semantics,
>> +the guest application can check to ensure the shared memory is attached to the
>> +guest before proceeding.
>> +
>> +The @option{role} argument can be set to either master or peer and will affect
>> +how the shared memory is migrated.  With @option{role=master}, the guest will
>> +copy the shared memory on migration to the destination host.  With
>> +@option{role=peer}, the shared memory will not be copied on migration.  Only
>> +one guest should be specified as
>> +the master.
>> +
>>  @node direct_linux_boot
>>  @section Direct Linux Boot
>>
>> --
>> 1.6.3.2.198.g6096d
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device
  2010-06-07 16:41             ` Cam Macdonell
@ 2010-06-09 20:12               ` Blue Swirl
  0 siblings, 0 replies; 42+ messages in thread
From: Blue Swirl @ 2010-06-09 20:12 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On Mon, Jun 7, 2010 at 4:41 PM, Cam Macdonell <cam@cs.ualberta.ca> wrote:
> On Sat, Jun 5, 2010 at 3:44 AM, Blue Swirl <blauwirbel@gmail.com> wrote:
>> On Fri, Jun 4, 2010 at 9:45 PM, Cam Macdonell <cam@cs.ualberta.ca> wrote:
>>> Support an inter-vm shared memory device that maps a shared-memory object as a
>>> PCI device in the guest.  This patch also supports interrupts between guest by
>>> communicating over a unix domain socket.  This patch applies to the qemu-kvm
>>> repository.
>>>
>>>    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>>
>>> Interrupts are supported between multiple VMs by using a shared memory server
>>> by using a chardev socket.
>>>
>>>    -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>>           [,chardev=<id>][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
>>>    -chardev socket,path=<path>,id=<id>
>>>
>>> (shared memory server is qemu.git/contrib/ivshmem-server)
>>>
>>> Sample programs and init scripts are in a git repo here:
>>>
>>>    www.gitorious.org/nahanni
>>> ---
>>>  Makefile.target |    3 +
>>>  hw/ivshmem.c    |  852 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  qemu-char.c     |    6 +
>>>  qemu-char.h     |    3 +
>>>  qemu-doc.texi   |   43 +++
>>>  5 files changed, 907 insertions(+), 0 deletions(-)
>>>  create mode 100644 hw/ivshmem.c
>>>
>>> diff --git a/Makefile.target b/Makefile.target
>>> index c4ba592..4888308 100644
>>> --- a/Makefile.target
>>> +++ b/Makefile.target
>>> @@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>>>  obj-y += rtl8139.o
>>>  obj-y += e1000.o
>>>
>>> +# Inter-VM PCI shared memory
>>> +obj-y += ivshmem.o
>>> +
>>
>> Can this be compiled once, simply by moving this to Makefile.objs
>> instead of Makefile.target? Also, because the code seems to be KVM
>> specific, it can't be compiled unconditionally but depending on at
>> least CONFIG_KVM and maybe CONFIG_EVENTFD.
>
> The device uses eventfds for signalling, but never creates the
> eventfds itself as they are passed from a server using SCM_RIGHTS.
> So, it does not depend on the eventfd API.  Its dependence on
> irqfd/ioeventfd (the only KVM specific bits) are optional via the
> command-line.
>
> A runtime check for KVM is done for the reasons Avi mentioned.
>
>>
>> Why is this KVM specific BTW, Posix SHM is available on many
>> platforms? What would happen if kvm_set_foobar functions were not
>> called when KVM is not being used? Is host eventfd support essential?
>>
>>>  # Hardware support
>>>  obj-i386-y += vga.o
>>>  obj-i386-y += mc146818rtc.o i8259.o pc.o
>>> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
>>> new file mode 100644
>>> index 0000000..9057612
>>> --- /dev/null
>>> +++ b/hw/ivshmem.c
>>> @@ -0,0 +1,852 @@
>>> +/*
>>> + * Inter-VM Shared Memory PCI device.
>>> + *
>>> + * Author:
>>> + *      Cam Macdonell <cam@cs.ualberta.ca>
>>> + *
>>> + * Based On: cirrus_vga.c
>>> + *          Copyright (c) 2004 Fabrice Bellard
>>> + *          Copyright (c) 2004 Makoto Suzuki (suzu)
>>> + *
>>> + *      and rtl8139.c
>>> + *          Copyright (c) 2006 Igor Kovalenko
>>> + *
>>> + * This code is licensed under the GNU GPL v2.
>>> + */
>>> +#include <sys/mman.h>
>>> +#include <sys/types.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/io.h>
>>> +#include <sys/ioctl.h>
>>> +#include "hw.h"
>>> +#include "console.h"
>>> +#include "pc.h"
>>> +#include "pci.h"
>>> +#include "sysemu.h"
>>> +
>>> +#include "msix.h"
>>> +#include "qemu-kvm.h"
>>> +#include "libkvm.h"
>>> +
>>> +#include <sys/eventfd.h>
>>> +#include <sys/mman.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/ioctl.h>
>>> +
>>> +#define IVSHMEM_IRQFD   0
>>> +#define IVSHMEM_MSI     1
>>> +
>>> +//#define DEBUG_IVSHMEM
>>> +#ifdef DEBUG_IVSHMEM
>>> +#define IVSHMEM_DPRINTF(fmt, args...)        \
>>> +    do {printf("IVSHMEM: " fmt, ##args); } while (0)
>>
>> Please use __VA_ARGS__.
>>
>>> +#else
>>> +#define IVSHMEM_DPRINTF(fmt, args...)
>>> +#endif
>>> +
>>> +typedef struct Peer {
>>> +    int nb_eventfds;
>>> +    int *eventfds;
>>> +} Peer;
>>> +
>>> +typedef struct EventfdEntry {
>>> +    PCIDevice *pdev;
>>> +    int vector;
>>> +} EventfdEntry;
>>> +
>>> +typedef struct IVShmemState {
>>> +    PCIDevice dev;
>>> +    uint32_t intrmask;
>>> +    uint32_t intrstatus;
>>> +    uint32_t doorbell;
>>> +
>>> +    CharDriverState ** eventfd_chr;
>>
>> I'd remove the space between '**' and 'eventfd_chr', it's used inconsistently.
>>
>>> +    CharDriverState * server_chr;
>>> +    int ivshmem_mmio_io_addr;
>>> +
>>> +    pcibus_t mmio_addr;
>>> +    pcibus_t shm_pci_addr;
>>> +    uint64_t ivshmem_offset;
>>> +    uint64_t ivshmem_size; /* size of shared memory region */
>>> +    int shm_fd; /* shared memory file descriptor */
>>> +
>>> +    Peer *peers;
>>> +    int nb_peers; /* how many guests we have space for */
>>> +    int max_peer; /* maximum numbered peer */
>>> +
>>> +    int vm_id;
>>> +    uint32_t vectors;
>>> +    uint32_t features;
>>> +    EventfdEntry *eventfd_table;
>>> +
>>> +    char * shmobj;
>>> +    char * sizearg;
>>> +    char * role;
>>> +} IVShmemState;
>>> +
>>> +/* registers for the Inter-VM shared memory device */
>>> +enum ivshmem_registers {
>>> +    IntrMask = 0,
>>> +    IntrStatus = 4,
>>> +    IVPosition = 8,
>>> +    Doorbell = 12,
>>> +};
>>
>> IIRC these should be uppercase.
>
> I worked from rtl8139 which doesn't have them uppercase.  But doing a
> quick search, I can see most devices do, I'll fix that.
>
>>
>>> +
>>> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
>>> +    return (ivs->features & (1 << feature));
>>> +}
>>
>> Since this is the first version, do we need any features at this
>> point, can't we expect that all features are available now? Why does
>> the user need to specify the features?
>
> Some features require host support such as irqfd/ioeventfds.  So
> having features allows them to be disabled on the command-line (e.g.,
> irqfd=off).
>
>>
>> To avoid a negative shift, I'd make 'feature' unsigned.
>>
>>> +
>>> +static inline bool is_power_of_two(uint64_t x) {
>>> +    return (x & (x - 1)) == 0;
>>> +}
>>> +
>>> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
>>> +                    pcibus_t addr, pcibus_t size, int type)
>>> +{
>>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>>> +
>>> +    s->shm_pci_addr = addr;
>>> +
>>> +    if (s->ivshmem_offset > 0) {
>>> +        cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
>>> +                                                            s->ivshmem_offset);
>>> +        if (s->role && strncmp(s->role, "peer", 4) == 0) {
>>> +            IVSHMEM_DPRINTF("marking pages no migrate\n");
>>> +            cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
>>> +        }
>>> +    }
>>> +
>>> +    IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
>>> +                (uint32_t)addr, (uint32_t)s->ivshmem_offset, (uint32_t)size);
>>
>> Please use FMT_PCIBUS for addr and size and PRIu64 for s->ivshmem_offset.
>>
>>> +
>>> +}
>>> +
>>> +/* accessing registers - based on rtl8139 */
>>> +static void ivshmem_update_irq(IVShmemState *s, int val)
>>> +{
>>> +    int isr;
>>> +    isr = (s->intrstatus & s->intrmask) & 0xffffffff;
>>> +
>>> +    /* don't print ISR resets */
>>> +    if (isr) {
>>> +        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
>>> +           isr ? 1 : 0, s->intrstatus, s->intrmask);
>>> +    }
>>> +
>>> +    qemu_set_irq(s->dev.irq[0], (isr != 0));
>>> +}
>>> +
>>> +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
>>> +{
>>> +    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
>>> +
>>> +    s->intrmask = val;
>>> +
>>> +    ivshmem_update_irq(s, val);
>>> +}
>>> +
>>> +static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
>>> +{
>>> +    uint32_t ret = s->intrmask;
>>> +
>>> +    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
>>> +{
>>> +    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
>>> +
>>> +    s->intrstatus = val;
>>> +
>>> +    ivshmem_update_irq(s, val);
>>> +    return;
>>> +}
>>> +
>>> +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
>>> +{
>>> +    uint32_t ret = s->intrstatus;
>>> +
>>> +    /* reading ISR clears all interrupts */
>>> +    s->intrstatus = 0;
>>> +
>>> +    ivshmem_update_irq(s, 0);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
>>> +{
>>> +
>>> +    IVSHMEM_DPRINTF("We shouldn't be writing words\n");
>>> +}
>>> +
>>> +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
>>> +{
>>> +    IVShmemState *s = opaque;
>>> +
>>> +    u_int64_t write_one = 1;
>>
>> Please use uintNN_t instead of u_intNN_t.
>>
>>> +    u_int16_t dest = val >> 16;
>>> +    u_int16_t vector = val & 0xff;
>>> +
>>> +    addr &= 0xfc;
>>
>> I'd add a debug printf here, likewise for exit of ivshmem_io_readl().
>> When you do the merge (see below), the correct printf format for the
>> addresses will be TARGET_FMT_plx.
>>
>>> +
>>> +    switch (addr)
>>> +    {
>>> +        case IntrMask:
>>> +            ivshmem_IntrMask_write(s, val);
>>> +            break;
>>> +
>>> +        case IntrStatus:
>>> +            ivshmem_IntrStatus_write(s, val);
>>> +            break;
>>> +
>>> +        case Doorbell:
>>> +            /* check that dest VM ID is reasonable */
>>> +            if ((dest < 0) || (dest > s->max_peer)) {
>>> +                IVSHMEM_DPRINTF("Invalid destination VM ID (%d)\n", dest);
>>> +                break;
>>> +            }
>>> +
>>> +            /* check doorbell range */
>>> +            if ((vector >= 0) && (vector < s->peers[dest].nb_eventfds)) {
>>> +                IVSHMEM_DPRINTF("Writing %ld to VM %d on vector %d\n",
>>> +                                                    write_one, dest, vector);
>>
>> PRId64 for write_one, %ld is not enough on ILP32.
>>
>>> +                if (write(s->peers[dest].eventfds[vector],
>>> +                                                    &(write_one), 8) != 8) {
>>> +                    IVSHMEM_DPRINTF("error writing to eventfd\n");
>>> +                }
>>> +            }
>>> +            break;
>>> +        default:
>>> +            IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest);
>>> +    }
>>> +}
>>> +
>>> +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
>>> +{
>>> +    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
>>> +}
>>> +
>>> +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
>>> +{
>>> +
>>> +    IVSHMEM_DPRINTF("We shouldn't be reading words\n");
>>> +    return 0;
>>> +}
>>> +
>>> +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
>>> +{
>>> +
>>> +    IVShmemState *s = opaque;
>>> +    uint32_t ret;
>>> +
>>> +    switch (addr)
>>> +    {
>>> +        case IntrMask:
>>> +            ret = ivshmem_IntrMask_read(s);
>>> +            break;
>>> +
>>> +        case IntrStatus:
>>> +            ret = ivshmem_IntrStatus_read(s);
>>> +            break;
>>> +
>>> +        case IVPosition:
>>> +            /* return my VM ID if the memory is mapped */
>>> +            if (s->shm_fd > 0) {
>>> +                ret = s->vm_id;
>>> +            } else {
>>> +                ret = -1;
>>> +            }
>>> +            break;
>>> +
>>> +        default:
>>> +            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
>>> +            ret = 0;
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
>>> +{
>>> +    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void ivshmem_mmio_writeb(void *opaque,
>>> +                                target_phys_addr_t addr, uint32_t val)
>>> +{
>>> +    ivshmem_io_writeb(opaque, addr & 0xFF, val);
>>> +}
>>
>> This function and others below only performs a cast and useless
>> masking (the address passed is these days an offset from start of the
>> area). Please merge these to ivshmem_io_readl() etc.
>
> Ok, these are artifacts from basing my patch on rtl8139.c.  I'll remove them.
>
>>
>>> +
>>> +static void ivshmem_mmio_writew(void *opaque,
>>> +                                target_phys_addr_t addr, uint32_t val)
>>> +{
>>> +    ivshmem_io_writew(opaque, addr & 0xFF, val);
>>> +}
>>> +
>>> +static void ivshmem_mmio_writel(void *opaque,
>>> +                                target_phys_addr_t addr, uint32_t val)
>>> +{
>>> +    ivshmem_io_writel(opaque, addr & 0xFF, val);
>>> +}
>>> +
>>> +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
>>> +{
>>> +    return ivshmem_io_readb(opaque, addr & 0xFF);
>>> +}
>>> +
>>> +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
>>> +{
>>> +    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
>>> +    return val;
>>> +}
>>> +
>>> +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
>>> +{
>>> +    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
>>> +    return val;
>>> +}
>>> +
>>> +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
>>
>> Please add 'const'.
>>
>>> +    ivshmem_mmio_readb,
>>> +    ivshmem_mmio_readw,
>>> +    ivshmem_mmio_readl,
>>> +};
>>> +
>>> +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
>>> +    ivshmem_mmio_writeb,
>>> +    ivshmem_mmio_writew,
>>> +    ivshmem_mmio_writel,
>>> +};
>>> +
>>> +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
>>> +{
>>> +    IVShmemState *s = opaque;
>>> +
>>> +    ivshmem_IntrStatus_write(s, *buf);
>>> +
>>> +    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
>>> +}
>>> +
>>> +static int ivshmem_can_receive(void * opaque)
>>> +{
>>> +    return 8;
>>> +}
>>> +
>>> +static void ivshmem_event(void *opaque, int event)
>>> +{
>>> +    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
>>> +}
>>> +
>>> +static void fake_irqfd(void *opaque, const uint8_t *buf, int size) {
>>> +
>>> +    EventfdEntry *entry = opaque;
>>> +    PCIDevice *pdev = entry->pdev;
>>> +
>>> +    IVSHMEM_DPRINTF("fake irqfd on vector %p %d\n", pdev, entry->vector);
>>> +    msix_notify(pdev, entry->vector);
>>> +}
>>> +
>>> +static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd,
>>> +                                                                    int vector)
>>> +{
>>> +    /* create a event character device based on the passed eventfd */
>>> +    IVShmemState *s = opaque;
>>> +    CharDriverState * chr;
>>> +
>>> +    chr = qemu_chr_open_eventfd(eventfd);
>>> +
>>> +    if (chr == NULL) {
>>> +        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
>>
>> This should not be a DPRINTF.
>>
>>> +        exit(-1);
>>> +    }
>>> +
>>> +    /* if MSI is supported we need multiple interrupts */
>>> +    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>>> +        s->eventfd_table[vector].pdev = &s->dev;
>>> +        s->eventfd_table[vector].vector = vector;
>>> +
>>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, fake_irqfd,
>>> +                      ivshmem_event, &s->eventfd_table[vector]);
>>> +    } else {
>>> +        qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
>>> +                      ivshmem_event, s);
>>> +    }
>>> +
>>> +    return chr;
>>> +
>>> +}
>>> +
>>> +static int check_shm_size(IVShmemState *s, int fd) {
>>> +    /* check that the guest isn't going to try and map more memory than the
>>> +     * the object has allocated return -1 to indicate error */
>>> +
>>> +    struct stat buf;
>>> +
>>> +    fstat(fd, &buf);
>>> +
>>> +    if (s->ivshmem_size > buf.st_size) {
>>> +        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
>>> +        fprintf(stderr, " than shared object size (%ld > %ld)\n",
>>> +                                          s->ivshmem_size, buf.st_size);
>>
>> Please use PRIx64 for s->ivshmem_size, this will cause a warning on ILP32.
>>
>>> +        return -1;
>>> +    } else {
>>> +        return 0;
>>> +    }
>>> +}
>>> +
>>> +/* create the shared memory BAR when we are not using the server, so we can
>>> + * create the BAR and map the memory immediately */
>>> +static void create_shared_memory_BAR(IVShmemState *s, int fd) {
>>> +
>>> +    void * ptr;
>>> +
>>> +    s->shm_fd = fd;
>>> +
>>> +    ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>>> +
>>> +    s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, ptr);
>>
>> qemu_ram_map() does not exist in HEAD.
>>
>
> Ok, so qemu_ram_map() and kvm_set_irq() are in the KVM HEAD.  I had my
> own version of both functions, but removed them when Marcelo's were
> merged into KVM.
>
>>> +
>>> +    /* region for shared memory */
>>> +    pci_register_bar(&s->dev, 2, s->ivshmem_size,
>>> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
>>> +}
>>> +
>>> +static void close_guest_eventfds(IVShmemState *s, int posn)
>>> +{
>>> +    int i, guest_curr_max;
>>> +
>>> +    guest_curr_max = s->peers[posn].nb_eventfds;
>>> +
>>> +    for (i = 0; i < guest_curr_max; i++)
>>> +        close(s->peers[posn].eventfds[i]);
>>
>> CODING_STYLE.
>>
>>> +
>>> +    qemu_free(s->peers[posn].eventfds);
>>> +    s->peers[posn].nb_eventfds = 0;
>>> +}
>>> +
>>> +static void setup_ioeventfds(IVShmemState *s) {
>>> +
>>> +    int i, j;
>>> +
>>> +    for (i = 0; i <= s->max_peer; i++) {
>>> +        for (j = 0; j < s->peers[i].nb_eventfds; j++) {
>>> +            kvm_set_ioeventfd_mmio_long(s->peers[i].eventfds[j],
>>> +                    s->mmio_addr + Doorbell, (i << 16) | j, 1);
>>> +        }
>>> +    }
>>> +
>>> +    /* setup irqfd for this VM's eventfds */
>>> +    for (i = 0; i < s->vectors; i++) {
>>> +        kvm_set_irqfd(s->dev.msix_irq_entries[i].gsi,
>>> +                        s->peers[s->vm_id].eventfds[i], 1);
>>
>> kvm_set_irqfd() does not exist in HEAD.
>>
>>> +    }
>>> +}
>>> +
>>> +
>>> +/* this function increase the dynamic storage need to store data about other
>>> + * guests */
>>> +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
>>> +
>>> +    int j, old_nb_alloc;
>>> +
>>> +    old_nb_alloc = s->nb_peers;
>>> +
>>> +    while (new_min_size >= s->nb_peers)
>>> +        s->nb_peers = s->nb_peers * 2;
>>> +
>>> +    IVSHMEM_DPRINTF("bumping storage to %d guests\n", s->nb_peers);
>>> +    s->peers = qemu_realloc(s->peers, s->nb_peers * sizeof(Peer));
>>> +
>>> +    if (s->peers == NULL) {
>>> +        fprintf(stderr, "Allocation error - exiting\n");
>>> +        exit(1);
>>> +    }
>>
>> qemu_realloc will never return zero.
>>
>>> +
>>> +    /* zero out new pointers */
>>> +    for (j = old_nb_alloc; j < s->nb_peers; j++) {
>>> +        s->peers[j].eventfds = NULL;
>>> +        s->peers[j].nb_eventfds = 0;
>>> +    }
>>> +}
>>> +
>>> +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
>>> +{
>>> +    IVShmemState *s = opaque;
>>> +    int incoming_fd, tmp_fd;
>>> +    int guest_curr_max;
>>> +    long incoming_posn;
>>> +
>>> +    memcpy(&incoming_posn, buf, sizeof(long));
>>> +    /* pick off s->server_chr->msgfd and store it, posn should accompany msg */
>>> +    tmp_fd = qemu_chr_get_msgfd(s->server_chr);
>>> +    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
>>> +
>>> +    /* make sure we have enough space for this guest */
>>> +    if (incoming_posn >= s->nb_peers) {
>>> +        increase_dynamic_storage(s, incoming_posn);
>>> +    }
>>> +
>>> +    if (tmp_fd == -1) {
>>> +        /* if posn is positive and unseen before then this is our posn*/
>>> +        if ((incoming_posn >= 0) && (s->peers[incoming_posn].eventfds == NULL)) {
>>> +            /* receive our posn */
>>> +            s->vm_id = incoming_posn;
>>> +            return;
>>> +        } else {
>>> +            /* otherwise an fd == -1 means an existing guest has gone away */
>>> +            IVSHMEM_DPRINTF("posn %ld has gone away\n", incoming_posn);
>>> +            close_guest_eventfds(s, incoming_posn);
>>> +            return;
>>> +        }
>>> +    }
>>> +
>>> +    /* because of the implementation of get_msgfd, we need a dup */
>>> +    incoming_fd = dup(tmp_fd);
>>> +
>>> +    if (incoming_fd == -1) {
>>> +        fprintf(stderr, "could not allocate file descriptor %s\n",
>>> +                                                            strerror(errno));
>>> +        return;
>>> +    }
>>> +
>>> +    /* if the position is -1, then it's shared memory region fd */
>>> +    if (incoming_posn == -1) {
>>> +
>>> +        void * map_ptr;
>>> +
>>> +        s->max_peer = 0;
>>> +
>>> +        if (check_shm_size(s, incoming_fd) == -1) {
>>> +            exit(-1);
>>> +        }
>>> +
>>> +        /* mmap the region and map into the BAR2 */
>>> +        map_ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED,
>>> +                                                                incoming_fd, 0);
>>> +        s->ivshmem_offset = qemu_ram_map(s->ivshmem_size, map_ptr);
>>> +
>>> +        IVSHMEM_DPRINTF("guest pci addr = %u, guest h/w addr = %u, size = %u\n",
>>> +                        (uint32_t)s->shm_pci_addr, (uint32_t)s->ivshmem_offset,
>>> +                        (uint32_t)s->ivshmem_size);
>>> +
>>> +        if (s->shm_pci_addr > 0) {
>>> +            /* map memory into BAR2 */
>>> +            cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
>>> +                                                            s->ivshmem_offset);
>>> +            if (s->role && strncmp(s->role, "peer", 4) == 0) {
>>> +                IVSHMEM_DPRINTF("marking pages no migrate\n");
>>> +                cpu_mark_pages_no_migrate(s->ivshmem_offset, s->ivshmem_size);
>>> +            }
>>> +
>>> +        }
>>> +
>>> +        /* only store the fd if it is successfully mapped */
>>> +        s->shm_fd = incoming_fd;
>>> +
>>> +        return;
>>> +    }
>>> +
>>> +    /* each guest has an array of eventfds, and we keep track of how many
>>> +     * guests for each VM */
>>> +    guest_curr_max = s->peers[incoming_posn].nb_eventfds;
>>> +    if (guest_curr_max == 0) {
>>> +        /* one eventfd per MSI vector */
>>> +        s->peers[incoming_posn].eventfds = (int *) qemu_malloc(s->vectors *
>>> +                                                                sizeof(int));
>>
>> Useless cast in C.
>>
>>> +    }
>>> +
>>> +    /* this is an eventfd for a particular guest VM */
>>> +    IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, guest_curr_max,
>>> +                                                                incoming_fd);
>>> +    s->peers[incoming_posn].eventfds[guest_curr_max] = incoming_fd;
>>> +
>>> +    /* increment count for particular guest */
>>> +    s->peers[incoming_posn].nb_eventfds++;
>>> +
>>> +    /* keep track of the maximum VM ID */
>>> +    if (incoming_posn > s->max_peer) {
>>> +        s->max_peer = incoming_posn;
>>> +    }
>>> +
>>> +    if (incoming_posn == s->vm_id) {
>>> +        int vector = guest_curr_max;
>>
>> Why add a new variable?
>
> for clarity, so when looking at the code it's clear that the current
> maxium is a MSI vector.  Perhaps I'll rename guest_curr_max to achieve
> this.
>
>>
>>> +        if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>>> +            /* initialize char device for callback
>>> +             * if this is one of my eventfds */
>>> +            s->eventfd_chr[vector] = create_eventfd_chr_device(s,
>>> +                       s->peers[s->vm_id].eventfds[vector], vector);
>>> +        }
>>> +    }
>>> +
>>> +    return;
>>> +}
>>> +
>>> +static void ivshmem_reset(DeviceState *d)
>>> +{
>>> +    return;
>>
>> Nothing to do?
>
> Oversight, will fix.
>
>>
>>> +}
>>> +
>>> +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
>>> +                       pcibus_t addr, pcibus_t size, int type)
>>> +{
>>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
>>> +
>>> +    s->mmio_addr = addr;
>>> +    cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr);
>>
>> The 0x400 should be #defined earlier. Why 0x400 since you are only
>> interested in the 0x100 first bytes?
>
> We bumped to 1MB for a previous version that had multiple doorbells.
> ioeventfd masks eliminated made multiple doorbells unnecessary.  I'll
> put it back to 0x100.
>
>>
>>> +
>>> +    /* ioeventfd and irqfd are enabled together,
>>> +     * so the flag IRQFD refers to both */
>>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>>> +        setup_ioeventfds(s);
>>> +    }
>>> +}
>>> +
>>> +static uint64_t ivshmem_get_size(IVShmemState * s) {
>>> +
>>> +    uint64_t value;
>>> +    char *ptr;
>>> +
>>> +    value = strtoul(s->sizearg, &ptr, 10);
>>
>> I'd use strtoull() but the whole function should be suppressed, see below.
>>
>>> +    switch (*ptr) {
>>> +        case 0: case 'M': case 'm':
>>> +            value <<= 20;
>>> +            break;
>>> +        case 'G': case 'g':
>>> +            value <<= 30;
>>> +            break;
>>> +        default:
>>> +            fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg);
>>> +            exit(1);
>>> +    }
>>> +
>>> +    /* BARs must be a power of 2 */
>>> +    if (!is_power_of_two(value)) {
>>> +        fprintf(stderr, "ivshmem: size must be power of 2\n");
>>> +        exit(1);
>>> +    }
>>
>> Isn't the BAR check in pci.c enough?
>
> This would produce a clearer error that it is the ivshmem device that
> is the problem.
>
>>
>>> +
>>> +    return value;
>>> +
>>> +}
>>> +
>>> +static void ivshmem_setup_msi(IVShmemState * s) {
>>> +
>>> +    int i;
>>> +
>>> +    /* allocate the MSI-X vectors */
>>> +
>>> +    if (!msix_init(&s->dev, s->vectors, 1, 0)) {
>>> +        pci_register_bar(&s->dev, 1,
>>> +                         msix_bar_size(&s->dev),
>>> +                         PCI_BASE_ADDRESS_SPACE_MEMORY,
>>> +                         msix_mmio_map);
>>> +        IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
>>> +    } else {
>>> +        IVSHMEM_DPRINTF("msix initialization failed\n");
>>
>> Is this fatal considering the msix_vector_use() below?
>
> Perhaps not, we could fall back to regular interrupts.  Could move the
> msix_vector_use() into the "then" clause.
>
>>
>>> +    }
>>> +
>>> +    /* 'activate' the vectors */
>>> +    for (i = 0; i < s->vectors; i++) {
>>> +        msix_vector_use(&s->dev, i);
>>> +    }
>>> +
>>> +    /* if IRQFDs are not supported, we'll have to trigger the interrupts
>>> +     * via Qemu char devices */
>>> +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
>>> +        /* for handling interrupts when IRQFD is not available */
>>> +        s->eventfd_table = qemu_mallocz(s->vectors * sizeof(EventfdEntry));
>>> +    }
>>> +}
>>> +
>>> +static void ivshmem_save(QEMUFile* f, void *opaque)
>>> +{
>>> +    IVShmemState *proxy = opaque;
>>> +
>>> +    IVSHMEM_DPRINTF("ivshmem_save\n");
>>> +    pci_device_save(&proxy->dev, f);
>>> +
>>> +    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
>>> +        msix_save(&proxy->dev, f);
>>> +    } else {
>>> +        qemu_put_be32(f, proxy->intrstatus);
>>> +        qemu_put_be32(f, proxy->intrmask);
>>> +    }
>>> +
>>> +}
>>
>> There may be VMState magic to handle conditional structures (or just
>> make the structures unconditional), so VMState should be used instead.
>
> MSI-X requires the use of the old-style savevm/loadvm functions.
> Should VMState and the load/save be used together?

They do the same thing. Maybe MSI-X should be fixed then.

>>
>>> +
>>> +static int ivshmem_load(QEMUFile* f, void *opaque, int version_id)
>>> +{
>>> +    IVSHMEM_DPRINTF("ivshmem_load\n");
>>> +
>>> +    IVShmemState *proxy = opaque;
>>> +    int ret, i;
>>> +
>>
>> Missing version check.
>>
>>> +    ret = pci_device_load(&proxy->dev, f);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    if (ivshmem_has_feature(proxy, IVSHMEM_MSI)) {
>>> +        msix_load(&proxy->dev, f);
>>> +        for (i = 0; i < proxy->vectors; i++) {
>>> +            msix_vector_use(&proxy->dev, i);
>>> +        }
>>> +    } else {
>>> +        proxy->intrstatus = qemu_get_be32(f);
>>> +        proxy->intrmask = qemu_get_be32(f);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int pci_ivshmem_init(PCIDevice *dev)
>>> +{
>>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>>> +    uint8_t *pci_conf;
>>> +
>>> +    if (s->sizearg == NULL)
>>> +        s->ivshmem_size = 4 << 20; /* 4 MB default */
>>> +    else {
>>> +        s->ivshmem_size = ivshmem_get_size(s);
>>> +    }
>>> +
>>> +    register_savevm("ivshmem", 0, 0, ivshmem_save, ivshmem_load, dev);
>>> +
>>> +    /* IRQFD requires MSI */
>>> +    if (ivshmem_has_feature(s, IVSHMEM_IRQFD) &&
>>> +        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
>>> +        fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n");
>>> +        exit(1);
>>> +    }
>>> +
>>> +    /* check that role is reasonable */
>>> +    if (s->role && !((strncmp(s->role, "peer", 5) == 0) ||
>>> +                        (strncmp(s->role, "master", 7) == 0))) {
>>> +        fprintf(stderr, "ivshmem: 'role' must be 'peer' or 'master'\n");
>>> +        exit(1);
>>> +    }
>>
>> I'd add a scalar flag in IVShmemState for role so that further strcmps
>> are avoided.
>>
>>> +
>>> +    pci_conf = s->dev.config;
>>> +    pci_conf[0x00] = 0xf4; /* Qumranet vendor ID 0x5002 */
>>> +    pci_conf[0x01] = 0x1a;
>>> +    pci_conf[0x02] = 0x10;
>>> +    pci_conf[0x03] = 0x11;
>>
>> Please add the DID to hw/pci_ids.h and use pci_config_set_xyz() here.
>>
>>> +    pci_conf[0x04] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
>>> +    pci_conf[0x0a] = 0x00; /* RAM controller */
>>> +    pci_conf[0x0b] = 0x05;
>>> +    pci_conf[0x0e] = 0x00; /* header_type */
>>> +
>>> +    pci_conf[PCI_INTERRUPT_PIN] = 1;
>>> +
>>> +    s->shm_pci_addr = 0;
>>> +    s->ivshmem_offset = 0;
>>> +    s->shm_fd = 0;
>>> +
>>> +    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
>>> +                                    ivshmem_mmio_write, s);
>>> +    /* region for registers*/
>>> +    pci_register_bar(&s->dev, 0, 0x400,
>>> +                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
>>> +
>>> +    if ((s->server_chr != NULL) &&
>>> +                        (strncmp(s->server_chr->filename, "unix:", 5) == 0)) {
>>> +        /* if we get a UNIX socket as the parameter we will talk
>>> +         * to the ivshmem server to receive the memory region */
>>> +
>>> +        IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
>>> +                                                    s->server_chr->filename);
>>> +
>>> +        if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
>>> +            ivshmem_setup_msi(s);
>>> +        }
>>> +
>>> +        /* we allocate enough space for 16 guests and grow as needed */
>>> +        s->nb_peers = 16;
>>> +        s->vm_id = -1;
>>> +
>>> +        /* allocate/initialize space for interrupt handling */
>>> +        s->peers = qemu_mallocz(s->nb_peers * sizeof(Peer));
>>> +
>>> +        pci_register_bar(&s->dev, 2, s->ivshmem_size,
>>> +                                    PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
>>> +
>>> +        s->eventfd_chr = (CharDriverState **) qemu_mallocz(s->vectors *
>>> +                                                sizeof(CharDriverState *));
>>
>> Useless cast in C.
>>
>>> +
>>> +        qemu_chr_add_handlers(s->server_chr, ivshmem_can_receive, ivshmem_read,
>>> +                     ivshmem_event, s);
>>> +    } else {
>>> +        /* just map the file immediately, we're not using a server */
>>> +        int fd;
>>> +
>>> +        if (s->shmobj == NULL) {
>>> +            fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n");
>>> +        }
>>
>> I'd rather have separate 'chardev' and 'shm_file' parameters. Then
>> 'info qtree' could return more useful information about the chardev.
>
> can you elaborate?  the command-line currently must be one of the following
>
> server case:
> -ivshmem chardev=x,...
> -chardev id=x,...
>
> or
>
> non-server case:
> -ivshmem shm=<name>,...

Never mind, I was confused.

>
>>
>>> +
>>> +        IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
>>> +
>>> +        /* try opening with O_EXCL and if it succeeds zero the memory
>>> +         * by truncating to 0 */
>>> +        if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
>>> +                        S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
>>> +           /* truncate file to length PCI device's memory */
>>> +            if (ftruncate(fd, s->ivshmem_size) != 0) {
>>> +                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
>>
>> Why 'kvm_ivshmem'?
>
> old name, will remove.
>
>>
>>> +            }
>>> +
>>> +        } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
>>> +                        S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
>>> +            fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
>>> +            exit(-1);
>>> +
>>> +        }
>>> +
>>> +        if (check_shm_size(s, fd) == -1) {
>>> +            exit(-1);
>>> +        }
>>> +
>>> +        create_shared_memory_BAR(s, fd);
>>> +
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int pci_ivshmem_uninit(PCIDevice *dev)
>>> +{
>>> +    IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev);
>>> +
>>> +    cpu_unregister_io_memory(s->ivshmem_mmio_io_addr);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static PCIDeviceInfo ivshmem_info = {
>>> +    .qdev.name  = "ivshmem",
>>> +    .qdev.size  = sizeof(IVShmemState),
>>> +    .qdev.reset = ivshmem_reset,
>>> +    .init       = pci_ivshmem_init,
>>> +    .exit       = pci_ivshmem_uninit,
>>> +    .qdev.props = (Property[]) {
>>> +        DEFINE_PROP_CHR("chardev", IVShmemState, server_chr),
>>> +        DEFINE_PROP_STRING("size", IVShmemState, sizearg),
>>
>> This should be scalar type, not string.
>
> but it needs to handle the 'm' and 'g' suffixes for memory sizes.
>
>>
>>> +        DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 1),
>>> +        DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false),
>>> +        DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true),
>>> +        DEFINE_PROP_STRING("shm", IVShmemState, shmobj),
>>> +        DEFINE_PROP_STRING("role", IVShmemState, role),
>>> +        DEFINE_PROP_END_OF_LIST(),
>>> +    }
>>> +};
>>> +
>>> +static void ivshmem_register_devices(void)
>>> +{
>>> +    pci_qdev_register(&ivshmem_info);
>>> +}
>>> +
>>> +device_init(ivshmem_register_devices)
>>> diff --git a/qemu-char.c b/qemu-char.c
>>> index ac65a1c..b2e50d0 100644
>>> --- a/qemu-char.c
>>> +++ b/qemu-char.c
>>> @@ -2093,6 +2093,12 @@ static void tcp_chr_read(void *opaque)
>>>     }
>>>  }
>>>
>>> +CharDriverState *qemu_chr_open_eventfd(int eventfd){
>>> +
>>> +    return qemu_chr_open_fd(eventfd, eventfd);
>>> +
>>> +}
>>> +
>>>  static void tcp_chr_connect(void *opaque)
>>>  {
>>>     CharDriverState *chr = opaque;
>>> diff --git a/qemu-char.h b/qemu-char.h
>>> index e3a0783..6ea01ba 100644
>>> --- a/qemu-char.h
>>> +++ b/qemu-char.h
>>> @@ -94,6 +94,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
>>>  void qemu_chr_info(Monitor *mon, QObject **ret_data);
>>>  CharDriverState *qemu_chr_find(const char *name);
>>>
>>> +/* add an eventfd to the qemu devices that are polled */
>>> +CharDriverState *qemu_chr_open_eventfd(int eventfd);
>>
>> Maybe this should be removed and just open coded with qemu_chr_open_fd.
>
> I'm indifferent, it just looked a little funny passing the parameter twice.
>
>>
>>> +
>>>  extern int term_escape_char;
>>>
>>>  /* async I/O support */
>>> diff --git a/qemu-doc.texi b/qemu-doc.texi
>>> index 6647b7b..24f8748 100644
>>> --- a/qemu-doc.texi
>>> +++ b/qemu-doc.texi
>>> @@ -706,6 +706,49 @@ Using the @option{-net socket} option, it is possible to make VLANs
>>>  that span several QEMU instances. See @ref{sec_invocation} to have a
>>>  basic example.
>>>
>>> +@section Other Devices
>>> +
>>> +@subsection Inter-VM Shared Memory device
>>> +
>>> +With KVM enabled on a Linux host, a shared memory device is available.  Guests
>>> +map a POSIX shared memory region into the guest as a PCI device that enables
>>> +zero-copy communication to the application level of the guests.  The basic
>>> +syntax is:
>>> +
>>> +@example
>>> +qemu -device ivshmem,size=<size in format accepted by -m>[,shm=<shm name>]
>>> +@end example
>>> +
>>> +If desired, interrupts can be sent between guest VMs accessing the same shared
>>> +memory region.  Interrupt support requires using a shared memory server and
>>> +using a chardev socket to connect to it.  The code for the shared memory server
>>> +is qemu.git/contrib/ivshmem-server.  An example syntax when using the shared
>>> +memory server is:
>>> +
>>> +@example
>>> +qemu -device ivshmem,size=<size in format accepted by -m>[,chardev=<id>]
>>> +                        [,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
>>> +qemu -chardev socket,path=<path>,id=<id>
>>> +@end example
>>> +
>>> +When using the server, the guest will be assigned a VM ID (>=0) that allows guests
>>> +using the same server to communicate via interrupts.  Guests can read their
>>> +VM ID from a device register (see example code).  Since receiving the shared
>>> +memory region from the server is asynchronous, there is a (small) chance the
>>> +guest may boot before the shared memory is attached.  To allow an application
>>> +to ensure shared memory is attached, the VM ID register will return -1 (an
>>> +invalid VM ID) until the memory is attached.  Once the shared memory is
>>> +attached, the VM ID will return the guest's valid VM ID.  With these semantics,
>>> +the guest application can check to ensure the shared memory is attached to the
>>> +guest before proceeding.
>>> +
>>> +The @option{role} argument can be set to either master or peer and will affect
>>> +how the shared memory is migrated.  With @option{role=master}, the guest will
>>> +copy the shared memory on migration to the destination host.  With
>>> +@option{role=peer}, the shared memory will not be copied on migration.  Only
>>> +one guest should be specified as
>>> +the master.
>>> +
>>>  @node direct_linux_boot
>>>  @section Direct Linux Boot
>>>
>>> --
>>> 1.6.3.2.198.g6096d
>>>
>>>
>>>
>>
>>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support
  2010-06-04 21:45 ` [Qemu-devel] " Cam Macdonell
@ 2010-06-11 22:03   ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-11 22:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: KVM General, Anthony Liguori, Blue Swirl

Hi Anthony,

Is my implementation of master/peer roles acceptable?  I realize with
Alex's RAMList changes I may need to modify my patch, but is the
approach of marking memory non-migratable an acceptable
implementation?

Thanks,
Cam

On Fri, Jun 4, 2010 at 3:45 PM, Cam Macdonell <cam@cs.ualberta.ca> wrote:
> Latest patch for PCI shared memory device that maps a host shared memory object
> to be shared between guests
>
> new in this series
>    - migration support with 'master' and 'peer' roles for guest to determine
>      who "owns" memory.  With 'master', the guest has the canonical copy of
>      the shared memory and will copy it with it on migration.  With 'role=peer',
>      the guest will not copy the shared memory, but attach to what is on the
>      destination machine.
>    - modified phys_ram_dirty array for marking memory as not to be migrated
>    - add support for non-migrated memory regions
>
>    v5:
>    - fixed segfault for non-server case
>    - code style fixes
>    - removed limit on the number of guests
>    - shared memory server is now in qemu.git/contrib
>    - made ioeventfd setup function generic
>    - removed interrupts when guest joined (let application handle it)
>
>    v4:
>    - moved to single Doorbell register and use datamatch to trigger different
>      VMs rather than one register per eventfd
>    - remove writing arbitrary values to eventfds.  Only values of 1 are now
>      written to ensure correct usage
>
> Cam Macdonell (6):
>  Device specification for shared memory PCI device
>  Adds two new functions for assigning ioeventfd and irqfds.
>  Change phys_ram_dirty to phys_ram_status
>  Add support for marking memory to not be migrated.  On migration,
>    memory is checked for the NO_MIGRATION_FLAG.
>  Inter-VM shared memory PCI device
>  the stand-alone shared memory server for inter-VM shared memory
>
>  Makefile.target                         |    3 +
>  arch_init.c                             |   28 +-
>  contrib/ivshmem-server/Makefile         |   16 +
>  contrib/ivshmem-server/README           |   30 ++
>  contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++
>  contrib/ivshmem-server/send_scm.c       |  208 ++++++++
>  contrib/ivshmem-server/send_scm.h       |   19 +
>  cpu-all.h                               |   18 +-
>  cpu-common.h                            |    2 +
>  docs/specs/ivshmem_device_spec.txt      |   96 ++++
>  exec.c                                  |   48 ++-
>  hw/ivshmem.c                            |  852 +++++++++++++++++++++++++++++++
>  kvm-all.c                               |   32 ++
>  kvm.h                                   |    1 +
>  qemu-char.c                             |    6 +
>  qemu-char.h                             |    3 +
>  qemu-doc.texi                           |   32 ++
>  17 files changed, 1710 insertions(+), 37 deletions(-)
>  create mode 100644 contrib/ivshmem-server/Makefile
>  create mode 100644 contrib/ivshmem-server/README
>  create mode 100644 contrib/ivshmem-server/ivshmem_server.c
>  create mode 100644 contrib/ivshmem-server/send_scm.c
>  create mode 100644 contrib/ivshmem-server/send_scm.h
>  create mode 100644 docs/specs/ivshmem_device_spec.txt
>  create mode 100644 hw/ivshmem.c
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] Re: [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support
@ 2010-06-11 22:03   ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-11 22:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl, KVM General

Hi Anthony,

Is my implementation of master/peer roles acceptable?  I realize with
Alex's RAMList changes I may need to modify my patch, but is the
approach of marking memory non-migratable an acceptable
implementation?

Thanks,
Cam

On Fri, Jun 4, 2010 at 3:45 PM, Cam Macdonell <cam@cs.ualberta.ca> wrote:
> Latest patch for PCI shared memory device that maps a host shared memory object
> to be shared between guests
>
> new in this series
>    - migration support with 'master' and 'peer' roles for guest to determine
>      who "owns" memory.  With 'master', the guest has the canonical copy of
>      the shared memory and will copy it with it on migration.  With 'role=peer',
>      the guest will not copy the shared memory, but attach to what is on the
>      destination machine.
>    - modified phys_ram_dirty array for marking memory as not to be migrated
>    - add support for non-migrated memory regions
>
>    v5:
>    - fixed segfault for non-server case
>    - code style fixes
>    - removed limit on the number of guests
>    - shared memory server is now in qemu.git/contrib
>    - made ioeventfd setup function generic
>    - removed interrupts when guest joined (let application handle it)
>
>    v4:
>    - moved to single Doorbell register and use datamatch to trigger different
>      VMs rather than one register per eventfd
>    - remove writing arbitrary values to eventfds.  Only values of 1 are now
>      written to ensure correct usage
>
> Cam Macdonell (6):
>  Device specification for shared memory PCI device
>  Adds two new functions for assigning ioeventfd and irqfds.
>  Change phys_ram_dirty to phys_ram_status
>  Add support for marking memory to not be migrated.  On migration,
>    memory is checked for the NO_MIGRATION_FLAG.
>  Inter-VM shared memory PCI device
>  the stand-alone shared memory server for inter-VM shared memory
>
>  Makefile.target                         |    3 +
>  arch_init.c                             |   28 +-
>  contrib/ivshmem-server/Makefile         |   16 +
>  contrib/ivshmem-server/README           |   30 ++
>  contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++
>  contrib/ivshmem-server/send_scm.c       |  208 ++++++++
>  contrib/ivshmem-server/send_scm.h       |   19 +
>  cpu-all.h                               |   18 +-
>  cpu-common.h                            |    2 +
>  docs/specs/ivshmem_device_spec.txt      |   96 ++++
>  exec.c                                  |   48 ++-
>  hw/ivshmem.c                            |  852 +++++++++++++++++++++++++++++++
>  kvm-all.c                               |   32 ++
>  kvm.h                                   |    1 +
>  qemu-char.c                             |    6 +
>  qemu-char.h                             |    3 +
>  qemu-doc.texi                           |   32 ++
>  17 files changed, 1710 insertions(+), 37 deletions(-)
>  create mode 100644 contrib/ivshmem-server/Makefile
>  create mode 100644 contrib/ivshmem-server/README
>  create mode 100644 contrib/ivshmem-server/ivshmem_server.c
>  create mode 100644 contrib/ivshmem-server/send_scm.c
>  create mode 100644 contrib/ivshmem-server/send_scm.h
>  create mode 100644 docs/specs/ivshmem_device_spec.txt
>  create mode 100644 hw/ivshmem.c
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.
  2010-06-04 21:45         ` [Qemu-devel] " Cam Macdonell
  (?)
  (?)
@ 2010-06-14 15:51         ` Anthony Liguori
  2010-06-14 16:08           ` Cam Macdonell
  -1 siblings, 1 reply; 42+ messages in thread
From: Anthony Liguori @ 2010-06-14 15:51 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 06/04/2010 04:45 PM, Cam Macdonell wrote:
> This is useful for devices that do not want to take memory regions data with them on migration.
> ---
>   arch_init.c  |   28 ++++++++++++++++------------
>   cpu-all.h    |    2 ++
>   cpu-common.h |    2 ++
>   exec.c       |   12 ++++++++++++
>   4 files changed, 32 insertions(+), 12 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index cfc03ea..7a234fa 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
>                                               current_addr + TARGET_PAGE_SIZE,
>                                               MIGRATION_DIRTY_FLAG);
>
> -            p = qemu_get_ram_ptr(current_addr);
> -
> -            if (is_dup_page(p, *p)) {
> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
> -                qemu_put_byte(f, *p);
> -            } else {
> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
> -                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
> -            }
> +            if (!cpu_physical_memory_get_dirty(current_addr,
> +                                                    NO_MIGRATION_FLAG)) {
> +                p = qemu_get_ram_ptr(current_addr);
> +
> +                if (is_dup_page(p, *p)) {
> +                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
> +                    qemu_put_byte(f, *p);
> +                } else {
> +                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
> +                    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
> +                }
>
> -            found = 1;
> -            break;
> +                found = 1;
> +                break;
> +            }
>           }
>    

Shouldn't we just disable live migration out right?

I would rather that the device mark migration as impossible having the 
user hot remove the device before migration and then add it again after 
migration.  Device assignment could also use this functionality.

What this does is make migration possible but fundamentally broken which 
is not a good thing.

Regards,

Anthony Liguori

>           addr += TARGET_PAGE_SIZE;
>           current_addr = (saved_addr + addr) % last_ram_offset;
> @@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
>       ram_addr_t count = 0;
>
>       for (addr = 0; addr<  last_ram_offset; addr += TARGET_PAGE_SIZE) {
> -        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
> +        if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG)&&
> +                cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
>               count++;
>           }
>       }
> diff --git a/cpu-all.h b/cpu-all.h
> index 9080cc7..4df00ab 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -887,6 +887,8 @@ extern int mem_prealloc;
>   #define CODE_DIRTY_FLAG      0x02
>   #define MIGRATION_DIRTY_FLAG 0x08
>
> +#define NO_MIGRATION_FLAG 0x10
> +
>   #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG)
>
>   /* read dirty bit (return 0 or 1) */
> diff --git a/cpu-common.h b/cpu-common.h
> index 4b0ba60..a1ebbbe 100644
> --- a/cpu-common.h
> +++ b/cpu-common.h
> @@ -39,6 +39,8 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr,
>       cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0);
>   }
>
> +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
> +
>   ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
>   ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
>   ram_addr_t qemu_ram_alloc(ram_addr_t);
> diff --git a/exec.c b/exec.c
> index 39c18a7..c11d22f 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path)
>   }
>   #endif
>
> +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
> +{
> +    int i, len;
> +    uint8_t *p;
> +
> +    len = length>>  TARGET_PAGE_BITS;
> +    p = phys_ram_flags + (start>>  TARGET_PAGE_BITS);
> +    for (i = 0; i<  len; i++) {
> +        p[i] |= NO_MIGRATION_FLAG;
> +    }
> +}
> +
>   ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
>   {
>       RAMBlock *new_block;
>    


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
  2010-06-04 21:45             ` [Qemu-devel] " Cam Macdonell
  (?)
  (?)
@ 2010-06-14 15:53             ` Anthony Liguori
  2010-06-14 22:03               ` Cam Macdonell
  2010-06-23 13:12               ` Avi Kivity
  -1 siblings, 2 replies; 42+ messages in thread
From: Anthony Liguori @ 2010-06-14 15:53 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 06/04/2010 04:45 PM, Cam Macdonell wrote:
> this code is a standalone server which will pass file descriptors for the shared
> memory region and eventfds to support interrupts between guests using inter-VM
> shared memory.
> ---
>   contrib/ivshmem-server/Makefile         |   16 ++
>   contrib/ivshmem-server/README           |   30 +++
>   contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++++++++++++++++++++
>   contrib/ivshmem-server/send_scm.c       |  208 ++++++++++++++++++
>   contrib/ivshmem-server/send_scm.h       |   19 ++
>   5 files changed, 626 insertions(+), 0 deletions(-)
>   create mode 100644 contrib/ivshmem-server/Makefile
>   create mode 100644 contrib/ivshmem-server/README
>   create mode 100644 contrib/ivshmem-server/ivshmem_server.c
>   create mode 100644 contrib/ivshmem-server/send_scm.c
>   create mode 100644 contrib/ivshmem-server/send_scm.h
>
> diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile
> new file mode 100644
> index 0000000..da40ffa
> --- /dev/null
> +++ b/contrib/ivshmem-server/Makefile
> @@ -0,0 +1,16 @@
> +CC = gcc
> +CFLAGS = -O3 -Wall -Werror
> +LIBS = -lrt
> +
> +# a very simple makefile to build the inter-VM shared memory server
> +
> +all: ivshmem_server
> +
> +.c.o:
> +	$(CC) $(CFLAGS) -c $^ -o $@
> +
> +ivshmem_server: ivshmem_server.o send_scm.o
> +	$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
> +
> +clean:
> +	rm -f *.o ivshmem_server
> diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
> new file mode 100644
> index 0000000..b1fc2a2
> --- /dev/null
> +++ b/contrib/ivshmem-server/README
> @@ -0,0 +1,30 @@
> +Using the ivshmem shared memory server
> +--------------------------------------
> +
> +This server is only supported on Linux.
> +
> +To use the shared memory server, first compile it.  Running 'make' should
> +accomplish this.  An executable named 'ivshmem_server' will be built.
> +
> +to display the options run:
> +
> +./ivshmem_server -h
> +
> +Options
> +-------
> +
> +    -h  print help message
> +
> +    -p<path on host>
> +        unix socket to listen on.  The qemu-kvm chardev needs to connect on
> +        this socket. (default: '/tmp/ivshmem_socket')
> +
> +    -s<string>
> +        POSIX shared object to create that is the shared memory (default: 'ivshmem')
> +
> +    -m<#>
> +        size of the POSIX object in MBs (default: 1)
> +
> +    -n<#>
> +        number of eventfds for each guest.  This number must match the
> +        'vectors' argument passed the ivshmem device. (default: 1)
> diff --git a/contrib/ivshmem-server/ivshmem_server.c b/contrib/ivshmem-server/ivshmem_server.c
> new file mode 100644
> index 0000000..e0a7b98
> --- /dev/null
> +++ b/contrib/ivshmem-server/ivshmem_server.c
>    

There's no licensing here.  I don't think this belongs in the qemu tree 
either to be honest.  If it were to be included, it ought to use all of 
the existing qemu infrastructure like the other qemu-* tools.

Regards,

Anthony Liguori

> @@ -0,0 +1,353 @@
> +/*
> + * A stand-alone shared memory server for inter-VM shared memory for KVM
> +*/
> +
> +#include<errno.h>
> +#include<string.h>
> +#include<sys/types.h>
> +#include<sys/socket.h>
> +#include<sys/un.h>
> +#include<unistd.h>
> +#include<sys/types.h>
> +#include<sys/stat.h>
> +#include<fcntl.h>
> +#include<sys/eventfd.h>
> +#include<sys/mman.h>
> +#include<sys/select.h>
> +#include<stdio.h>
> +#include<stdlib.h>
> +#include "send_scm.h"
> +
> +#define DEFAULT_SOCK_PATH "/tmp/ivshmem_socket"
> +#define DEFAULT_SHM_OBJ "ivshmem"
> +
> +#define DEBUG 1
> +
> +typedef struct server_state {
> +    vmguest_t *live_vms;
> +    int nr_allocated_vms;
> +    int shm_size;
> +    long live_count;
> +    long total_count;
> +    int shm_fd;
> +    char * path;
> +    char * shmobj;
> +    int maxfd, conn_socket;
> +    long msi_vectors;
> +} server_state_t;
> +
> +void usage(char const *prg);
> +int find_set(fd_set * readset, int max);
> +void print_vec(server_state_t * s, const char * c);
> +
> +void add_new_guest(server_state_t * s);
> +void parse_args(int argc, char **argv, server_state_t * s);
> +int create_listening_socket(char * path);
> +
> +int main(int argc, char ** argv)
> +{
> +    fd_set readset;
> +    server_state_t * s;
> +
> +    s = (server_state_t *)calloc(1, sizeof(server_state_t));
> +
> +    s->live_count = 0;
> +    s->total_count = 0;
> +    parse_args(argc, argv, s);
> +
> +    /* open shared memory file  */
> +    if ((s->shm_fd = shm_open(s->shmobj, O_CREAT|O_RDWR, S_IRWXU))<  0)
> +    {
> +        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
> +        exit(-1);
> +    }
> +
> +    ftruncate(s->shm_fd, s->shm_size);
> +
> +    s->conn_socket = create_listening_socket(s->path);
> +
> +    s->maxfd = s->conn_socket;
> +
> +    for(;;) {
> +        int ret, handle, i;
> +        char buf[1024];
> +
> +        print_vec(s, "vm_sockets");
> +
> +        FD_ZERO(&readset);
> +        /* conn socket is in Live_vms at posn 0 */
> +        FD_SET(s->conn_socket,&readset);
> +        for (i = 0; i<  s->total_count; i++) {
> +            if (s->live_vms[i].alive != 0) {
> +                FD_SET(s->live_vms[i].sockfd,&readset);
> +            }
> +        }
> +
> +        printf("\nWaiting (maxfd = %d)\n", s->maxfd);
> +
> +        ret = select(s->maxfd + 1,&readset, NULL, NULL, NULL);
> +
> +        if (ret == -1) {
> +            perror("select()");
> +        }
> +
> +        handle = find_set(&readset, s->maxfd + 1);
> +        if (handle == -1) continue;
> +
> +        if (handle == s->conn_socket) {
> +
> +            printf("[NC] new connection\n");
> +            FD_CLR(s->conn_socket,&readset);
> +
> +            /* The Total_count is equal to the new guests VM ID */
> +            add_new_guest(s);
> +
> +            /* update our the maximum file descriptor number */
> +            s->maxfd = s->live_vms[s->total_count - 1].sockfd>  s->maxfd ?
> +                            s->live_vms[s->total_count - 1].sockfd : s->maxfd;
> +
> +            s->live_count++;
> +            printf("Live_count is %ld\n", s->live_count);
> +
> +        } else {
> +            /* then we have received a disconnection */
> +            int recv_ret;
> +            long i, j;
> +            long deadposn = -1;
> +
> +            recv_ret = recv(handle, buf, 1, 0);
> +
> +            printf("[DC] recv returned %d\n", recv_ret);
> +
> +            /* find the dead VM in our list and move it do the dead list. */
> +            for (i = 0; i<  s->total_count; i++) {
> +                if (s->live_vms[i].sockfd == handle) {
> +                    deadposn = i;
> +                    s->live_vms[i].alive = 0;
> +                    close(s->live_vms[i].sockfd);
> +
> +                    for (j = 0; j<  s->msi_vectors; j++) {
> +                        close(s->live_vms[i].efd[j]);
> +                    }
> +
> +                    free(s->live_vms[i].efd);
> +                    s->live_vms[i].sockfd = -1;
> +                    break;
> +                }
> +            }
> +
> +            for (j = 0; j<  s->total_count; j++) {
> +                /* update remaining clients that one client has left/died */
> +                if (s->live_vms[j].alive) {
> +                    printf("[UD] sending kill of fd[%ld] to %ld\n",
> +                                                                deadposn, j);
> +                    sendKill(s->live_vms[j].sockfd, deadposn, sizeof(deadposn));
> +                }
> +            }
> +
> +            s->live_count--;
> +
> +            /* close the socket for the departed VM */
> +            close(handle);
> +        }
> +
> +    }
> +
> +    return 0;
> +}
> +
> +void add_new_guest(server_state_t * s) {
> +
> +    struct sockaddr_un remote;
> +    socklen_t t = sizeof(remote);
> +    long i, j;
> +    int vm_sock;
> +    long new_posn;
> +    long neg1 = -1;
> +
> +    vm_sock = accept(s->conn_socket, (struct sockaddr *)&remote,&t);
> +
> +    if ( vm_sock == -1 ) {
> +        perror("accept");
> +        exit(1);
> +    }
> +
> +    new_posn = s->total_count;
> +
> +    if (new_posn == s->nr_allocated_vms) {
> +        printf("increasing vm slots\n");
> +        s->nr_allocated_vms = s->nr_allocated_vms * 2;
> +        if (s->nr_allocated_vms<  16)
> +            s->nr_allocated_vms = 16;
> +        s->live_vms = realloc(s->live_vms,
> +                    s->nr_allocated_vms * sizeof(vmguest_t));
> +
> +        if (s->live_vms == NULL) {
> +            fprintf(stderr, "realloc failed - quitting\n");
> +            exit(-1);
> +        }
> +    }
> +
> +    s->live_vms[new_posn].posn = new_posn;
> +    printf("[NC] Live_vms[%ld]\n", new_posn);
> +    s->live_vms[new_posn].efd = (int *) malloc(sizeof(int));
> +    for (i = 0; i<  s->msi_vectors; i++) {
> +        s->live_vms[new_posn].efd[i] = eventfd(0, 0);
> +        printf("\tefd[%ld] = %d\n", i, s->live_vms[new_posn].efd[i]);
> +    }
> +    s->live_vms[new_posn].sockfd = vm_sock;
> +    s->live_vms[new_posn].alive = 1;
> +
> +
> +    sendPosition(vm_sock, new_posn);
> +    sendUpdate(vm_sock, neg1, sizeof(long), s->shm_fd);
> +    printf("[NC] trying to send fds to new connection\n");
> +    sendRights(vm_sock, new_posn, sizeof(new_posn), s->live_vms, s->msi_vectors);
> +
> +    printf("[NC] Connected (count = %ld).\n", new_posn);
> +    for (i = 0; i<  new_posn; i++) {
> +        if (s->live_vms[i].alive) {
> +            // ping all clients that a new client has joined
> +            printf("[UD] sending fd[%ld] to %ld\n", new_posn, i);
> +            for (j = 0; j<  s->msi_vectors; j++) {
> +                printf("\tefd[%ld] = [%d]", j, s->live_vms[new_posn].efd[j]);
> +                sendUpdate(s->live_vms[i].sockfd, new_posn,
> +                        sizeof(new_posn), s->live_vms[new_posn].efd[j]);
> +            }
> +            printf("\n");
> +        }
> +    }
> +
> +    s->total_count++;
> +}
> +
> +int create_listening_socket(char * path) {
> +
> +    struct sockaddr_un local;
> +    int len, conn_socket;
> +
> +    if ((conn_socket = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
> +        perror("socket");
> +        exit(1);
> +    }
> +
> +    local.sun_family = AF_UNIX;
> +    strcpy(local.sun_path, path);
> +    unlink(local.sun_path);
> +    len = strlen(local.sun_path) + sizeof(local.sun_family);
> +    if (bind(conn_socket, (struct sockaddr *)&local, len) == -1) {
> +        perror("bind");
> +        exit(1);
> +    }
> +
> +    if (listen(conn_socket, 5) == -1) {
> +        perror("listen");
> +        exit(1);
> +    }
> +
> +    return conn_socket;
> +
> +}
> +
> +void parse_args(int argc, char **argv, server_state_t * s) {
> +
> +    int c;
> +
> +    s->shm_size = 1024 * 1024; // default shm_size
> +    s->path = NULL;
> +    s->shmobj = NULL;
> +    s->msi_vectors = 1;
> +
> +	while ((c = getopt(argc, argv, "hp:s:m:n:")) != -1) {
> +
> +        switch (c) {
> +            // path to listening socket
> +            case 'p':
> +                s->path = optarg;
> +                break;
> +            // name of shared memory object
> +            case 's':
> +                s->shmobj = optarg;
> +                break;
> +            // size of shared memory object
> +            case 'm': {
> +                    uint64_t value;
> +                    char *ptr;
> +
> +                    value = strtoul(optarg,&ptr, 10);
> +                    switch (*ptr) {
> +                    case 0: case 'M': case 'm':
> +                        value<<= 20;
> +                        break;
> +                    case 'G': case 'g':
> +                        value<<= 30;
> +                        break;
> +                    default:
> +                        fprintf(stderr, "qemu: invalid ram size: %s\n", optarg);
> +                        exit(1);
> +                    }
> +                    s->shm_size = value;
> +                    break;
> +                }
> +            case 'n':
> +                s->msi_vectors = atol(optarg);
> +                break;
> +            case 'h':
> +            default:
> +	            usage(argv[0]);
> +		        exit(1);
> +		}
> +	}
> +
> +    if (s->path == NULL) {
> +        s->path = strdup(DEFAULT_SOCK_PATH);
> +    }
> +
> +    printf("listening socket: %s\n", s->path);
> +
> +    if (s->shmobj == NULL) {
> +        s->shmobj = strdup(DEFAULT_SHM_OBJ);
> +    }
> +
> +    printf("shared object: %s\n", s->shmobj);
> +    printf("shared object size: %d (bytes)\n", s->shm_size);
> +
> +}
> +
> +void print_vec(server_state_t * s, const char * c) {
> +
> +    int i, j;
> +
> +#if DEBUG
> +    printf("%s (%ld) = ", c, s->total_count);
> +    for (i = 0; i<  s->total_count; i++) {
> +        if (s->live_vms[i].alive) {
> +            for (j = 0; j<  s->msi_vectors; j++) {
> +                printf("[%d|%d] ", s->live_vms[i].sockfd, s->live_vms[i].efd[j]);
> +            }
> +        }
> +    }
> +    printf("\n");
> +#endif
> +
> +}
> +
> +int find_set(fd_set * readset, int max) {
> +
> +    int i;
> +
> +    for (i = 1; i<  max; i++) {
> +        if (FD_ISSET(i, readset)) {
> +            return i;
> +        }
> +    }
> +
> +    printf("nothing set\n");
> +    return -1;
> +
> +}
> +
> +void usage(char const *prg) {
> +	fprintf(stderr, "use: %s [-h]  [-p<unix socket>] [-s<shm obj>] "
> +            "[-m<size in MB>] [-n<# of MSI vectors>]\n", prg);
> +}
> diff --git a/contrib/ivshmem-server/send_scm.c b/contrib/ivshmem-server/send_scm.c
> new file mode 100644
> index 0000000..b1bb4a3
> --- /dev/null
> +++ b/contrib/ivshmem-server/send_scm.c
> @@ -0,0 +1,208 @@
> +#include<stdint.h>
> +#include<stdlib.h>
> +#include<errno.h>
> +#include<stdio.h>
> +#include<unistd.h>
> +#include<sys/socket.h>
> +#include<sys/syscall.h>
> +#include<sys/un.h>
> +#include<sys/types.h>
> +#include<sys/stat.h>
> +#include<fcntl.h>
> +#include<poll.h>
> +#include "send_scm.h"
> +
> +#ifndef POLLRDHUP
> +#define POLLRDHUP 0x2000
> +#endif
> +
> +int readUpdate(int fd, long * posn, int * newfd)
> +{
> +    struct msghdr msg;
> +    struct iovec iov[1];
> +    struct cmsghdr *cmptr;
> +    size_t len;
> +    size_t msg_size = sizeof(int);
> +    char control[CMSG_SPACE(msg_size)];
> +
> +    msg.msg_name = 0;
> +    msg.msg_namelen = 0;
> +    msg.msg_control = control;
> +    msg.msg_controllen = sizeof(control);
> +    msg.msg_flags = 0;
> +    msg.msg_iov = iov;
> +    msg.msg_iovlen = 1;
> +
> +    iov[0].iov_base =&posn;
> +    iov[0].iov_len = sizeof(posn);
> +
> +    do {
> +        len = recvmsg(fd,&msg, 0);
> +    } while (len == (size_t) (-1)&&  (errno == EINTR || errno == EAGAIN));
> +
> +    printf("iov[0].buf is %ld\n", *((long *)iov[0].iov_base));
> +    printf("len is %ld\n", len);
> +    // TODO: Logging
> +    if (len == (size_t) (-1)) {
> +        perror("recvmsg()");
> +        return -1;
> +    }
> +
> +    if (msg.msg_controllen<  sizeof(struct cmsghdr))
> +        return *posn;
> +
> +    for (cmptr = CMSG_FIRSTHDR(&msg); cmptr != NULL;
> +        cmptr = CMSG_NXTHDR(&msg, cmptr)) {
> +        if (cmptr->cmsg_level != SOL_SOCKET ||
> +            cmptr->cmsg_type != SCM_RIGHTS){
> +                printf("continuing %ld\n", sizeof(size_t));
> +                printf("read msg_size = %ld\n", msg_size);
> +                if (cmptr->cmsg_len != sizeof(control))
> +                    printf("not equal (%ld != %ld)\n",cmptr->cmsg_len,sizeof(control));
> +                continue;
> +        }
> +
> +        memcpy(newfd, CMSG_DATA(cmptr), sizeof(int));
> +        printf("posn is %ld (fd = %d)\n", *posn, *newfd);
> +        return 0;
> +    }
> +
> +    fprintf(stderr, "bad data in packet\n");
> +    return -1;
> +}
> +
> +int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors)
> +{
> +    int j, newfd;
> +
> +    for (; ;){
> +        long posn = 0;
> +
> +        readUpdate(fd,&posn,&newfd);
> +        printf("reading posn %ld ", posn);
> +        fds[posn] = (int *)malloc (msi_vectors * sizeof(int));
> +        fds[posn][0] = newfd;
> +        for (j = 1; j<  msi_vectors; j++) {
> +            readUpdate(fd,&posn,&newfd);
> +            fds[posn][j] = newfd;
> +            printf("%d.", fds[posn][j]);
> +        }
> +        printf("\n");
> +
> +        /* stop reading once i've read my own eventfds */
> +        if (posn == count)
> +            break;
> +    }
> +
> +    return 0;
> +}
> +
> +int sendKill(int fd, long const posn, size_t posn_len) {
> +
> +    struct cmsghdr *cmsg;
> +    size_t msg_size = sizeof(int);
> +    char control[CMSG_SPACE(msg_size)];
> +    struct iovec iov[1];
> +    size_t len;
> +    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
> +
> +    struct pollfd mypollfd;
> +    int rv;
> +
> +    iov[0].iov_base = (void *)&posn;
> +    iov[0].iov_len = posn_len;
> +
> +    // from cmsg(3)
> +    cmsg = CMSG_FIRSTHDR(&msg);
> +    cmsg->cmsg_level = SOL_SOCKET;
> +    cmsg->cmsg_len = 0;
> +    msg.msg_controllen = cmsg->cmsg_len;
> +
> +    printf("Killing posn %ld\n", posn);
> +
> +    // check if the fd is dead or not
> +    mypollfd.fd = fd;
> +    mypollfd.events = POLLRDHUP;
> +    mypollfd.revents = 0;
> +
> +    rv = poll(&mypollfd, 1, 0);
> +
> +    printf("rv is %d\n", rv);
> +
> +    if (rv == 0) {
> +        len = sendmsg(fd,&msg, 0);
> +        if (len == (size_t) (-1)) {
> +            perror("sendmsg()");
> +            return -1;
> +        }
> +        return (len == posn_len);
> +    } else {
> +        printf("already dead\n");
> +        return 0;
> +    }
> +}
> +
> +int sendUpdate(int fd, long posn, size_t posn_len, int sendfd)
> +{
> +
> +    struct cmsghdr *cmsg;
> +    size_t msg_size = sizeof(int);
> +    char control[CMSG_SPACE(msg_size)];
> +    struct iovec iov[1];
> +    size_t len;
> +    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
> +
> +    iov[0].iov_base = (void *) (&posn);
> +    iov[0].iov_len = posn_len;
> +
> +    // from cmsg(3)
> +    cmsg = CMSG_FIRSTHDR(&msg);
> +    cmsg->cmsg_level = SOL_SOCKET;
> +    cmsg->cmsg_type = SCM_RIGHTS;
> +    cmsg->cmsg_len = CMSG_LEN(msg_size);
> +    msg.msg_controllen = cmsg->cmsg_len;
> +
> +    memcpy((CMSG_DATA(cmsg)),&sendfd, msg_size);
> +
> +    len = sendmsg(fd,&msg, 0);
> +    if (len == (size_t) (-1)) {
> +        perror("sendmsg()");
> +        return -1;
> +    }
> +
> +    return (len == posn_len);
> +
> +}
> +
> +int sendPosition(int fd, long const posn)
> +{
> +    int rv;
> +
> +    rv = send(fd,&posn, sizeof(long), 0);
> +    if (rv != sizeof(long)) {
> +        fprintf(stderr, "error sending posn\n");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +int sendRights(int fd, long const count, size_t count_len, vmguest_t * Live_vms,
> +                                                            long msi_vectors)
> +{
> +    /* updates about new guests are sent one at a time */
> +
> +    long i, j;
> +
> +    for (i = 0; i<= count; i++) {
> +        if (Live_vms[i].alive) {
> +            for (j = 0; j<  msi_vectors; j++) {
> +                sendUpdate(Live_vms[count].sockfd, i, sizeof(long),
> +                                                        Live_vms[i].efd[j]);
> +            }
> +        }
> +    }
> +
> +    return 0;
> +
> +}
> diff --git a/contrib/ivshmem-server/send_scm.h b/contrib/ivshmem-server/send_scm.h
> new file mode 100644
> index 0000000..48c9a8d
> --- /dev/null
> +++ b/contrib/ivshmem-server/send_scm.h
> @@ -0,0 +1,19 @@
> +#ifndef SEND_SCM
> +#define SEND_SCM
> +
> +struct vm_guest_conn {
> +    int posn;
> +    int sockfd;
> +    int * efd;
> +    int alive;
> +};
> +
> +typedef struct vm_guest_conn vmguest_t;
> +
> +int readRights(int fd, long count, size_t count_len, int **fds, int msi_vectors);
> +int sendRights(int fd, long const count, size_t count_len, vmguest_t *Live_vms, long msi_vectors);
> +int readUpdate(int fd, long * posn, int * newfd);
> +int sendUpdate(int fd, long const posn, size_t posn_len, int sendfd);
> +int sendPosition(int fd, long const posn);
> +int sendKill(int fd, long const posn, size_t posn_len);
> +#endif
>    


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support
  2010-06-11 22:03   ` [Qemu-devel] " Cam Macdonell
@ 2010-06-14 15:54     ` Anthony Liguori
  -1 siblings, 0 replies; 42+ messages in thread
From: Anthony Liguori @ 2010-06-14 15:54 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, Blue Swirl, KVM General

On 06/11/2010 05:03 PM, Cam Macdonell wrote:
> Hi Anthony,
>
> Is my implementation of master/peer roles acceptable?

Yes, it looks good.

>    I realize with
> Alex's RAMList changes I may need to modify my patch, but is the
> approach of marking memory non-migratable an acceptable
> implementation?
>    

Please make sure to address some of the CODING_STYLE comments too.

Regards,

Anthony Liguori

> Thanks,
> Cam
>
> On Fri, Jun 4, 2010 at 3:45 PM, Cam Macdonell<cam@cs.ualberta.ca>  wrote:
>    
>> Latest patch for PCI shared memory device that maps a host shared memory object
>> to be shared between guests
>>
>> new in this series
>>     - migration support with 'master' and 'peer' roles for guest to determine
>>       who "owns" memory.  With 'master', the guest has the canonical copy of
>>       the shared memory and will copy it with it on migration.  With 'role=peer',
>>       the guest will not copy the shared memory, but attach to what is on the
>>       destination machine.
>>     - modified phys_ram_dirty array for marking memory as not to be migrated
>>     - add support for non-migrated memory regions
>>
>>     v5:
>>     - fixed segfault for non-server case
>>     - code style fixes
>>     - removed limit on the number of guests
>>     - shared memory server is now in qemu.git/contrib
>>     - made ioeventfd setup function generic
>>     - removed interrupts when guest joined (let application handle it)
>>
>>     v4:
>>     - moved to single Doorbell register and use datamatch to trigger different
>>       VMs rather than one register per eventfd
>>     - remove writing arbitrary values to eventfds.  Only values of 1 are now
>>       written to ensure correct usage
>>
>> Cam Macdonell (6):
>>   Device specification for shared memory PCI device
>>   Adds two new functions for assigning ioeventfd and irqfds.
>>   Change phys_ram_dirty to phys_ram_status
>>   Add support for marking memory to not be migrated.  On migration,
>>     memory is checked for the NO_MIGRATION_FLAG.
>>   Inter-VM shared memory PCI device
>>   the stand-alone shared memory server for inter-VM shared memory
>>
>>   Makefile.target                         |    3 +
>>   arch_init.c                             |   28 +-
>>   contrib/ivshmem-server/Makefile         |   16 +
>>   contrib/ivshmem-server/README           |   30 ++
>>   contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++
>>   contrib/ivshmem-server/send_scm.c       |  208 ++++++++
>>   contrib/ivshmem-server/send_scm.h       |   19 +
>>   cpu-all.h                               |   18 +-
>>   cpu-common.h                            |    2 +
>>   docs/specs/ivshmem_device_spec.txt      |   96 ++++
>>   exec.c                                  |   48 ++-
>>   hw/ivshmem.c                            |  852 +++++++++++++++++++++++++++++++
>>   kvm-all.c                               |   32 ++
>>   kvm.h                                   |    1 +
>>   qemu-char.c                             |    6 +
>>   qemu-char.h                             |    3 +
>>   qemu-doc.texi                           |   32 ++
>>   17 files changed, 1710 insertions(+), 37 deletions(-)
>>   create mode 100644 contrib/ivshmem-server/Makefile
>>   create mode 100644 contrib/ivshmem-server/README
>>   create mode 100644 contrib/ivshmem-server/ivshmem_server.c
>>   create mode 100644 contrib/ivshmem-server/send_scm.c
>>   create mode 100644 contrib/ivshmem-server/send_scm.h
>>   create mode 100644 docs/specs/ivshmem_device_spec.txt
>>   create mode 100644 hw/ivshmem.c
>>
>>
>>      
>
>    


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support
@ 2010-06-14 15:54     ` Anthony Liguori
  0 siblings, 0 replies; 42+ messages in thread
From: Anthony Liguori @ 2010-06-14 15:54 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Blue Swirl, qemu-devel, KVM General

On 06/11/2010 05:03 PM, Cam Macdonell wrote:
> Hi Anthony,
>
> Is my implementation of master/peer roles acceptable?

Yes, it looks good.

>    I realize with
> Alex's RAMList changes I may need to modify my patch, but is the
> approach of marking memory non-migratable an acceptable
> implementation?
>    

Please make sure to address some of the CODING_STYLE comments too.

Regards,

Anthony Liguori

> Thanks,
> Cam
>
> On Fri, Jun 4, 2010 at 3:45 PM, Cam Macdonell<cam@cs.ualberta.ca>  wrote:
>    
>> Latest patch for PCI shared memory device that maps a host shared memory object
>> to be shared between guests
>>
>> new in this series
>>     - migration support with 'master' and 'peer' roles for guest to determine
>>       who "owns" memory.  With 'master', the guest has the canonical copy of
>>       the shared memory and will copy it with it on migration.  With 'role=peer',
>>       the guest will not copy the shared memory, but attach to what is on the
>>       destination machine.
>>     - modified phys_ram_dirty array for marking memory as not to be migrated
>>     - add support for non-migrated memory regions
>>
>>     v5:
>>     - fixed segfault for non-server case
>>     - code style fixes
>>     - removed limit on the number of guests
>>     - shared memory server is now in qemu.git/contrib
>>     - made ioeventfd setup function generic
>>     - removed interrupts when guest joined (let application handle it)
>>
>>     v4:
>>     - moved to single Doorbell register and use datamatch to trigger different
>>       VMs rather than one register per eventfd
>>     - remove writing arbitrary values to eventfds.  Only values of 1 are now
>>       written to ensure correct usage
>>
>> Cam Macdonell (6):
>>   Device specification for shared memory PCI device
>>   Adds two new functions for assigning ioeventfd and irqfds.
>>   Change phys_ram_dirty to phys_ram_status
>>   Add support for marking memory to not be migrated.  On migration,
>>     memory is checked for the NO_MIGRATION_FLAG.
>>   Inter-VM shared memory PCI device
>>   the stand-alone shared memory server for inter-VM shared memory
>>
>>   Makefile.target                         |    3 +
>>   arch_init.c                             |   28 +-
>>   contrib/ivshmem-server/Makefile         |   16 +
>>   contrib/ivshmem-server/README           |   30 ++
>>   contrib/ivshmem-server/ivshmem_server.c |  353 +++++++++++++
>>   contrib/ivshmem-server/send_scm.c       |  208 ++++++++
>>   contrib/ivshmem-server/send_scm.h       |   19 +
>>   cpu-all.h                               |   18 +-
>>   cpu-common.h                            |    2 +
>>   docs/specs/ivshmem_device_spec.txt      |   96 ++++
>>   exec.c                                  |   48 ++-
>>   hw/ivshmem.c                            |  852 +++++++++++++++++++++++++++++++
>>   kvm-all.c                               |   32 ++
>>   kvm.h                                   |    1 +
>>   qemu-char.c                             |    6 +
>>   qemu-char.h                             |    3 +
>>   qemu-doc.texi                           |   32 ++
>>   17 files changed, 1710 insertions(+), 37 deletions(-)
>>   create mode 100644 contrib/ivshmem-server/Makefile
>>   create mode 100644 contrib/ivshmem-server/README
>>   create mode 100644 contrib/ivshmem-server/ivshmem_server.c
>>   create mode 100644 contrib/ivshmem-server/send_scm.c
>>   create mode 100644 contrib/ivshmem-server/send_scm.h
>>   create mode 100644 docs/specs/ivshmem_device_spec.txt
>>   create mode 100644 hw/ivshmem.c
>>
>>
>>      
>
>    

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.
  2010-06-14 15:51         ` [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG Anthony Liguori
@ 2010-06-14 16:08           ` Cam Macdonell
  2010-06-14 16:15             ` Anthony Liguori
  0 siblings, 1 reply; 42+ messages in thread
From: Cam Macdonell @ 2010-06-14 16:08 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, kvm

On Mon, Jun 14, 2010 at 9:51 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 06/04/2010 04:45 PM, Cam Macdonell wrote:
>>
>> This is useful for devices that do not want to take memory regions data
>> with them on migration.
>> ---
>>  arch_init.c  |   28 ++++++++++++++++------------
>>  cpu-all.h    |    2 ++
>>  cpu-common.h |    2 ++
>>  exec.c       |   12 ++++++++++++
>>  4 files changed, 32 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index cfc03ea..7a234fa 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
>>                                              current_addr +
>> TARGET_PAGE_SIZE,
>>                                              MIGRATION_DIRTY_FLAG);
>>
>> -            p = qemu_get_ram_ptr(current_addr);
>> -
>> -            if (is_dup_page(p, *p)) {
>> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
>> -                qemu_put_byte(f, *p);
>> -            } else {
>> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
>> -                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
>> -            }
>> +            if (!cpu_physical_memory_get_dirty(current_addr,
>> +                                                    NO_MIGRATION_FLAG)) {
>> +                p = qemu_get_ram_ptr(current_addr);
>> +
>> +                if (is_dup_page(p, *p)) {
>> +                    qemu_put_be64(f, current_addr |
>> RAM_SAVE_FLAG_COMPRESS);
>> +                    qemu_put_byte(f, *p);
>> +                } else {
>> +                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
>> +                    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
>> +                }
>>
>> -            found = 1;
>> -            break;
>> +                found = 1;
>> +                break;
>> +            }
>>          }
>>
>
> Shouldn't we just disable live migration out right?

I'm confused, as you seemed insistent on migration before.  Do you
want to support static migration (suspend/resume), but not live
migration?  What information do the master/peer roles represent then?

>
> I would rather that the device mark migration as impossible having the user
> hot remove the device before migration and then add it again after
> migration.  Device assignment could also use this functionality.

Would marking migration impossible be a new mechanism or are there
other devices that mark migration impossible? or something added to
QMP "Sorry, you can't migrate with device 'x' attached"?

Cam

>>          addr += TARGET_PAGE_SIZE;
>>          current_addr = (saved_addr + addr) % last_ram_offset;
>> @@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
>>      ram_addr_t count = 0;
>>
>>      for (addr = 0; addr<  last_ram_offset; addr += TARGET_PAGE_SIZE) {
>> -        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
>> +        if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG)&&
>> +                cpu_physical_memory_get_dirty(addr,
>> MIGRATION_DIRTY_FLAG)) {
>>              count++;
>>          }
>>      }
>> diff --git a/cpu-all.h b/cpu-all.h
>> index 9080cc7..4df00ab 100644
>> --- a/cpu-all.h
>> +++ b/cpu-all.h
>> @@ -887,6 +887,8 @@ extern int mem_prealloc;
>>  #define CODE_DIRTY_FLAG      0x02
>>  #define MIGRATION_DIRTY_FLAG 0x08
>>
>> +#define NO_MIGRATION_FLAG 0x10
>> +
>>  #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG |
>> MIGRATION_DIRTY_FLAG)
>>
>>  /* read dirty bit (return 0 or 1) */
>> diff --git a/cpu-common.h b/cpu-common.h
>> index 4b0ba60..a1ebbbe 100644
>> --- a/cpu-common.h
>> +++ b/cpu-common.h
>> @@ -39,6 +39,8 @@ static inline void
>> cpu_register_physical_memory(target_phys_addr_t start_addr,
>>      cpu_register_physical_memory_offset(start_addr, size, phys_offset,
>> 0);
>>  }
>>
>> +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
>> +
>>  ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
>>  ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
>>  ram_addr_t qemu_ram_alloc(ram_addr_t);
>> diff --git a/exec.c b/exec.c
>> index 39c18a7..c11d22f 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory,
>> const char *path)
>>  }
>>  #endif
>>
>> +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
>> +{
>> +    int i, len;
>> +    uint8_t *p;
>> +
>> +    len = length>>  TARGET_PAGE_BITS;
>> +    p = phys_ram_flags + (start>>  TARGET_PAGE_BITS);
>> +    for (i = 0; i<  len; i++) {
>> +        p[i] |= NO_MIGRATION_FLAG;
>> +    }
>> +}
>> +
>>  ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
>>  {
>>      RAMBlock *new_block;
>>
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.
  2010-06-14 16:08           ` Cam Macdonell
@ 2010-06-14 16:15             ` Anthony Liguori
  2010-06-15 16:16                 ` [Qemu-devel] " Cam Macdonell
  0 siblings, 1 reply; 42+ messages in thread
From: Anthony Liguori @ 2010-06-14 16:15 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 06/14/2010 11:08 AM, Cam Macdonell wrote:
> On Mon, Jun 14, 2010 at 9:51 AM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 06/04/2010 04:45 PM, Cam Macdonell wrote:
>>      
>>> This is useful for devices that do not want to take memory regions data
>>> with them on migration.
>>> ---
>>>   arch_init.c  |   28 ++++++++++++++++------------
>>>   cpu-all.h    |    2 ++
>>>   cpu-common.h |    2 ++
>>>   exec.c       |   12 ++++++++++++
>>>   4 files changed, 32 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch_init.c b/arch_init.c
>>> index cfc03ea..7a234fa 100644
>>> --- a/arch_init.c
>>> +++ b/arch_init.c
>>> @@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
>>>                                               current_addr +
>>> TARGET_PAGE_SIZE,
>>>                                               MIGRATION_DIRTY_FLAG);
>>>
>>> -            p = qemu_get_ram_ptr(current_addr);
>>> -
>>> -            if (is_dup_page(p, *p)) {
>>> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
>>> -                qemu_put_byte(f, *p);
>>> -            } else {
>>> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
>>> -                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
>>> -            }
>>> +            if (!cpu_physical_memory_get_dirty(current_addr,
>>> +                                                    NO_MIGRATION_FLAG)) {
>>> +                p = qemu_get_ram_ptr(current_addr);
>>> +
>>> +                if (is_dup_page(p, *p)) {
>>> +                    qemu_put_be64(f, current_addr |
>>> RAM_SAVE_FLAG_COMPRESS);
>>> +                    qemu_put_byte(f, *p);
>>> +                } else {
>>> +                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
>>> +                    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
>>> +                }
>>>
>>> -            found = 1;
>>> -            break;
>>> +                found = 1;
>>> +                break;
>>> +            }
>>>           }
>>>
>>>        
>> Shouldn't we just disable live migration out right?
>>      
> I'm confused, as you seemed insistent on migration before.  Do you
> want to support static migration (suspend/resume), but not live
> migration?  What information do the master/peer roles represent then?
>    

When role=master, you should not disable live migration and you should 
still migrate the contents of the data.  Otherwise, when you migrate, 
you lose the contents of shared memory and since the role is master, 
it's the one responsible for the data.

>> I would rather that the device mark migration as impossible having the user
>> hot remove the device before migration and then add it again after
>> migration.  Device assignment could also use this functionality.
>>      
> Would marking migration impossible be a new mechanism or are there
> other devices that mark migration impossible? or something added to
> QMP "Sorry, you can't migrate with device 'x' attached"?
>    

We don't have such a mechanism today.

Regards,

Anthony Liguori

> Cam
>
>    
>>>           addr += TARGET_PAGE_SIZE;
>>>           current_addr = (saved_addr + addr) % last_ram_offset;
>>> @@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
>>>       ram_addr_t count = 0;
>>>
>>>       for (addr = 0; addr<    last_ram_offset; addr += TARGET_PAGE_SIZE) {
>>> -        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
>>> +        if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG)&&
>>> +                cpu_physical_memory_get_dirty(addr,
>>> MIGRATION_DIRTY_FLAG)) {
>>>               count++;
>>>           }
>>>       }
>>> diff --git a/cpu-all.h b/cpu-all.h
>>> index 9080cc7..4df00ab 100644
>>> --- a/cpu-all.h
>>> +++ b/cpu-all.h
>>> @@ -887,6 +887,8 @@ extern int mem_prealloc;
>>>   #define CODE_DIRTY_FLAG      0x02
>>>   #define MIGRATION_DIRTY_FLAG 0x08
>>>
>>> +#define NO_MIGRATION_FLAG 0x10
>>> +
>>>   #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG |
>>> MIGRATION_DIRTY_FLAG)
>>>
>>>   /* read dirty bit (return 0 or 1) */
>>> diff --git a/cpu-common.h b/cpu-common.h
>>> index 4b0ba60..a1ebbbe 100644
>>> --- a/cpu-common.h
>>> +++ b/cpu-common.h
>>> @@ -39,6 +39,8 @@ static inline void
>>> cpu_register_physical_memory(target_phys_addr_t start_addr,
>>>       cpu_register_physical_memory_offset(start_addr, size, phys_offset,
>>> 0);
>>>   }
>>>
>>> +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
>>> +
>>>   ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
>>>   ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
>>>   ram_addr_t qemu_ram_alloc(ram_addr_t);
>>> diff --git a/exec.c b/exec.c
>>> index 39c18a7..c11d22f 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory,
>>> const char *path)
>>>   }
>>>   #endif
>>>
>>> +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
>>> +{
>>> +    int i, len;
>>> +    uint8_t *p;
>>> +
>>> +    len = length>>    TARGET_PAGE_BITS;
>>> +    p = phys_ram_flags + (start>>    TARGET_PAGE_BITS);
>>> +    for (i = 0; i<    len; i++) {
>>> +        p[i] |= NO_MIGRATION_FLAG;
>>> +    }
>>> +}
>>> +
>>>   ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
>>>   {
>>>       RAMBlock *new_block;
>>>
>>>        
>>
>>      


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
  2010-06-14 15:53             ` [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory Anthony Liguori
@ 2010-06-14 22:03               ` Cam Macdonell
  2010-06-23 13:12               ` Avi Kivity
  1 sibling, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-14 22:03 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, kvm

On Mon, Jun 14, 2010 at 9:53 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 06/04/2010 04:45 PM, Cam Macdonell wrote:
>>
>> this code is a standalone server which will pass file descriptors for the
>> shared
>> memory region and eventfds to support interrupts between guests using
>> inter-VM
>> shared memory.
>> ---
>>  contrib/ivshmem-server/Makefile         |   16 ++
>>  contrib/ivshmem-server/README           |   30 +++
>>  contrib/ivshmem-server/ivshmem_server.c |  353
>> +++++++++++++++++++++++++++++++
>>  contrib/ivshmem-server/send_scm.c       |  208 ++++++++++++++++++
>>  contrib/ivshmem-server/send_scm.h       |   19 ++
>>  5 files changed, 626 insertions(+), 0 deletions(-)
>>  create mode 100644 contrib/ivshmem-server/Makefile
>>  create mode 100644 contrib/ivshmem-server/README
>>  create mode 100644 contrib/ivshmem-server/ivshmem_server.c
>>  create mode 100644 contrib/ivshmem-server/send_scm.c
>>  create mode 100644 contrib/ivshmem-server/send_scm.h
>>
>> diff --git a/contrib/ivshmem-server/Makefile
>> b/contrib/ivshmem-server/Makefile
>> new file mode 100644
>> index 0000000..da40ffa
>> --- /dev/null
>> +++ b/contrib/ivshmem-server/Makefile
>> @@ -0,0 +1,16 @@
>> +CC = gcc
>> +CFLAGS = -O3 -Wall -Werror
>> +LIBS = -lrt
>> +
>> +# a very simple makefile to build the inter-VM shared memory server
>> +
>> +all: ivshmem_server
>> +
>> +.c.o:
>> +       $(CC) $(CFLAGS) -c $^ -o $@
>> +
>> +ivshmem_server: ivshmem_server.o send_scm.o
>> +       $(CC) $(CFLAGS) -o $@ $^ $(LIBS)
>> +
>> +clean:
>> +       rm -f *.o ivshmem_server
>> diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
>> new file mode 100644
>> index 0000000..b1fc2a2
>> --- /dev/null
>> +++ b/contrib/ivshmem-server/README
>> @@ -0,0 +1,30 @@
>> +Using the ivshmem shared memory server
>> +--------------------------------------
>> +
>> +This server is only supported on Linux.
>> +
>> +To use the shared memory server, first compile it.  Running 'make' should
>> +accomplish this.  An executable named 'ivshmem_server' will be built.
>> +
>> +to display the options run:
>> +
>> +./ivshmem_server -h
>> +
>> +Options
>> +-------
>> +
>> +    -h  print help message
>> +
>> +    -p<path on host>
>> +        unix socket to listen on.  The qemu-kvm chardev needs to connect
>> on
>> +        this socket. (default: '/tmp/ivshmem_socket')
>> +
>> +    -s<string>
>> +        POSIX shared object to create that is the shared memory (default:
>> 'ivshmem')
>> +
>> +    -m<#>
>> +        size of the POSIX object in MBs (default: 1)
>> +
>> +    -n<#>
>> +        number of eventfds for each guest.  This number must match the
>> +        'vectors' argument passed the ivshmem device. (default: 1)
>> diff --git a/contrib/ivshmem-server/ivshmem_server.c
>> b/contrib/ivshmem-server/ivshmem_server.c
>> new file mode 100644
>> index 0000000..e0a7b98
>> --- /dev/null
>> +++ b/contrib/ivshmem-server/ivshmem_server.c
>>
>
> There's no licensing here.  I don't think this belongs in the qemu tree
> either to be honest.  If it were to be included, it ought to use all of the
> existing qemu infrastructure like the other qemu-* tools.

For the time being, I'm willing to leave it out and host it externally.

>
> Regards,
>
> Anthony Liguori
>
>> @@ -0,0 +1,353 @@
>> +/*
>> + * A stand-alone shared memory server for inter-VM shared memory for KVM
>> +*/
>> +
>> +#include<errno.h>
>> +#include<string.h>
>> +#include<sys/types.h>
>> +#include<sys/socket.h>
>> +#include<sys/un.h>
>> +#include<unistd.h>
>> +#include<sys/types.h>
>> +#include<sys/stat.h>
>> +#include<fcntl.h>
>> +#include<sys/eventfd.h>
>> +#include<sys/mman.h>
>> +#include<sys/select.h>
>> +#include<stdio.h>
>> +#include<stdlib.h>
>> +#include "send_scm.h"
>> +
>> +#define DEFAULT_SOCK_PATH "/tmp/ivshmem_socket"
>> +#define DEFAULT_SHM_OBJ "ivshmem"
>> +
>> +#define DEBUG 1
>> +
>> +typedef struct server_state {
>> +    vmguest_t *live_vms;
>> +    int nr_allocated_vms;
>> +    int shm_size;
>> +    long live_count;
>> +    long total_count;
>> +    int shm_fd;
>> +    char * path;
>> +    char * shmobj;
>> +    int maxfd, conn_socket;
>> +    long msi_vectors;
>> +} server_state_t;
>> +
>> +void usage(char const *prg);
>> +int find_set(fd_set * readset, int max);
>> +void print_vec(server_state_t * s, const char * c);
>> +
>> +void add_new_guest(server_state_t * s);
>> +void parse_args(int argc, char **argv, server_state_t * s);
>> +int create_listening_socket(char * path);
>> +
>> +int main(int argc, char ** argv)
>> +{
>> +    fd_set readset;
>> +    server_state_t * s;
>> +
>> +    s = (server_state_t *)calloc(1, sizeof(server_state_t));
>> +
>> +    s->live_count = 0;
>> +    s->total_count = 0;
>> +    parse_args(argc, argv, s);
>> +
>> +    /* open shared memory file  */
>> +    if ((s->shm_fd = shm_open(s->shmobj, O_CREAT|O_RDWR, S_IRWXU))<  0)
>> +    {
>> +        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
>> +        exit(-1);
>> +    }
>> +
>> +    ftruncate(s->shm_fd, s->shm_size);
>> +
>> +    s->conn_socket = create_listening_socket(s->path);
>> +
>> +    s->maxfd = s->conn_socket;
>> +
>> +    for(;;) {
>> +        int ret, handle, i;
>> +        char buf[1024];
>> +
>> +        print_vec(s, "vm_sockets");
>> +
>> +        FD_ZERO(&readset);
>> +        /* conn socket is in Live_vms at posn 0 */
>> +        FD_SET(s->conn_socket,&readset);
>> +        for (i = 0; i<  s->total_count; i++) {
>> +            if (s->live_vms[i].alive != 0) {
>> +                FD_SET(s->live_vms[i].sockfd,&readset);
>> +            }
>> +        }
>> +
>> +        printf("\nWaiting (maxfd = %d)\n", s->maxfd);
>> +
>> +        ret = select(s->maxfd + 1,&readset, NULL, NULL, NULL);
>> +
>> +        if (ret == -1) {
>> +            perror("select()");
>> +        }
>> +
>> +        handle = find_set(&readset, s->maxfd + 1);
>> +        if (handle == -1) continue;
>> +
>> +        if (handle == s->conn_socket) {
>> +
>> +            printf("[NC] new connection\n");
>> +            FD_CLR(s->conn_socket,&readset);
>> +
>> +            /* The Total_count is equal to the new guests VM ID */
>> +            add_new_guest(s);
>> +
>> +            /* update our the maximum file descriptor number */
>> +            s->maxfd = s->live_vms[s->total_count - 1].sockfd>  s->maxfd
>> ?
>> +                            s->live_vms[s->total_count - 1].sockfd :
>> s->maxfd;
>> +
>> +            s->live_count++;
>> +            printf("Live_count is %ld\n", s->live_count);
>> +
>> +        } else {
>> +            /* then we have received a disconnection */
>> +            int recv_ret;
>> +            long i, j;
>> +            long deadposn = -1;
>> +
>> +            recv_ret = recv(handle, buf, 1, 0);
>> +
>> +            printf("[DC] recv returned %d\n", recv_ret);
>> +
>> +            /* find the dead VM in our list and move it do the dead list.
>> */
>> +            for (i = 0; i<  s->total_count; i++) {
>> +                if (s->live_vms[i].sockfd == handle) {
>> +                    deadposn = i;
>> +                    s->live_vms[i].alive = 0;
>> +                    close(s->live_vms[i].sockfd);
>> +
>> +                    for (j = 0; j<  s->msi_vectors; j++) {
>> +                        close(s->live_vms[i].efd[j]);
>> +                    }
>> +
>> +                    free(s->live_vms[i].efd);
>> +                    s->live_vms[i].sockfd = -1;
>> +                    break;
>> +                }
>> +            }
>> +
>> +            for (j = 0; j<  s->total_count; j++) {
>> +                /* update remaining clients that one client has left/died
>> */
>> +                if (s->live_vms[j].alive) {
>> +                    printf("[UD] sending kill of fd[%ld] to %ld\n",
>> +                                                                deadposn,
>> j);
>> +                    sendKill(s->live_vms[j].sockfd, deadposn,
>> sizeof(deadposn));
>> +                }
>> +            }
>> +
>> +            s->live_count--;
>> +
>> +            /* close the socket for the departed VM */
>> +            close(handle);
>> +        }
>> +
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +void add_new_guest(server_state_t * s) {
>> +
>> +    struct sockaddr_un remote;
>> +    socklen_t t = sizeof(remote);
>> +    long i, j;
>> +    int vm_sock;
>> +    long new_posn;
>> +    long neg1 = -1;
>> +
>> +    vm_sock = accept(s->conn_socket, (struct sockaddr *)&remote,&t);
>> +
>> +    if ( vm_sock == -1 ) {
>> +        perror("accept");
>> +        exit(1);
>> +    }
>> +
>> +    new_posn = s->total_count;
>> +
>> +    if (new_posn == s->nr_allocated_vms) {
>> +        printf("increasing vm slots\n");
>> +        s->nr_allocated_vms = s->nr_allocated_vms * 2;
>> +        if (s->nr_allocated_vms<  16)
>> +            s->nr_allocated_vms = 16;
>> +        s->live_vms = realloc(s->live_vms,
>> +                    s->nr_allocated_vms * sizeof(vmguest_t));
>> +
>> +        if (s->live_vms == NULL) {
>> +            fprintf(stderr, "realloc failed - quitting\n");
>> +            exit(-1);
>> +        }
>> +    }
>> +
>> +    s->live_vms[new_posn].posn = new_posn;
>> +    printf("[NC] Live_vms[%ld]\n", new_posn);
>> +    s->live_vms[new_posn].efd = (int *) malloc(sizeof(int));
>> +    for (i = 0; i<  s->msi_vectors; i++) {
>> +        s->live_vms[new_posn].efd[i] = eventfd(0, 0);
>> +        printf("\tefd[%ld] = %d\n", i, s->live_vms[new_posn].efd[i]);
>> +    }
>> +    s->live_vms[new_posn].sockfd = vm_sock;
>> +    s->live_vms[new_posn].alive = 1;
>> +
>> +
>> +    sendPosition(vm_sock, new_posn);
>> +    sendUpdate(vm_sock, neg1, sizeof(long), s->shm_fd);
>> +    printf("[NC] trying to send fds to new connection\n");
>> +    sendRights(vm_sock, new_posn, sizeof(new_posn), s->live_vms,
>> s->msi_vectors);
>> +
>> +    printf("[NC] Connected (count = %ld).\n", new_posn);
>> +    for (i = 0; i<  new_posn; i++) {
>> +        if (s->live_vms[i].alive) {
>> +            // ping all clients that a new client has joined
>> +            printf("[UD] sending fd[%ld] to %ld\n", new_posn, i);
>> +            for (j = 0; j<  s->msi_vectors; j++) {
>> +                printf("\tefd[%ld] = [%d]", j,
>> s->live_vms[new_posn].efd[j]);
>> +                sendUpdate(s->live_vms[i].sockfd, new_posn,
>> +                        sizeof(new_posn), s->live_vms[new_posn].efd[j]);
>> +            }
>> +            printf("\n");
>> +        }
>> +    }
>> +
>> +    s->total_count++;
>> +}
>> +
>> +int create_listening_socket(char * path) {
>> +
>> +    struct sockaddr_un local;
>> +    int len, conn_socket;
>> +
>> +    if ((conn_socket = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
>> +        perror("socket");
>> +        exit(1);
>> +    }
>> +
>> +    local.sun_family = AF_UNIX;
>> +    strcpy(local.sun_path, path);
>> +    unlink(local.sun_path);
>> +    len = strlen(local.sun_path) + sizeof(local.sun_family);
>> +    if (bind(conn_socket, (struct sockaddr *)&local, len) == -1) {
>> +        perror("bind");
>> +        exit(1);
>> +    }
>> +
>> +    if (listen(conn_socket, 5) == -1) {
>> +        perror("listen");
>> +        exit(1);
>> +    }
>> +
>> +    return conn_socket;
>> +
>> +}
>> +
>> +void parse_args(int argc, char **argv, server_state_t * s) {
>> +
>> +    int c;
>> +
>> +    s->shm_size = 1024 * 1024; // default shm_size
>> +    s->path = NULL;
>> +    s->shmobj = NULL;
>> +    s->msi_vectors = 1;
>> +
>> +       while ((c = getopt(argc, argv, "hp:s:m:n:")) != -1) {
>> +
>> +        switch (c) {
>> +            // path to listening socket
>> +            case 'p':
>> +                s->path = optarg;
>> +                break;
>> +            // name of shared memory object
>> +            case 's':
>> +                s->shmobj = optarg;
>> +                break;
>> +            // size of shared memory object
>> +            case 'm': {
>> +                    uint64_t value;
>> +                    char *ptr;
>> +
>> +                    value = strtoul(optarg,&ptr, 10);
>> +                    switch (*ptr) {
>> +                    case 0: case 'M': case 'm':
>> +                        value<<= 20;
>> +                        break;
>> +                    case 'G': case 'g':
>> +                        value<<= 30;
>> +                        break;
>> +                    default:
>> +                        fprintf(stderr, "qemu: invalid ram size: %s\n",
>> optarg);
>> +                        exit(1);
>> +                    }
>> +                    s->shm_size = value;
>> +                    break;
>> +                }
>> +            case 'n':
>> +                s->msi_vectors = atol(optarg);
>> +                break;
>> +            case 'h':
>> +            default:
>> +                   usage(argv[0]);
>> +                       exit(1);
>> +               }
>> +       }
>> +
>> +    if (s->path == NULL) {
>> +        s->path = strdup(DEFAULT_SOCK_PATH);
>> +    }
>> +
>> +    printf("listening socket: %s\n", s->path);
>> +
>> +    if (s->shmobj == NULL) {
>> +        s->shmobj = strdup(DEFAULT_SHM_OBJ);
>> +    }
>> +
>> +    printf("shared object: %s\n", s->shmobj);
>> +    printf("shared object size: %d (bytes)\n", s->shm_size);
>> +
>> +}
>> +
>> +void print_vec(server_state_t * s, const char * c) {
>> +
>> +    int i, j;
>> +
>> +#if DEBUG
>> +    printf("%s (%ld) = ", c, s->total_count);
>> +    for (i = 0; i<  s->total_count; i++) {
>> +        if (s->live_vms[i].alive) {
>> +            for (j = 0; j<  s->msi_vectors; j++) {
>> +                printf("[%d|%d] ", s->live_vms[i].sockfd,
>> s->live_vms[i].efd[j]);
>> +            }
>> +        }
>> +    }
>> +    printf("\n");
>> +#endif
>> +
>> +}
>> +
>> +int find_set(fd_set * readset, int max) {
>> +
>> +    int i;
>> +
>> +    for (i = 1; i<  max; i++) {
>> +        if (FD_ISSET(i, readset)) {
>> +            return i;
>> +        }
>> +    }
>> +
>> +    printf("nothing set\n");
>> +    return -1;
>> +
>> +}
>> +
>> +void usage(char const *prg) {
>> +       fprintf(stderr, "use: %s [-h]  [-p<unix socket>] [-s<shm obj>] "
>> +            "[-m<size in MB>] [-n<# of MSI vectors>]\n", prg);
>> +}
>> diff --git a/contrib/ivshmem-server/send_scm.c
>> b/contrib/ivshmem-server/send_scm.c
>> new file mode 100644
>> index 0000000..b1bb4a3
>> --- /dev/null
>> +++ b/contrib/ivshmem-server/send_scm.c
>> @@ -0,0 +1,208 @@
>> +#include<stdint.h>
>> +#include<stdlib.h>
>> +#include<errno.h>
>> +#include<stdio.h>
>> +#include<unistd.h>
>> +#include<sys/socket.h>
>> +#include<sys/syscall.h>
>> +#include<sys/un.h>
>> +#include<sys/types.h>
>> +#include<sys/stat.h>
>> +#include<fcntl.h>
>> +#include<poll.h>
>> +#include "send_scm.h"
>> +
>> +#ifndef POLLRDHUP
>> +#define POLLRDHUP 0x2000
>> +#endif
>> +
>> +int readUpdate(int fd, long * posn, int * newfd)
>> +{
>> +    struct msghdr msg;
>> +    struct iovec iov[1];
>> +    struct cmsghdr *cmptr;
>> +    size_t len;
>> +    size_t msg_size = sizeof(int);
>> +    char control[CMSG_SPACE(msg_size)];
>> +
>> +    msg.msg_name = 0;
>> +    msg.msg_namelen = 0;
>> +    msg.msg_control = control;
>> +    msg.msg_controllen = sizeof(control);
>> +    msg.msg_flags = 0;
>> +    msg.msg_iov = iov;
>> +    msg.msg_iovlen = 1;
>> +
>> +    iov[0].iov_base =&posn;
>> +    iov[0].iov_len = sizeof(posn);
>> +
>> +    do {
>> +        len = recvmsg(fd,&msg, 0);
>> +    } while (len == (size_t) (-1)&&  (errno == EINTR || errno ==
>> EAGAIN));
>> +
>> +    printf("iov[0].buf is %ld\n", *((long *)iov[0].iov_base));
>> +    printf("len is %ld\n", len);
>> +    // TODO: Logging
>> +    if (len == (size_t) (-1)) {
>> +        perror("recvmsg()");
>> +        return -1;
>> +    }
>> +
>> +    if (msg.msg_controllen<  sizeof(struct cmsghdr))
>> +        return *posn;
>> +
>> +    for (cmptr = CMSG_FIRSTHDR(&msg); cmptr != NULL;
>> +        cmptr = CMSG_NXTHDR(&msg, cmptr)) {
>> +        if (cmptr->cmsg_level != SOL_SOCKET ||
>> +            cmptr->cmsg_type != SCM_RIGHTS){
>> +                printf("continuing %ld\n", sizeof(size_t));
>> +                printf("read msg_size = %ld\n", msg_size);
>> +                if (cmptr->cmsg_len != sizeof(control))
>> +                    printf("not equal (%ld !=
>> %ld)\n",cmptr->cmsg_len,sizeof(control));
>> +                continue;
>> +        }
>> +
>> +        memcpy(newfd, CMSG_DATA(cmptr), sizeof(int));
>> +        printf("posn is %ld (fd = %d)\n", *posn, *newfd);
>> +        return 0;
>> +    }
>> +
>> +    fprintf(stderr, "bad data in packet\n");
>> +    return -1;
>> +}
>> +
>> +int readRights(int fd, long count, size_t count_len, int **fds, int
>> msi_vectors)
>> +{
>> +    int j, newfd;
>> +
>> +    for (; ;){
>> +        long posn = 0;
>> +
>> +        readUpdate(fd,&posn,&newfd);
>> +        printf("reading posn %ld ", posn);
>> +        fds[posn] = (int *)malloc (msi_vectors * sizeof(int));
>> +        fds[posn][0] = newfd;
>> +        for (j = 1; j<  msi_vectors; j++) {
>> +            readUpdate(fd,&posn,&newfd);
>> +            fds[posn][j] = newfd;
>> +            printf("%d.", fds[posn][j]);
>> +        }
>> +        printf("\n");
>> +
>> +        /* stop reading once i've read my own eventfds */
>> +        if (posn == count)
>> +            break;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int sendKill(int fd, long const posn, size_t posn_len) {
>> +
>> +    struct cmsghdr *cmsg;
>> +    size_t msg_size = sizeof(int);
>> +    char control[CMSG_SPACE(msg_size)];
>> +    struct iovec iov[1];
>> +    size_t len;
>> +    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
>> +
>> +    struct pollfd mypollfd;
>> +    int rv;
>> +
>> +    iov[0].iov_base = (void *)&posn;
>> +    iov[0].iov_len = posn_len;
>> +
>> +    // from cmsg(3)
>> +    cmsg = CMSG_FIRSTHDR(&msg);
>> +    cmsg->cmsg_level = SOL_SOCKET;
>> +    cmsg->cmsg_len = 0;
>> +    msg.msg_controllen = cmsg->cmsg_len;
>> +
>> +    printf("Killing posn %ld\n", posn);
>> +
>> +    // check if the fd is dead or not
>> +    mypollfd.fd = fd;
>> +    mypollfd.events = POLLRDHUP;
>> +    mypollfd.revents = 0;
>> +
>> +    rv = poll(&mypollfd, 1, 0);
>> +
>> +    printf("rv is %d\n", rv);
>> +
>> +    if (rv == 0) {
>> +        len = sendmsg(fd,&msg, 0);
>> +        if (len == (size_t) (-1)) {
>> +            perror("sendmsg()");
>> +            return -1;
>> +        }
>> +        return (len == posn_len);
>> +    } else {
>> +        printf("already dead\n");
>> +        return 0;
>> +    }
>> +}
>> +
>> +int sendUpdate(int fd, long posn, size_t posn_len, int sendfd)
>> +{
>> +
>> +    struct cmsghdr *cmsg;
>> +    size_t msg_size = sizeof(int);
>> +    char control[CMSG_SPACE(msg_size)];
>> +    struct iovec iov[1];
>> +    size_t len;
>> +    struct msghdr msg = { 0, 0, iov, 1, control, sizeof control, 0 };
>> +
>> +    iov[0].iov_base = (void *) (&posn);
>> +    iov[0].iov_len = posn_len;
>> +
>> +    // from cmsg(3)
>> +    cmsg = CMSG_FIRSTHDR(&msg);
>> +    cmsg->cmsg_level = SOL_SOCKET;
>> +    cmsg->cmsg_type = SCM_RIGHTS;
>> +    cmsg->cmsg_len = CMSG_LEN(msg_size);
>> +    msg.msg_controllen = cmsg->cmsg_len;
>> +
>> +    memcpy((CMSG_DATA(cmsg)),&sendfd, msg_size);
>> +
>> +    len = sendmsg(fd,&msg, 0);
>> +    if (len == (size_t) (-1)) {
>> +        perror("sendmsg()");
>> +        return -1;
>> +    }
>> +
>> +    return (len == posn_len);
>> +
>> +}
>> +
>> +int sendPosition(int fd, long const posn)
>> +{
>> +    int rv;
>> +
>> +    rv = send(fd,&posn, sizeof(long), 0);
>> +    if (rv != sizeof(long)) {
>> +        fprintf(stderr, "error sending posn\n");
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int sendRights(int fd, long const count, size_t count_len, vmguest_t *
>> Live_vms,
>> +                                                            long
>> msi_vectors)
>> +{
>> +    /* updates about new guests are sent one at a time */
>> +
>> +    long i, j;
>> +
>> +    for (i = 0; i<= count; i++) {
>> +        if (Live_vms[i].alive) {
>> +            for (j = 0; j<  msi_vectors; j++) {
>> +                sendUpdate(Live_vms[count].sockfd, i, sizeof(long),
>> +
>>  Live_vms[i].efd[j]);
>> +            }
>> +        }
>> +    }
>> +
>> +    return 0;
>> +
>> +}
>> diff --git a/contrib/ivshmem-server/send_scm.h
>> b/contrib/ivshmem-server/send_scm.h
>> new file mode 100644
>> index 0000000..48c9a8d
>> --- /dev/null
>> +++ b/contrib/ivshmem-server/send_scm.h
>> @@ -0,0 +1,19 @@
>> +#ifndef SEND_SCM
>> +#define SEND_SCM
>> +
>> +struct vm_guest_conn {
>> +    int posn;
>> +    int sockfd;
>> +    int * efd;
>> +    int alive;
>> +};
>> +
>> +typedef struct vm_guest_conn vmguest_t;
>> +
>> +int readRights(int fd, long count, size_t count_len, int **fds, int
>> msi_vectors);
>> +int sendRights(int fd, long const count, size_t count_len, vmguest_t
>> *Live_vms, long msi_vectors);
>> +int readUpdate(int fd, long * posn, int * newfd);
>> +int sendUpdate(int fd, long const posn, size_t posn_len, int sendfd);
>> +int sendPosition(int fd, long const posn);
>> +int sendKill(int fd, long const posn, size_t posn_len);
>> +#endif
>>
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC] Mark a device as non-migratable
  2010-06-14 16:15             ` Anthony Liguori
@ 2010-06-15 16:16                 ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-15 16:16 UTC (permalink / raw)
  To: anthony; +Cc: qemu-devel, kvm, Cam Macdonell

How does this look for marking the device as non-migratable?  It adds a field
'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This would
replace anything that touches memory.

Cam

---
 hw/hw.h  |    1 +
 savevm.c |   32 +++++++++++++++++++++++++++++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index d78d814..7c93f08 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
                          void *opaque);
 
 void unregister_savevm(const char *idstr, void *opaque);
+void mark_no_migrate(const char *idstr, void *opaque);
 
 typedef void QEMUResetHandler(void *opaque);
 
diff --git a/savevm.c b/savevm.c
index 017695b..2642a9c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1023,6 +1023,7 @@ typedef struct SaveStateEntry {
     LoadStateHandler *load_state;
     const VMStateDescription *vmsd;
     void *opaque;
+    int no_migrate;
 } SaveStateEntry;
 
 
@@ -1069,6 +1070,7 @@ int register_savevm_live(const char *idstr,
     se->load_state = load_state;
     se->opaque = opaque;
     se->vmsd = NULL;
+    se->no_migrate = 0;
 
     if (instance_id == -1) {
         se->instance_id = calculate_new_instance_id(idstr);
@@ -1103,6 +1105,19 @@ void unregister_savevm(const char *idstr, void *opaque)
     }
 }
 
+/* mark a device as not to be migrated, that is the device should be
+   unplugged before migration */
+void mark_no_migrate(const char *idstr, void *opaque)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        if (strcmp(se->idstr, idstr) == 0 && se->opaque == opaque) {
+            se->no_migrate = 1;
+        }
+    }
+}
+
 int vmstate_register_with_alias_id(int instance_id,
                                    const VMStateDescription *vmsd,
                                    void *opaque, int alias_id,
@@ -1277,13 +1292,19 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se, int version_id)
     return vmstate_load_state(f, se->vmsd, se->opaque, version_id);
 }
 
-static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
+static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
 {
+    if (se->no_migrate) {
+        return -1;
+    }
+
     if (!se->vmsd) {         /* Old style */
         se->save_state(f, se->opaque);
-        return;
+        return 0;
     }
     vmstate_save_state(f,se->vmsd, se->opaque);
+
+    return 0;
 }
 
 #define QEMU_VM_FILE_MAGIC           0x5145564d
@@ -1377,6 +1398,7 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 {
     SaveStateEntry *se;
+    int r;
 
     cpu_synchronize_all_states();
 
@@ -1409,7 +1431,11 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
         qemu_put_be32(f, se->instance_id);
         qemu_put_be32(f, se->version_id);
 
-        vmstate_save(f, se);
+        r = vmstate_save(f, se);
+        if (r < 0) {
+            monitor_printf(mon, "cannot migrate with device '%s'\n", se->idstr);
+            return r;
+        }
     }
 
     qemu_put_byte(f, QEMU_VM_EOF);
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH RFC] Mark a device as non-migratable
@ 2010-06-15 16:16                 ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-15 16:16 UTC (permalink / raw)
  To: anthony; +Cc: Cam Macdonell, qemu-devel, kvm

How does this look for marking the device as non-migratable?  It adds a field
'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This would
replace anything that touches memory.

Cam

---
 hw/hw.h  |    1 +
 savevm.c |   32 +++++++++++++++++++++++++++++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index d78d814..7c93f08 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
                          void *opaque);
 
 void unregister_savevm(const char *idstr, void *opaque);
+void mark_no_migrate(const char *idstr, void *opaque);
 
 typedef void QEMUResetHandler(void *opaque);
 
diff --git a/savevm.c b/savevm.c
index 017695b..2642a9c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1023,6 +1023,7 @@ typedef struct SaveStateEntry {
     LoadStateHandler *load_state;
     const VMStateDescription *vmsd;
     void *opaque;
+    int no_migrate;
 } SaveStateEntry;
 
 
@@ -1069,6 +1070,7 @@ int register_savevm_live(const char *idstr,
     se->load_state = load_state;
     se->opaque = opaque;
     se->vmsd = NULL;
+    se->no_migrate = 0;
 
     if (instance_id == -1) {
         se->instance_id = calculate_new_instance_id(idstr);
@@ -1103,6 +1105,19 @@ void unregister_savevm(const char *idstr, void *opaque)
     }
 }
 
+/* mark a device as not to be migrated, that is the device should be
+   unplugged before migration */
+void mark_no_migrate(const char *idstr, void *opaque)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        if (strcmp(se->idstr, idstr) == 0 && se->opaque == opaque) {
+            se->no_migrate = 1;
+        }
+    }
+}
+
 int vmstate_register_with_alias_id(int instance_id,
                                    const VMStateDescription *vmsd,
                                    void *opaque, int alias_id,
@@ -1277,13 +1292,19 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se, int version_id)
     return vmstate_load_state(f, se->vmsd, se->opaque, version_id);
 }
 
-static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
+static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
 {
+    if (se->no_migrate) {
+        return -1;
+    }
+
     if (!se->vmsd) {         /* Old style */
         se->save_state(f, se->opaque);
-        return;
+        return 0;
     }
     vmstate_save_state(f,se->vmsd, se->opaque);
+
+    return 0;
 }
 
 #define QEMU_VM_FILE_MAGIC           0x5145564d
@@ -1377,6 +1398,7 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 {
     SaveStateEntry *se;
+    int r;
 
     cpu_synchronize_all_states();
 
@@ -1409,7 +1431,11 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
         qemu_put_be32(f, se->instance_id);
         qemu_put_be32(f, se->version_id);
 
-        vmstate_save(f, se);
+        r = vmstate_save(f, se);
+        if (r < 0) {
+            monitor_printf(mon, "cannot migrate with device '%s'\n", se->idstr);
+            return r;
+        }
     }
 
     qemu_put_byte(f, QEMU_VM_EOF);
-- 
1.6.3.2.198.g6096d

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC] Mark a device as non-migratable
  2010-06-15 16:16                 ` [Qemu-devel] " Cam Macdonell
@ 2010-06-15 16:32                   ` Anthony Liguori
  -1 siblings, 0 replies; 42+ messages in thread
From: Anthony Liguori @ 2010-06-15 16:32 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 06/15/2010 11:16 AM, Cam Macdonell wrote:
> How does this look for marking the device as non-migratable?  It adds a field
> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This would
> replace anything that touches memory.
>
> Cam
>
> ---
>   hw/hw.h  |    1 +
>   savevm.c |   32 +++++++++++++++++++++++++++++---
>   2 files changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/hw/hw.h b/hw/hw.h
> index d78d814..7c93f08 100644
> --- a/hw/hw.h
> +++ b/hw/hw.h
> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>                            void *opaque);
>
>   void unregister_savevm(const char *idstr, void *opaque);
> +void mark_no_migrate(const char *idstr, void *opaque);
>    

I'm not thrilled with the name but the functionality is spot on.  I lack 
the creativity to offer a better name suggestion :-)

Regards,

Anthony Liguori

>   typedef void QEMUResetHandler(void *opaque);
>
> diff --git a/savevm.c b/savevm.c
> index 017695b..2642a9c 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1023,6 +1023,7 @@ typedef struct SaveStateEntry {
>       LoadStateHandler *load_state;
>       const VMStateDescription *vmsd;
>       void *opaque;
> +    int no_migrate;
>   } SaveStateEntry;
>
>
> @@ -1069,6 +1070,7 @@ int register_savevm_live(const char *idstr,
>       se->load_state = load_state;
>       se->opaque = opaque;
>       se->vmsd = NULL;
> +    se->no_migrate = 0;
>
>       if (instance_id == -1) {
>           se->instance_id = calculate_new_instance_id(idstr);
> @@ -1103,6 +1105,19 @@ void unregister_savevm(const char *idstr, void *opaque)
>       }
>   }
>
> +/* mark a device as not to be migrated, that is the device should be
> +   unplugged before migration */
> +void mark_no_migrate(const char *idstr, void *opaque)
> +{
> +    SaveStateEntry *se;
> +
> +    QTAILQ_FOREACH(se,&savevm_handlers, entry) {
> +        if (strcmp(se->idstr, idstr) == 0&&  se->opaque == opaque) {
> +            se->no_migrate = 1;
> +        }
> +    }
> +}
> +
>   int vmstate_register_with_alias_id(int instance_id,
>                                      const VMStateDescription *vmsd,
>                                      void *opaque, int alias_id,
> @@ -1277,13 +1292,19 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se, int version_id)
>       return vmstate_load_state(f, se->vmsd, se->opaque, version_id);
>   }
>
> -static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
> +static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
>   {
> +    if (se->no_migrate) {
> +        return -1;
> +    }
> +
>       if (!se->vmsd) {         /* Old style */
>           se->save_state(f, se->opaque);
> -        return;
> +        return 0;
>       }
>       vmstate_save_state(f,se->vmsd, se->opaque);
> +
> +    return 0;
>   }
>
>   #define QEMU_VM_FILE_MAGIC           0x5145564d
> @@ -1377,6 +1398,7 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
>   int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
>   {
>       SaveStateEntry *se;
> +    int r;
>
>       cpu_synchronize_all_states();
>
> @@ -1409,7 +1431,11 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
>           qemu_put_be32(f, se->instance_id);
>           qemu_put_be32(f, se->version_id);
>
> -        vmstate_save(f, se);
> +        r = vmstate_save(f, se);
> +        if (r<  0) {
> +            monitor_printf(mon, "cannot migrate with device '%s'\n", se->idstr);
> +            return r;
> +        }
>       }
>
>       qemu_put_byte(f, QEMU_VM_EOF);
>    


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] Re: [PATCH RFC] Mark a device as non-migratable
@ 2010-06-15 16:32                   ` Anthony Liguori
  0 siblings, 0 replies; 42+ messages in thread
From: Anthony Liguori @ 2010-06-15 16:32 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel, kvm

On 06/15/2010 11:16 AM, Cam Macdonell wrote:
> How does this look for marking the device as non-migratable?  It adds a field
> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This would
> replace anything that touches memory.
>
> Cam
>
> ---
>   hw/hw.h  |    1 +
>   savevm.c |   32 +++++++++++++++++++++++++++++---
>   2 files changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/hw/hw.h b/hw/hw.h
> index d78d814..7c93f08 100644
> --- a/hw/hw.h
> +++ b/hw/hw.h
> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>                            void *opaque);
>
>   void unregister_savevm(const char *idstr, void *opaque);
> +void mark_no_migrate(const char *idstr, void *opaque);
>    

I'm not thrilled with the name but the functionality is spot on.  I lack 
the creativity to offer a better name suggestion :-)

Regards,

Anthony Liguori

>   typedef void QEMUResetHandler(void *opaque);
>
> diff --git a/savevm.c b/savevm.c
> index 017695b..2642a9c 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1023,6 +1023,7 @@ typedef struct SaveStateEntry {
>       LoadStateHandler *load_state;
>       const VMStateDescription *vmsd;
>       void *opaque;
> +    int no_migrate;
>   } SaveStateEntry;
>
>
> @@ -1069,6 +1070,7 @@ int register_savevm_live(const char *idstr,
>       se->load_state = load_state;
>       se->opaque = opaque;
>       se->vmsd = NULL;
> +    se->no_migrate = 0;
>
>       if (instance_id == -1) {
>           se->instance_id = calculate_new_instance_id(idstr);
> @@ -1103,6 +1105,19 @@ void unregister_savevm(const char *idstr, void *opaque)
>       }
>   }
>
> +/* mark a device as not to be migrated, that is the device should be
> +   unplugged before migration */
> +void mark_no_migrate(const char *idstr, void *opaque)
> +{
> +    SaveStateEntry *se;
> +
> +    QTAILQ_FOREACH(se,&savevm_handlers, entry) {
> +        if (strcmp(se->idstr, idstr) == 0&&  se->opaque == opaque) {
> +            se->no_migrate = 1;
> +        }
> +    }
> +}
> +
>   int vmstate_register_with_alias_id(int instance_id,
>                                      const VMStateDescription *vmsd,
>                                      void *opaque, int alias_id,
> @@ -1277,13 +1292,19 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se, int version_id)
>       return vmstate_load_state(f, se->vmsd, se->opaque, version_id);
>   }
>
> -static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
> +static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
>   {
> +    if (se->no_migrate) {
> +        return -1;
> +    }
> +
>       if (!se->vmsd) {         /* Old style */
>           se->save_state(f, se->opaque);
> -        return;
> +        return 0;
>       }
>       vmstate_save_state(f,se->vmsd, se->opaque);
> +
> +    return 0;
>   }
>
>   #define QEMU_VM_FILE_MAGIC           0x5145564d
> @@ -1377,6 +1398,7 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
>   int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
>   {
>       SaveStateEntry *se;
> +    int r;
>
>       cpu_synchronize_all_states();
>
> @@ -1409,7 +1431,11 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
>           qemu_put_be32(f, se->instance_id);
>           qemu_put_be32(f, se->version_id);
>
> -        vmstate_save(f, se);
> +        r = vmstate_save(f, se);
> +        if (r<  0) {
> +            monitor_printf(mon, "cannot migrate with device '%s'\n", se->idstr);
> +            return r;
> +        }
>       }
>
>       qemu_put_byte(f, QEMU_VM_EOF);
>    

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] Re: [PATCH RFC] Mark a device as non-migratable
  2010-06-15 16:32                   ` [Qemu-devel] " Anthony Liguori
  (?)
@ 2010-06-15 17:45                   ` Markus Armbruster
  -1 siblings, 0 replies; 42+ messages in thread
From: Markus Armbruster @ 2010-06-15 17:45 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

Anthony Liguori <anthony@codemonkey.ws> writes:

> On 06/15/2010 11:16 AM, Cam Macdonell wrote:
>> How does this look for marking the device as non-migratable?  It adds a field
>> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This would
>> replace anything that touches memory.
>>
>> Cam
>>
>> ---
>>   hw/hw.h  |    1 +
>>   savevm.c |   32 +++++++++++++++++++++++++++++---
>>   2 files changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/hw.h b/hw/hw.h
>> index d78d814..7c93f08 100644
>> --- a/hw/hw.h
>> +++ b/hw/hw.h
>> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>>                            void *opaque);
>>
>>   void unregister_savevm(const char *idstr, void *opaque);
>> +void mark_no_migrate(const char *idstr, void *opaque);
>>    
>
> I'm not thrilled with the name but the functionality is spot on.  I
> lack the creativity to offer a better name suggestion :-)

Tongue firmly in cheek: mark_sedentary()?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC] Mark a device as non-migratable
  2010-06-15 16:32                   ` [Qemu-devel] " Anthony Liguori
@ 2010-06-15 22:26                     ` Cam Macdonell
  -1 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-15 22:26 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, kvm

On Tue, Jun 15, 2010 at 10:32 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 06/15/2010 11:16 AM, Cam Macdonell wrote:
>>
>> How does this look for marking the device as non-migratable?  It adds a
>> field
>> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This
>> would
>> replace anything that touches memory.
>>
>> Cam
>>
>> ---
>>  hw/hw.h  |    1 +
>>  savevm.c |   32 +++++++++++++++++++++++++++++---
>>  2 files changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/hw.h b/hw/hw.h
>> index d78d814..7c93f08 100644
>> --- a/hw/hw.h
>> +++ b/hw/hw.h
>> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>>                           void *opaque);
>>
>>  void unregister_savevm(const char *idstr, void *opaque);
>> +void mark_no_migrate(const char *idstr, void *opaque);
>>
>
> I'm not thrilled with the name but the functionality is spot on.  I lack the
> creativity to offer a better name suggestion :-)
>
> Regards,
>
> Anthony Liguori

Hmmm, in working on this it seems that the memory (from
qemu_ram_map()) is still attached even when the device is removed
(which causes migration to fail because there is an unexpected
memory).

Is something like cpu_unregister_physical_memory()/qemu_ram_free() needed?

Cam

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] Re: [PATCH RFC] Mark a device as non-migratable
@ 2010-06-15 22:26                     ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-15 22:26 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, kvm

On Tue, Jun 15, 2010 at 10:32 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 06/15/2010 11:16 AM, Cam Macdonell wrote:
>>
>> How does this look for marking the device as non-migratable?  It adds a
>> field
>> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This
>> would
>> replace anything that touches memory.
>>
>> Cam
>>
>> ---
>>  hw/hw.h  |    1 +
>>  savevm.c |   32 +++++++++++++++++++++++++++++---
>>  2 files changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/hw.h b/hw/hw.h
>> index d78d814..7c93f08 100644
>> --- a/hw/hw.h
>> +++ b/hw/hw.h
>> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>>                           void *opaque);
>>
>>  void unregister_savevm(const char *idstr, void *opaque);
>> +void mark_no_migrate(const char *idstr, void *opaque);
>>
>
> I'm not thrilled with the name but the functionality is spot on.  I lack the
> creativity to offer a better name suggestion :-)
>
> Regards,
>
> Anthony Liguori

Hmmm, in working on this it seems that the memory (from
qemu_ram_map()) is still attached even when the device is removed
(which causes migration to fail because there is an unexpected
memory).

Is something like cpu_unregister_physical_memory()/qemu_ram_free() needed?

Cam

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] Re: [PATCH RFC] Mark a device as non-migratable
  2010-06-15 22:26                     ` [Qemu-devel] " Cam Macdonell
  (?)
@ 2010-06-15 22:33                     ` Anthony Liguori
  2010-06-16  5:05                       ` Cam Macdonell
  -1 siblings, 1 reply; 42+ messages in thread
From: Anthony Liguori @ 2010-06-15 22:33 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel

On 06/15/2010 05:26 PM, Cam Macdonell wrote:
> On Tue, Jun 15, 2010 at 10:32 AM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 06/15/2010 11:16 AM, Cam Macdonell wrote:
>>      
>>> How does this look for marking the device as non-migratable?  It adds a
>>> field
>>> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.  This
>>> would
>>> replace anything that touches memory.
>>>
>>> Cam
>>>
>>> ---
>>>   hw/hw.h  |    1 +
>>>   savevm.c |   32 +++++++++++++++++++++++++++++---
>>>   2 files changed, 30 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/hw/hw.h b/hw/hw.h
>>> index d78d814..7c93f08 100644
>>> --- a/hw/hw.h
>>> +++ b/hw/hw.h
>>> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>>>                            void *opaque);
>>>
>>>   void unregister_savevm(const char *idstr, void *opaque);
>>> +void mark_no_migrate(const char *idstr, void *opaque);
>>>
>>>        
>> I'm not thrilled with the name but the functionality is spot on.  I lack the
>> creativity to offer a better name suggestion :-)
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> Hmmm, in working on this it seems that the memory (from
> qemu_ram_map()) is still attached even when the device is removed
> (which causes migration to fail because there is an unexpected
> memory).
>
> Is something like cpu_unregister_physical_memory()/qemu_ram_free() needed?
>    

Yes.  You need to unregister any memory that you have registered upon 
device removal.

Regards,

Anthony Liguori

> Cam
>
>    

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] Re: [PATCH RFC] Mark a device as non-migratable
  2010-06-15 22:33                     ` Anthony Liguori
@ 2010-06-16  5:05                       ` Cam Macdonell
  2010-06-16 12:34                         ` Anthony Liguori
  0 siblings, 1 reply; 42+ messages in thread
From: Cam Macdonell @ 2010-06-16  5:05 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

On Tue, Jun 15, 2010 at 4:33 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 06/15/2010 05:26 PM, Cam Macdonell wrote:
>>
>> On Tue, Jun 15, 2010 at 10:32 AM, Anthony Liguori<anthony@codemonkey.ws>
>>  wrote:
>>
>>>
>>> On 06/15/2010 11:16 AM, Cam Macdonell wrote:
>>>
>>>>
>>>> How does this look for marking the device as non-migratable?  It adds a
>>>> field
>>>> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.
>>>>  This
>>>> would
>>>> replace anything that touches memory.
>>>>
>>>> Cam
>>>>
>>>> ---
>>>>  hw/hw.h  |    1 +
>>>>  savevm.c |   32 +++++++++++++++++++++++++++++---
>>>>  2 files changed, 30 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/hw/hw.h b/hw/hw.h
>>>> index d78d814..7c93f08 100644
>>>> --- a/hw/hw.h
>>>> +++ b/hw/hw.h
>>>> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>>>>                           void *opaque);
>>>>
>>>>  void unregister_savevm(const char *idstr, void *opaque);
>>>> +void mark_no_migrate(const char *idstr, void *opaque);
>>>>
>>>>
>>>
>>> I'm not thrilled with the name but the functionality is spot on.  I lack
>>> the
>>> creativity to offer a better name suggestion :-)
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>
>> Hmmm, in working on this it seems that the memory (from
>> qemu_ram_map()) is still attached even when the device is removed
>> (which causes migration to fail because there is an unexpected
>> memory).
>>
>> Is something like cpu_unregister_physical_memory()/qemu_ram_free() needed?
>>
>
> Yes.  You need to unregister any memory that you have registered upon device
> removal.

Is there an established way to achieve this?  I can't seem find
another device that unregisters memory registered with
cpu_register_physical_memory().  Is something like
cpu_unregister_physical_memory() needed?

Thanks,
Cam

>
> Regards,
>
> Anthony Liguori
>
>> Cam
>>
>>
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] Re: [PATCH RFC] Mark a device as non-migratable
  2010-06-16  5:05                       ` Cam Macdonell
@ 2010-06-16 12:34                         ` Anthony Liguori
  2010-06-17  4:18                           ` Cam Macdonell
  0 siblings, 1 reply; 42+ messages in thread
From: Anthony Liguori @ 2010-06-16 12:34 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: qemu-devel

On 06/16/2010 12:05 AM, Cam Macdonell wrote:
> On Tue, Jun 15, 2010 at 4:33 PM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 06/15/2010 05:26 PM, Cam Macdonell wrote:
>>      
>>> On Tue, Jun 15, 2010 at 10:32 AM, Anthony Liguori<anthony@codemonkey.ws>
>>>   wrote:
>>>
>>>        
>>>> On 06/15/2010 11:16 AM, Cam Macdonell wrote:
>>>>
>>>>          
>>>>> How does this look for marking the device as non-migratable?  It adds a
>>>>> field
>>>>> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.
>>>>>   This
>>>>> would
>>>>> replace anything that touches memory.
>>>>>
>>>>> Cam
>>>>>
>>>>> ---
>>>>>   hw/hw.h  |    1 +
>>>>>   savevm.c |   32 +++++++++++++++++++++++++++++---
>>>>>   2 files changed, 30 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/hw/hw.h b/hw/hw.h
>>>>> index d78d814..7c93f08 100644
>>>>> --- a/hw/hw.h
>>>>> +++ b/hw/hw.h
>>>>> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>>>>>                            void *opaque);
>>>>>
>>>>>   void unregister_savevm(const char *idstr, void *opaque);
>>>>> +void mark_no_migrate(const char *idstr, void *opaque);
>>>>>
>>>>>
>>>>>            
>>>> I'm not thrilled with the name but the functionality is spot on.  I lack
>>>> the
>>>> creativity to offer a better name suggestion :-)
>>>>
>>>> Regards,
>>>>
>>>> Anthony Liguori
>>>>
>>>>          
>>> Hmmm, in working on this it seems that the memory (from
>>> qemu_ram_map()) is still attached even when the device is removed
>>> (which causes migration to fail because there is an unexpected
>>> memory).
>>>
>>> Is something like cpu_unregister_physical_memory()/qemu_ram_free() needed?
>>>
>>>        
>> Yes.  You need to unregister any memory that you have registered upon device
>> removal.
>>      
> Is there an established way to achieve this?  I can't seem find
> another device that unregisters memory registered with
> cpu_register_physical_memory().  Is something like
> cpu_unregister_physical_memory() needed?
>    

cpu_register_physical_memory(IO_MEM_UNASSIGNED).

If you look at pci.c, you'll see that it automatically unregisters any 
mapped io regions on remove.

Regards,

Anthony Liguori

> Thanks,
> Cam
>
>    
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>> Cam
>>>
>>>
>>>        
>>
>>      

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] Re: [PATCH RFC] Mark a device as non-migratable
  2010-06-16 12:34                         ` Anthony Liguori
@ 2010-06-17  4:18                           ` Cam Macdonell
  0 siblings, 0 replies; 42+ messages in thread
From: Cam Macdonell @ 2010-06-17  4:18 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

On Wed, Jun 16, 2010 at 6:34 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 06/16/2010 12:05 AM, Cam Macdonell wrote:
>>
>> On Tue, Jun 15, 2010 at 4:33 PM, Anthony Liguori<anthony@codemonkey.ws>
>>  wrote:
>>
>>>
>>> On 06/15/2010 05:26 PM, Cam Macdonell wrote:
>>>
>>>>
>>>> On Tue, Jun 15, 2010 at 10:32 AM, Anthony Liguori<anthony@codemonkey.ws>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> On 06/15/2010 11:16 AM, Cam Macdonell wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> How does this look for marking the device as non-migratable?  It adds
>>>>>> a
>>>>>> field
>>>>>> 'no_migrate' to the SaveStateEntry and tests for it in vmstate_save.
>>>>>>  This
>>>>>> would
>>>>>> replace anything that touches memory.
>>>>>>
>>>>>> Cam
>>>>>>
>>>>>> ---
>>>>>>  hw/hw.h  |    1 +
>>>>>>  savevm.c |   32 +++++++++++++++++++++++++++++---
>>>>>>  2 files changed, 30 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/hw.h b/hw/hw.h
>>>>>> index d78d814..7c93f08 100644
>>>>>> --- a/hw/hw.h
>>>>>> +++ b/hw/hw.h
>>>>>> @@ -263,6 +263,7 @@ int register_savevm_live(const char *idstr,
>>>>>>                           void *opaque);
>>>>>>
>>>>>>  void unregister_savevm(const char *idstr, void *opaque);
>>>>>> +void mark_no_migrate(const char *idstr, void *opaque);
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> I'm not thrilled with the name but the functionality is spot on.  I
>>>>> lack
>>>>> the
>>>>> creativity to offer a better name suggestion :-)
>>>>>
>>>>> Regards,
>>>>>
>>>>> Anthony Liguori
>>>>>
>>>>>
>>>>
>>>> Hmmm, in working on this it seems that the memory (from
>>>> qemu_ram_map()) is still attached even when the device is removed
>>>> (which causes migration to fail because there is an unexpected
>>>> memory).
>>>>
>>>> Is something like cpu_unregister_physical_memory()/qemu_ram_free()
>>>> needed?
>>>>
>>>>
>>>
>>> Yes.  You need to unregister any memory that you have registered upon
>>> device
>>> removal.
>>>
>>
>> Is there an established way to achieve this?  I can't seem find
>> another device that unregisters memory registered with
>> cpu_register_physical_memory().  Is something like
>> cpu_unregister_physical_memory() needed?
>>
>
> cpu_register_physical_memory(IO_MEM_UNASSIGNED).
>
> If you look at pci.c, you'll see that it automatically unregisters any
> mapped io regions on remove.
>

It appears that the 'peer' migration won't work until memory hotplug
is supported, correct?  AFAICT the memory sizes will not match between
the source and destination VMs after the device is removed and the
memory system currently doesn't support gaps.  A technique similar to
my patch for non-migratable memory would be needed to mark free'd
memory pages without Alex's patches in.

For the purposes of my patch, can it be merged without the 'peer' case
(pending Alex's patches and hotplug)?

Thanks,
Cam

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
  2010-06-14 15:53             ` [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory Anthony Liguori
  2010-06-14 22:03               ` Cam Macdonell
@ 2010-06-23 13:12               ` Avi Kivity
  2010-06-23 21:54                 ` Anthony Liguori
  1 sibling, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2010-06-23 13:12 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Cam Macdonell, qemu-devel, kvm

On 06/14/2010 06:53 PM, Anthony Liguori wrote:
>> index 0000000..e0a7b98
>> --- /dev/null
>> +++ b/contrib/ivshmem-server/ivshmem_server.c
>
>
> There's no licensing here.  I don't think this belongs in the qemu 
> tree either to be honest. 

I asked for this, to simplify life for people trying this out.

> If it were to be included, it ought to use all of the existing qemu 
> infrastructure like the other qemu-* tools.

That's why it's in contrib/, a customary place for things included for 
convenience but not really belonging.

I don't mind leaving it out though.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
  2010-06-23 13:12               ` Avi Kivity
@ 2010-06-23 21:54                 ` Anthony Liguori
  0 siblings, 0 replies; 42+ messages in thread
From: Anthony Liguori @ 2010-06-23 21:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, qemu-devel, kvm

On 06/23/2010 08:12 AM, Avi Kivity wrote:
> On 06/14/2010 06:53 PM, Anthony Liguori wrote:
>>> index 0000000..e0a7b98
>>> --- /dev/null
>>> +++ b/contrib/ivshmem-server/ivshmem_server.c
>>
>>
>> There's no licensing here.  I don't think this belongs in the qemu 
>> tree either to be honest. 
>
> I asked for this, to simplify life for people trying this out.
>
>> If it were to be included, it ought to use all of the existing qemu 
>> infrastructure like the other qemu-* tools.
>
> That's why it's in contrib/, a customary place for things included for 
> convenience but not really belonging.
>
> I don't mind leaving it out though.

I think it's better in the long term.  Then it has it's own tree and can 
evolve at it's own rate.

Regards,

Anthony Liguori



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2010-06-23 21:54 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-04 21:45 [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support Cam Macdonell
2010-06-04 21:45 ` [Qemu-devel] " Cam Macdonell
2010-06-04 21:45 ` [PATCH v6 1/6] Device specification for shared memory PCI device Cam Macdonell
2010-06-04 21:45   ` [Qemu-devel] " Cam Macdonell
2010-06-04 21:45   ` [PATCH v6 2/6] Add function to assign ioeventfd to MMIO Cam Macdonell
2010-06-04 21:45     ` [Qemu-devel] " Cam Macdonell
2010-06-04 21:45     ` [PATCH v6 3/6] Change phys_ram_dirty to phys_ram_status Cam Macdonell
2010-06-04 21:45       ` [Qemu-devel] " Cam Macdonell
2010-06-04 21:45       ` [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG Cam Macdonell
2010-06-04 21:45         ` [Qemu-devel] " Cam Macdonell
2010-06-04 21:45         ` [PATCH v6 5/6] Inter-VM shared memory PCI device Cam Macdonell
2010-06-04 21:45           ` [Qemu-devel] " Cam Macdonell
2010-06-04 21:45           ` [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory Cam Macdonell
2010-06-04 21:45             ` [Qemu-devel] " Cam Macdonell
2010-06-04 21:47             ` [PATCH v6] Shared memory uio_pci driver Cam Macdonell
2010-06-04 21:47               ` [Qemu-devel] " Cam Macdonell
2010-06-14 15:53             ` [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory Anthony Liguori
2010-06-14 22:03               ` Cam Macdonell
2010-06-23 13:12               ` Avi Kivity
2010-06-23 21:54                 ` Anthony Liguori
2010-06-05  9:44           ` [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device Blue Swirl
2010-06-06 15:02             ` Avi Kivity
2010-06-07 16:41             ` Cam Macdonell
2010-06-09 20:12               ` Blue Swirl
2010-06-14 15:51         ` [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG Anthony Liguori
2010-06-14 16:08           ` Cam Macdonell
2010-06-14 16:15             ` Anthony Liguori
2010-06-15 16:16               ` [PATCH RFC] Mark a device as non-migratable Cam Macdonell
2010-06-15 16:16                 ` [Qemu-devel] " Cam Macdonell
2010-06-15 16:32                 ` Anthony Liguori
2010-06-15 16:32                   ` [Qemu-devel] " Anthony Liguori
2010-06-15 17:45                   ` Markus Armbruster
2010-06-15 22:26                   ` Cam Macdonell
2010-06-15 22:26                     ` [Qemu-devel] " Cam Macdonell
2010-06-15 22:33                     ` Anthony Liguori
2010-06-16  5:05                       ` Cam Macdonell
2010-06-16 12:34                         ` Anthony Liguori
2010-06-17  4:18                           ` Cam Macdonell
2010-06-11 22:03 ` [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support Cam Macdonell
2010-06-11 22:03   ` [Qemu-devel] " Cam Macdonell
2010-06-14 15:54   ` Anthony Liguori
2010-06-14 15:54     ` Anthony Liguori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.