qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/21] Initial support for multi-process qemu
@ 2020-06-27 17:09 elena.ufimtseva
  2020-06-27 17:09 ` [PATCH v7 01/21] memory: alloc RAM from file at offset elena.ufimtseva
                   ` (21 more replies)
  0 siblings, 22 replies; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

Hello

This is the v7 of the patchset.
Thank you very much for the detailed feedback for v6. We appreciate your time.

We have addressed the latest comments and suggestions that were
provided on v6 patch series and incorporated to this patchset.

This is the list of changes for v7:
 - QEMU & remote process share the same binary.
   This allowed us to reduce the number of patches as well.

 - We introduced the machine type "remote" that drives the remote process
   initialization.

 - v7 now uses QIOChannel for communication and descriptors management.

 - The remote process uses the main loop instead of a separate loop.

 - Co-routines support in the QEMU Proxy-remote process communication
   The communication model based on co-routines needs some more work and
   we would like to hear your take on it.
   Stefan has shared some ideas how we can proceed and we will take this
   to the next version after additional discussion.
   We did not implement the protocol to listen and accept new connections.

There are other changes that were incorporated from the feedback we have
received on v6.

We posted the Proof Of Concept patches [2] before the BoF session in 2018.
Subsequently, we posted RFC v1 [3], RFC v2 [4], RFC v3 [5], RFC v4 [6],
v5 [7] and v6 [8] of the patch series.
Following people contributed to this patchset:

John G Johnson <john.g.johnson@oracle.com>
Jagannathan Raman <jag.raman@oracle.com>
Elena Ufimtseva <elena.ufimtseva@oracle.com>
Kanth Ghatraju <kanth.ghatraju@oracle.com>
Konrad Wilk <konrad.wilk@oracle.com>

Also we would like to thank QEMU community for your help, suggestions
and reviewing this large series of patches.

For the full concept writeup about QEMU multi-process, please refer to
docs/devel/qemu-multiprocess.rst. Also see docs/qemu-multiprocess.txt for
usage information.

We will post separate patchsets for the following improvements for
the experimental Qemu multi-process:
 - Live migration;
 - communication channel improvements;

We welcome all your ideas, concerns, and questions for this patchset.

[1]: http://events17.linuxfoundation.org/sites/events/files/slides/KVM%20FORUM%20multi-process.pdf
[1]: https://www.youtube.com/watch?v=Kq1-coHh7lg
[2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg566538.html
[3]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg602285.html
[4]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg624877.html
[5]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg642000.html
[6]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg655118.html
[7]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg682429.html
[8]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg697484.html

Elena Ufimtseva (9):
  multi-process: add qio channel function to transmit
  multi-process: define MPQemuMsg format and transmission functions
  multi-process: add co-routines to communicate with remote
  multi-process: Initialize communication channel at the remote end
  multi-process: introduce proxy object
  multi-process: Forward PCI config space acceses to the remote process
  multi-process: heartbeat messages to remote
  multi-process: perform device reset in the remote process
  multi-process: add configure and usage information

Jagannathan Raman (11):
  memory: alloc RAM from file at offset
  multi-process: Add config option for multi-process QEMU
  multi-process: setup PCI host bridge for remote device
  multi-process: setup a machine object for remote device process
  multi-process: Initialize message handler in remote device
  multi-process: setup memory manager for remote device
  multi-process: Connect Proxy Object with device in the remote process
  multi-process: PCI BAR read/write handling for proxy & remote
    endpoints
  multi-process: Synchronize remote memory
  multi-process: create IOHUB object to handle irq
  multi-process: Retrieve PCI info from remote process

John G Johnson (1):
  multi-process: add the concept description to
    docs/devel/qemu-multiprocess

 MAINTAINERS                          |  24 +
 backends/hostmem-memfd.c             |   2 +-
 configure                            |  11 +
 docs/devel/index.rst                 |   1 +
 docs/devel/multi-process.rst         | 957 +++++++++++++++++++++++++++
 docs/multi-process.rst               |  71 ++
 exec.c                               |  11 +-
 hw/Makefile.objs                     |   1 +
 hw/i386/Makefile.objs                |   3 +
 hw/i386/remote-memory.c              |  58 ++
 hw/i386/remote-msg.c                 | 301 +++++++++
 hw/i386/remote.c                     |  99 +++
 hw/misc/ivshmem.c                    |   3 +-
 hw/pci-host/Makefile.objs            |   1 +
 hw/pci-host/remote.c                 |  63 ++
 hw/pci/Makefile.objs                 |   2 +
 hw/pci/memory-sync.c                 | 214 ++++++
 hw/pci/proxy.c                       | 436 ++++++++++++
 hw/remote/Makefile.objs              |   1 +
 hw/remote/iohub.c                    | 153 +++++
 include/exec/memory.h                |   2 +
 include/exec/ram_addr.h              |   2 +-
 include/hw/i386/remote-memory.h      |  20 +
 include/hw/i386/remote.h             |  38 ++
 include/hw/pci-host/remote.h         |  34 +
 include/hw/pci/memory-sync.h         |  30 +
 include/hw/pci/pci_ids.h             |   3 +
 include/hw/pci/proxy.h               |  69 ++
 include/hw/remote/iohub.h            |  50 ++
 include/io/channel.h                 |  24 +
 include/io/mpqemu-link.h             | 140 ++++
 include/qemu/mmap-alloc.h            |   3 +-
 io/Makefile.objs                     |   2 +
 io/channel.c                         |  45 ++
 io/mpqemu-link.c                     | 277 ++++++++
 memory.c                             |   3 +-
 scripts/mpqemu-launcher-perf-mode.py |  67 ++
 scripts/mpqemu-launcher.py           |  47 ++
 util/mmap-alloc.c                    |   7 +-
 util/oslib-posix.c                   |   2 +-
 40 files changed, 3264 insertions(+), 13 deletions(-)
 create mode 100644 docs/devel/multi-process.rst
 create mode 100644 docs/multi-process.rst
 create mode 100644 hw/i386/remote-memory.c
 create mode 100644 hw/i386/remote-msg.c
 create mode 100644 hw/i386/remote.c
 create mode 100644 hw/pci-host/remote.c
 create mode 100644 hw/pci/memory-sync.c
 create mode 100644 hw/pci/proxy.c
 create mode 100644 hw/remote/Makefile.objs
 create mode 100644 hw/remote/iohub.c
 create mode 100644 include/hw/i386/remote-memory.h
 create mode 100644 include/hw/i386/remote.h
 create mode 100644 include/hw/pci-host/remote.h
 create mode 100644 include/hw/pci/memory-sync.h
 create mode 100644 include/hw/pci/proxy.h
 create mode 100644 include/hw/remote/iohub.h
 create mode 100644 include/io/mpqemu-link.h
 create mode 100644 io/mpqemu-link.c
 create mode 100644 scripts/mpqemu-launcher-perf-mode.py
 create mode 100755 scripts/mpqemu-launcher.py

-- 
2.25.GIT



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v7 01/21] memory: alloc RAM from file at offset
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-06-30 14:59   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 02/21] multi-process: Add config option for multi-process QEMU elena.ufimtseva
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

Allow RAM MemoryRegion to be created from an offset in a file, instead
of allocating at offset of 0 by default. This is needed to synchronize
RAM between QEMU & remote process.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 backends/hostmem-memfd.c  |  2 +-
 exec.c                    | 11 +++++++----
 hw/misc/ivshmem.c         |  3 ++-
 include/exec/memory.h     |  2 ++
 include/exec/ram_addr.h   |  2 +-
 include/qemu/mmap-alloc.h |  3 ++-
 memory.c                  |  3 ++-
 util/mmap-alloc.c         |  7 ++++---
 util/oslib-posix.c        |  2 +-
 9 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 1b5e4bfe0d..a862d010ab 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -56,7 +56,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     name = host_memory_backend_get_name(backend);
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
                                    name, backend->size,
-                                   backend->share, fd, errp);
+                                   backend->share, fd, 0, errp);
     g_free(name);
 }
 
diff --git a/exec.c b/exec.c
index 21926dc9c7..afc42722b6 100644
--- a/exec.c
+++ b/exec.c
@@ -1854,6 +1854,7 @@ static void *file_ram_alloc(RAMBlock *block,
                             ram_addr_t memory,
                             int fd,
                             bool truncate,
+                            off_t offset,
                             Error **errp)
 {
     void *area;
@@ -1904,7 +1905,8 @@ static void *file_ram_alloc(RAMBlock *block,
     }
 
     area = qemu_ram_mmap(fd, memory, block->mr->align,
-                         block->flags & RAM_SHARED, block->flags & RAM_PMEM);
+                         block->flags & RAM_SHARED, block->flags & RAM_PMEM,
+                         offset);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
@@ -2336,7 +2338,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
 #ifdef CONFIG_POSIX
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
                                  uint32_t ram_flags, int fd,
-                                 Error **errp)
+                                 off_t offset, Error **errp)
 {
     RAMBlock *new_block;
     Error *local_err = NULL;
@@ -2389,7 +2391,8 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     new_block->used_length = size;
     new_block->max_length = size;
     new_block->flags = ram_flags;
-    new_block->host = file_ram_alloc(new_block, size, fd, !file_size, errp);
+    new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
+                                     errp);
     if (!new_block->host) {
         g_free(new_block);
         return NULL;
@@ -2419,7 +2422,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, errp);
+    block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, 0, errp);
     if (!block) {
         if (created) {
             unlink(mem_path);
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index a8dc9b377d..b3cffbd1e0 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -492,7 +492,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, Error **errp)
 
     /* mmap the region and map into the BAR2 */
     memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s),
-                                   "ivshmem.bar2", size, true, fd, &local_err);
+                                   "ivshmem.bar2", size, true, fd, 0,
+                                   &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7207025bd4..a8fc1c36db 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -902,6 +902,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @size: size of the region.
  * @share: %true if memory must be mmaped with the MAP_SHARED flag
  * @fd: the fd to mmap.
+ * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -913,6 +914,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     uint64_t size,
                                     bool share,
                                     int fd,
+                                    ram_addr_t offset,
                                     Error **errp);
 #endif
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 7b5c24e928..ff0784a79c 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -121,7 +121,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                    Error **errp);
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
                                  uint32_t ram_flags, int fd,
-                                 Error **errp);
+                                 off_t offset, Error **errp);
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index e786266b92..4f579858bc 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -25,7 +25,8 @@ void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
                     bool shared,
-                    bool is_pmem);
+                    bool is_pmem,
+                    off_t start);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
 
diff --git a/memory.c b/memory.c
index 9200b20130..130760bd0f 100644
--- a/memory.c
+++ b/memory.c
@@ -1576,6 +1576,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
                                     uint64_t size,
                                     bool share,
                                     int fd,
+                                    ram_addr_t offset,
                                     Error **errp)
 {
     Error *err = NULL;
@@ -1585,7 +1586,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
     mr->destructor = memory_region_destructor_ram;
     mr->ram_block = qemu_ram_alloc_from_fd(size, mr,
                                            share ? RAM_SHARED : 0,
-                                           fd, &err);
+                                           fd, offset, &err);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
     if (err) {
         mr->size = int128_zero();
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 27dcccd8ec..a28f7025f0 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -86,7 +86,8 @@ void *qemu_ram_mmap(int fd,
                     size_t size,
                     size_t align,
                     bool shared,
-                    bool is_pmem)
+                    bool is_pmem,
+                    off_t start)
 {
     int flags;
     int map_sync_flags = 0;
@@ -147,7 +148,7 @@ void *qemu_ram_mmap(int fd,
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
     ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
-               flags | map_sync_flags, fd, 0);
+               flags | map_sync_flags, fd, start);
 
     if (ptr == MAP_FAILED && map_sync_flags) {
         if (errno == ENOTSUP) {
@@ -172,7 +173,7 @@ void *qemu_ram_mmap(int fd,
          * we will remove these flags to handle compatibility.
          */
         ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
-                   flags, fd, 0);
+                   flags, fd, start);
     }
 
     if (ptr == MAP_FAILED) {
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 39ddc77c85..132e7c46c6 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -222,7 +222,7 @@ void *qemu_memalign(size_t alignment, size_t size)
 void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
 {
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, shared, false);
+    void *ptr = qemu_ram_mmap(-1, size, align, shared, false, 0);
 
     if (ptr == MAP_FAILED) {
         return NULL;
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 02/21] multi-process: Add config option for multi-process QEMU
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
  2020-06-27 17:09 ` [PATCH v7 01/21] memory: alloc RAM from file at offset elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-06-30 14:57   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 03/21] multi-process: setup PCI host bridge for remote device elena.ufimtseva
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

Add a configuration option to separate multi-process code

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 configure | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/configure b/configure
index 4a22dcd563..591e3c62ba 100755
--- a/configure
+++ b/configure
@@ -519,6 +519,7 @@ fuzzing="no"
 rng_none="no"
 secret_keyring=""
 libdaxctl=""
+mpqemu="no"
 
 supported_cpu="no"
 supported_os="no"
@@ -1631,6 +1632,10 @@ for opt do
   ;;
   --disable-libdaxctl) libdaxctl=no
   ;;
+  --enable-mpqemu) mpqemu=yes
+  ;;
+  --disable-mpqemu) mpqemu=no
+  ;;
   *)
       echo "ERROR: unknown option $opt"
       echo "Try '$0 --help' for more information"
@@ -1933,6 +1938,8 @@ disabled with --disable-FEATURE, default is enabled if available:
   xkbcommon       xkbcommon support
   rng-none        dummy RNG, avoid using /dev/(u)random and getrandom()
   libdaxctl       libdaxctl support
+  mpqemu          multi-process QEMU support
+
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -6999,6 +7006,7 @@ echo "fuzzing support   $fuzzing"
 echo "gdb               $gdb_bin"
 echo "rng-none          $rng_none"
 echo "Linux keyring     $secret_keyring"
+echo "multiprocess QEMU $mpqemu"
 
 if test "$supported_cpu" = "no"; then
     echo
@@ -7856,6 +7864,9 @@ fi
 if test "$sheepdog" = "yes" ; then
   echo "CONFIG_SHEEPDOG=y" >> $config_host_mak
 fi
+if test "$mpqemu" = "yes" ; then
+  echo "CONFIG_MPQEMU=y" >> $config_host_mak
+fi
 if test "$fuzzing" = "yes" ; then
   if test "$have_fuzzer" = "yes"; then
     FUZZ_LDFLAGS=" -fsanitize=address,fuzzer"
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 03/21] multi-process: setup PCI host bridge for remote device
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
  2020-06-27 17:09 ` [PATCH v7 01/21] memory: alloc RAM from file at offset elena.ufimtseva
  2020-06-27 17:09 ` [PATCH v7 02/21] multi-process: Add config option for multi-process QEMU elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-06-30 15:17   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 04/21] multi-process: setup a machine object for remote device process elena.ufimtseva
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

PCI host bridge is setup for the remote device process. It is
implemented using remote-pcihost object. It is an extension of the PCI
host bridge setup by QEMU.
Remote-pcihost configures a PCI bus which could be used by the remote
PCI device to latch on to.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 MAINTAINERS                  |  8 +++++
 hw/pci-host/Makefile.objs    |  1 +
 hw/pci-host/remote.c         | 63 ++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/remote.h | 34 +++++++++++++++++++
 4 files changed, 106 insertions(+)
 create mode 100644 hw/pci-host/remote.c
 create mode 100644 include/hw/pci-host/remote.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1b40446c73..e46f1960bf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2938,6 +2938,14 @@ S: Maintained
 F: hw/semihosting/
 F: include/hw/semihosting/
 
+Multi-process QEMU
+M: Jagannathan Raman <jag.raman@oracle.com>
+M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
+M: John G Johnson <john.g.johnson@oracle.com>
+S: Maintained
+F: hw/pci-host/remote.c
+F: include/hw/pci-host/remote.h
+
 Build and test automation
 -------------------------
 Build and test automation
diff --git a/hw/pci-host/Makefile.objs b/hw/pci-host/Makefile.objs
index e422e0aca0..daf900710d 100644
--- a/hw/pci-host/Makefile.objs
+++ b/hw/pci-host/Makefile.objs
@@ -18,6 +18,7 @@ common-obj-$(CONFIG_XEN_IGD_PASSTHROUGH) += xen_igd_pt.o
 common-obj-$(CONFIG_PCI_EXPRESS_Q35) += q35.o
 common-obj-$(CONFIG_PCI_EXPRESS_GENERIC_BRIDGE) += gpex.o
 common-obj-$(CONFIG_PCI_EXPRESS_XILINX) += xilinx-pcie.o
+common-obj-$(CONFIG_MPQEMU) += remote.o
 
 common-obj-$(CONFIG_PCI_EXPRESS_DESIGNWARE) += designware.o
 obj-$(CONFIG_POWERNV) += pnv_phb4.o pnv_phb4_pec.o
diff --git a/hw/pci-host/remote.c b/hw/pci-host/remote.c
new file mode 100644
index 0000000000..5ea9af4154
--- /dev/null
+++ b/hw/pci-host/remote.c
@@ -0,0 +1,63 @@
+/*
+ * Remote PCI host device
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_host.h"
+#include "hw/pci/pcie_host.h"
+#include "hw/qdev-properties.h"
+#include "hw/pci-host/remote.h"
+#include "exec/memory.h"
+
+static const char *remote_pcihost_root_bus_path(PCIHostState *host_bridge,
+                                                PCIBus *rootbus)
+{
+    return "0000:00";
+}
+
+static void remote_pcihost_realize(DeviceState *dev, Error **errp)
+{
+    char *busname = g_strdup_printf("remote-pci-%ld", (unsigned long)getpid());
+    PCIHostState *pci = PCI_HOST_BRIDGE(dev);
+    RemotePCIHost *s = REMOTE_HOST_DEVICE(dev);
+
+    pci->bus = pci_root_bus_new(DEVICE(s), busname,
+                                s->mr_pci_mem, s->mr_sys_io,
+                                0, TYPE_PCIE_BUS);
+}
+
+static void remote_pcihost_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
+
+    hc->root_bus_path = remote_pcihost_root_bus_path;
+    dc->realize = remote_pcihost_realize;
+
+    dc->user_creatable = false;
+    set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
+    dc->fw_name = "pci";
+}
+
+static const TypeInfo remote_pcihost_info = {
+    .name = TYPE_REMOTE_HOST_DEVICE,
+    .parent = TYPE_PCIE_HOST_BRIDGE,
+    .instance_size = sizeof(RemotePCIHost),
+    .class_init = remote_pcihost_class_init,
+};
+
+static void remote_pcihost_register(void)
+{
+    type_register_static(&remote_pcihost_info);
+}
+
+type_init(remote_pcihost_register)
diff --git a/include/hw/pci-host/remote.h b/include/hw/pci-host/remote.h
new file mode 100644
index 0000000000..3df1b53c17
--- /dev/null
+++ b/include/hw/pci-host/remote.h
@@ -0,0 +1,34 @@
+/*
+ * PCI Host for remote device
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_PCIHOST_H
+#define REMOTE_PCIHOST_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "exec/memory.h"
+#include "hw/pci/pcie_host.h"
+
+#define TYPE_REMOTE_HOST_DEVICE "remote-pcihost"
+#define REMOTE_HOST_DEVICE(obj) \
+    OBJECT_CHECK(RemotePCIHost, (obj), TYPE_REMOTE_HOST_DEVICE)
+
+typedef struct RemotePCIHost {
+    /*< private >*/
+    PCIExpressHost parent_obj;
+    /*< public >*/
+
+    MemoryRegion *mr_pci_mem;
+    MemoryRegion *mr_sys_mem;
+    MemoryRegion *mr_sys_io;
+} RemotePCIHost;
+
+#endif
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 04/21] multi-process: setup a machine object for remote device process
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (2 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 03/21] multi-process: setup PCI host bridge for remote device elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-06-30 15:26   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 05/21] multi-process: add qio channel function to transmit elena.ufimtseva
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

remote-machine object sets up various subsystems of the remote
device process. Instantiate PCI host bridge object and initialize RAM, IO &
PCI memory regions.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 MAINTAINERS              |  2 ++
 hw/i386/Makefile.objs    |  1 +
 hw/i386/remote.c         | 64 ++++++++++++++++++++++++++++++++++++++++
 include/hw/i386/remote.h | 31 +++++++++++++++++++
 4 files changed, 98 insertions(+)
 create mode 100644 hw/i386/remote.c
 create mode 100644 include/hw/i386/remote.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e46f1960bf..83aae5b441 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2945,6 +2945,8 @@ M: John G Johnson <john.g.johnson@oracle.com>
 S: Maintained
 F: hw/pci-host/remote.c
 F: include/hw/pci-host/remote.h
+F: hw/i386/remote.c
+F: include/hw/i386/remote.h
 
 Build and test automation
 -------------------------
diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index 6abc74551a..5ead266a15 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -14,6 +14,7 @@ obj-$(CONFIG_XEN) += ../xenpv/ xen/
 obj-$(CONFIG_VMPORT) += vmport.o
 obj-$(CONFIG_VMMOUSE) += vmmouse.o
 obj-$(CONFIG_PC) += port92.o
+obj-$(CONFIG_MPQEMU) += remote.o
 
 obj-y += kvmvapic.o
 obj-$(CONFIG_ACPI) += acpi-common.o
diff --git a/hw/i386/remote.c b/hw/i386/remote.c
new file mode 100644
index 0000000000..4d13abe9f3
--- /dev/null
+++ b/hw/i386/remote.c
@@ -0,0 +1,64 @@
+/*
+ * Machine for remote device
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/i386/remote.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "qapi/error.h"
+
+static void remote_machine_init(MachineState *machine)
+{
+    MemoryRegion *system_memory, *system_io, *pci_memory;
+    RemMachineState *s = REMOTE_MACHINE(machine);
+    RemotePCIHost *rem_host;
+
+    system_memory = get_system_memory();
+    system_io = get_system_io();
+
+    pci_memory = g_new(MemoryRegion, 1);
+    memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
+
+    rem_host = REMOTE_HOST_DEVICE(qdev_new(TYPE_REMOTE_HOST_DEVICE));
+
+    rem_host->mr_pci_mem = pci_memory;
+    rem_host->mr_sys_mem = system_memory;
+    rem_host->mr_sys_io = system_io;
+
+    s->host = rem_host;
+
+    object_property_add_child(OBJECT(s), "remote-device", OBJECT(rem_host));
+    memory_region_add_subregion_overlap(system_memory, 0x0, pci_memory, -1);
+
+    qdev_realize(DEVICE(rem_host), sysbus_get_default(), &error_fatal);
+}
+
+static void remote_machine_class_init(ObjectClass *oc, void *data)
+{
+    MachineClass *mc = MACHINE_CLASS(oc);
+
+    mc->init = remote_machine_init;
+}
+
+static const TypeInfo remote_machine = {
+    .name = TYPE_REMOTE_MACHINE,
+    .parent = TYPE_MACHINE,
+    .instance_size = sizeof(RemMachineState),
+    .class_init = remote_machine_class_init,
+};
+
+static void remote_machine_register_types(void)
+{
+    type_register_static(&remote_machine);
+}
+
+type_init(remote_machine_register_types);
diff --git a/include/hw/i386/remote.h b/include/hw/i386/remote.h
new file mode 100644
index 0000000000..d118a940be
--- /dev/null
+++ b/include/hw/i386/remote.h
@@ -0,0 +1,31 @@
+/*
+ * Remote machine configuration
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_MACHINE_H
+#define REMOTE_MACHINE_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qom/object.h"
+#include "hw/boards.h"
+#include "hw/pci-host/remote.h"
+
+typedef struct RemMachineState {
+    MachineState parent_obj;
+
+    RemotePCIHost *host;
+} RemMachineState;
+
+#define TYPE_REMOTE_MACHINE "remote-machine"
+#define REMOTE_MACHINE(obj) \
+    OBJECT_CHECK(RemMachineState, (obj), TYPE_REMOTE_MACHINE)
+
+#endif
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 05/21] multi-process: add qio channel function to transmit
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (3 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 04/21] multi-process: setup a machine object for remote device process elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-06-30 15:29   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 06/21] multi-process: define MPQemuMsg format and transmission functions elena.ufimtseva
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

The entire array of the memory regions and file handlers.
Will be used in the next patch.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/io/channel.h | 24 +++++++++++++++++++++++
 io/channel.c         | 45 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index d4557f0930..c2e3eaeafc 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -779,5 +779,29 @@ void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
                                     IOHandler *io_read,
                                     IOHandler *io_write,
                                     void *opaque);
+/**
+ * qio_channel_writev_full_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @fds: an array of file handles to send
+ * @nfds: number of file handles in @fds
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Behaves like qio_channel_writev_full but will attempt
+ * to send all data passed (file handles and memory regions).
+ * The function will wait for all requested data
+ * to be written, yielding from the current coroutine
+ * if required.
+ *
+ * Returns: 0 if all bytes were written, or -1 on error
+ */
+
+int qio_channel_writev_full_all(QIOChannel *ioc,
+                           const struct iovec *iov,
+                           size_t niov,
+                           int *fds, size_t nfds,
+                           Error **errp);
 
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel.c b/io/channel.c
index e4376eb0bc..22c10c5ccc 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -190,6 +190,51 @@ int qio_channel_writev_all(QIOChannel *ioc,
     return ret;
 }
 
+int qio_channel_writev_full_all(QIOChannel *ioc,
+                                const struct iovec *iov,
+                                size_t niov,
+                                int *fds, size_t nfds,
+                                Error **errp)
+{
+    int ret = -1;
+    struct iovec *local_iov = g_new(struct iovec, niov);
+    struct iovec *local_iov_head = local_iov;
+    unsigned int nlocal_iov = niov;
+
+    nlocal_iov = iov_copy(local_iov, nlocal_iov,
+                          iov, niov,
+                          0, iov_size(iov, niov));
+
+    while (nlocal_iov > 0) {
+        ssize_t len;
+        len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds,
+                                      nfds, errp);
+        if (len == QIO_CHANNEL_ERR_BLOCK) {
+            if (qemu_in_coroutine()) {
+                qio_channel_yield(ioc, G_IO_OUT);
+            } else {
+                qio_channel_wait(ioc, G_IO_OUT);
+            }
+            continue;
+        }
+        if (len < 0) {
+            goto cleanup;
+        }
+
+        iov_discard_front(&local_iov, &nlocal_iov, len);
+
+        if (len > 0) {
+            fds = NULL;
+            nfds = 0;
+        }
+    }
+
+    ret = 0;
+ cleanup:
+    g_free(local_iov_head);
+    return ret;
+}
+
 ssize_t qio_channel_readv(QIOChannel *ioc,
                           const struct iovec *iov,
                           size_t niov,
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 06/21] multi-process: define MPQemuMsg format and transmission functions
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (4 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 05/21] multi-process: add qio channel function to transmit elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-06-30 15:53   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 07/21] multi-process: add co-routines to communicate with remote elena.ufimtseva
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

Defines MPQemuMsg, which is the message that is sent to the remote
process. This message is sent over QIOChannel and is used to
command the remote process to perform various tasks.

Also defined the helper functions to send and receive messages over the
QIOChannel

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 MAINTAINERS              |   2 +
 include/io/mpqemu-link.h |  75 +++++++++++++++++++
 io/Makefile.objs         |   2 +
 io/mpqemu-link.c         | 151 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 230 insertions(+)
 create mode 100644 include/io/mpqemu-link.h
 create mode 100644 io/mpqemu-link.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 83aae5b441..50a5fc53d6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2947,6 +2947,8 @@ F: hw/pci-host/remote.c
 F: include/hw/pci-host/remote.h
 F: hw/i386/remote.c
 F: include/hw/i386/remote.h
+F: io/mpqemu-link.c
+F: include/io/mpqemu-link.h
 
 Build and test automation
 -------------------------
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
new file mode 100644
index 0000000000..1542e8ed07
--- /dev/null
+++ b/include/io/mpqemu-link.h
@@ -0,0 +1,75 @@
+/*
+ * Communication channel between QEMU and remote device process
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef MPQEMU_LINK_H
+#define MPQEMU_LINK_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qom/object.h"
+#include "qemu/thread.h"
+#include "io/channel.h"
+
+#define REMOTE_MAX_FDS 8
+
+#define MPQEMU_MSG_HDR_SIZE offsetof(MPQemuMsg, data1.u64)
+
+/**
+ * MPQemuCmd:
+ *
+ * MPQemuCmd enum type to specify the command to be executed on the remote
+ * device.
+ */
+typedef enum {
+    INIT = 0,
+    MAX = INT_MAX,
+} MPQemuCmd;
+
+/**
+ * Maximum size of data2 field in the message to be transmitted.
+ */
+#define MPQEMU_MSG_DATA_MAX 256
+
+/**
+ * MPQemuMsg:
+ * @cmd: The remote command
+ * @bytestream: Indicates if the data to be shared is structured (data1)
+ *              or unstructured (data2)
+ * @size: Size of the data to be shared
+ * @data1: Structured data
+ * @fds: File descriptors to be shared with remote device
+ * @data2: Unstructured data
+ *
+ * MPQemuMsg Format of the message sent to the remote device from QEMU.
+ *
+ */
+typedef struct {
+    int cmd;
+    int bytestream;
+    size_t size;
+
+    union {
+        uint64_t u64;
+    } data1;
+
+    int fds[REMOTE_MAX_FDS];
+    int num_fds;
+
+    /* Max size of data2 is MPQEMU_MSG_DATA_MAX */
+    uint8_t *data2;
+} MPQemuMsg;
+
+void mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc);
+int mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc);
+
+bool mpqemu_msg_valid(MPQemuMsg *msg);
+
+#endif
diff --git a/io/Makefile.objs b/io/Makefile.objs
index 9a20fce4ed..5875ab0697 100644
--- a/io/Makefile.objs
+++ b/io/Makefile.objs
@@ -10,3 +10,5 @@ io-obj-y += channel-util.o
 io-obj-y += dns-resolver.o
 io-obj-y += net-listener.o
 io-obj-y += task.o
+
+io-obj-$(CONFIG_MPQEMU) += mpqemu-link.o
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
new file mode 100644
index 0000000000..bfc542b5fd
--- /dev/null
+++ b/io/mpqemu-link.c
@@ -0,0 +1,151 @@
+/*
+ * Communication channel between QEMU and remote device process
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qemu/module.h"
+#include "io/mpqemu-link.h"
+#include "qapi/error.h"
+#include "qemu/iov.h"
+#include "qemu/error-report.h"
+
+void mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc)
+{
+    Error *local_err = NULL;
+    struct iovec send[2];
+    int *fds = NULL;
+    size_t nfds = 0;
+
+    send[0].iov_base = msg;
+    send[0].iov_len = MPQEMU_MSG_HDR_SIZE;
+
+    send[1].iov_base = msg->bytestream ? msg->data2 : (void *)&msg->data1;
+    send[1].iov_len = msg->size;
+
+    if (msg->num_fds) {
+        nfds = msg->num_fds;
+        fds = msg->fds;
+    }
+
+    (void)qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send), fds, nfds,
+                                      &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+    }
+}
+
+static int mpqemu_readv(QIOChannel *ioc, struct iovec *iov, int **fds,
+                        size_t *nfds, Error **errp)
+{
+    size_t size, len;
+
+    size = iov->iov_len;
+
+    while (size > 0) {
+        len = qio_channel_readv_full(ioc, iov, 1, fds, nfds, errp);
+
+        if (len == QIO_CHANNEL_ERR_BLOCK) {
+            if (qemu_in_coroutine()) {
+                qio_channel_yield(ioc, G_IO_IN);
+            } else {
+                qio_channel_wait(ioc, G_IO_IN);
+            }
+            continue;
+        }
+
+        if (len <= 0) {
+            return -EIO;
+        }
+
+        size -= len;
+    }
+
+    return iov->iov_len;
+}
+
+int mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc)
+{
+    Error *local_err = NULL;
+    int *fds = NULL;
+    struct iovec hdr, data;
+    size_t nfds = 0;
+
+    hdr.iov_base = g_malloc0(MPQEMU_MSG_HDR_SIZE);
+    hdr.iov_len = MPQEMU_MSG_HDR_SIZE;
+
+    if (mpqemu_readv(ioc, &hdr, &fds, &nfds, &local_err) < 0) {
+        return -EIO;
+    }
+
+    memcpy(msg, hdr.iov_base, hdr.iov_len);
+
+    free(hdr.iov_base);
+    if (msg->size > MPQEMU_MSG_DATA_MAX) {
+        error_report("The message size is more than MPQEMU_MSG_DATA_MAX %d",
+                     MPQEMU_MSG_DATA_MAX);
+        return -EINVAL;
+    }
+
+    data.iov_base = g_malloc0(msg->size);
+    data.iov_len = msg->size;
+
+    if (mpqemu_readv(ioc, &data, NULL, NULL, &local_err) < 0) {
+        return -EIO;
+    }
+
+    if (msg->bytestream) {
+        msg->data2 = calloc(1, msg->size);
+        memcpy(msg->data2, data.iov_base, msg->size);
+    } else {
+        memcpy((void *)&msg->data1, data.iov_base, msg->size);
+    }
+
+    free(data.iov_base);
+
+    if (nfds) {
+        msg->num_fds = nfds;
+        memcpy(msg->fds, fds, nfds * sizeof(int));
+    }
+
+    return 0;
+}
+
+bool mpqemu_msg_valid(MPQemuMsg *msg)
+{
+    if (msg->cmd >= MAX && msg->cmd < 0) {
+        return false;
+    }
+
+    if (msg->bytestream) {
+        if (!msg->data2) {
+            return false;
+        }
+    } else {
+        if (msg->data2) {
+            return false;
+        }
+    }
+
+    /* Verify FDs. */
+    if (msg->num_fds >= REMOTE_MAX_FDS) {
+        return false;
+    }
+
+    if (msg->num_fds > 0) {
+        for (int i = 0; i < msg->num_fds; i++) {
+            if (fcntl(msg->fds[i], F_GETFL) == -1) {
+                return false;
+            }
+        }
+    }
+
+    return true;
+}
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 07/21] multi-process: add co-routines to communicate with remote
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (5 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 06/21] multi-process: define MPQemuMsg format and transmission functions elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-06-30 18:31   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 08/21] multi-process: Initialize communication channel at the remote end elena.ufimtseva
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

process to avoid blocking the main loop during the message exchanges.
To be used by proxy device.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 include/io/mpqemu-link.h | 16 +++++++++
 io/mpqemu-link.c         | 78 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+)

diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index 1542e8ed07..52aa89656c 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -17,6 +17,7 @@
 #include "qom/object.h"
 #include "qemu/thread.h"
 #include "io/channel.h"
+#include "io/channel-socket.h"
 
 #define REMOTE_MAX_FDS 8
 
@@ -30,6 +31,7 @@
  */
 typedef enum {
     INIT = 0,
+    RET_MSG,
     MAX = INT_MAX,
 } MPQemuCmd;
 
@@ -67,6 +69,20 @@ typedef struct {
     uint8_t *data2;
 } MPQemuMsg;
 
+struct MPQemuRequest {
+    MPQemuMsg *msg;
+    QIOChannelSocket *sioc;
+    Coroutine *co;
+    bool finished;
+    int error;
+    long ret;
+};
+
+typedef struct MPQemuRequest MPQemuRequest;
+
+uint64_t mpqemu_msg_send_reply_co(MPQemuMsg *msg, QIOChannel *ioc,
+                                  Error **errp);
+
 void mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc);
 int mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc);
 
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index bfc542b5fd..c430b4d6a2 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -16,6 +16,8 @@
 #include "qapi/error.h"
 #include "qemu/iov.h"
 #include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "io/channel-socket.h"
 
 void mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc)
 {
@@ -118,6 +120,82 @@ int mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc)
     return 0;
 }
 
+/* Use in proxy only as it clobbers fd handlers. */
+static void coroutine_fn mpqemu_msg_send_co(void *data)
+{
+    MPQemuRequest *req = (MPQemuRequest *)data;
+    MPQemuMsg msg_reply = {0};
+    long ret = -EINVAL;
+
+    if (!req->sioc) {
+        error_report("No channel available to send command %d",
+                     req->msg->cmd);
+        atomic_mb_set(&req->finished, true);
+        req->error = -EINVAL;
+        return;
+    }
+
+    req->co = qemu_coroutine_self();
+    mpqemu_msg_send(req->msg, QIO_CHANNEL(req->sioc));
+
+    yield_until_fd_readable(req->sioc->fd);
+
+    ret = mpqemu_msg_recv(&msg_reply, QIO_CHANNEL(req->sioc));
+    if (ret < 0) {
+        error_report("ERROR: failed to get a reply for command %d, \
+                     errno %s, ret is %ld",
+                     req->msg->cmd, strerror(errno), ret);
+        req->error = -errno;
+    } else {
+        if (!mpqemu_msg_valid(&msg_reply) || msg_reply.cmd != RET_MSG) {
+            error_report("ERROR: Invalid reply received for command %d",
+                         req->msg->cmd);
+            req->error = -EINVAL;
+        } else {
+            req->ret = msg_reply.data1.u64;
+        }
+    }
+    atomic_mb_set(&req->finished, true);
+}
+
+/*
+ * Create if needed and enter co-routine to send the message to the
+ * remote channel ioc and wait for the reply.
+ * Resturns the value from the reply message, sets the error on failure.
+ */
+
+uint64_t mpqemu_msg_send_reply_co(MPQemuMsg *msg, QIOChannel *ioc,
+                                  Error **errp)
+{
+    MPQemuRequest req = {0};
+    uint64_t ret = UINT64_MAX;
+
+    req.sioc = QIO_CHANNEL_SOCKET(ioc);
+    if (!req.sioc) {
+        return ret;
+    }
+
+    req.msg = msg;
+    req.ret = 0;
+    req.finished = false;
+
+    if (!req.co) {
+        req.co = qemu_coroutine_create(mpqemu_msg_send_co, &req);
+    }
+
+    qemu_coroutine_enter(req.co);
+    while (!req.finished) {
+        aio_poll(qemu_get_aio_context(), false);
+    }
+    if (req.error) {
+        error_setg(errp, "Error exchanging message with remote process, "\
+                        "socket %d, error %d", req.sioc->fd, req.error);
+    }
+    ret = req.ret;
+
+    return ret;
+}
+
 bool mpqemu_msg_valid(MPQemuMsg *msg)
 {
     if (msg->cmd >= MAX && msg->cmd < 0) {
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 08/21] multi-process: Initialize communication channel at the remote end
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (6 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 07/21] multi-process: add co-routines to communicate with remote elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01  6:44   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 09/21] multi-process: Initialize message handler in remote device elena.ufimtseva
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

Initialize the common QIOChannel at remote the end

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/i386/remote.c         | 16 ++++++++++++++++
 include/hw/i386/remote.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/hw/i386/remote.c b/hw/i386/remote.c
index 4d13abe9f3..1a1becffe0 100644
--- a/hw/i386/remote.c
+++ b/hw/i386/remote.c
@@ -15,6 +15,7 @@
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
 #include "qapi/error.h"
+#include "io/channel-util.h"
 
 static void remote_machine_init(MachineState *machine)
 {
@@ -42,6 +43,20 @@ static void remote_machine_init(MachineState *machine)
     qdev_realize(DEVICE(rem_host), sysbus_get_default(), &error_fatal);
 }
 
+static void remote_set_socket(Object *obj, const char *str, Error **errp)
+{
+    RemMachineState *s = REMOTE_MACHINE(obj);
+    Error *local_err = NULL;
+    int fd = atoi(str);
+
+    s->ioc = qio_channel_new_fd(fd, &local_err);
+}
+
+static void remote_instance_init(Object *obj)
+{
+    object_property_add_str(obj, "socket", NULL, remote_set_socket);
+}
+
 static void remote_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
@@ -53,6 +68,7 @@ static const TypeInfo remote_machine = {
     .name = TYPE_REMOTE_MACHINE,
     .parent = TYPE_MACHINE,
     .instance_size = sizeof(RemMachineState),
+    .instance_init = remote_instance_init,
     .class_init = remote_machine_class_init,
 };
 
diff --git a/include/hw/i386/remote.h b/include/hw/i386/remote.h
index d118a940be..0f8b861e7a 100644
--- a/include/hw/i386/remote.h
+++ b/include/hw/i386/remote.h
@@ -17,11 +17,13 @@
 #include "qom/object.h"
 #include "hw/boards.h"
 #include "hw/pci-host/remote.h"
+#include "io/channel.h"
 
 typedef struct RemMachineState {
     MachineState parent_obj;
 
     RemotePCIHost *host;
+    QIOChannel *ioc;
 } RemMachineState;
 
 #define TYPE_REMOTE_MACHINE "remote-machine"
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 09/21] multi-process: Initialize message handler in remote device
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (7 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 08/21] multi-process: Initialize communication channel at the remote end elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01  6:53   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 10/21] multi-process: setup memory manager for " elena.ufimtseva
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

Initializes the message handler function in the remote process. It is
called whenever there's an event pending on QIOChannel that registers
this function.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 MAINTAINERS              |  1 +
 hw/i386/Makefile.objs    |  1 +
 hw/i386/remote-msg.c     | 52 ++++++++++++++++++++++++++++++++++++++++
 hw/i386/remote.c         |  4 ++++
 include/hw/i386/remote.h |  3 +++
 5 files changed, 61 insertions(+)
 create mode 100644 hw/i386/remote-msg.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 50a5fc53d6..e204b3e0c6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2949,6 +2949,7 @@ F: hw/i386/remote.c
 F: include/hw/i386/remote.h
 F: io/mpqemu-link.c
 F: include/io/mpqemu-link.h
+F: hw/i386/remote-msg.c
 
 Build and test automation
 -------------------------
diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index 5ead266a15..83969585c1 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -15,6 +15,7 @@ obj-$(CONFIG_VMPORT) += vmport.o
 obj-$(CONFIG_VMMOUSE) += vmmouse.o
 obj-$(CONFIG_PC) += port92.o
 obj-$(CONFIG_MPQEMU) += remote.o
+obj-$(CONFIG_MPQEMU) += remote-msg.o
 
 obj-y += kvmvapic.o
 obj-$(CONFIG_ACPI) += acpi-common.o
diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
new file mode 100644
index 0000000000..58e24ab2ad
--- /dev/null
+++ b/hw/i386/remote-msg.c
@@ -0,0 +1,52 @@
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/i386/remote.h"
+#include "io/channel.h"
+#include "io/mpqemu-link.h"
+#include "qapi/error.h"
+#include "sysemu/runstate.h"
+
+gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
+                            gpointer opaque)
+{
+    Error *local_err = NULL;
+    MPQemuMsg msg = { 0 };
+
+    if (cond & G_IO_HUP) {
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
+
+    if (cond & (G_IO_ERR | G_IO_NVAL)) {
+        error_setg(&local_err, "Error %d while processing message from proxy \
+                   in remote process pid=%d", errno, getpid());
+        return FALSE;
+    }
+
+    if (mpqemu_msg_recv(&msg, ioc) < 0) {
+        return FALSE;
+    }
+
+    if (!mpqemu_msg_valid(&msg)) {
+        error_report("Received invalid message from proxy \
+                     in remote process pid=%d", getpid());
+        return TRUE;
+    }
+
+    switch (msg.cmd) {
+    default:
+        error_setg(&local_err, "Unknown command (%d) received from proxy \
+                   in remote process pid=%d", msg.cmd, getpid());
+    }
+
+    if (msg.data2) {
+        free(msg.data2);
+    }
+
+    if (local_err) {
+        error_report_err(local_err);
+        return FALSE;
+    }
+
+    return TRUE;
+}
diff --git a/hw/i386/remote.c b/hw/i386/remote.c
index 1a1becffe0..5342e884ad 100644
--- a/hw/i386/remote.c
+++ b/hw/i386/remote.c
@@ -16,6 +16,7 @@
 #include "exec/memory.h"
 #include "qapi/error.h"
 #include "io/channel-util.h"
+#include "io/channel.h"
 
 static void remote_machine_init(MachineState *machine)
 {
@@ -50,6 +51,9 @@ static void remote_set_socket(Object *obj, const char *str, Error **errp)
     int fd = atoi(str);
 
     s->ioc = qio_channel_new_fd(fd, &local_err);
+
+    qio_channel_add_watch(s->ioc, G_IO_IN | G_IO_HUP | G_IO_ERR | G_IO_NVAL,
+                          mpqemu_process_msg, NULL, NULL);
 }
 
 static void remote_instance_init(Object *obj)
diff --git a/include/hw/i386/remote.h b/include/hw/i386/remote.h
index 0f8b861e7a..c3890e57ab 100644
--- a/include/hw/i386/remote.h
+++ b/include/hw/i386/remote.h
@@ -30,4 +30,7 @@ typedef struct RemMachineState {
 #define REMOTE_MACHINE(obj) \
     OBJECT_CHECK(RemMachineState, (obj), TYPE_REMOTE_MACHINE)
 
+gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
+                            gpointer opaque);
+
 #endif
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 10/21] multi-process: setup memory manager for remote device
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (8 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 09/21] multi-process: Initialize message handler in remote device elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01  7:58   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 11/21] multi-process: introduce proxy object elena.ufimtseva
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

SyncSysMemMsg message format is defined. It is used to send
file descriptors of the RAM regions to remote device.
RAM on the remote device is configured with a set of file descriptors.
Old RAM regions are deleted and new regions, each with an fd, is
added to the RAM.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 MAINTAINERS                     |  2 ++
 hw/i386/Makefile.objs           |  1 +
 hw/i386/remote-memory.c         | 58 +++++++++++++++++++++++++++++++++
 include/hw/i386/remote-memory.h | 20 ++++++++++++
 include/io/mpqemu-link.h        | 13 ++++++++
 io/mpqemu-link.c                | 13 ++++++++
 6 files changed, 107 insertions(+)
 create mode 100644 hw/i386/remote-memory.c
 create mode 100644 include/hw/i386/remote-memory.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e204b3e0c6..017c96eace 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2950,6 +2950,8 @@ F: include/hw/i386/remote.h
 F: io/mpqemu-link.c
 F: include/io/mpqemu-link.h
 F: hw/i386/remote-msg.c
+F: include/hw/i386/remote-memory.h
+F: hw/i386/remote-memory.c
 
 Build and test automation
 -------------------------
diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index 83969585c1..d85537713e 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -16,6 +16,7 @@ obj-$(CONFIG_VMMOUSE) += vmmouse.o
 obj-$(CONFIG_PC) += port92.o
 obj-$(CONFIG_MPQEMU) += remote.o
 obj-$(CONFIG_MPQEMU) += remote-msg.o
+obj-$(CONFIG_MPQEMU) += remote-memory.o
 
 obj-y += kvmvapic.o
 obj-$(CONFIG_ACPI) += acpi-common.o
diff --git a/hw/i386/remote-memory.c b/hw/i386/remote-memory.c
new file mode 100644
index 0000000000..4fad426bbb
--- /dev/null
+++ b/hw/i386/remote-memory.c
@@ -0,0 +1,58 @@
+/*
+ * Memory manager for remote device
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/i386/remote-memory.h"
+#include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
+#include "qapi/error.h"
+
+void remote_sysmem_reconfig(MPQemuMsg *msg, Error **errp)
+{
+    SyncSysmemMsg *sysmem_info = &msg->data1.sync_sysmem;
+    MemoryRegion *sysmem, *subregion, *next;
+    static unsigned int suffix;
+    Error *local_err = NULL;
+    char *name;
+    int region;
+
+    sysmem = get_system_memory();
+
+    memory_region_transaction_begin();
+
+    QTAILQ_FOREACH_SAFE(subregion, &sysmem->subregions, subregions_link, next) {
+        if (subregion->ram) {
+            memory_region_del_subregion(sysmem, subregion);
+            qemu_ram_free(subregion->ram_block);
+        }
+    }
+
+    for (region = 0; region < msg->num_fds; region++) {
+        subregion = g_new(MemoryRegion, 1);
+        name = g_strdup_printf("remote-mem-%u", suffix++);
+        memory_region_init_ram_from_fd(subregion, OBJECT(sysmem),
+                                       name, sysmem_info->sizes[region],
+                                       RAM_SHARED, msg->fds[region],
+                                       sysmem_info->offsets[region],
+                                       &local_err);
+        g_free(name);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            break;
+        }
+
+        memory_region_add_subregion(sysmem, sysmem_info->gpas[region],
+                                    subregion);
+    }
+
+    memory_region_transaction_commit();
+}
diff --git a/include/hw/i386/remote-memory.h b/include/hw/i386/remote-memory.h
new file mode 100644
index 0000000000..e2e479bb6f
--- /dev/null
+++ b/include/hw/i386/remote-memory.h
@@ -0,0 +1,20 @@
+/*
+ * Memory manager for remote device
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_MEMORY_H
+#define REMOTE_MEMORY_H
+
+#include "qemu/osdep.h"
+#include "exec/hwaddr.h"
+#include "io/mpqemu-link.h"
+
+void remote_sysmem_reconfig(MPQemuMsg *msg, Error **errp);
+
+#endif
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index 52aa89656c..c6d2b6bf8b 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -18,6 +18,8 @@
 #include "qemu/thread.h"
 #include "io/channel.h"
 #include "io/channel-socket.h"
+#include "exec/cpu-common.h"
+#include "exec/hwaddr.h"
 
 #define REMOTE_MAX_FDS 8
 
@@ -28,13 +30,22 @@
  *
  * MPQemuCmd enum type to specify the command to be executed on the remote
  * device.
+ *
+ * SYNC_SYSMEM      Shares QEMU's RAM with remote device's RAM
  */
 typedef enum {
     INIT = 0,
+    SYNC_SYSMEM,
     RET_MSG,
     MAX = INT_MAX,
 } MPQemuCmd;
 
+typedef struct {
+    hwaddr gpas[REMOTE_MAX_FDS];
+    uint64_t sizes[REMOTE_MAX_FDS];
+    off_t offsets[REMOTE_MAX_FDS];
+} SyncSysmemMsg;
+
 /**
  * Maximum size of data2 field in the message to be transmitted.
  */
@@ -53,6 +64,7 @@ typedef enum {
  * MPQemuMsg Format of the message sent to the remote device from QEMU.
  *
  */
+
 typedef struct {
     int cmd;
     int bytestream;
@@ -60,6 +72,7 @@ typedef struct {
 
     union {
         uint64_t u64;
+        SyncSysmemMsg sync_sysmem;
     } data1;
 
     int fds[REMOTE_MAX_FDS];
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index c430b4d6a2..5887c8c6c0 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -224,6 +224,19 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
             }
         }
     }
+     /* Verify message specific fields. */
+    switch (msg->cmd) {
+    case SYNC_SYSMEM:
+        if (msg->num_fds == 0 || msg->bytestream) {
+            return false;
+        }
+        if (msg->size != sizeof(msg->data1)) {
+            return false;
+        }
+        break;
+    default:
+        break;
+    }
 
     return true;
 }
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 11/21] multi-process: introduce proxy object
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (9 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 10/21] multi-process: setup memory manager for " elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01  8:58   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process elena.ufimtseva
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

Defines a PCI Device proxy object as a child of TYPE_PCI_DEVICE.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 MAINTAINERS            |  2 ++
 hw/pci/Makefile.objs   |  1 +
 hw/pci/proxy.c         | 70 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/proxy.h | 43 ++++++++++++++++++++++++++
 4 files changed, 116 insertions(+)
 create mode 100644 hw/pci/proxy.c
 create mode 100644 include/hw/pci/proxy.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 017c96eace..b48c3114c1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2952,6 +2952,8 @@ F: include/io/mpqemu-link.h
 F: hw/i386/remote-msg.c
 F: include/hw/i386/remote-memory.h
 F: hw/i386/remote-memory.c
+F: hw/pci/proxy.c
+F: include/hw/pci/proxy.h
 
 Build and test automation
 -------------------------
diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
index c78f2fb24b..515dda506c 100644
--- a/hw/pci/Makefile.objs
+++ b/hw/pci/Makefile.objs
@@ -12,3 +12,4 @@ common-obj-$(CONFIG_PCI_EXPRESS) += pcie_port.o pcie_host.o
 
 common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
 common-obj-$(CONFIG_ALL) += pci-stub.o
+obj-$(CONFIG_MPQEMU) += proxy.o
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
new file mode 100644
index 0000000000..6d62399c52
--- /dev/null
+++ b/hw/pci/proxy.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/pci/proxy.h"
+#include "hw/pci/pci.h"
+#include "qapi/error.h"
+#include "io/channel-util.h"
+#include "hw/qdev-properties.h"
+#include "monitor/monitor.h"
+
+static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
+{
+    pdev->com = qio_channel_new_fd(fd, errp);
+}
+
+static Property proxy_properties[] = {
+    DEFINE_PROP_STRING("fd", PCIProxyDev, fd),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
+{
+    PCIProxyDev *dev = PCI_PROXY_DEV(device);
+    int proxyfd;
+
+    if (dev->fd) {
+        proxyfd = monitor_fd_param(cur_mon, dev->fd, errp);
+        if (proxyfd == -1) {
+            error_prepend(errp, "proxy: unable to parse proxyfd: ");
+            return;
+        }
+        proxy_set_socket(dev, proxyfd, errp);
+    }
+}
+
+static void pci_proxy_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+    k->realize = pci_proxy_dev_realize;
+    device_class_set_props(dc, proxy_properties);
+}
+
+static const TypeInfo pci_proxy_dev_type_info = {
+    .name          = TYPE_PCI_PROXY_DEV,
+    .parent        = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(PCIProxyDev),
+    .class_size    = sizeof(PCIProxyDevClass),
+    .class_init    = pci_proxy_dev_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { INTERFACE_CONVENTIONAL_PCI_DEVICE },
+        { },
+    },
+};
+
+static void pci_proxy_dev_register_types(void)
+{
+    type_register_static(&pci_proxy_dev_type_info);
+}
+
+type_init(pci_proxy_dev_register_types)
diff --git a/include/hw/pci/proxy.h b/include/hw/pci/proxy.h
new file mode 100644
index 0000000000..c1c7142fa2
--- /dev/null
+++ b/include/hw/pci/proxy.h
@@ -0,0 +1,43 @@
+/*
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PROXY_H
+#define PROXY_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/pci/pci.h"
+#include "io/channel.h"
+
+#define TYPE_PCI_PROXY_DEV "pci-proxy-dev"
+
+#define PCI_PROXY_DEV(obj) \
+            OBJECT_CHECK(PCIProxyDev, (obj), TYPE_PCI_PROXY_DEV)
+
+#define PCI_PROXY_DEV_CLASS(klass) \
+            OBJECT_CLASS_CHECK(PCIProxyDevClass, (klass), TYPE_PCI_PROXY_DEV)
+
+#define PCI_PROXY_DEV_GET_CLASS(obj) \
+            OBJECT_GET_CLASS(PCIProxyDevClass, (obj), TYPE_PCI_PROXY_DEV)
+
+typedef struct PCIProxyDev {
+    PCIDevice parent_dev;
+    char *fd;
+    QIOChannel *com;
+} PCIProxyDev;
+
+typedef struct PCIProxyDevClass {
+    PCIDeviceClass parent_class;
+
+    void (*realize)(PCIProxyDev *dev, Error **errp);
+
+    char *command;
+} PCIProxyDevClass;
+
+#endif /* PROXY_H */
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (10 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 11/21] multi-process: introduce proxy object elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01  9:20   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 13/21] multi-process: Forward PCI config space acceses to " elena.ufimtseva
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

Send a message to the remote process to connect PCI device with the
corresponding Proxy object in QEMU

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/i386/remote-msg.c     | 39 +++++++++++++++++++++++++++++++++++++++
 hw/pci/proxy.c           | 28 ++++++++++++++++++++++++++++
 include/hw/pci/proxy.h   |  1 +
 include/io/mpqemu-link.h |  1 +
 io/mpqemu-link.c         |  8 ++++++++
 5 files changed, 77 insertions(+)

diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index 58e24ab2ad..68f50866bb 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -6,6 +6,11 @@
 #include "io/mpqemu-link.h"
 #include "qapi/error.h"
 #include "sysemu/runstate.h"
+#include "io/channel-util.h"
+#include "hw/pci/pci.h"
+
+static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
+                                    Error **errp);
 
 gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
                             gpointer opaque)
@@ -34,6 +39,9 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     }
 
     switch (msg.cmd) {
+    case CONNECT_DEV:
+        process_connect_dev_msg(&msg, ioc, &local_err);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
@@ -50,3 +58,34 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
 
     return TRUE;
 }
+
+static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
+                                    Error **errp)
+{
+    char *devid = (char *)msg->data2;
+    QIOChannel *dioc = NULL;
+    DeviceState *dev = NULL;
+    MPQemuMsg ret = { 0 };
+    int rc = 0;
+
+    g_assert(devid && (devid[msg->size - 1] == '\0'));
+
+    dev = qdev_find_recursive(sysbus_get_default(), devid);
+    if (!dev || !object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
+        rc = 0xff;
+        goto exit;
+    }
+
+    dioc = qio_channel_new_fd(msg->fds[0], errp);
+
+    qio_channel_add_watch(dioc, G_IO_IN | G_IO_HUP, mpqemu_process_msg,
+                          (void *)dev, NULL);
+
+exit:
+    ret.cmd = RET_MSG;
+    ret.bytestream = 0;
+    ret.data1.u64 = rc;
+    ret.size = sizeof(ret.data1);
+
+    mpqemu_msg_send(&ret, com);
+}
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index 6d62399c52..16649ed0ec 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -15,10 +15,38 @@
 #include "io/channel-util.h"
 #include "hw/qdev-properties.h"
 #include "monitor/monitor.h"
+#include "io/mpqemu-link.h"
 
 static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
 {
+    DeviceState *dev = DEVICE(pdev);
+    MPQemuMsg msg = { 0 };
+    int fds[2];
+    Error *local_err = NULL;
+
     pdev->com = qio_channel_new_fd(fd, errp);
+
+    if (socketpair(AF_UNIX, SOCK_STREAM, 0, fds)) {
+        error_setg(errp, "Failed to create proxy channel with fd %d", fd);
+        return;
+    }
+
+    msg.cmd = CONNECT_DEV;
+    msg.bytestream = 1;
+    msg.data2 = (uint8_t *)dev->id;
+    msg.size = strlen(dev->id) + 1;
+    msg.num_fds = 1;
+    msg.fds[0] = fds[1];
+
+    (void)mpqemu_msg_send_reply_co(&msg, pdev->com, &local_err);
+    if (local_err) {
+        error_setg(errp, "Failed to send DEV_CONNECT to the remote process");
+        close(fds[0]);
+    } else {
+        pdev->dev = qio_channel_new_fd(fds[0], errp);
+    }
+
+    close(fds[1]);
 }
 
 static Property proxy_properties[] = {
diff --git a/include/hw/pci/proxy.h b/include/hw/pci/proxy.h
index c1c7142fa2..72dd7e0944 100644
--- a/include/hw/pci/proxy.h
+++ b/include/hw/pci/proxy.h
@@ -30,6 +30,7 @@ typedef struct PCIProxyDev {
     PCIDevice parent_dev;
     char *fd;
     QIOChannel *com;
+    QIOChannel *dev;
 } PCIProxyDev;
 
 typedef struct PCIProxyDevClass {
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index c6d2b6bf8b..d620806c17 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -36,6 +36,7 @@
 typedef enum {
     INIT = 0,
     SYNC_SYSMEM,
+    CONNECT_DEV,
     RET_MSG,
     MAX = INT_MAX,
 } MPQemuCmd;
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index 5887c8c6c0..54df3b254e 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -234,6 +234,14 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
             return false;
         }
         break;
+    case CONNECT_DEV:
+        if ((msg->num_fds != 1) ||
+            (msg->fds[0] == -1) ||
+            (msg->fds[0] == -1) ||
+            !msg->bytestream) {
+            return false;
+        }
+        break;
     default:
         break;
     }
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 13/21] multi-process: Forward PCI config space acceses to the remote process
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (11 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01  9:40   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 14/21] multi-process: PCI BAR read/write handling for proxy & remote endpoints elena.ufimtseva
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

The Proxy Object sends the PCI config space accesses as messages
to the remote process over the communication channel

TODO:
Investigate if the local PCI config writes can be dropped.
Without the proxy local PCI config space writes for the device,
the driver in the guest times out on the probing.
We have tried to only refer to the remote for the PCI config writes,
but the driver timeout in the guest forced as to left this
as it is (removing local PCI config only).

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 hw/i386/remote-msg.c     | 54 ++++++++++++++++++++++++++++++++++++++++
 hw/pci/proxy.c           | 54 ++++++++++++++++++++++++++++++++++++++++
 include/io/mpqemu-link.h |  8 ++++++
 io/mpqemu-link.c         | 14 +++++++++++
 4 files changed, 130 insertions(+)

diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index 68f50866bb..aa5780d521 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -11,10 +11,16 @@
 
 static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
                                     Error **errp);
+static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
+                                 MPQemuMsg *msg);
+static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
+                                MPQemuMsg *msg);
 
 gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
                             gpointer opaque)
 {
+    DeviceState *dev = (DeviceState *)opaque;
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
     Error *local_err = NULL;
     MPQemuMsg msg = { 0 };
 
@@ -42,6 +48,12 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     case CONNECT_DEV:
         process_connect_dev_msg(&msg, ioc, &local_err);
         break;
+    case PCI_CONFIG_WRITE:
+        process_config_write(ioc, pci_dev, &msg);
+        break;
+    case PCI_CONFIG_READ:
+        process_config_read(ioc, pci_dev, &msg);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
@@ -89,3 +101,45 @@ exit:
 
     mpqemu_msg_send(&ret, com);
 }
+
+static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
+                                 MPQemuMsg *msg)
+{
+    struct conf_data_msg *conf = (struct conf_data_msg *)msg->data2;
+    MPQemuMsg ret = { 0 };
+
+    if (conf->addr >= PCI_CFG_SPACE_EXP_SIZE) {
+        error_report("Bad address received when writing PCI config, pid %d",
+                     getpid());
+        ret.data1.u64 = UINT64_MAX;
+    } else {
+        pci_default_write_config(dev, conf->addr, conf->val, conf->l);
+    }
+
+    ret.cmd = RET_MSG;
+    ret.bytestream = 0;
+    ret.size = sizeof(ret.data1);
+
+    mpqemu_msg_send(&ret, ioc);
+}
+
+static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
+                                MPQemuMsg *msg)
+{
+    struct conf_data_msg *conf = (struct conf_data_msg *)msg->data2;
+    MPQemuMsg ret = { 0 };
+
+    if (conf->addr >= PCI_CFG_SPACE_EXP_SIZE) {
+        error_report("Bad address received when reading PCI config, pid %d",
+                     getpid());
+        ret.data1.u64 = UINT64_MAX;
+    } else {
+        ret.data1.u64 = pci_default_read_config(dev, conf->addr, conf->l);
+    }
+
+    ret.cmd = RET_MSG;
+    ret.bytestream = 0;
+    ret.size = sizeof(ret.data1);
+
+    mpqemu_msg_send(&ret, ioc);
+}
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index 16649ed0ec..8934070a20 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -16,6 +16,7 @@
 #include "hw/qdev-properties.h"
 #include "monitor/monitor.h"
 #include "io/mpqemu-link.h"
+#include "qemu/error-report.h"
 
 static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
 {
@@ -69,12 +70,65 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
     }
 }
 
+static int config_op_send(PCIProxyDev *pdev, uint32_t addr, uint32_t *val,
+                          int l, unsigned int op)
+{
+    struct conf_data_msg conf_data;
+    MPQemuMsg msg = { 0 };
+    long ret = -EINVAL;
+    Error *local_err = NULL;
+
+    conf_data.addr = addr;
+    conf_data.val = (op == PCI_CONFIG_WRITE) ? *val : 0;
+    conf_data.l = l;
+
+    msg.data2 = (uint8_t *)&conf_data;
+
+    msg.size = sizeof(conf_data);
+    msg.cmd = op;
+    msg.bytestream = 1;
+
+    ret = mpqemu_msg_send_reply_co(&msg, pdev->dev, &local_err);
+    if (local_err) {
+        error_report("Failed to exchange PCI_CONFIG message with remote");
+    }
+    if (op == PCI_CONFIG_READ) {
+        *val = (uint32_t)ret;
+    }
+
+    return ret;
+}
+
+static uint32_t pci_proxy_read_config(PCIDevice *d, uint32_t addr, int len)
+{
+    uint32_t val;
+
+    (void)config_op_send(PCI_PROXY_DEV(d), addr, &val, len, PCI_CONFIG_READ);
+
+    return val;
+}
+
+static void pci_proxy_write_config(PCIDevice *d, uint32_t addr, uint32_t val,
+                                   int l)
+{
+    /*
+     * Some of the functions access the copy of the remote device
+     * PCI config space, therefore maintain it updated.
+     */
+    pci_default_write_config(d, addr, val, l);
+
+    (void)config_op_send(PCI_PROXY_DEV(d), addr, &val, l, PCI_CONFIG_WRITE);
+}
+
 static void pci_proxy_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
 
     k->realize = pci_proxy_dev_realize;
+    k->config_read = pci_proxy_read_config;
+    k->config_write = pci_proxy_write_config;
+
     device_class_set_props(dc, proxy_properties);
 }
 
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index d620806c17..bbd6f3db44 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -38,6 +38,8 @@ typedef enum {
     SYNC_SYSMEM,
     CONNECT_DEV,
     RET_MSG,
+    PCI_CONFIG_WRITE,
+    PCI_CONFIG_READ,
     MAX = INT_MAX,
 } MPQemuCmd;
 
@@ -47,6 +49,12 @@ typedef struct {
     off_t offsets[REMOTE_MAX_FDS];
 } SyncSysmemMsg;
 
+struct conf_data_msg {
+    uint32_t addr;
+    uint32_t val;
+    int l;
+};
+
 /**
  * Maximum size of data2 field in the message to be transmitted.
  */
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index 54df3b254e..19887611d0 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -242,6 +242,20 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
             return false;
         }
         break;
+    case PCI_CONFIG_WRITE:
+    case PCI_CONFIG_READ:
+        if (msg->size != sizeof(struct conf_data_msg)) {
+            return false;
+        }
+        if (!msg->bytestream) {
+            return false;
+        }
+        struct conf_data_msg *conf = (struct conf_data_msg *)msg->data2;
+
+        if (conf->l != 1 && conf->l != 2 && conf->l != 4) {
+            return false;
+        }
+        break;
     default:
         break;
     }
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 14/21] multi-process: PCI BAR read/write handling for proxy & remote endpoints
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (12 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 13/21] multi-process: Forward PCI config space acceses to " elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01 10:41   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 15/21] multi-process: Synchronize remote memory elena.ufimtseva
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

Proxy device object implements handler for PCI BAR writes and reads.
The handler uses BAR_WRITE/BAR_READ message to communicate to the
remote process with the BAR address and value to be written/read.
The remote process implements handler for BAR_WRITE/BAR_READ
message.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 hw/i386/remote-msg.c     | 95 ++++++++++++++++++++++++++++++++++++++++
 hw/pci/proxy.c           | 61 ++++++++++++++++++++++++++
 include/hw/pci/proxy.h   | 16 ++++++-
 include/io/mpqemu-link.h | 10 +++++
 io/mpqemu-link.c         |  6 +++
 5 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index aa5780d521..ffb4143736 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -8,6 +8,7 @@
 #include "sysemu/runstate.h"
 #include "io/channel-util.h"
 #include "hw/pci/pci.h"
+#include "exec/memattrs.h"
 
 static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
                                     Error **errp);
@@ -15,6 +16,8 @@ static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
                                  MPQemuMsg *msg);
 static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
                                 MPQemuMsg *msg);
+static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
+static void process_bar_read(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
 
 gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
                             gpointer opaque)
@@ -54,6 +57,12 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     case PCI_CONFIG_READ:
         process_config_read(ioc, pci_dev, &msg);
         break;
+    case BAR_WRITE:
+        process_bar_write(ioc, &msg, &local_err);
+        break;
+    case BAR_READ:
+        process_bar_read(ioc, &msg, &local_err);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
@@ -143,3 +152,89 @@ static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
 
     mpqemu_msg_send(&ret, ioc);
 }
+
+static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp)
+{
+    BarAccessMsg *bar_access = &msg->data1.bar_access;
+    AddressSpace *as =
+        bar_access->memory ? &address_space_memory : &address_space_io;
+    MPQemuMsg ret = { 0 };
+    MemTxResult res;
+
+    if (!is_power_of_2(bar_access->size) ||
+       (bar_access->size > sizeof(uint64_t))) {
+        ret.data1.u64 = UINT64_MAX;
+        goto fail;
+    }
+
+    res = address_space_rw(as, bar_access->addr, MEMTXATTRS_UNSPECIFIED,
+                           (void *)&bar_access->val, bar_access->size,
+                           true);
+
+    if (res != MEMTX_OK) {
+        error_setg(errp, "Could not perform address space write operation,"
+                   " inaccessible address: %lx in pid %d.",
+                   bar_access->addr, getpid());
+        ret.data1.u64 = -1;
+    }
+
+fail:
+    ret.cmd = RET_MSG;
+    ret.size = sizeof(ret.data1);
+
+    mpqemu_msg_send(&ret, ioc);
+}
+
+static void process_bar_read(QIOChannel *ioc, MPQemuMsg *msg, Error **errp)
+{
+    BarAccessMsg *bar_access = &msg->data1.bar_access;
+    MPQemuMsg ret = { 0 };
+    AddressSpace *as;
+    MemTxResult res;
+    uint64_t val = 0;
+
+    as = bar_access->memory ? &address_space_memory : &address_space_io;
+
+    if (!is_power_of_2(bar_access->size) ||
+       (bar_access->size > sizeof(uint64_t))) {
+        val = UINT64_MAX;
+        goto fail;
+    }
+
+    res = address_space_rw(as, bar_access->addr, MEMTXATTRS_UNSPECIFIED,
+                           (void *)&val, bar_access->size, false);
+
+    if (res != MEMTX_OK) {
+        error_setg(errp, "Could not perform address space read operation,"
+                   " inaccessible address: %lx in pid %d.",
+                   bar_access->addr, getpid());
+        val = UINT64_MAX;
+        goto fail;
+    }
+
+    switch (bar_access->size) {
+    case 8:
+        /* Nothing to do as val is already 8 bytes long */
+        break;
+    case 4:
+        val = *((uint32_t *)&val);
+        break;
+    case 2:
+        val = *((uint16_t *)&val);
+        break;
+    case 1:
+        val = *((uint8_t *)&val);
+        break;
+    default:
+        error_setg(errp, "Invalid PCI BAR read size in pid %d",
+                   getpid());
+        val = (uint64_t)-1;
+    }
+
+fail:
+    ret.cmd = RET_MSG;
+    ret.data1.u64 = val;
+    ret.size = sizeof(ret.data1);
+
+    mpqemu_msg_send(&ret, ioc);
+}
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index 8934070a20..fff021a06a 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -150,3 +150,64 @@ static void pci_proxy_dev_register_types(void)
 }
 
 type_init(pci_proxy_dev_register_types)
+
+static void send_bar_access_msg(PCIProxyDev *pdev, MemoryRegion *mr,
+                                bool write, hwaddr addr, uint64_t *val,
+                                unsigned size, bool memory)
+{
+    MPQemuMsg msg = { 0 };
+    long ret = -EINVAL;
+    Error *local_err = NULL;
+
+    msg.bytestream = 0;
+    msg.size = sizeof(msg.data1);
+    msg.data1.bar_access.addr = mr->addr + addr;
+    msg.data1.bar_access.size = size;
+    msg.data1.bar_access.memory = memory;
+
+    if (write) {
+        msg.cmd = BAR_WRITE;
+        msg.data1.bar_access.val = *val;
+    } else {
+        msg.cmd = BAR_READ;
+    }
+
+    ret = mpqemu_msg_send_reply_co(&msg, pdev->com, &local_err);
+    if (local_err) {
+        error_report("Failed to send BAR command to the remote process.");
+    }
+
+    if (!write) {
+        *val = ret;
+    }
+}
+
+static void proxy_bar_write(void *opaque, hwaddr addr, uint64_t val,
+                            unsigned size)
+{
+    ProxyMemoryRegion *pmr = opaque;
+
+    send_bar_access_msg(pmr->dev, &pmr->mr, true, addr, &val, size,
+                        pmr->memory);
+}
+
+static uint64_t proxy_bar_read(void *opaque, hwaddr addr, unsigned size)
+{
+    ProxyMemoryRegion *pmr = opaque;
+    uint64_t val;
+
+    send_bar_access_msg(pmr->dev, &pmr->mr, false, addr, &val, size,
+                        pmr->memory);
+
+    return val;
+}
+
+const MemoryRegionOps proxy_mr_ops = {
+    .read = proxy_bar_read,
+    .write = proxy_bar_write,
+    .endianness = DEVICE_NATIVE_ENDIAN,
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 1,
+    },
+};
diff --git a/include/hw/pci/proxy.h b/include/hw/pci/proxy.h
index 72dd7e0944..4f9f9c4e15 100644
--- a/include/hw/pci/proxy.h
+++ b/include/hw/pci/proxy.h
@@ -26,12 +26,24 @@
 #define PCI_PROXY_DEV_GET_CLASS(obj) \
             OBJECT_GET_CLASS(PCIProxyDevClass, (obj), TYPE_PCI_PROXY_DEV)
 
-typedef struct PCIProxyDev {
+typedef struct PCIProxyDev PCIProxyDev;
+
+typedef struct ProxyMemoryRegion {
+    PCIProxyDev *dev;
+    MemoryRegion mr;
+    bool memory;
+    bool present;
+    uint8_t type;
+} ProxyMemoryRegion;
+
+struct PCIProxyDev {
     PCIDevice parent_dev;
     char *fd;
     QIOChannel *com;
     QIOChannel *dev;
-} PCIProxyDev;
+
+    ProxyMemoryRegion region[PCI_NUM_REGIONS];
+};
 
 typedef struct PCIProxyDevClass {
     PCIDeviceClass parent_class;
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index bbd6f3db44..0422213863 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -40,6 +40,8 @@ typedef enum {
     RET_MSG,
     PCI_CONFIG_WRITE,
     PCI_CONFIG_READ,
+    BAR_WRITE,
+    BAR_READ,
     MAX = INT_MAX,
 } MPQemuCmd;
 
@@ -55,6 +57,13 @@ struct conf_data_msg {
     int l;
 };
 
+typedef struct {
+    hwaddr addr;
+    uint64_t val;
+    unsigned size;
+    bool memory;
+} BarAccessMsg;
+
 /**
  * Maximum size of data2 field in the message to be transmitted.
  */
@@ -82,6 +91,7 @@ typedef struct {
     union {
         uint64_t u64;
         SyncSysmemMsg sync_sysmem;
+        BarAccessMsg bar_access;
     } data1;
 
     int fds[REMOTE_MAX_FDS];
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index 19887611d0..026e25dca4 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -256,6 +256,12 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
             return false;
         }
         break;
+    case BAR_WRITE:
+    case BAR_READ:
+        if (msg->size != sizeof(msg->data1)) {
+            return false;
+        }
+        break;
     default:
         break;
     }
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 15/21] multi-process: Synchronize remote memory
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (13 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 14/21] multi-process: PCI BAR read/write handling for proxy & remote endpoints elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-01 10:55   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 16/21] multi-process: create IOHUB object to handle irq elena.ufimtseva
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

Add memory-listener object which is used to keep the view of the RAM
in sync between QEMU and remote process.
A MemoryListener is registered for system-memory AddressSpace. The
listener sends SYNC_SYSMEM message to the remote process when memory
listener commits the changes to memory, the remote process receives
the message and processes it in the handler for SYNC_SYSMEM message.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 MAINTAINERS                  |   2 +
 hw/i386/remote-msg.c         |   4 +
 hw/pci/Makefile.objs         |   1 +
 hw/pci/memory-sync.c         | 214 +++++++++++++++++++++++++++++++++++
 hw/pci/proxy.c               |   4 +
 include/hw/pci/memory-sync.h |  30 +++++
 include/hw/pci/proxy.h       |   3 +
 7 files changed, 258 insertions(+)
 create mode 100644 hw/pci/memory-sync.c
 create mode 100644 include/hw/pci/memory-sync.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b48c3114c1..38d605445e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2954,6 +2954,8 @@ F: include/hw/i386/remote-memory.h
 F: hw/i386/remote-memory.c
 F: hw/pci/proxy.c
 F: include/hw/pci/proxy.h
+F: hw/pci/memory-sync.c
+F: include/hw/pci/memory-sync.h
 
 Build and test automation
 -------------------------
diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index ffb4143736..48b153eaae 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -9,6 +9,7 @@
 #include "io/channel-util.h"
 #include "hw/pci/pci.h"
 #include "exec/memattrs.h"
+#include "hw/i386/remote-memory.h"
 
 static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
                                     Error **errp);
@@ -63,6 +64,9 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     case BAR_READ:
         process_bar_read(ioc, &msg, &local_err);
         break;
+    case SYNC_SYSMEM:
+        remote_sysmem_reconfig(&msg, &local_err);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
index 515dda506c..c90acd5a6e 100644
--- a/hw/pci/Makefile.objs
+++ b/hw/pci/Makefile.objs
@@ -13,3 +13,4 @@ common-obj-$(CONFIG_PCI_EXPRESS) += pcie_port.o pcie_host.o
 common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
 common-obj-$(CONFIG_ALL) += pci-stub.o
 obj-$(CONFIG_MPQEMU) += proxy.o
+obj-$(CONFIG_MPQEMU) += memory-sync.o
diff --git a/hw/pci/memory-sync.c b/hw/pci/memory-sync.c
new file mode 100644
index 0000000000..5f867974c4
--- /dev/null
+++ b/hw/pci/memory-sync.c
@@ -0,0 +1,214 @@
+/*
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qemu/compiler.h"
+#include "qemu/int128.h"
+#include "qemu/range.h"
+#include "exec/memory.h"
+#include "exec/cpu-common.h"
+#include "cpu.h"
+#include "exec/ram_addr.h"
+#include "exec/address-spaces.h"
+#include "io/mpqemu-link.h"
+#include "hw/pci/memory-sync.h"
+
+static void proxy_ml_begin(MemoryListener *listener)
+{
+    RemoteMemSync *sync = container_of(listener, RemoteMemSync, listener);
+    int mrs;
+
+    for (mrs = 0; mrs < sync->n_mr_sections; mrs++) {
+        memory_region_unref(sync->mr_sections[mrs].mr);
+    }
+
+    g_free(sync->mr_sections);
+    sync->mr_sections = NULL;
+    sync->n_mr_sections = 0;
+}
+
+static int get_fd_from_hostaddr(uint64_t host, ram_addr_t *offset)
+{
+    MemoryRegion *mr;
+    ram_addr_t off;
+
+    /**
+     * Assumes that the host address is a valid address as it's
+     * coming from the MemoryListener system. In the case host
+     * address is not valid, the following call would return
+     * the default subregion of "system_memory" region, and
+     * not NULL. So it's not possible to check for NULL here.
+     */
+    mr = memory_region_from_host((void *)(uintptr_t)host, &off);
+
+    if (offset) {
+        *offset = off;
+    }
+
+    return memory_region_get_fd(mr);
+}
+
+static bool proxy_mrs_can_merge(uint64_t host, uint64_t prev_host, size_t size)
+{
+    bool merge;
+    int fd1, fd2;
+
+    fd1 = get_fd_from_hostaddr(host, NULL);
+
+    fd2 = get_fd_from_hostaddr(prev_host, NULL);
+
+    merge = (fd1 == fd2);
+
+    merge &= ((prev_host + size) == host);
+
+    return merge;
+}
+
+static bool try_merge(RemoteMemSync *sync, MemoryRegionSection *section)
+{
+    uint64_t mrs_size, mrs_gpa, mrs_page;
+    MemoryRegionSection *prev_sec;
+    bool merged = false;
+    uintptr_t mrs_host;
+    RAMBlock *mrs_rb;
+
+    if (!sync->n_mr_sections) {
+        return false;
+    }
+
+    mrs_rb = section->mr->ram_block;
+    mrs_page = (uint64_t)qemu_ram_pagesize(mrs_rb);
+    mrs_size = int128_get64(section->size);
+    mrs_gpa = section->offset_within_address_space;
+    mrs_host = (uintptr_t)memory_region_get_ram_ptr(section->mr) +
+               section->offset_within_region;
+
+    if (get_fd_from_hostaddr(mrs_host, NULL) < 0) {
+        return true;
+    }
+
+    mrs_host = mrs_host & ~(mrs_page - 1);
+    mrs_gpa = mrs_gpa & ~(mrs_page - 1);
+    mrs_size = ROUND_UP(mrs_size, mrs_page);
+
+    if (sync->n_mr_sections) {
+        prev_sec = sync->mr_sections + (sync->n_mr_sections - 1);
+        uint64_t prev_gpa_start = prev_sec->offset_within_address_space;
+        uint64_t prev_size = int128_get64(prev_sec->size);
+        uint64_t prev_gpa_end   = range_get_last(prev_gpa_start, prev_size);
+        uint64_t prev_host_start =
+            (uintptr_t)memory_region_get_ram_ptr(prev_sec->mr) +
+            prev_sec->offset_within_region;
+        uint64_t prev_host_end = range_get_last(prev_host_start, prev_size);
+
+        if (mrs_gpa <= (prev_gpa_end + 1)) {
+            g_assert(mrs_gpa > prev_gpa_start);
+
+            if ((section->mr == prev_sec->mr) &&
+                proxy_mrs_can_merge(mrs_host, prev_host_start,
+                                    (mrs_gpa - prev_gpa_start))) {
+                uint64_t max_end = MAX(prev_host_end, mrs_host + mrs_size);
+                merged = true;
+                prev_sec->offset_within_address_space =
+                    MIN(prev_gpa_start, mrs_gpa);
+                prev_sec->offset_within_region =
+                    MIN(prev_host_start, mrs_host) -
+                    (uintptr_t)memory_region_get_ram_ptr(prev_sec->mr);
+                prev_sec->size = int128_make64(max_end - MIN(prev_host_start,
+                                                             mrs_host));
+            }
+        }
+    }
+
+    return merged;
+}
+
+static void proxy_ml_region_addnop(MemoryListener *listener,
+                                   MemoryRegionSection *section)
+{
+    RemoteMemSync *sync = container_of(listener, RemoteMemSync, listener);
+
+    if (!(memory_region_is_ram(section->mr) &&
+          !memory_region_is_rom(section->mr))) {
+        return;
+    }
+
+    if (try_merge(sync, section)) {
+        return;
+    }
+
+    ++sync->n_mr_sections;
+    sync->mr_sections = g_renew(MemoryRegionSection, sync->mr_sections,
+                                sync->n_mr_sections);
+    sync->mr_sections[sync->n_mr_sections - 1] = *section;
+    sync->mr_sections[sync->n_mr_sections - 1].fv = NULL;
+    memory_region_ref(section->mr);
+}
+
+static void proxy_ml_commit(MemoryListener *listener)
+{
+    RemoteMemSync *sync = container_of(listener, RemoteMemSync, listener);
+    MPQemuMsg msg;
+    MemoryRegionSection section;
+    ram_addr_t offset;
+    uintptr_t host_addr;
+    int region;
+
+    memset(&msg, 0, sizeof(MPQemuMsg));
+
+    msg.cmd = SYNC_SYSMEM;
+    msg.bytestream = 0;
+    msg.num_fds = sync->n_mr_sections;
+    msg.size = sizeof(msg.data1);
+    assert(msg.num_fds <= REMOTE_MAX_FDS);
+
+    for (region = 0; region < sync->n_mr_sections; region++) {
+        section = sync->mr_sections[region];
+        msg.data1.sync_sysmem.gpas[region] =
+            section.offset_within_address_space;
+        msg.data1.sync_sysmem.sizes[region] = int128_get64(section.size);
+        host_addr = (uintptr_t)memory_region_get_ram_ptr(section.mr) +
+                    section.offset_within_region;
+        msg.fds[region] = get_fd_from_hostaddr(host_addr, &offset);
+        msg.data1.sync_sysmem.offsets[region] = offset;
+    }
+    mpqemu_msg_send(&msg, sync->ioc);
+}
+
+void deconfigure_memory_sync(RemoteMemSync *sync)
+{
+    memory_listener_unregister(&sync->listener);
+}
+
+/*
+ * TODO: Memory Sync need not be instantianted once per every proxy device.
+ *       All remote devices are going to get the exact same updates at the
+ *       same time. It therefore makes sense to have a broadcast model.
+ *
+ *       Broadcast model would involve running the MemorySync object in a
+ *       thread. MemorySync would contain a list of mpqemu-link objects
+ *       that need notification. proxy_ml_commit() could send the same
+ *       message to all the links at the same time.
+ */
+void configure_memory_sync(RemoteMemSync *sync, QIOChannel *ioc)
+{
+    sync->n_mr_sections = 0;
+    sync->mr_sections = NULL;
+
+    sync->ioc = ioc;
+
+    sync->listener.begin = proxy_ml_begin;
+    sync->listener.commit = proxy_ml_commit;
+    sync->listener.region_add = proxy_ml_region_addnop;
+    sync->listener.region_nop = proxy_ml_region_addnop;
+    sync->listener.priority = 10;
+
+    memory_listener_register(&sync->listener, &address_space_memory);
+}
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index fff021a06a..5ecbdd2dcf 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -17,6 +17,8 @@
 #include "monitor/monitor.h"
 #include "io/mpqemu-link.h"
 #include "qemu/error-report.h"
+#include "hw/pci/memory-sync.h"
+#include "qom/object.h"
 
 static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
 {
@@ -68,6 +70,8 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
         }
         proxy_set_socket(dev, proxyfd, errp);
     }
+
+    configure_memory_sync(&dev->sync, dev->com);
 }
 
 static int config_op_send(PCIProxyDev *pdev, uint32_t addr, uint32_t *val,
diff --git a/include/hw/pci/memory-sync.h b/include/hw/pci/memory-sync.h
new file mode 100644
index 0000000000..3c9007f318
--- /dev/null
+++ b/include/hw/pci/memory-sync.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef MEMORY_SYNC_H
+#define MEMORY_SYNC_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "exec/memory.h"
+#include "io/channel.h"
+
+typedef struct RemoteMemSync {
+    MemoryListener listener;
+
+    int n_mr_sections;
+    MemoryRegionSection *mr_sections;
+
+    QIOChannel *ioc;
+} RemoteMemSync;
+
+void configure_memory_sync(RemoteMemSync *sync, QIOChannel *ioc);
+void deconfigure_memory_sync(RemoteMemSync *sync);
+
+#endif
diff --git a/include/hw/pci/proxy.h b/include/hw/pci/proxy.h
index 4f9f9c4e15..a41a6aeaa5 100644
--- a/include/hw/pci/proxy.h
+++ b/include/hw/pci/proxy.h
@@ -14,6 +14,7 @@
 
 #include "hw/pci/pci.h"
 #include "io/channel.h"
+#include "hw/pci/memory-sync.h"
 
 #define TYPE_PCI_PROXY_DEV "pci-proxy-dev"
 
@@ -42,6 +43,8 @@ struct PCIProxyDev {
     QIOChannel *com;
     QIOChannel *dev;
 
+    RemoteMemSync sync;
+
     ProxyMemoryRegion region[PCI_NUM_REGIONS];
 };
 
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 16/21] multi-process: create IOHUB object to handle irq
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (14 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 15/21] multi-process: Synchronize remote memory elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-02 12:09   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 17/21] multi-process: Retrieve PCI info from remote process elena.ufimtseva
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

IOHUB object is added to manage PCI IRQs. It uses KVM_IRQFD
ioctl to create irqfd to injecting PCI interrupts to the guest.
IOHUB object forwards the irqfd to the remote process. Remote process
uses this fd to directly send interrupts to the guest, bypassing QEMU.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 MAINTAINERS               |   2 +
 hw/Makefile.objs          |   1 +
 hw/i386/remote-msg.c      |   4 +
 hw/i386/remote.c          |  15 ++++
 hw/pci/proxy.c            |  52 +++++++++++++
 hw/remote/Makefile.objs   |   1 +
 hw/remote/iohub.c         | 153 ++++++++++++++++++++++++++++++++++++++
 include/hw/i386/remote.h  |   2 +
 include/hw/pci/pci_ids.h  |   3 +
 include/hw/pci/proxy.h    |   8 ++
 include/hw/remote/iohub.h |  50 +++++++++++++
 include/io/mpqemu-link.h  |   6 ++
 io/mpqemu-link.c          |   1 +
 13 files changed, 298 insertions(+)
 create mode 100644 hw/remote/Makefile.objs
 create mode 100644 hw/remote/iohub.c
 create mode 100644 include/hw/remote/iohub.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 38d605445e..f9ede7e094 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2956,6 +2956,8 @@ F: hw/pci/proxy.c
 F: include/hw/pci/proxy.h
 F: hw/pci/memory-sync.c
 F: include/hw/pci/memory-sync.h
+F: hw/remote/iohub.c
+F: include/hw/remote/iohub.h
 
 Build and test automation
 -------------------------
diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 4cbe5e4e57..8caf659de0 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
 devices-dirs-$(CONFIG_NUBUS) += nubus/
 devices-dirs-y += semihosting/
 devices-dirs-y += smbios/
+devices-dirs-y += remote/
 endif
 
 common-obj-y += $(devices-dirs-y)
diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index 48b153eaae..67fee4bb57 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -10,6 +10,7 @@
 #include "hw/pci/pci.h"
 #include "exec/memattrs.h"
 #include "hw/i386/remote-memory.h"
+#include "hw/remote/iohub.h"
 
 static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
                                     Error **errp);
@@ -67,6 +68,9 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     case SYNC_SYSMEM:
         remote_sysmem_reconfig(&msg, &local_err);
         break;
+    case SET_IRQFD:
+        process_set_irqfd_msg(pci_dev, &msg);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
diff --git a/hw/i386/remote.c b/hw/i386/remote.c
index 5342e884ad..8e74a6f1af 100644
--- a/hw/i386/remote.c
+++ b/hw/i386/remote.c
@@ -17,12 +17,16 @@
 #include "qapi/error.h"
 #include "io/channel-util.h"
 #include "io/channel.h"
+#include "hw/pci/pci_host.h"
+#include "hw/remote/iohub.h"
 
 static void remote_machine_init(MachineState *machine)
 {
     MemoryRegion *system_memory, *system_io, *pci_memory;
     RemMachineState *s = REMOTE_MACHINE(machine);
     RemotePCIHost *rem_host;
+    PCIHostState *pci_host;
+    PCIDevice *pci_dev;
 
     system_memory = get_system_memory();
     system_io = get_system_io();
@@ -42,6 +46,17 @@ static void remote_machine_init(MachineState *machine)
     memory_region_add_subregion_overlap(system_memory, 0x0, pci_memory, -1);
 
     qdev_realize(DEVICE(rem_host), sysbus_get_default(), &error_fatal);
+
+    pci_host = PCI_HOST_BRIDGE(rem_host);
+    pci_dev = pci_create_simple_multifunction(pci_host->bus,
+                                              PCI_DEVFN(REMOTE_IOHUB_DEV,
+                                                        REMOTE_IOHUB_FUNC),
+                                              true, TYPE_REMOTE_IOHUB_DEVICE);
+
+    s->iohub = REMOTE_IOHUB_DEVICE(pci_dev);
+
+    pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
+                 s->iohub, REMOTE_IOHUB_NB_PIRQS);
 }
 
 static void remote_set_socket(Object *obj, const char *str, Error **errp)
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index 5ecbdd2dcf..9d8559b6d4 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -19,6 +19,9 @@
 #include "qemu/error-report.h"
 #include "hw/pci/memory-sync.h"
 #include "qom/object.h"
+#include "qemu/event_notifier.h"
+#include "sysemu/kvm.h"
+#include "util/event_notifier-posix.c"
 
 static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
 {
@@ -57,6 +60,53 @@ static Property proxy_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void proxy_intx_update(PCIDevice *pci_dev)
+{
+    PCIProxyDev *dev = PCI_PROXY_DEV(pci_dev);
+    PCIINTxRoute route;
+    int pin = pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
+
+    if (dev->irqfd.fd) {
+        dev->irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+        (void) kvm_vm_ioctl(kvm_state, KVM_IRQFD, &dev->irqfd);
+        memset(&dev->irqfd, 0, sizeof(struct kvm_irqfd));
+    }
+
+    route = pci_device_route_intx_to_irq(pci_dev, pin);
+
+    dev->irqfd.fd = event_notifier_get_fd(&dev->intr);
+    dev->irqfd.resamplefd = event_notifier_get_fd(&dev->resample);
+    dev->irqfd.gsi = route.irq;
+    dev->irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
+    (void) kvm_vm_ioctl(kvm_state, KVM_IRQFD, &dev->irqfd);
+}
+
+static void setup_irqfd(PCIProxyDev *dev)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    MPQemuMsg msg;
+
+    event_notifier_init(&dev->intr, 0);
+    event_notifier_init(&dev->resample, 0);
+
+    memset(&msg, 0, sizeof(MPQemuMsg));
+    msg.cmd = SET_IRQFD;
+    msg.num_fds = 2;
+    msg.fds[0] = event_notifier_get_fd(&dev->intr);
+    msg.fds[1] = event_notifier_get_fd(&dev->resample);
+    msg.data1.set_irqfd.intx =
+        pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
+    msg.size = sizeof(msg.data1);
+
+    mpqemu_msg_send(&msg, dev->dev);
+
+    memset(&dev->irqfd, 0, sizeof(struct kvm_irqfd));
+
+    proxy_intx_update(pci_dev);
+
+    pci_device_set_intx_routing_notifier(pci_dev, proxy_intx_update);
+}
+
 static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
 {
     PCIProxyDev *dev = PCI_PROXY_DEV(device);
@@ -72,6 +122,8 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
     }
 
     configure_memory_sync(&dev->sync, dev->com);
+
+    setup_irqfd(dev);
 }
 
 static int config_op_send(PCIProxyDev *pdev, uint32_t addr, uint32_t *val,
diff --git a/hw/remote/Makefile.objs b/hw/remote/Makefile.objs
new file mode 100644
index 0000000000..635ce5e0ab
--- /dev/null
+++ b/hw/remote/Makefile.objs
@@ -0,0 +1 @@
+common-obj-$(CONFIG_MPQEMU) += iohub.o
diff --git a/hw/remote/iohub.c b/hw/remote/iohub.c
new file mode 100644
index 0000000000..9c6cfaecd4
--- /dev/null
+++ b/hw/remote/iohub.c
@@ -0,0 +1,153 @@
+/*
+ * Remote IO Hub
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_ids.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/remote/iohub.h"
+#include "qemu/thread.h"
+#include "hw/boards.h"
+#include "hw/i386/remote.h"
+#include "qemu/main-loop.h"
+
+static void remote_iohub_initfn(Object *obj)
+{
+    RemoteIOHubState *iohub = REMOTE_IOHUB_DEVICE(obj);
+    int slot, intx, pirq;
+
+    memset(&iohub->irqfds, 0, sizeof(iohub->irqfds));
+    memset(&iohub->resamplefds, 0, sizeof(iohub->resamplefds));
+
+    for (slot = 0; slot < PCI_SLOT_MAX; slot++) {
+        for (intx = 0; intx < PCI_NUM_PINS; intx++) {
+            iohub->irq_num[slot][intx] = (slot + intx) % 4 + 4;
+        }
+    }
+
+    for (pirq = 0; pirq < REMOTE_IOHUB_NB_PIRQS; pirq++) {
+        qemu_mutex_init(&iohub->irq_level_lock[pirq]);
+        iohub->irq_level[pirq] = 0;
+        event_notifier_init_fd(&iohub->irqfds[pirq], -1);
+        event_notifier_init_fd(&iohub->resamplefds[pirq], -1);
+    }
+}
+
+static void remote_iohub_class_init(ObjectClass *klass, void *data)
+{
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+    k->vendor_id = PCI_VENDOR_ID_ORACLE;
+    k->device_id = PCI_DEVICE_ID_REMOTE_IOHUB;
+}
+
+static const TypeInfo remote_iohub_info = {
+    .name       = TYPE_REMOTE_IOHUB_DEVICE,
+    .parent     = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(RemoteIOHubState),
+    .instance_init = remote_iohub_initfn,
+    .class_init  = remote_iohub_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { INTERFACE_CONVENTIONAL_PCI_DEVICE },
+        { }
+    }
+};
+
+static void remote_iohub_register(void)
+{
+    type_register_static(&remote_iohub_info);
+}
+
+type_init(remote_iohub_register);
+
+int remote_iohub_map_irq(PCIDevice *pci_dev, int intx)
+{
+    BusState *bus = qdev_get_parent_bus(&pci_dev->qdev);
+    PCIBus *pci_bus = PCI_BUS(bus);
+    PCIDevice *pci_iohub =
+        pci_bus->devices[PCI_DEVFN(REMOTE_IOHUB_DEV, REMOTE_IOHUB_FUNC)];
+    RemoteIOHubState *iohub = REMOTE_IOHUB_DEVICE(pci_iohub);
+
+    return iohub->irq_num[PCI_SLOT(pci_dev->devfn)][intx];
+}
+
+/*
+ * TODO: Using lock to set the interrupt level could become a
+ *       performance bottleneck. Check if atomic arithmetic
+ *       is possible.
+ */
+void remote_iohub_set_irq(void *opaque, int pirq, int level)
+{
+    RemoteIOHubState *iohub = opaque;
+
+    assert(pirq >= 0);
+    assert(pirq < REMOTE_IOHUB_NB_PIRQS);
+
+    qemu_mutex_lock(&iohub->irq_level_lock[pirq]);
+
+    if (level) {
+        if (++iohub->irq_level[pirq] == 1) {
+            event_notifier_set(&iohub->irqfds[pirq]);
+        }
+    } else if (iohub->irq_level[pirq] > 0) {
+        iohub->irq_level[pirq]--;
+    }
+
+    qemu_mutex_unlock(&iohub->irq_level_lock[pirq]);
+}
+
+static void intr_resample_handler(void *opaque)
+{
+    ResampleToken *token = opaque;
+    RemoteIOHubState *iohub = token->iohub;
+    int pirq, s;
+
+    pirq = token->pirq;
+
+    s = event_notifier_test_and_clear(&iohub->resamplefds[pirq]);
+
+    assert(s >= 0);
+
+    qemu_mutex_lock(&iohub->irq_level_lock[pirq]);
+
+    if (iohub->irq_level[pirq]) {
+        event_notifier_set(&iohub->irqfds[pirq]);
+    }
+
+    qemu_mutex_unlock(&iohub->irq_level_lock[pirq]);
+}
+
+void process_set_irqfd_msg(PCIDevice *pci_dev, MPQemuMsg *msg)
+{
+    RemMachineState *machine = REMOTE_MACHINE(current_machine);
+    RemoteIOHubState *iohub = machine->iohub;
+    int pirq;
+
+    g_assert(msg->data1.set_irqfd.intx < 4);
+    g_assert(msg->num_fds == 2);
+
+    pirq = remote_iohub_map_irq(pci_dev, msg->data1.set_irqfd.intx);
+
+    if (event_notifier_get_fd(&iohub->irqfds[pirq]) != -1) {
+        event_notifier_cleanup(&iohub->irqfds[pirq]);
+        event_notifier_cleanup(&iohub->resamplefds[pirq]);
+        memset(&iohub->token[pirq], 0, sizeof(ResampleToken));
+    }
+
+    event_notifier_init_fd(&iohub->irqfds[pirq], msg->fds[0]);
+    event_notifier_init_fd(&iohub->resamplefds[pirq], msg->fds[1]);
+
+    iohub->token[pirq].iohub = iohub;
+    iohub->token[pirq].pirq = pirq;
+
+    qemu_set_fd_handler(msg->fds[1], intr_resample_handler, NULL,
+                        &iohub->token[pirq]);
+}
diff --git a/include/hw/i386/remote.h b/include/hw/i386/remote.h
index c3890e57ab..6af423ab54 100644
--- a/include/hw/i386/remote.h
+++ b/include/hw/i386/remote.h
@@ -18,12 +18,14 @@
 #include "hw/boards.h"
 #include "hw/pci-host/remote.h"
 #include "io/channel.h"
+#include "hw/remote/iohub.h"
 
 typedef struct RemMachineState {
     MachineState parent_obj;
 
     RemotePCIHost *host;
     QIOChannel *ioc;
+    RemoteIOHubState *iohub;
 } RemMachineState;
 
 #define TYPE_REMOTE_MACHINE "remote-machine"
diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index 11f8ab7149..bd0c17dc78 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -192,6 +192,9 @@
 #define PCI_DEVICE_ID_SUN_SIMBA          0x5000
 #define PCI_DEVICE_ID_SUN_SABRE          0xa000
 
+#define PCI_VENDOR_ID_ORACLE             0x108e
+#define PCI_DEVICE_ID_REMOTE_IOHUB       0xb000
+
 #define PCI_VENDOR_ID_CMD                0x1095
 #define PCI_DEVICE_ID_CMD_646            0x0646
 
diff --git a/include/hw/pci/proxy.h b/include/hw/pci/proxy.h
index a41a6aeaa5..e6f076ae95 100644
--- a/include/hw/pci/proxy.h
+++ b/include/hw/pci/proxy.h
@@ -12,9 +12,12 @@
 #include "qemu/osdep.h"
 #include "qemu-common.h"
 
+#include <linux/kvm.h>
+
 #include "hw/pci/pci.h"
 #include "io/channel.h"
 #include "hw/pci/memory-sync.h"
+#include "qemu/event_notifier.h"
 
 #define TYPE_PCI_PROXY_DEV "pci-proxy-dev"
 
@@ -45,6 +48,11 @@ struct PCIProxyDev {
 
     RemoteMemSync sync;
 
+    struct kvm_irqfd irqfd;
+
+    EventNotifier intr;
+    EventNotifier resample;
+
     ProxyMemoryRegion region[PCI_NUM_REGIONS];
 };
 
diff --git a/include/hw/remote/iohub.h b/include/hw/remote/iohub.h
new file mode 100644
index 0000000000..9aacf3e04c
--- /dev/null
+++ b/include/hw/remote/iohub.h
@@ -0,0 +1,50 @@
+/*
+ * IO Hub for remote device
+ *
+ * Copyright © 2018, 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_IOHUB_H
+#define REMOTE_IOHUB_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/pci/pci.h"
+#include "qemu/event_notifier.h"
+#include "qemu/thread-posix.h"
+#include "io/mpqemu-link.h"
+
+#define REMOTE_IOHUB_NB_PIRQS    8
+
+#define REMOTE_IOHUB_DEV         31
+#define REMOTE_IOHUB_FUNC        0
+
+#define TYPE_REMOTE_IOHUB_DEVICE "remote-iohub"
+#define REMOTE_IOHUB_DEVICE(obj) \
+    OBJECT_CHECK(RemoteIOHubState, (obj), TYPE_REMOTE_IOHUB_DEVICE)
+
+typedef struct ResampleToken {
+    void *iohub;
+    int pirq;
+} ResampleToken;
+
+typedef struct RemoteIOHubState {
+    PCIDevice d;
+    uint8_t irq_num[PCI_SLOT_MAX][PCI_NUM_PINS];
+    EventNotifier irqfds[REMOTE_IOHUB_NB_PIRQS];
+    EventNotifier resamplefds[REMOTE_IOHUB_NB_PIRQS];
+    unsigned int irq_level[REMOTE_IOHUB_NB_PIRQS];
+    ResampleToken token[REMOTE_IOHUB_NB_PIRQS];
+    QemuMutex irq_level_lock[REMOTE_IOHUB_NB_PIRQS];
+} RemoteIOHubState;
+
+int remote_iohub_map_irq(PCIDevice *pci_dev, int intx);
+void remote_iohub_set_irq(void *opaque, int pirq, int level);
+void process_set_irqfd_msg(PCIDevice *pci_dev, MPQemuMsg *msg);
+
+#endif
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index 0422213863..a563b557ce 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -42,6 +42,7 @@ typedef enum {
     PCI_CONFIG_READ,
     BAR_WRITE,
     BAR_READ,
+    SET_IRQFD,
     MAX = INT_MAX,
 } MPQemuCmd;
 
@@ -64,6 +65,10 @@ typedef struct {
     bool memory;
 } BarAccessMsg;
 
+typedef struct {
+    int intx;
+} SetIrqFdMsg;
+
 /**
  * Maximum size of data2 field in the message to be transmitted.
  */
@@ -92,6 +97,7 @@ typedef struct {
         uint64_t u64;
         SyncSysmemMsg sync_sysmem;
         BarAccessMsg bar_access;
+        SetIrqFdMsg set_irqfd;
     } data1;
 
     int fds[REMOTE_MAX_FDS];
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index 026e25dca4..561ac0576f 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -258,6 +258,7 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
         break;
     case BAR_WRITE:
     case BAR_READ:
+    case SET_IRQFD:
         if (msg->size != sizeof(msg->data1)) {
             return false;
         }
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 17/21] multi-process: Retrieve PCI info from remote process
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (15 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 16/21] multi-process: create IOHUB object to handle irq elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-02 12:59   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 18/21] multi-process: heartbeat messages to remote elena.ufimtseva
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Jagannathan Raman <jag.raman@oracle.com>

Retrieve PCI configuration info about the remote device and
configure the Proxy PCI object based on the returned information

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/i386/remote-msg.c     | 23 +++++++++++
 hw/pci/proxy.c           | 89 ++++++++++++++++++++++++++++++++++++++++
 include/io/mpqemu-link.h |  9 ++++
 io/mpqemu-link.c         |  5 +++
 4 files changed, 126 insertions(+)

diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index 67fee4bb57..9379ee6442 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -20,6 +20,8 @@ static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
                                 MPQemuMsg *msg);
 static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
 static void process_bar_read(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
+static void process_get_pci_info_msg(QIOChannel *ioc, MPQemuMsg *msg,
+                                     PCIDevice *pci_dev);
 
 gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
                             gpointer opaque)
@@ -71,6 +73,9 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     case SET_IRQFD:
         process_set_irqfd_msg(pci_dev, &msg);
         break;
+    case GET_PCI_INFO:
+        process_get_pci_info_msg(ioc, &msg, pci_dev);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
@@ -246,3 +251,21 @@ fail:
 
     mpqemu_msg_send(&ret, ioc);
 }
+
+static void process_get_pci_info_msg(QIOChannel *ioc, MPQemuMsg *msg,
+                                     PCIDevice *pci_dev)
+{
+    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
+    MPQemuMsg ret = { 0 };
+
+    ret.cmd = RET_MSG;
+
+    ret.data1.ret_pci_info.vendor_id = pc->vendor_id;
+    ret.data1.ret_pci_info.device_id = pc->device_id;
+    ret.data1.ret_pci_info.class_id = pc->class_id;
+    ret.data1.ret_pci_info.subsystem_id = pc->subsystem_id;
+
+    ret.size = sizeof(ret.data1);
+
+    mpqemu_msg_send(&ret, ioc);
+}
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index 9d8559b6d4..449341e459 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -23,6 +23,8 @@
 #include "sysemu/kvm.h"
 #include "util/event_notifier-posix.c"
 
+static void probe_pci_info(PCIDevice *dev);
+
 static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
 {
     DeviceState *dev = DEVICE(pdev);
@@ -110,6 +112,7 @@ static void setup_irqfd(PCIProxyDev *dev)
 static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
 {
     PCIProxyDev *dev = PCI_PROXY_DEV(device);
+    uint8_t *pci_conf = device->config;
     int proxyfd;
 
     if (dev->fd) {
@@ -121,9 +124,14 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
         proxy_set_socket(dev, proxyfd, errp);
     }
 
+    pci_conf[PCI_LATENCY_TIMER] = 0xff;
+    pci_conf[PCI_INTERRUPT_PIN] = 0x01;
+
     configure_memory_sync(&dev->sync, dev->com);
 
     setup_irqfd(dev);
+
+    probe_pci_info(PCI_DEVICE(dev));
 }
 
 static int config_op_send(PCIProxyDev *pdev, uint32_t addr, uint32_t *val,
@@ -267,3 +275,84 @@ const MemoryRegionOps proxy_mr_ops = {
         .max_access_size = 1,
     },
 };
+
+static void probe_pci_info(PCIDevice *dev)
+{
+    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
+    DeviceClass *dc = DEVICE_CLASS(pc);
+    PCIProxyDev *pdev = PCI_PROXY_DEV(dev);
+    MPQemuMsg msg = { 0 }, ret =  { 0 };
+    uint32_t orig_val, new_val, class;
+    uint8_t type;
+    int i, size;
+    char *name;
+    Error *local_err = NULL;
+
+    msg.bytestream = 0;
+    msg.size = 0;
+    msg.cmd = GET_PCI_INFO;
+
+    ret.data1.u64  = mpqemu_msg_send_reply_co(&msg, pdev->dev, &local_err);
+    if (local_err) {
+        error_report("Error while sending GET_PCI_INFO message");
+    }
+
+    pc->vendor_id = ret.data1.ret_pci_info.vendor_id;
+    pc->device_id = ret.data1.ret_pci_info.device_id;
+    pc->class_id = ret.data1.ret_pci_info.class_id;
+    pc->subsystem_id = ret.data1.ret_pci_info.subsystem_id;
+
+    config_op_send(pdev, PCI_CLASS_DEVICE, &class, 1, PCI_CONFIG_READ);
+    switch (class) {
+    case PCI_BASE_CLASS_BRIDGE:
+        set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
+        break;
+    case PCI_BASE_CLASS_STORAGE:
+        set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+        break;
+    case PCI_BASE_CLASS_NETWORK:
+        set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+        break;
+    case PCI_BASE_CLASS_INPUT:
+        set_bit(DEVICE_CATEGORY_INPUT, dc->categories);
+        break;
+    case PCI_BASE_CLASS_DISPLAY:
+        set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+        break;
+    case PCI_BASE_CLASS_PROCESSOR:
+        set_bit(DEVICE_CATEGORY_CPU, dc->categories);
+        break;
+    default:
+        set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+        break;
+    }
+
+    for (i = 0; i < 6; i++) {
+        config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &orig_val, 4,
+                       PCI_CONFIG_READ);
+        new_val = 0xffffffff;
+        config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &new_val, 4,
+                       PCI_CONFIG_WRITE);
+        config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &new_val, 4,
+                       PCI_CONFIG_READ);
+        size = (~(new_val & 0xFFFFFFF0)) + 1;
+        config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &orig_val, 4,
+                       PCI_CONFIG_WRITE);
+        type = (new_val & 0x1) ?
+                   PCI_BASE_ADDRESS_SPACE_IO : PCI_BASE_ADDRESS_SPACE_MEMORY;
+
+        if (size) {
+            pdev->region[i].dev = pdev;
+            pdev->region[i].present = true;
+            if (type == PCI_BASE_ADDRESS_SPACE_MEMORY) {
+                pdev->region[i].memory = true;
+            }
+            name = g_strdup_printf("bar-region-%d", i);
+            memory_region_init_io(&pdev->region[i].mr, OBJECT(pdev),
+                                  &proxy_mr_ops, &pdev->region[i],
+                                  name, size);
+            pci_register_bar(dev, i, type, &pdev->region[i].mr);
+            g_free(name);
+        }
+    }
+}
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index a563b557ce..4b96cb8ccb 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -43,6 +43,7 @@ typedef enum {
     BAR_WRITE,
     BAR_READ,
     SET_IRQFD,
+    GET_PCI_INFO,
     MAX = INT_MAX,
 } MPQemuCmd;
 
@@ -69,6 +70,13 @@ typedef struct {
     int intx;
 } SetIrqFdMsg;
 
+typedef struct {
+    uint16_t vendor_id;
+    uint16_t device_id;
+    uint16_t class_id;
+    uint16_t subsystem_id;
+} RetPCIInfoMsg;
+
 /**
  * Maximum size of data2 field in the message to be transmitted.
  */
@@ -98,6 +106,7 @@ typedef struct {
         SyncSysmemMsg sync_sysmem;
         BarAccessMsg bar_access;
         SetIrqFdMsg set_irqfd;
+        RetPCIInfoMsg ret_pci_info;
     } data1;
 
     int fds[REMOTE_MAX_FDS];
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index 561ac0576f..d09b2a2f50 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -263,6 +263,11 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
             return false;
         }
         break;
+    case GET_PCI_INFO:
+        if (msg->size) {
+            return false;
+        }
+        break;
     default:
         break;
     }
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 18/21] multi-process: heartbeat messages to remote
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (16 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 17/21] multi-process: Retrieve PCI info from remote process elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-02 13:16   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 19/21] multi-process: perform device reset in the remote process elena.ufimtseva
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

In order to detect remote processes which are hung, the
proxy periodically sends heartbeat messages to confirm if
the remote process is alive. The remote process responds
to this heartbeat message to confirm it is alive.

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---
 hw/i386/remote-msg.c     | 14 ++++++++++
 hw/pci/proxy.c           | 58 ++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/proxy.h   |  2 ++
 include/io/mpqemu-link.h |  1 +
 io/mpqemu-link.c         |  1 +
 5 files changed, 76 insertions(+)

diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index 9379ee6442..919bddc1d5 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -22,6 +22,7 @@ static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
 static void process_bar_read(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
 static void process_get_pci_info_msg(QIOChannel *ioc, MPQemuMsg *msg,
                                      PCIDevice *pci_dev);
+static void process_proxy_ping_msg(QIOChannel *ioc);
 
 gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
                             gpointer opaque)
@@ -76,6 +77,9 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     case GET_PCI_INFO:
         process_get_pci_info_msg(ioc, &msg, pci_dev);
         break;
+    case PROXY_PING:
+        process_proxy_ping_msg(ioc);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
@@ -269,3 +273,13 @@ static void process_get_pci_info_msg(QIOChannel *ioc, MPQemuMsg *msg,
 
     mpqemu_msg_send(&ret, ioc);
 }
+
+static void process_proxy_ping_msg(QIOChannel *ioc)
+{
+    MPQemuMsg ret = { 0 };
+
+    ret.cmd = RET_MSG;
+    ret.size = sizeof(ret.data1);
+
+    mpqemu_msg_send(&ret, ioc);
+}
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index 449341e459..e2e9a13287 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -24,6 +24,8 @@
 #include "util/event_notifier-posix.c"
 
 static void probe_pci_info(PCIDevice *dev);
+static void start_hb_timer(PCIProxyDev *dev);
+static void pci_proxy_dev_exit(PCIDevice *pdev);
 
 static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
 {
@@ -132,6 +134,8 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
     setup_irqfd(dev);
 
     probe_pci_info(PCI_DEVICE(dev));
+
+    start_hb_timer(dev);
 }
 
 static int config_op_send(PCIProxyDev *pdev, uint32_t addr, uint32_t *val,
@@ -192,6 +196,7 @@ static void pci_proxy_dev_class_init(ObjectClass *klass, void *data)
     k->realize = pci_proxy_dev_realize;
     k->config_read = pci_proxy_read_config;
     k->config_write = pci_proxy_write_config;
+    k->exit = pci_proxy_dev_exit;
 
     device_class_set_props(dc, proxy_properties);
 }
@@ -356,3 +361,56 @@ static void probe_pci_info(PCIDevice *dev)
         }
     }
 }
+
+static void hb_msg(PCIProxyDev *dev)
+{
+    DeviceState *ds = DEVICE(dev);
+    MPQemuMsg msg = { 0 };
+    long ret = -EINVAL;
+    Error *local_err = NULL;
+
+    msg.cmd = PROXY_PING;
+    msg.bytestream = 0;
+    msg.size = 0;
+
+    ret = mpqemu_msg_send_reply_co(&msg, dev->com, &local_err);
+    if (local_err) {
+        error_report("Lost contact with remote device %s, error code %ld",
+                     ds->id, ret);
+    }
+}
+
+#define NOP_INTERVAL 1000
+
+static void remote_ping(void *opaque)
+{
+    PCIProxyDev *dev = opaque;
+
+    hb_msg(dev);
+
+    timer_mod(dev->hb_timer,
+              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + NOP_INTERVAL);
+}
+
+static void start_hb_timer(PCIProxyDev *dev)
+{
+    dev->hb_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                 remote_ping,
+                                 dev);
+
+    timer_mod(dev->hb_timer,
+              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + NOP_INTERVAL);
+}
+
+static void stop_hb_timer(PCIProxyDev *dev)
+{
+    timer_del(dev->hb_timer);
+    timer_free(dev->hb_timer);
+}
+
+static void pci_proxy_dev_exit(PCIDevice *pdev)
+{
+    PCIProxyDev *dev = PCI_PROXY_DEV(pdev);
+
+    stop_hb_timer(dev);
+}
diff --git a/include/hw/pci/proxy.h b/include/hw/pci/proxy.h
index e6f076ae95..037740309d 100644
--- a/include/hw/pci/proxy.h
+++ b/include/hw/pci/proxy.h
@@ -53,6 +53,8 @@ struct PCIProxyDev {
     EventNotifier intr;
     EventNotifier resample;
 
+    QEMUTimer *hb_timer;
+
     ProxyMemoryRegion region[PCI_NUM_REGIONS];
 };
 
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index 4b96cb8ccb..676d7eb3ef 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -44,6 +44,7 @@ typedef enum {
     BAR_READ,
     SET_IRQFD,
     GET_PCI_INFO,
+    PROXY_PING,
     MAX = INT_MAX,
 } MPQemuCmd;
 
diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
index d09b2a2f50..7452e55e17 100644
--- a/io/mpqemu-link.c
+++ b/io/mpqemu-link.c
@@ -264,6 +264,7 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
         }
         break;
     case GET_PCI_INFO:
+    case PROXY_PING:
         if (msg->size) {
             return false;
         }
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 19/21] multi-process: perform device reset in the remote process
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (17 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 18/21] multi-process: heartbeat messages to remote elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-02 13:19   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 20/21] multi-process: add the concept description to docs/devel/qemu-multiprocess elena.ufimtseva
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

Perform device reset in the remote process when QEMU performs
device reset. This is required to reset the internal state
(like registers, etc...) of emulated devices

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/i386/remote-msg.c     | 16 ++++++++++++++++
 hw/pci/proxy.c           | 20 ++++++++++++++++++++
 include/io/mpqemu-link.h |  1 +
 3 files changed, 37 insertions(+)

diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
index 919bddc1d5..63d2be4701 100644
--- a/hw/i386/remote-msg.c
+++ b/hw/i386/remote-msg.c
@@ -11,6 +11,7 @@
 #include "exec/memattrs.h"
 #include "hw/i386/remote-memory.h"
 #include "hw/remote/iohub.h"
+#include "sysemu/reset.h"
 
 static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
                                     Error **errp);
@@ -23,6 +24,7 @@ static void process_bar_read(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
 static void process_get_pci_info_msg(QIOChannel *ioc, MPQemuMsg *msg,
                                      PCIDevice *pci_dev);
 static void process_proxy_ping_msg(QIOChannel *ioc);
+static void process_device_reset_msg(QIOChannel *ioc);
 
 gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
                             gpointer opaque)
@@ -80,6 +82,9 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
     case PROXY_PING:
         process_proxy_ping_msg(ioc);
         break;
+    case DEVICE_RESET:
+        process_device_reset_msg(ioc);
+        break;
     default:
         error_setg(&local_err, "Unknown command (%d) received from proxy \
                    in remote process pid=%d", msg.cmd, getpid());
@@ -283,3 +288,14 @@ static void process_proxy_ping_msg(QIOChannel *ioc)
 
     mpqemu_msg_send(&ret, ioc);
 }
+
+static void process_device_reset_msg(QIOChannel *ioc)
+{
+    MPQemuMsg ret = { 0 };
+
+    qemu_devices_reset();
+
+    ret.cmd = RET_MSG;
+
+    mpqemu_msg_send(&ret, ioc);
+}
diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
index e2e9a13287..9cba2bec65 100644
--- a/hw/pci/proxy.c
+++ b/hw/pci/proxy.c
@@ -26,6 +26,7 @@
 static void probe_pci_info(PCIDevice *dev);
 static void start_hb_timer(PCIProxyDev *dev);
 static void pci_proxy_dev_exit(PCIDevice *pdev);
+static void proxy_device_reset(DeviceState *dev);
 
 static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
 {
@@ -198,6 +199,8 @@ static void pci_proxy_dev_class_init(ObjectClass *klass, void *data)
     k->config_write = pci_proxy_write_config;
     k->exit = pci_proxy_dev_exit;
 
+    dc->reset = proxy_device_reset;
+
     device_class_set_props(dc, proxy_properties);
 }
 
@@ -414,3 +417,20 @@ static void pci_proxy_dev_exit(PCIDevice *pdev)
 
     stop_hb_timer(dev);
 }
+
+static void proxy_device_reset(DeviceState *dev)
+{
+    PCIProxyDev *pdev = PCI_PROXY_DEV(dev);
+    MPQemuMsg msg = { 0 };
+    Error *local_err = NULL;
+
+    msg.bytestream = 0;
+    msg.size = sizeof(msg.data1);
+    msg.cmd = DEVICE_RESET;
+
+    (void)mpqemu_msg_send_reply_co(&msg, pdev->com, &local_err);
+    if (local_err) {
+        error_report("Failed to send DEVICE_RESET to the remote process");
+    }
+
+}
diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
index 676d7eb3ef..29dce60035 100644
--- a/include/io/mpqemu-link.h
+++ b/include/io/mpqemu-link.h
@@ -45,6 +45,7 @@ typedef enum {
     SET_IRQFD,
     GET_PCI_INFO,
     PROXY_PING,
+    DEVICE_RESET,
     MAX = INT_MAX,
 } MPQemuCmd;
 
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 20/21] multi-process: add the concept description to docs/devel/qemu-multiprocess
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (18 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 19/21] multi-process: perform device reset in the remote process elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-02 13:31   ` Stefan Hajnoczi
  2020-06-27 17:09 ` [PATCH v7 21/21] multi-process: add configure and usage information elena.ufimtseva
  2020-07-02 13:40 ` [PATCH v7 00/21] Initial support for multi-process qemu Stefan Hajnoczi
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: John G Johnson <john.g.johnson@oracle.com>

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 MAINTAINERS                  |   1 +
 docs/devel/index.rst         |   1 +
 docs/devel/multi-process.rst | 957 +++++++++++++++++++++++++++++++++++
 3 files changed, 959 insertions(+)
 create mode 100644 docs/devel/multi-process.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index f9ede7e094..a65a5dd279 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2958,6 +2958,7 @@ F: hw/pci/memory-sync.c
 F: include/hw/pci/memory-sync.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
+F: docs/devel/multi-process.rst
 
 Build and test automation
 -------------------------
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index bb8238c5d6..8ed4369632 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -28,3 +28,4 @@ Contents:
    reset
    s390-dasd-ipl
    clocks
+   multi-process
diff --git a/docs/devel/multi-process.rst b/docs/devel/multi-process.rst
new file mode 100644
index 0000000000..406728854c
--- /dev/null
+++ b/docs/devel/multi-process.rst
@@ -0,0 +1,957 @@
+Multi-process QEMU
+===================
+
+QEMU is often used as the hypervisor for virtual machines running in the
+Oracle cloud. Since one of the advantages of cloud computing is the
+ability to run many VMs from different tenants in the same cloud
+infrastructure, a guest that compromised its hypervisor could
+potentially use the hypervisor's access privileges to access data it is
+not authorized for.
+
+QEMU can be susceptible to security attacks because it is a large,
+monolithic program that provides many features to the VMs it services.
+Many of these features can be configured out of QEMU, but even a reduced
+configuration QEMU has a large amount of code a guest can potentially
+attack. Separating QEMU reduces the attack surface by aiding to
+limit each component in the system to only access the resources that
+it needs to perform its job.
+
+QEMU services
+-------------
+
+QEMU can be broadly described as providing three main services. One is a
+VM control point, where VMs can be created, migrated, re-configured, and
+destroyed. A second is to emulate the CPU instructions within the VM,
+often accelerated by HW virtualization features such as Intel's VT
+extensions. Finally, it provides IO services to the VM by emulating HW
+IO devices, such as disk and network devices.
+
+A multi-process QEMU
+~~~~~~~~~~~~~~~~~~~~
+
+A multi-process QEMU involves separating QEMU services into separate
+host processes. Each of these processes can be given only the privileges
+it needs to provide its service, e.g., a disk service could be given
+access only to the disk images it provides, and not be allowed to
+access other files, or any network devices. An attacker who compromised
+this service would not be able to use this exploit to access files or
+devices beyond what the disk service was given access to.
+
+A QEMU control process would remain, but in multi-process mode, will
+have no direct interfaces to the VM. During VM execution, it would still
+provide the user interface to hot-plug devices or live migrate the VM.
+
+A first step in creating a multi-process QEMU is to separate IO services
+from the main QEMU program, which would continue to provide CPU
+emulation. i.e., the control process would also be the CPU emulation
+process. In a later phase, CPU emulation could be separated from the
+control process.
+
+Separating IO services
+----------------------
+
+Separating IO services into individual host processes is a good place to
+begin for a couple of reasons. One is the sheer number of IO devices QEMU
+can emulate provides a large surface of interfaces which could potentially
+be exploited, and, indeed, have been a source of exploits in the past.
+Another is the modular nature of QEMU device emulation code provides
+interface points where the QEMU functions that perform device emulation
+can be separated from the QEMU functions that manage the emulation of
+guest CPU instructions. The devices emulated in the separate process are
+referred to as remote devices.
+
+QEMU device emulation
+~~~~~~~~~~~~~~~~~~~~~
+
+QEMU uses an object oriented SW architecture for device emulation code.
+Configured objects are all compiled into the QEMU binary, then objects
+are instantiated by name when used by the guest VM. For example, the
+code to emulate a device named "foo" is always present in QEMU, but its
+instantiation code is only run when the device is included in the target
+VM. (e.g., via the QEMU command line as *-device foo*)
+
+The object model is hierarchical, so device emulation code names its
+parent object (such as "pci-device" for a PCI device) and QEMU will
+instantiate a parent object before calling the device's instantiation
+code.
+
+Current separation models
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to separate the device emulation code from the CPU emulation
+code, the device object code must run in a different process. There are
+a couple of existing QEMU features that can run emulation code
+separately from the main QEMU process. These are examined below.
+
+vhost user model
+^^^^^^^^^^^^^^^^
+
+Virtio guest device drivers can be connected to vhost user applications
+in order to perform their IO operations. This model uses special virtio
+device drivers in the guest and vhost user device objects in QEMU, but
+once the QEMU vhost user code has configured the vhost user application,
+mission-mode IO is performed by the application. The vhost user
+application is a daemon process that can be contacted via a known UNIX
+domain socket.
+
+vhost socket
+''''''''''''
+
+As mentioned above, one of the tasks of the vhost device object within
+QEMU is to contact the vhost application and send it configuration
+information about this device instance. As part of the configuration
+process, the application can also be sent other file descriptors over
+the socket, which then can be used by the vhost user application in
+various ways, some of which are described below.
+
+vhost MMIO store acceleration
+'''''''''''''''''''''''''''''
+
+VMs are often run using HW virtualization features via the KVM kernel
+driver. This driver allows QEMU to accelerate the emulation of guest CPU
+instructions by running the guest in a virtual HW mode. When the guest
+executes instructions that cannot be executed by virtual HW mode,
+execution returns to the KVM driver so it can inform QEMU to emulate the
+instructions in SW.
+
+One of the events that can cause a return to QEMU is when a guest device
+driver accesses an IO location. QEMU then dispatches the memory
+operation to the corresponding QEMU device object. In the case of a
+vhost user device, the memory operation would need to be sent over a
+socket to the vhost application. This path is accelerated by the QEMU
+virtio code by setting up an eventfd file descriptor that the vhost
+application can directly receive MMIO store notifications from the KVM
+driver, instead of needing them to be sent to the QEMU process first.
+
+vhost interrupt acceleration
+''''''''''''''''''''''''''''
+
+Another optimization used by the vhost application is the ability to
+directly inject interrupts into the VM via the KVM driver, again,
+bypassing the need to send the interrupt back to the QEMU process first.
+The QEMU virtio setup code configures the KVM driver with an eventfd
+that triggers the device interrupt in the guest when the eventfd is
+written. This irqfd file descriptor is then passed to the vhost user
+application program.
+
+vhost access to guest memory
+''''''''''''''''''''''''''''
+
+The vhost application is also allowed to directly access guest memory,
+instead of needing to send the data as messages to QEMU. This is also
+done with file descriptors sent to the vhost user application by QEMU.
+These descriptors can be passed to ``mmap()`` by the vhost application
+to map the guest address space into the vhost application.
+
+IOMMUs introduce another level of complexity, since the address given to
+the guest virtio device to DMA to or from is not a guest physical
+address. This case is handled by having vhost code within QEMU register
+as a listener for IOMMU mapping changes. The vhost application maintains
+a cache of IOMMMU translations: sending translation requests back to
+QEMU on cache misses, and in turn receiving flush requests from QEMU
+when mappings are purged.
+
+applicability to device separation
+''''''''''''''''''''''''''''''''''
+
+Much of the vhost model can be re-used by separated device emulation. In
+particular, the ideas of using a socket between QEMU and the device
+emulation application, using a file descriptor to inject interrupts into
+the VM via KVM, and allowing the application to ``mmap()`` the guest
+should be re used.
+
+There are, however, some notable differences between how a vhost
+application works and the needs of separated device emulation. The most
+basic is that vhost uses custom virtio device drivers which always
+trigger IO with MMIO stores. A separated device emulation model must
+work with existing IO device models and guest device drivers. MMIO loads
+break vhost store acceleration since they are synchronous - guest
+progress cannot continue until the load has been emulated. By contrast,
+stores are asynchronous, the guest can continue after the store event
+has been sent to the vhost application.
+
+Another difference is that in the vhost user model, a single daemon can
+support multiple QEMU instances. This is contrary to the security regime
+desired, in which the emulation application should only be allowed to
+access the files or devices the VM it's running on behalf of can access.
+#### qemu-io model
+
+Qemu-io is a test harness used to test changes to the QEMU block backend
+object code. (e.g., the code that implements disk images for disk driver
+emulation) Qemu-io is not a device emulation application per se, but it
+does compile the QEMU block objects into a separate binary from the main
+QEMU one. This could be useful for disk device emulation, since its
+emulation applications will need to include the QEMU block objects.
+
+New separation model based on proxy objects
+-------------------------------------------
+
+A different model based on proxy objects in the QEMU program
+communicating with remote emulation programs could provide separation
+while minimizing the changes needed to the device emulation code. The
+rest of this section is a discussion of how a proxy object model would
+work.
+
+Remote emulation processes
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The remote emulation process will run the QEMU object hierarchy without
+modification. The device emulation objects will be also be based on the
+QEMU code, because for anything but the simplest device, it would not be
+a tractable to re-implement both the object model and the many device
+backends that QEMU has.
+
+The processes will communicate with the QEMU process over UNIX domain
+sockets. The processes can be executed either as standalone processes,
+or be executed by QEMU. In both cases, the host backends the emulation
+processes will provide are specified on its command line, as they would
+be for QEMU. For example:
+
+::
+
+    disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0  \
+    -blockdev driver=qcow2,node-name=drive0,file=file0
+
+would indicate process *disk-proc* uses a qcow2 emulated disk named
+*file0* as its backend.
+
+Emulation processes may emulate more than one guest controller. A common
+configuration might be to put all controllers of the same device class
+(e.g., disk, network, etc.) in a single process, so that all backends of
+the same type can be managed by a single QMP monitor.
+
+communication with QEMU
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The first argument to the remote emulation process will be a Unix domain
+socket that connects with the Proxy object. This is a required argument.
+
+::
+
+    disk-proc <socket number> <backend list>
+
+remote process QMP monitor
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Remote emulation processes can be monitored via QMP, similar to QEMU
+itself. The QMP monitor socket is specified the same as for a QEMU
+process:
+
+::
+
+    disk-proc -qmp unix:/tmp/disk-mon,server
+
+can be monitored over the UNIX socket path */tmp/disk-mon*.
+
+QEMU command line
+~~~~~~~~~~~~~~~~~
+
+Each remote device emulated in a remote process on the host is
+represented as a *-device* of type *pci-proxy-dev*. A socket
+sub-option to this option specifies the Unix socket that connects
+to the remote process. An *id* sub-option is required, and it should
+be the same id as used in the remote process.
+
+::
+
+    qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3
+
+can be used to add a device emulated in a remote process
+
+
+QEMU management of remote processes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+QEMU is not aware of the type of type of the remote PCI device. It is
+a pass through device as far as QEMU is concerned.
+
+communication with emulation process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+primary channel
+'''''''''''''''
+
+The primary channel (referred to as com in the code) is used to bootstrap
+the remote process. It is also used to pass on device-agnostic commands
+like reset.
+
+per-device channels
+'''''''''''''''''''
+
+Each remote device communicates with QEMU using a dedicated communication
+channel. The proxy object sets up this channel using the primary
+channel during its initialization.
+
+QEMU device proxy objects
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+QEMU has an object model based on sub-classes inherited from the
+"object" super-class. The sub-classes that are of interest here are the
+"device" and "bus" sub-classes whose child sub-classes make up the
+device tree of a QEMU emulated system.
+
+The proxy object model will use device proxy objects to replace the
+device emulation code within the QEMU process. These objects will live
+in the same place in the object and bus hierarchies as the objects they
+replace. i.e., the proxy object for an LSI SCSI controller will be a
+sub-class of the "pci-device" class, and will have the same PCI bus
+parent and the same SCSI bus child objects as the LSI controller object
+it replaces.
+
+It is worth noting that the same proxy object is used to mediate with
+all types of remote PCI devices.
+
+object initialization
+^^^^^^^^^^^^^^^^^^^^^
+
+The Proxy device objects are initialized in the exact same manner in
+which any other QEMU device would be initialized.
+
+In addition, the Proxy objects perform the following two tasks:
+- Parses the "socket" sub option and connects to the remote process
+using this channel
+- Uses the "id" sub-option to connect to the emulated device on the
+separate process
+
+class\_init
+'''''''''''
+
+The ``class_init()`` method of a proxy object will, in general behave
+similarly to the object it replaces, including setting any static
+properties and methods needed by the proxy.
+
+instance\_init / realize
+''''''''''''''''''''''''
+
+The ``instance_init()`` and ``realize()`` functions would only need to
+perform tasks related to being a proxy, such are registering its own
+MMIO handlers, or creating a child bus that other proxy devices can be
+attached to later.
+
+Other tasks will be device-specific. For example, PCI device objects
+will initialize the PCI config space in order to make a valid PCI device
+tree within the QEMU process.
+
+address space registration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most devices are driven by guest device driver accesses to IO addresses
+or ports. The QEMU device emulation code uses QEMU's memory region
+function calls (such as ``memory_region_init_io()``) to add callback
+functions that QEMU will invoke when the guest accesses the device's
+areas of the IO address space. When a guest driver does access the
+device, the VM will exit HW virtualization mode and return to QEMU,
+which will then lookup and execute the corresponding callback function.
+
+A proxy object would need to mirror the memory region calls the actual
+device emulator would perform in its initialization code, but with its
+own callbacks. When invoked by QEMU as a result of a guest IO operation,
+they will forward the operation to the device emulation process.
+
+PCI config space
+^^^^^^^^^^^^^^^^
+
+PCI devices also have a configuration space that can be accessed by the
+guest driver. Guest accesses to this space is not handled by the device
+emulation object, but by its PCI parent object. Much of this space is
+read-only, but certain registers (especially BAR and MSI-related ones)
+need to be propagated to the emulation process.
+
+PCI parent proxy
+''''''''''''''''
+
+One way to propagate guest PCI config accesses is to create a
+"pci-device-proxy" class that can serve as the parent of a PCI device
+proxy object. This class's parent would be "pci-device" and it would
+override the PCI parent's ``config_read()`` and ``config_write()``
+methods with ones that forward these operations to the emulation
+program.
+
+interrupt receipt
+^^^^^^^^^^^^^^^^^
+
+A proxy for a device that generates interrupts will need to create a
+socket to receive interrupt indications from the emulation process. An
+incoming interrupt indication would then be sent up to its bus parent to
+be injected into the guest. For example, a PCI device object may use
+``pci_set_irq()``.
+
+live migration
+^^^^^^^^^^^^^^
+
+The proxy will register to save and restore any *vmstate* it needs over
+a live migration event. The device proxy does not need to manage the
+remote device's *vmstate*; that will be handled by the remote process
+proxy (see below).
+
+QEMU remote device operation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generic device operations, such as DMA, will be performed by the remote
+process proxy by sending messages to the remote process.
+
+DMA operations
+^^^^^^^^^^^^^^
+
+DMA operations would be handled much like vhost applications do. One of
+the initial messages sent to the emulation process is a guest memory
+table. Each entry in this table consists of a file descriptor and size
+that the emulation process can ``mmap()`` to directly access guest
+memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
+must be backed by file descriptors, such as when QEMU is given the
+*-mem-path* command line option.
+
+IOMMU operations
+^^^^^^^^^^^^^^^^
+
+When the emulated system includes an IOMMU, the remote process proxy in
+QEMU will need to create a socket for IOMMU requests from the emulation
+process. It will handle those requests with an
+``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
+unmaps, the remote process proxy will also register as a listener on the
+device's DMA address space. When an IOMMU memory region is created
+within the DMA address space, an IOMMU notifier for unmaps will be added
+to the memory region that will forward unmaps to the emulation process
+over the IOMMU socket.
+
+device hot-plug via QMP
+^^^^^^^^^^^^^^^^^^^^^^^
+
+An QMP "device\_add" command can add a device emulated by a remote
+process. It will also have "rid" option to the command, just as the
+*-device* command line option does. The remote process may either be one
+started at QEMU startup, or be one added by the "add-process" QMP
+command described above. In either case, the remote process proxy will
+forward the new device's JSON description to the corresponding emulation
+process.
+
+live migration
+^^^^^^^^^^^^^^
+
+The remote process proxy will also register for live migration
+notifications with ``vmstate_register()``. When called to save state,
+the proxy will send the remote process a secondary socket file
+descriptor to save the remote process's device *vmstate* over. The
+incoming byte stream length and data will be saved as the proxy's
+*vmstate*. When the proxy is resumed on its new host, this *vmstate*
+will be extracted, and a secondary socket file descriptor will be sent
+to the new remote process through which it receives the *vmstate* in
+order to restore the devices there.
+
+device emulation in remote process
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The parts of QEMU that the emulation program will need include the
+object model; the memory emulation objects; the device emulation objects
+of the targeted device, and any dependent devices; and, the device's
+backends. It will also need code to setup the machine environment,
+handle requests from the QEMU process, and route machine-level requests
+(such as interrupts or IOMMU mappings) back to the QEMU process.
+
+initialization
+^^^^^^^^^^^^^^
+
+The process initialization sequence will follow the same sequence
+followed by QEMU. It will first initialize the backend objects, then
+device emulation objects. The JSON descriptions sent by the QEMU process
+will drive which objects need to be created.
+
+-  address spaces
+
+Before the device objects are created, the initial address spaces and
+memory regions must be configured with ``memory_map_init()``. This
+creates a RAM memory region object (*system\_memory*) and an IO memory
+region object (*system\_io*).
+
+-  RAM
+
+RAM memory region creation will follow how ``pc_memory_init()`` creates
+them, but must use ``memory_region_init_ram_from_fd()`` instead of
+``memory_region_allocate_system_memory()``. The file descriptors needed
+will be supplied by the guest memory table from above. Those RAM regions
+would then be added to the *system\_memory* memory region with
+``memory_region_add_subregion()``.
+
+-  PCI
+
+IO initialization will be driven by the JSON descriptions sent from the
+QEMU process. For a PCI device, a PCI bus will need to be created with
+``pci_root_bus_new()``, and a PCI memory region will need to be created
+and added to the *system\_memory* memory region with
+``memory_region_add_subregion_overlap()``. The overlap version is
+required for architectures where PCI memory overlaps with RAM memory.
+
+MMIO handling
+^^^^^^^^^^^^^
+
+The device emulation objects will use ``memory_region_init_io()`` to
+install their MMIO handlers, and ``pci_register_bar()`` to associate
+those handlers with a PCI BAR, as they do within QEMU currently.
+
+In order to use ``address_space_rw()`` in the emulation process to
+handle MMIO requests from QEMU, the PCI physical addresses must be the
+same in the QEMU process and the device emulation process. In order to
+accomplish that, guest BAR programming must also be forwarded from QEMU
+to the emulation process.
+
+interrupt injection
+^^^^^^^^^^^^^^^^^^^
+
+When device emulation wants to inject an interrupt into the VM, the
+request climbs the device's bus object hierarchy until the point where a
+bus object knows how to signal the interrupt to the guest. The details
+depend on the type of interrupt being raised.
+
+-  PCI pin interrupts
+
+On x86 systems, there is an emulated IOAPIC object attached to the root
+PCI bus object, and the root PCI object forwards interrupt requests to
+it. The IOAPIC object, in turn, calls the KVM driver to inject the
+corresponding interrupt into the VM. The simplest way to handle this in
+an emulation process would be to setup the root PCI bus driver (via
+``pci_bus_irqs()``) to send a interrupt request back to the QEMU
+process, and have the device proxy object reflect it up the PCI tree
+there.
+
+-  PCI MSI/X interrupts
+
+PCI MSI/X interrupts are implemented in HW as DMA writes to a
+CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
+these DMA writes, then calls into the KVM driver to inject the interrupt
+into the VM. A simple emulation process implementation would be to send
+the MSI DMA address from QEMU as a message at initialization, then
+install an address space handler at that address which forwards the MSI
+message back to QEMU.
+
+DMA operations
+^^^^^^^^^^^^^^
+
+When a emulation object wants to DMA into or out of guest memory, it
+first must use dma\_memory\_map() to convert the DMA address to a local
+virtual address. The emulation process memory region objects setup above
+will be used to translate the DMA address to a local virtual address the
+device emulation code can access.
+
+IOMMU
+^^^^^
+
+When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
+regions to translate the DMA address to a guest physical address before
+that physical address can be translated to a local virtual address. The
+emulation process will need similar functionality.
+
+-  IOTLB cache
+
+The emulation process will maintain a cache of recent IOMMU translations
+(the IOTLB). When the translate() callback of an IOMMU memory region is
+invoked, the IOTLB cache will be searched for an entry that will map the
+DMA address to a guest PA. On a cache miss, a message will be sent back
+to QEMU requesting the corresponding translation entry, which be both be
+used to return a guest address and be added to the cache.
+
+-  IOTLB purge
+
+The IOMMU emulation will also need to act on unmap requests from QEMU.
+These happen when the guest IOMMU driver purges an entry from the
+guest's translation table.
+
+live migration
+^^^^^^^^^^^^^^
+
+When a remote process receives a live migration indication from QEMU, it
+will set up a channel using the received file descriptor with
+``qio_channel_socket_new_fd()``. This channel will be used to create a
+*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
+the process's device state back to QEMU. This method will be reversed on
+restore - the channel will be passed to ``qemu_loadvm_state()`` to
+restore the device state.
+
+Accelerating device emulation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The messages that are required to be sent between QEMU and the emulation
+process can add considerable latency to IO operations. The optimizations
+described below attempt to ameliorate this effect by allowing the
+emulation process to communicate directly with the kernel KVM driver.
+The KVM file descriptors created would be passed to the emulation process
+via initialization messages, much like the guest memory table is done.
+#### MMIO acceleration
+
+Vhost user applications can receive guest virtio driver stores directly
+from KVM. The issue with the eventfd mechanism used by vhost user is
+that it does not pass any data with the event indication, so it cannot
+handle guest loads or guest stores that carry store data. This concept
+could, however, be expanded to cover more cases.
+
+The expanded idea would require a new type of KVM device:
+*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master
+descriptor that QEMU can use for configuration, and a slave descriptor
+that the emulation process can use to receive MMIO notifications. QEMU
+would create both descriptors using the KVM driver, and pass the slave
+descriptor to the emulation process via an initialization message.
+
+data structures
+^^^^^^^^^^^^^^^
+
+-  guest physical range
+
+The guest physical range structure describes the address range that a
+device will respond to. It includes the base and length of the range, as
+well as which bus the range resides on (e.g., on an x86machine, it can
+specify whether the range refers to memory or IO addresses).
+
+A device can have multiple physical address ranges it responds to (e.g.,
+a PCI device can have multiple BARs), so the structure will also include
+an enumerated identifier to specify which of the device's ranges is
+being referred to.
+
++--------+----------------------------+
+| Name   | Description                |
++========+============================+
+| addr   | range base address         |
++--------+----------------------------+
+| len    | range length               |
++--------+----------------------------+
+| bus    | addr type (memory or IO)   |
++--------+----------------------------+
+| id     | range ID (e.g., PCI BAR)   |
++--------+----------------------------+
+
+-  MMIO request structure
+
+This structure describes an MMIO operation. It includes which guest
+physical range the MMIO was within, the offset within that range, the
+MMIO type (e.g., load or store), and its length and data. It also
+includes a sequence number that can be used to reply to the MMIO, and
+the CPU that issued the MMIO.
+
++----------+------------------------+
+| Name     | Description            |
++==========+========================+
+| rid      | range MMIO is within   |
++----------+------------------------+
+| offset   | offset withing *rid*   |
++----------+------------------------+
+| type     | e.g., load or store    |
++----------+------------------------+
+| len      | MMIO length            |
++----------+------------------------+
+| data     | store data             |
++----------+------------------------+
+| seq      | sequence ID            |
++----------+------------------------+
+
+-  MMIO request queues
+
+MMIO request queues are FIFO arrays of MMIO request structures. There
+are two queues: pending queue is for MMIOs that haven't been read by the
+emulation program, and the sent queue is for MMIOs that haven't been
+acknowledged. The main use of the second queue is to validate MMIO
+replies from the emulation program.
+
+-  scoreboard
+
+Each CPU in the VM is emulated in QEMU by a separate thread, so multiple
+MMIOs may be waiting to be consumed by an emulation program and multiple
+threads may be waiting for MMIO replies. The scoreboard would contain a
+wait queue and sequence number for the per-CPU threads, allowing them to
+be individually woken when the MMIO reply is received from the emulation
+program. It also tracks the number of posted MMIO stores to the device
+that haven't been replied to, in order to satisfy the PCI constraint
+that a load to a device will not complete until all previous stores to
+that device have been completed.
+
+-  device shadow memory
+
+Some MMIO loads do not have device side-effects. These MMIOs can be
+completed without sending a MMIO request to the emulation program if the
+emulation program shares a shadow image of the device's memory image
+with the KVM driver.
+
+The emulation program will ask the KVM driver to allocate memory for the
+shadow image, and will then use ``mmap()`` to directly access it. The
+emulation program can control KVM access to the shadow image by sending
+KVM an access map telling it which areas of the image have no
+side-effects (and can be completed immediately), and which require a
+MMIO request to the emulation program. The access map can also inform
+the KVM drive which size accesses are allowed to the image.
+
+master descriptor
+^^^^^^^^^^^^^^^^^
+
+The master descriptor is used by QEMU to configure the new KVM device.
+The descriptor would be returned by the KVM driver when QEMU issues a
+*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type.
+
+KVM\_DEV\_TYPE\_USER device ops
+
+
+The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
+``kvm_register_device_ops()`` call when the KVM system in initialized by
+``kvm_init()``. These device ops are called by the KVM driver when QEMU
+executes certain ``ioctl()`` operations on its KVM file descriptor. They
+include:
+
+-  create
+
+This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE*
+``ioctl()`` on its per-VM file descriptor. It will allocate and
+initialize a KVM user device specific data structure, and assign the
+*kvm\_device* private field to it.
+
+-  ioctl
+
+This routine is invoked when QEMU issues an ``ioctl()`` on the master
+descriptor. The ``ioctl()`` commands supported are defined by the KVM
+device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands:
+
+*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will
+be passed to the device emulation program. Only one slave can be created
+by each master descriptor. The file operations performed by this
+descriptor are described below.
+
+The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
+address range that the slave descriptor will receive MMIO notifications
+for. The range is specified by a guest physical range structure
+argument. For buses that assign addresses to devices dynamically, this
+command can be executed while the guest is running, such as the case
+when a guest changes a device's PCI BAR registers.
+
+*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
+register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
+performs a MMIO operation within the range. When a range is changed,
+``kvm_io_bus_unregister_dev()`` is used to remove the previous
+instantiation.
+
+*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies
+how long KVM will wait for the emulation process to respond to a MMIO
+indication.
+
+-  destroy
+
+This routine is called when the VM instance is destroyed. It will need
+to destroy the slave descriptor; and free any memory allocated by the
+driver, as well as the *kvm\_device* structure itself.
+
+slave descriptor
+^^^^^^^^^^^^^^^^
+
+The slave descriptor will have its own file operations vector, which
+responds to system calls on the descriptor performed by the device
+emulation program.
+
+-  read
+
+A read returns any pending MMIO requests from the KVM driver as MMIO
+request structures. Multiple structures can be returned if there are
+multiple MMIO operations pending. The MMIO requests are moved from the
+pending queue to the sent queue, and if there are threads waiting for
+space in the pending to add new MMIO operations, they will be woken
+here.
+
+-  write
+
+A write also consists of a set of MMIO requests. They are compared to
+the MMIO requests in the sent queue. Matches are removed from the sent
+queue, and any threads waiting for the reply are woken. If a store is
+removed, then the number of posted stores in the per-CPU scoreboard is
+decremented. When the number is zero, and a non side-effect load was
+waiting for posted stores to complete, the load is continued.
+
+-  ioctl
+
+There are several ioctl()s that can be performed on the slave
+descriptor.
+
+A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
+allocate memory for the shadow image. This memory can later be
+``mmap()``\ ed by the emulation process to share the emulation's view of
+device memory with the KVM driver.
+
+A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
+shadow image. It will send the KVM driver a shadow control map, which
+specifies which areas of the image can complete guest loads without
+sending the load request to the emulation program. It will also specify
+the size of load operations that are allowed.
+
+-  poll
+
+An emulation program will use the ``poll()`` call with a *POLLIN* flag
+to determine if there are MMIO requests waiting to be read. It will
+return if the pending MMIO request queue is not empty.
+
+-  mmap
+
+This call allows the emulation program to directly access the shadow
+image allocated by the KVM driver. As device emulation updates device
+memory, changes with no side-effects will be reflected in the shadow,
+and the KVM driver can satisfy guest loads from the shadow image without
+needing to wait for the emulation program.
+
+kvm\_io\_device ops
+^^^^^^^^^^^^^^^^^^^
+
+Each KVM per-CPU thread can handle MMIO operation on behalf of the guest
+VM. KVM will use the MMIO's guest physical address to search for a
+matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
+driver instead of exiting back to QEMU. If a match is found, the
+corresponding callback will be invoked.
+
+-  read
+
+This callback is invoked when the guest performs a load to the device.
+Loads with side-effects must be handled synchronously, with the KVM
+driver putting the QEMU thread to sleep waiting for the emulation
+process reply before re-starting the guest. Loads that do not have
+side-effects may be optimized by satisfying them from the shadow image,
+if there are no outstanding stores to the device by this CPU. PCI memory
+ordering demands that a load cannot complete before all older stores to
+the same device have been completed.
+
+-  write
+
+Stores can be handled asynchronously unless the pending MMIO request
+queue is full. In this case, the QEMU thread must sleep waiting for
+space in the queue. Stores will increment the number of posted stores in
+the per-CPU scoreboard, in order to implement the PCI ordering
+constraint above.
+
+interrupt acceleration
+^^^^^^^^^^^^^^^^^^^^^^
+
+This performance optimization would work much like a vhost user
+application does, where the QEMU process sets up *eventfds* that cause
+the device's corresponding interrupt to be triggered by the KVM driver.
+These irq file descriptors are sent to the emulation process at
+initialization, and are used when the emulation code raises a device
+interrupt.
+
+intx acceleration
+'''''''''''''''''
+
+Traditional PCI pin interrupts are level based, so, in addition to an
+irq file descriptor, a re-sampling file descriptor needs to be sent to
+the emulation program. This second file descriptor allows multiple
+devices sharing an irq to be notified when the interrupt has been
+acknowledged by the guest, so they can re-trigger the interrupt if their
+device has not de-asserted its interrupt.
+
+intx irq descriptor
+
+
+The irq descriptors are created by the proxy object
+``using event_notifier_init()`` to create the irq and re-sampling
+*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
+The interrupt route can be found with
+``pci_device_route_intx_to_irq()``.
+
+intx routing changes
+
+
+Intx routing can be changed when the guest programs the APIC the device
+pin is connected to. The proxy object in QEMU will use
+``pci_device_set_intx_routing_notifier()`` to be informed of any guest
+changes to the route. This handler will broadly follow the VFIO
+interrupt logic to change the route: de-assigning the existing irq
+descriptor from its route, then assigning it the new route. (see
+``vfio_intx_update()``)
+
+MSI/X acceleration
+''''''''''''''''''
+
+MSI/X interrupts are sent as DMA transactions to the host. The interrupt
+data contains a vector that is programmed by the guest, A device may have
+multiple MSI interrupts associated with it, so multiple irq descriptors
+may need to be sent to the emulation program.
+
+MSI/X irq descriptor
+
+
+This case will also follow the VFIO example. For each MSI/X interrupt,
+an *eventfd* is created, a virtual interrupt is allocated by
+``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
+the eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
+
+MSI/X config space changes
+
+
+The guest may dynamically update several MSI-related tables in the
+device's PCI config space. These include per-MSI interrupt enables and
+vector data. Additionally, MSIX tables exist in device memory space, not
+config space. Much like the BAR case above, the proxy object must look
+at guest config space programming to keep the MSI interrupt state
+consistent between QEMU and the emulation program.
+
+--------------
+
+Disaggregated CPU emulation
+---------------------------
+
+After IO services have been disaggregated, a second phase would be to
+separate a process to handle CPU instruction emulation from the main
+QEMU control function. There are no object separation points for this
+code, so the first task would be to create one.
+
+Host access controls
+--------------------
+
+Separating QEMU relies on the host OS's access restriction mechanisms to
+enforce that the differing processes can only access the objects they
+are entitled to. There are a couple types of mechanisms usually provided
+by general purpose OSs.
+
+Discretionary access control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Discretionary access control allows each user to control who can access
+their files. In Linux, this type of control is usually too coarse for
+QEMU separation, since it only provides three separate access controls:
+one for the same user ID, the second for users IDs with the same group
+ID, and the third for all other user IDs. Each device instance would
+need a separate user ID to provide access control, which is likely to be
+unwieldy for dynamically created VMs.
+
+Mandatory access control
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Mandatory access control allows the OS to add an additional set of
+controls on top of discretionary access for the OS to control. It also
+adds other attributes to processes and files such as types, roles, and
+categories, and can establish rules for how processes and files can
+interact.
+
+Type enforcement
+^^^^^^^^^^^^^^^^
+
+Type enforcement assigns a *type* attribute to processes and files, and
+allows rules to be written on what operations a process with a given
+type can perform on a file with a given type. QEMU separation could take
+advantage of type enforcement by running the emulation processes with
+different types, both from the main QEMU process, and from the emulation
+processes of different classes of devices.
+
+For example, guest disk images and disk emulation processes could have
+types separate from the main QEMU process and non-disk emulation
+processes, and the type rules could prevent processes other than disk
+emulation ones from accessing guest disk images. Similarly, network
+emulation processes can have a type separate from the main QEMU process
+and non-network emulation process, and only that type can access the
+host tun/tap device used to provide guest networking.
+
+Category enforcement
+^^^^^^^^^^^^^^^^^^^^
+
+Category enforcement assigns a set of numbers within a given range to
+the process or file. The process is granted access to the file if the
+process's set is a superset of the file's set. This enforcement can be
+used to separate multiple instances of devices in the same class.
+
+For example, if there are multiple disk devices provides to a guest,
+each device emulation process could be provisioned with a separate
+category. The different device emulation processes would not be able to
+access each other's backing disk images.
+
+Alternatively, categories could be used in lieu of the type enforcement
+scheme described above. In this scenario, different categories would be
+used to prevent device emulation processes in different classes from
+accessing resources assigned to other classes.
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 21/21] multi-process: add configure and usage information
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (19 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 20/21] multi-process: add the concept description to docs/devel/qemu-multiprocess elena.ufimtseva
@ 2020-06-27 17:09 ` elena.ufimtseva
  2020-07-02 13:26   ` Stefan Hajnoczi
  2020-07-02 13:40 ` [PATCH v7 00/21] Initial support for multi-process qemu Stefan Hajnoczi
  21 siblings, 1 reply; 51+ messages in thread
From: elena.ufimtseva @ 2020-06-27 17:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, fam, swapnil.ingle, john.g.johnson, kraxel,
	jag.raman, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, stefanha,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
 MAINTAINERS                          |  2 +
 docs/multi-process.rst               | 71 ++++++++++++++++++++++++++++
 scripts/mpqemu-launcher-perf-mode.py | 67 ++++++++++++++++++++++++++
 scripts/mpqemu-launcher.py           | 47 ++++++++++++++++++
 4 files changed, 187 insertions(+)
 create mode 100644 docs/multi-process.rst
 create mode 100644 scripts/mpqemu-launcher-perf-mode.py
 create mode 100755 scripts/mpqemu-launcher.py

diff --git a/MAINTAINERS b/MAINTAINERS
index a65a5dd279..60c21d8210 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2959,6 +2959,8 @@ F: include/hw/pci/memory-sync.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
 F: docs/devel/multi-process.rst
+F: scripts/mpqemu-launcher.py
+F: scripts/mpqemu-launcher-perf-mode.py
 
 Build and test automation
 -------------------------
diff --git a/docs/multi-process.rst b/docs/multi-process.rst
new file mode 100644
index 0000000000..0fa643a511
--- /dev/null
+++ b/docs/multi-process.rst
@@ -0,0 +1,71 @@
+Multi-process QEMU
+==================
+
+This document describes how to configure and use multi-process qemu.
+For the design document refer to docs/devel/qemu-multiprocess.
+
+1) Configuration
+----------------
+
+To enable support for multi-process add --enable-mpqemu
+to the list of options for the "configure" script.
+
+
+2) Usage
+--------
+
+Multi-process QEMU requires an orchestrator to launch. Please refer to a
+light-weight python based orchestrator for mpqemu in
+scripts/mpqemu-launcher.py to lauch QEMU in multi-process mode.
+
+Please refer to scripts/mpqemu-launcher-perf-mode.py for running the remote
+process in performance mode.
+
+As of now, we only support the emulation of lsi53c895a in a separate process.
+
+Following is a description of command-line used to launch mpqemu.
+
+* Orchestrator:
+
+  - The Orchestrator creates a unix socketpair
+
+  - It launches the remote process and passes one of the
+    sockets to it via command-line.
+
+  - It then launches QEMU and specifies the other socket as an option
+    to the Proxy device object
+
+* Remote Process:
+
+  - QEMU can enter remote process mode by using the "remote" machine
+    option. The socket suboption specifies the channel in which the
+    remote process received messages
+
+  - The remaining options are no different from how one launches QEMU with
+    devices. The only other requirement is each PCI device must have a
+    unique ID specified to it. This is needed to pair remote device with the
+    Proxy object.
+
+  - Example command-line for the remote process is as follows:
+
+      /usr/bin/qemu-system-x86_64                                        \
+      -machine remote,socket=4                                           \
+      -device lsi53c895a,id=lsi0                                         \
+      -drive id=drive_image2,file=/build/ol7-nvme-test-1.qcow2           \
+      -device scsi-hd,id=drive2,drive=drive_image2,bus=lsi0.0,scsi-id=0
+
+* QEMU:
+
+  - Since parts of the RAM are shared between QEMU & remote process, a
+    memory-backend-memfd is required to facilitate this, as follows:
+
+    -object memory-backend-memfd,id=mem,size=2G
+
+  - A "pci-proxy-dev" device is created for each of the PCI devices emulated
+    in the remote process. A "socket" sub-option specifies the other end of
+    unix channel created by orchestrator. The "id" sub-option must be specified
+    and should be the same as the "id" specified for the remote PCI device
+
+  - Example commandline for QEMU is as follows:
+
+      -device pci-proxy-dev,id=lsi0,socket=3
diff --git a/scripts/mpqemu-launcher-perf-mode.py b/scripts/mpqemu-launcher-perf-mode.py
new file mode 100644
index 0000000000..583f65ca99
--- /dev/null
+++ b/scripts/mpqemu-launcher-perf-mode.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python3
+import socket
+import os
+import subprocess
+import time
+
+PROC_QEMU='/usr/bin/qemu-system-x86_64'
+
+proxy_1, remote_1 = socket.socketpair(socket.AF_UNIX, socket.SOCK_STREAM)
+proxy_2, remote_2 = socket.socketpair(socket.AF_UNIX, socket.SOCK_STREAM)
+
+remote_cmd_1 = [ PROC_QEMU,                                                    \
+                 '-machine', 'remote,socket='+str(remote_1.fileno()),          \
+                 '-device', 'lsi53c895a,id=lsi1',                              \
+                 '-drive', 'id=drive_image1,file=/test-1.qcow2',               \
+                 '-device', 'scsi-hd,id=drive1,drive=drive_image1,bus=lsi1.0,' \
+                                'scsi-id=0',                                   \
+                 '-nographic',                                                 \
+               ]
+
+remote_cmd_2 = [ PROC_QEMU,                                                    \
+                 '-machine', 'remote,socket='+str(remote_2.fileno()),          \
+                 '-device', 'lsi53c895a,id=lsi2',                              \
+                 '-drive', 'id=drive_image2,file=/test-2.qcow2',               \
+                 '-device', 'scsi-hd,id=drive2,drive=drive_image2,bus=lsi2.0,' \
+                                'scsi-id=0',                                   \
+                 '-nographic',                                                 \
+               ]
+
+proxy_cmd = [ PROC_QEMU,                                                       \
+              '-name', 'OL7.4',                                                \
+              '-machine', 'q35,accel=kvm',                                     \
+              '-smp', 'sockets=1,cores=1,threads=1',                           \
+              '-m', '2048',                                                    \
+              '-object', 'memory-backend-memfd,id=sysmem-file,size=2G',        \
+              '-numa', 'node,memdev=sysmem-file',                              \
+              '-device', 'virtio-scsi-pci,id=virtio_scsi_pci0',                \
+              '-drive', 'id=drive_image1,if=none,format=qcow2,'                \
+                            'file=/home/ol7-hdd-1.qcow2',                      \
+              '-device', 'scsi-hd,id=image1,drive=drive_image1,'               \
+                             'bus=virtio_scsi_pci0.0',                         \
+              '-boot', 'd',                                                    \
+              '-vnc', ':0',                                                    \
+              '-device', 'pci-proxy-dev,id=lsi1,fd='+str(proxy_1.fileno()),    \
+              '-device', 'pci-proxy-dev,id=lsi2,fd='+str(proxy_2.fileno()),    \
+            ]
+
+
+pid = os.fork();
+if pid == 0:
+    # In remote_1
+    print('Launching Remote process 1');
+    process = subprocess.Popen(remote_cmd_1, pass_fds=[remote_1.fileno()])
+    os._exit(0)
+
+
+pid = os.fork();
+if pid == 0:
+    # In remote_2
+    print('Launching Remote process 2');
+    process = subprocess.Popen(remote_cmd_2, pass_fds=[remote_2.fileno()])
+    os._exit(0)
+
+
+print('Launching Proxy process');
+process = subprocess.Popen(proxy_cmd, pass_fds=[proxy_1.fileno(),
+                           proxy_2.fileno()])
diff --git a/scripts/mpqemu-launcher.py b/scripts/mpqemu-launcher.py
new file mode 100755
index 0000000000..751b51eae7
--- /dev/null
+++ b/scripts/mpqemu-launcher.py
@@ -0,0 +1,47 @@
+#!/usr/bin/env python3
+import socket
+import os
+import subprocess
+import time
+
+PROC_QEMU='/usr/bin/qemu-system-x86_64'
+
+proxy, remote = socket.socketpair(socket.AF_UNIX, socket.SOCK_STREAM)
+
+remote_cmd = [ PROC_QEMU,                                                      \
+               '-machine', 'remote,socket='+str(remote.fileno()),              \
+               '-device', 'lsi53c895a,id=lsi1',                                \
+               '-drive', 'id=drive_image1,file=/build/ol7-nvme-test-1.qcow2',  \
+               '-device', 'scsi-hd,id=drive1,drive=drive_image1,bus=lsi1.0,'   \
+                              'scsi-id=0',                                     \
+               '-nographic',                                                   \
+             ]
+
+proxy_cmd = [ PROC_QEMU,                                                       \
+              '-name', 'OL7.4',                                                \
+              '-machine', 'q35,accel=kvm',                                     \
+              '-smp', 'sockets=1,cores=1,threads=1',                           \
+              '-m', '2048',                                                    \
+              '-object', 'memory-backend-memfd,id=sysmem-file,size=2G',        \
+              '-numa', 'node,memdev=sysmem-file',                              \
+              '-device', 'virtio-scsi-pci,id=virtio_scsi_pci0',                \
+              '-drive', 'id=drive_image1,if=none,format=qcow2,'                \
+                            'file=/home/ol7-hdd-1.qcow2',                      \
+              '-device', 'scsi-hd,id=image1,drive=drive_image1,'               \
+                             'bus=virtio_scsi_pci0.0',                         \
+              '-boot', 'd',                                                    \
+              '-vnc', ':0',                                                    \
+              '-device', 'pci-proxy-dev,id=lsi1,fd='+str(proxy.fileno()),      \
+            ]
+
+
+pid = os.fork();
+
+if pid:
+    # In Proxy
+    print('Launching QEMU with Proxy object');
+    process = subprocess.Popen(proxy_cmd, pass_fds=[proxy.fileno()])
+else:
+    # In remote
+    print('Launching Remote process');
+    process = subprocess.Popen(remote_cmd, pass_fds=[remote.fileno()])
-- 
2.25.GIT



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 02/21] multi-process: Add config option for multi-process QEMU
  2020-06-27 17:09 ` [PATCH v7 02/21] multi-process: Add config option for multi-process QEMU elena.ufimtseva
@ 2020-06-30 14:57   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-06-30 14:57 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 508 bytes --]

On Sat, Jun 27, 2020 at 10:09:24AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Jagannathan Raman <jag.raman@oracle.com>
> 
> Add a configuration option to separate multi-process code
> 
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> ---
>  configure | 11 +++++++++++
>  1 file changed, 11 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 01/21] memory: alloc RAM from file at offset
  2020-06-27 17:09 ` [PATCH v7 01/21] memory: alloc RAM from file at offset elena.ufimtseva
@ 2020-06-30 14:59   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-06-30 14:59 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 979 bytes --]

On Sat, Jun 27, 2020 at 10:09:23AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Jagannathan Raman <jag.raman@oracle.com>
> 
> Allow RAM MemoryRegion to be created from an offset in a file, instead
> of allocating at offset of 0 by default. This is needed to synchronize
> RAM between QEMU & remote process.
> 
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> ---
>  backends/hostmem-memfd.c  |  2 +-
>  exec.c                    | 11 +++++++----
>  hw/misc/ivshmem.c         |  3 ++-
>  include/exec/memory.h     |  2 ++
>  include/exec/ram_addr.h   |  2 +-
>  include/qemu/mmap-alloc.h |  3 ++-
>  memory.c                  |  3 ++-
>  util/mmap-alloc.c         |  7 ++++---
>  util/oslib-posix.c        |  2 +-
>  9 files changed, 22 insertions(+), 13 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 03/21] multi-process: setup PCI host bridge for remote device
  2020-06-27 17:09 ` [PATCH v7 03/21] multi-process: setup PCI host bridge for remote device elena.ufimtseva
@ 2020-06-30 15:17   ` Stefan Hajnoczi
  2020-07-09 14:23     ` Jag Raman
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-06-30 15:17 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1528 bytes --]

On Sat, Jun 27, 2020 at 10:09:25AM -0700, elena.ufimtseva@oracle.com wrote:
> diff --git a/hw/pci-host/remote.c b/hw/pci-host/remote.c
> new file mode 100644
> index 0000000000..5ea9af4154
> --- /dev/null
> +++ b/hw/pci-host/remote.c
> @@ -0,0 +1,63 @@
> +/*
> + * Remote PCI host device
> + *
> + * Copyright © 2018, 2020 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *

A little more detail would be nice:

  Unlike PCI host devices that model physical hardware, the purpose of
  this PCI host is to host multi-process QEMU remote PCI devices.

  Multi-process QEMU talks to a remote PCI device that runs in a
  separate process. In order to reuse QEMU device models in the remote
  process we need a PCI bus that holds the devices.

  This PCI host is purely a container for PCI devices. It's fake in the
  sense that the guest never sees this PCI host and has no way of
  accessing it. It's job is just to provide the environment that QEMU
  PCI device models need when running in a remote process.

I think this could be restated more clearly but hopefully it
communicates the purpose of hw/pci-host/remote.c. :P

> +typedef struct RemotePCIHost {
> +    /*< private >*/
> +    PCIExpressHost parent_obj;
> +    /*< public >*/
> +
> +    MemoryRegion *mr_pci_mem;
> +    MemoryRegion *mr_sys_mem;

Unused? Please add mr_sys_mem if and when it is used.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 04/21] multi-process: setup a machine object for remote device process
  2020-06-27 17:09 ` [PATCH v7 04/21] multi-process: setup a machine object for remote device process elena.ufimtseva
@ 2020-06-30 15:26   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-06-30 15:26 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

On Sat, Jun 27, 2020 at 10:09:26AM -0700, elena.ufimtseva@oracle.com wrote:
> diff --git a/hw/i386/remote.c b/hw/i386/remote.c
> new file mode 100644
> index 0000000000..4d13abe9f3
> --- /dev/null
> +++ b/hw/i386/remote.c
> @@ -0,0 +1,64 @@
> +/*
> + * Machine for remote device
> + *
> + * Copyright © 2018, 2020 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.

A suggestion for expanding the comment:

  This machine type is used by the remote device process in
  multi-process QEMU. QEMU device models depend on parent busses,
  interrupt controllers, memory regions, etc. The remote machine type
  offers this environment so that QEMU device models can be used as
  remote devices.

> +typedef struct RemMachineState {
> +    MachineState parent_obj;
> +
> +    RemotePCIHost *host;
> +} RemMachineState;

s/RemMachineState/RemoteMachineState/ for consistency with
"remote-machine" and RemotePCIHost.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 05/21] multi-process: add qio channel function to transmit
  2020-06-27 17:09 ` [PATCH v7 05/21] multi-process: add qio channel function to transmit elena.ufimtseva
@ 2020-06-30 15:29   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-06-30 15:29 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 646 bytes --]

On Sat, Jun 27, 2020 at 10:09:27AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> 
> The entire array of the memory regions and file handlers.
> Will be used in the next patch.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/io/channel.h | 24 +++++++++++++++++++++++
>  io/channel.c         | 45 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 06/21] multi-process: define MPQemuMsg format and transmission functions
  2020-06-27 17:09 ` [PATCH v7 06/21] multi-process: define MPQemuMsg format and transmission functions elena.ufimtseva
@ 2020-06-30 15:53   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-06-30 15:53 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 4045 bytes --]

On Sat, Jun 27, 2020 at 10:09:28AM -0700, elena.ufimtseva@oracle.com wrote:
> +void mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc)
> +{
> +    Error *local_err = NULL;
> +    struct iovec send[2];
> +    int *fds = NULL;
> +    size_t nfds = 0;
> +
> +    send[0].iov_base = msg;
> +    send[0].iov_len = MPQEMU_MSG_HDR_SIZE;
> +
> +    send[1].iov_base = msg->bytestream ? msg->data2 : (void *)&msg->data1;
> +    send[1].iov_len = msg->size;
> +
> +    if (msg->num_fds) {
> +        nfds = msg->num_fds;
> +        fds = msg->fds;
> +    }
> +
> +    (void)qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send), fds, nfds,
> +                                      &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +    }

Error propagation is missing. Is this by design, i.e. errors only need
to be handled in the read path, or is the patch incomplete?

> +}
> +
> +static int mpqemu_readv(QIOChannel *ioc, struct iovec *iov, int **fds,
> +                        size_t *nfds, Error **errp)
> +{
> +    size_t size, len;
> +
> +    size = iov->iov_len;
> +
> +    while (size > 0) {
> +        len = qio_channel_readv_full(ioc, iov, 1, fds, nfds, errp);

iovec buffers are overwritten when this loop iterates because iov is
not updated after reading some data.

fds is leaked when this loop iterates multiple times and there are fds
each time. fds/nfds should probably be set to 0 after the first
iteration.

> +
> +        if (len == QIO_CHANNEL_ERR_BLOCK) {
> +            if (qemu_in_coroutine()) {
> +                qio_channel_yield(ioc, G_IO_IN);
> +            } else {
> +                qio_channel_wait(ioc, G_IO_IN);
> +            }
> +            continue;
> +        }
> +
> +        if (len <= 0) {
> +            return -EIO;
> +        }
> +
> +        size -= len;
> +    }
> +
> +    return iov->iov_len;
> +}
> +
> +int mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc)
> +{
> +    Error *local_err = NULL;
> +    int *fds = NULL;
> +    struct iovec hdr, data;
> +    size_t nfds = 0;
> +
> +    hdr.iov_base = g_malloc0(MPQEMU_MSG_HDR_SIZE);
> +    hdr.iov_len = MPQEMU_MSG_HDR_SIZE;
> +
> +    if (mpqemu_readv(ioc, &hdr, &fds, &nfds, &local_err) < 0) {
> +        return -EIO;

hdr.iov_base is leaked.

Why is hdr.iov_base allocated on the heap? Can you read directly into
msg instead of using an extra variable?

> +    }
> +
> +    memcpy(msg, hdr.iov_base, hdr.iov_len);
> +
> +    free(hdr.iov_base);
> +    if (msg->size > MPQEMU_MSG_DATA_MAX) {
> +        error_report("The message size is more than MPQEMU_MSG_DATA_MAX %d",
> +                     MPQEMU_MSG_DATA_MAX);
> +        return -EINVAL;
> +    }
> +
> +    data.iov_base = g_malloc0(msg->size);
> +    data.iov_len = msg->size;
> +
> +    if (mpqemu_readv(ioc, &data, NULL, NULL, &local_err) < 0) {
> +        return -EIO;

data.iov_base is leaked.

> +    }
> +
> +    if (msg->bytestream) {
> +        msg->data2 = calloc(1, msg->size);

QEMU uses glib memory allocation functions when possible. Please use
g_malloc0().

The calloc() and memcpy() can be avoided like this:

  msg->data2 = data.iov_base;
  data.iov_base = NULL; /* the pointer has been moved, don't free it */

> +        memcpy(msg->data2, data.iov_base, msg->size);
> +    } else {
> +        memcpy((void *)&msg->data1, data.iov_base, msg->size);

Buffer overflow when msg->size > sizeof(msg->data1). There should be a
check:

  if (msg->bytestream) {
      ...
  } else {
      if (msg->size > sizeof(msg->data1)) {
          ...handle invalid size...
	  return -EIO;
      }
      memcpy(&msg->data1, data.iov_base, msg->size);
  }

> +    }
> +
> +    free(data.iov_base);

Should be g_free() since the allocation was made with g_malloc0().

> +
> +    if (nfds) {
> +        msg->num_fds = nfds;
> +        memcpy(msg->fds, fds, nfds * sizeof(int));
> +    }

It's safer to move the msg->num_fds = nfds outside the if statement so
that it always happens. That way num_fds is set to 0 and we don't depend
on the caller initializing it to 0.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 07/21] multi-process: add co-routines to communicate with remote
  2020-06-27 17:09 ` [PATCH v7 07/21] multi-process: add co-routines to communicate with remote elena.ufimtseva
@ 2020-06-30 18:31   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-06-30 18:31 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 8358 bytes --]

On Sat, Jun 27, 2020 at 10:09:29AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> 
> process to avoid blocking the main loop during the message exchanges.
> To be used by proxy device.

The nested aio_poll() in this patch could be a problem. I've highlighted
it here so that others skimming through this series can join the
discussion without reading all my review comments:

  qemu_coroutine_enter(req.co);
  while (!req.finished) {
      aio_poll(qemu_get_aio_context(), false);

This is called from the proxy device, which means the vCPU thread. The
QEMU global mutex is held here.

aio_poll() does not release the mutex, so other vCPU threads will not be
able to run QEMU code until mpqemu_msg_send_reply_co() completes and the
QEMU global mutex has been released again.

Our goal is to do blocking socket I/O here since our vCPU thread cannot
make progress until the remote device replies. This means we can forget
about QEMU's event loop and simply do blocking I/O from the vCPU thread.
The QEMU global mutex can be dropped around the send/recv but a
per-socket mutex is needed to protect against multiple vCPU threads
interleaving requests.

I think this boils down to:

  /*
   * Blocking version of send+recv for the proxy device. Called from
   * vCPU thread where we cannot make progress until the remote device
   * has replied. Releases and re-acquires the QEMU global mutex.
   */
  uint64_t mpqemu_msg_send_and_await_reply(MPQemuMsg *msg, QIOChannel *ioc,
                                           Error **errp)
  {
      uint64_t ret;

      qemu_mutex_unlock_iothread();

      qemu_mutex_lock(&mplink->io_mutex);
      blocking_send(ioc, msg);
      ret = blocking_recv(ioc);
      qemu_mutex_unlock(&mplink->io_mutex);

      qemu_mutex_lock_iothread();

      return ret;
  }

This is very simple! The tricky thing is combining this approach with
mplink I/O that happens in the event loop. The heartbeat is one example.
I don't remember the protocol details enough to know how the other
message exchanges work. Code running in the event loop is not allowed to
block, so it really needs to use coroutines.

> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> ---
>  include/io/mpqemu-link.h | 16 +++++++++
>  io/mpqemu-link.c         | 78 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 94 insertions(+)
> 
> diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
> index 1542e8ed07..52aa89656c 100644
> --- a/include/io/mpqemu-link.h
> +++ b/include/io/mpqemu-link.h
> @@ -17,6 +17,7 @@
>  #include "qom/object.h"
>  #include "qemu/thread.h"
>  #include "io/channel.h"
> +#include "io/channel-socket.h"
>  
>  #define REMOTE_MAX_FDS 8
>  
> @@ -30,6 +31,7 @@
>   */
>  typedef enum {
>      INIT = 0,
> +    RET_MSG,
>      MAX = INT_MAX,
>  } MPQemuCmd;
>  
> @@ -67,6 +69,20 @@ typedef struct {
>      uint8_t *data2;
>  } MPQemuMsg;
>  
> +struct MPQemuRequest {
> +    MPQemuMsg *msg;
> +    QIOChannelSocket *sioc;
> +    Coroutine *co;
> +    bool finished;
> +    int error;
> +    long ret;
> +};
> +
> +typedef struct MPQemuRequest MPQemuRequest;
> +
> +uint64_t mpqemu_msg_send_reply_co(MPQemuMsg *msg, QIOChannel *ioc,

This function sends a message and waits for the reply. The name confuses
me, I would guess that this function sends a reply. Perhaps
mpqemu_msg_send_and_await_reply() is a clearer name.

> +                                  Error **errp);

Functions with "co" in their name run in coroutine context. This
function does not. Please remove the _co from the name to avoid
confusion.

> +
>  void mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc);
>  int mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc);
>  
> diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
> index bfc542b5fd..c430b4d6a2 100644
> --- a/io/mpqemu-link.c
> +++ b/io/mpqemu-link.c
> @@ -16,6 +16,8 @@
>  #include "qapi/error.h"
>  #include "qemu/iov.h"
>  #include "qemu/error-report.h"
> +#include "qemu/main-loop.h"
> +#include "io/channel-socket.h"
>  
>  void mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc)
>  {
> @@ -118,6 +120,82 @@ int mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc)
>      return 0;
>  }
>  
> +/* Use in proxy only as it clobbers fd handlers. */
> +static void coroutine_fn mpqemu_msg_send_co(void *data)
> +{
> +    MPQemuRequest *req = (MPQemuRequest *)data;
> +    MPQemuMsg msg_reply = {0};
> +    long ret = -EINVAL;
> +
> +    if (!req->sioc) {
> +        error_report("No channel available to send command %d",
> +                     req->msg->cmd);

Can this ever happen?

> +        atomic_mb_set(&req->finished, true);

Why is atomic_mb_set() used here?

> +        req->error = -EINVAL;
> +        return;
> +    }
> +
> +    req->co = qemu_coroutine_self();
> +    mpqemu_msg_send(req->msg, QIO_CHANNEL(req->sioc));
> +
> +    yield_until_fd_readable(req->sioc->fd);
> +
> +    ret = mpqemu_msg_recv(&msg_reply, QIO_CHANNEL(req->sioc));

mpqemu_msg_recv() already yields when called from a coroutine. Why is
yield_until_fd_readable() called explicitly here?

After removing the yield_until_fd_readable() it will also be possible to
use QIOChannel instead of the more specific QIOChannelSocket. That will
make the code a little simpler because the QIO_CHANNEL() casts can be
removed.

> +    if (ret < 0) {
> +        error_report("ERROR: failed to get a reply for command %d, \
> +                     errno %s, ret is %ld",

C doesn't support backslash string wrapping in a useful way. The
backslash tells the preprocessor to wrap the line but the string will
contain the leading whitespace.

This format string is equivalent to:

  "ERROR: failed to get a reply for command %d,                    errno %s, ret is %ld"

A common way of wrapping strings is:

  error_report("ERROR: failed to get a reply for command %d, "
               "errno %s, ret is %ld",

> +                     req->msg->cmd, strerror(errno), ret);
> +        req->error = -errno;

s/errno/ret/

> +    } else {
> +        if (!mpqemu_msg_valid(&msg_reply) || msg_reply.cmd != RET_MSG) {
> +            error_report("ERROR: Invalid reply received for command %d",
> +                         req->msg->cmd);
> +            req->error = -EINVAL;
> +        } else {
> +            req->ret = msg_reply.data1.u64;
> +        }
> +    }
> +    atomic_mb_set(&req->finished, true);
> +}
> +
> +/*
> + * Create if needed and enter co-routine to send the message to the
> + * remote channel ioc and wait for the reply.
> + * Resturns the value from the reply message, sets the error on failure.

s/Resturns/Returns/

> + */
> +
> +uint64_t mpqemu_msg_send_reply_co(MPQemuMsg *msg, QIOChannel *ioc,
> +                                  Error **errp)
> +{
> +    MPQemuRequest req = {0};
> +    uint64_t ret = UINT64_MAX;
> +
> +    req.sioc = QIO_CHANNEL_SOCKET(ioc);
> +    if (!req.sioc) {
> +        return ret;

The doc comment says "sets the error on failure" but this error return
does not set errp.

> +    }
> +
> +    req.msg = msg;
> +    req.ret = 0;
> +    req.finished = false;
> +
> +    if (!req.co) {

req.co is always NULL here because it's a local variable initialized to
0 at the beginning of this function.

> +        req.co = qemu_coroutine_create(mpqemu_msg_send_co, &req);
> +    }
> +
> +    qemu_coroutine_enter(req.co);
> +    while (!req.finished) {
> +        aio_poll(qemu_get_aio_context(), false);

The second argument must be true so that aio_poll() waits for activity
instead of returning immediately. If it returns immediately then this
loop will consume 100% CPU.

> +    }
> +    if (req.error) {
> +        error_setg(errp, "Error exchanging message with remote process, "\
> +                        "socket %d, error %d", req.sioc->fd, req.error);
> +    }
> +    ret = req.ret;
> +
> +    return ret;
> +}
> +
>  bool mpqemu_msg_valid(MPQemuMsg *msg)
>  {
>      if (msg->cmd >= MAX && msg->cmd < 0) {
> -- 
> 2.25.GIT
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 08/21] multi-process: Initialize communication channel at the remote end
  2020-06-27 17:09 ` [PATCH v7 08/21] multi-process: Initialize communication channel at the remote end elena.ufimtseva
@ 2020-07-01  6:44   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01  6:44 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1177 bytes --]

On Sat, Jun 27, 2020 at 10:09:30AM -0700, elena.ufimtseva@oracle.com wrote:
> @@ -42,6 +43,20 @@ static void remote_machine_init(MachineState *machine)
>      qdev_realize(DEVICE(rem_host), sysbus_get_default(), &error_fatal);
>  }
>  
> +static void remote_set_socket(Object *obj, const char *str, Error **errp)
> +{
> +    RemMachineState *s = REMOTE_MACHINE(obj);
> +    Error *local_err = NULL;
> +    int fd = atoi(str);
> +
> +    s->ioc = qio_channel_new_fd(fd, &local_err);

Missing error handling and local_err is leaked. errp should be set.

> +}
> +
> +static void remote_instance_init(Object *obj)
> +{
> +    object_property_add_str(obj, "socket", NULL, remote_set_socket);
> +}

The name "socket" does not communicate the structure of the value. Is it
a <host>:<port> pair, a UNIX domain socket path, a file descriptor, etc?
It's common to name file descriptor arguments with an "fd" suffix (e.g.
vhostfd) and that would work here too.

Please also include a help string with

  object_property_set_description(obj, property_name, help_text);

The help string when QEMU is invoked like this:

  $ qemu-system-x86_64 -M remote,\?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 09/21] multi-process: Initialize message handler in remote device
  2020-06-27 17:09 ` [PATCH v7 09/21] multi-process: Initialize message handler in remote device elena.ufimtseva
@ 2020-07-01  6:53   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01  6:53 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1971 bytes --]

On Sat, Jun 27, 2020 at 10:09:31AM -0700, elena.ufimtseva@oracle.com wrote:
> diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
> new file mode 100644
> index 0000000000..58e24ab2ad
> --- /dev/null
> +++ b/hw/i386/remote-msg.c
> @@ -0,0 +1,52 @@
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +
> +#include "hw/i386/remote.h"
> +#include "io/channel.h"
> +#include "io/mpqemu-link.h"
> +#include "qapi/error.h"
> +#include "sysemu/runstate.h"
> +
> +gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
> +                            gpointer opaque)
> +{
> +    Error *local_err = NULL;
> +    MPQemuMsg msg = { 0 };
> +
> +    if (cond & G_IO_HUP) {
> +        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
> +    }

Missing return FALSE?

> +
> +    if (cond & (G_IO_ERR | G_IO_NVAL)) {
> +        error_setg(&local_err, "Error %d while processing message from proxy \
> +                   in remote process pid=%d", errno, getpid());

This sets and leaks local_err. Should this be error_report()?

> +        return FALSE;
> +    }
> +
> +    if (mpqemu_msg_recv(&msg, ioc) < 0) {
> +        return FALSE;
> +    }
> +
> +    if (!mpqemu_msg_valid(&msg)) {
> +        error_report("Received invalid message from proxy \
> +                     in remote process pid=%d", getpid());
> +        return TRUE;
> +    }

Why return TRUE here but FALSE in previous error cases?

> +
> +    switch (msg.cmd) {
> +    default:
> +        error_setg(&local_err, "Unknown command (%d) received from proxy \
> +                   in remote process pid=%d", msg.cmd, getpid());
> +    }
> +
> +    if (msg.data2) {
> +        free(msg.data2);
> +    }

A mpqemu_msg_cleanup() function in mpqemu-link.h would be nice. That way
the code that frees data2 can be next to the code that allocates it.

Do passed file descriptors need to be closed too (especially in cases
where the command normally does not expect passed file descriptors)?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 10/21] multi-process: setup memory manager for remote device
  2020-06-27 17:09 ` [PATCH v7 10/21] multi-process: setup memory manager for " elena.ufimtseva
@ 2020-07-01  7:58   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01  7:58 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]

On Sat, Jun 27, 2020 at 10:09:32AM -0700, elena.ufimtseva@oracle.com wrote:
> +void remote_sysmem_reconfig(MPQemuMsg *msg, Error **errp)
> +{
> +    SyncSysmemMsg *sysmem_info = &msg->data1.sync_sysmem;
> +    MemoryRegion *sysmem, *subregion, *next;
> +    static unsigned int suffix;
> +    Error *local_err = NULL;
> +    char *name;
> +    int region;
> +
> +    sysmem = get_system_memory();
> +
> +    memory_region_transaction_begin();
> +
> +    QTAILQ_FOREACH_SAFE(subregion, &sysmem->subregions, subregions_link, next) {
> +        if (subregion->ram) {
> +            memory_region_del_subregion(sysmem, subregion);
> +            qemu_ram_free(subregion->ram_block);

Why is qemu_ram_free() called explicitly? Normally the only caller is
memory_region_destructor_ram().

MemoryRegions are reference counted. The qemu_ram_free() call should
probably be replaced with memory_region_unref(subregion). This also
solves the subregion memory leak.

That said, I haven't read the MemoryRegion lifecycle code so please
check that what I'm suggesting is correct.

> +typedef struct {
> +    hwaddr gpas[REMOTE_MAX_FDS];
> +    uint64_t sizes[REMOTE_MAX_FDS];
> +    off_t offsets[REMOTE_MAX_FDS];
> +} SyncSysmemMsg;

It's safer to use fixed-width types like uint64_t for external
communication, but in this case we expect the client and server to
match.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 11/21] multi-process: introduce proxy object
  2020-06-27 17:09 ` [PATCH v7 11/21] multi-process: introduce proxy object elena.ufimtseva
@ 2020-07-01  8:58   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01  8:58 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 3659 bytes --]

On Sat, Jun 27, 2020 at 10:09:33AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> 
> Defines a PCI Device proxy object as a child of TYPE_PCI_DEVICE.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> ---
>  MAINTAINERS            |  2 ++
>  hw/pci/Makefile.objs   |  1 +
>  hw/pci/proxy.c         | 70 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci/proxy.h | 43 ++++++++++++++++++++++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 hw/pci/proxy.c
>  create mode 100644 include/hw/pci/proxy.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 017c96eace..b48c3114c1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2952,6 +2952,8 @@ F: include/io/mpqemu-link.h
>  F: hw/i386/remote-msg.c
>  F: include/hw/i386/remote-memory.h
>  F: hw/i386/remote-memory.c
> +F: hw/pci/proxy.c
> +F: include/hw/pci/proxy.h
>  
>  Build and test automation
>  -------------------------
> diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
> index c78f2fb24b..515dda506c 100644
> --- a/hw/pci/Makefile.objs
> +++ b/hw/pci/Makefile.objs
> @@ -12,3 +12,4 @@ common-obj-$(CONFIG_PCI_EXPRESS) += pcie_port.o pcie_host.o
>  
>  common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
>  common-obj-$(CONFIG_ALL) += pci-stub.o
> +obj-$(CONFIG_MPQEMU) += proxy.o
> diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
> new file mode 100644
> index 0000000000..6d62399c52
> --- /dev/null
> +++ b/hw/pci/proxy.c
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright © 2018, 2020 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +
> +#include "hw/pci/proxy.h"
> +#include "hw/pci/pci.h"
> +#include "qapi/error.h"
> +#include "io/channel-util.h"
> +#include "hw/qdev-properties.h"
> +#include "monitor/monitor.h"
> +
> +static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
> +{
> +    pdev->com = qio_channel_new_fd(fd, errp);

The caller needs to close(fd) on failure:

  if (!pdev->com) {
      close(fd);
  }

> +}

pdev->com is never freed. It seems like hotplug should be possible
eventually. Implementing ->unrealize() from the start will make that
easier because it will be necessary to think through the lifecycle of
this object.

> +
> +static Property proxy_properties[] = {
> +    DEFINE_PROP_STRING("fd", PCIProxyDev, fd),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
> +{
> +    PCIProxyDev *dev = PCI_PROXY_DEV(device);
> +    int proxyfd;
> +
> +    if (dev->fd) {

Does it make sense to succeed without fd? If not then errp should be
set so the caller knows .realize() failed.

> +        proxyfd = monitor_fd_param(cur_mon, dev->fd, errp);
> +        if (proxyfd == -1) {
> +            error_prepend(errp, "proxy: unable to parse proxyfd: ");

The user-visible property name is "fd":
s/proxyfd/fd/

> +typedef struct PCIProxyDevClass {
> +    PCIDeviceClass parent_class;
> +
> +    void (*realize)(PCIProxyDev *dev, Error **errp);
> +
> +    char *command;
> +} PCIProxyDevClass;

Neither ->realize nor ->command are used in this patch. Please drop
PCIProxyDevClass for now. If these fields are used later on then they
should be introduced in the patches that need them.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process
  2020-06-27 17:09 ` [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process elena.ufimtseva
@ 2020-07-01  9:20   ` Stefan Hajnoczi
  2020-07-24 16:57     ` Jag Raman
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01  9:20 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 3637 bytes --]

On Sat, Jun 27, 2020 at 10:09:34AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Jagannathan Raman <jag.raman@oracle.com>
> 
> Send a message to the remote process to connect PCI device with the
> corresponding Proxy object in QEMU

I thought the protocol was simplified to a 1:1 device:socket model, but
this patch seems to implement an N:1 model?

In a 1:1 model the CONNECT_DEV message is not necessary because each
socket is already associated with a specific remote device (e.g. qemu -M
remote -object mplink,dev=lsi-scsi-1,sockpath=/tmp/lsi-scsi-1.sock).
Connecting to the socket already indicates which device we are talking
to.

The N:1 model will work but it's more complex. There is a main socket
that is used for CONNECT_DEV (anything else?) and we need to worry about
the lifecycle of the per-device sockets that are passed over the main
socket.

> @@ -50,3 +58,34 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
>  
>      return TRUE;
>  }
> +
> +static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
> +                                    Error **errp)
> +{
> +    char *devid = (char *)msg->data2;
> +    QIOChannel *dioc = NULL;
> +    DeviceState *dev = NULL;
> +    MPQemuMsg ret = { 0 };
> +    int rc = 0;
> +
> +    g_assert(devid && (devid[msg->size - 1] == '\0'));

Asserts are not suitable for external input validation since a failure
aborts the program and lets the client cause a denial-of-service. When
there are multiple clients, one misbehaved client shouldn't be able to
kill the server. Please validate devid using an if statement and set
errp on failure.

Can msg->size be 0? If yes, this code accesses before the beginning of
the buffer.

> +
> +    dev = qdev_find_recursive(sysbus_get_default(), devid);
> +    if (!dev || !object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
> +        rc = 0xff;
> +        goto exit;
> +    }
> +
> +    dioc = qio_channel_new_fd(msg->fds[0], errp);

Missing error handling if qio_channel_new_fd() fails. We need to
close(msg->fds[0]) ourselves in this case.

> +
> +    qio_channel_add_watch(dioc, G_IO_IN | G_IO_HUP, mpqemu_process_msg,
> +                          (void *)dev, NULL);
> +
> +exit:
> +    ret.cmd = RET_MSG;
> +    ret.bytestream = 0;
> +    ret.data1.u64 = rc;
> +    ret.size = sizeof(ret.data1);
> +
> +    mpqemu_msg_send(&ret, com);
> +}
> diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
> index 6d62399c52..16649ed0ec 100644
> --- a/hw/pci/proxy.c
> +++ b/hw/pci/proxy.c
> @@ -15,10 +15,38 @@
>  #include "io/channel-util.h"
>  #include "hw/qdev-properties.h"
>  #include "monitor/monitor.h"
> +#include "io/mpqemu-link.h"
>  
>  static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
>  {
> +    DeviceState *dev = DEVICE(pdev);
> +    MPQemuMsg msg = { 0 };
> +    int fds[2];
> +    Error *local_err = NULL;
> +
>      pdev->com = qio_channel_new_fd(fd, errp);
> +
> +    if (socketpair(AF_UNIX, SOCK_STREAM, 0, fds)) {
> +        error_setg(errp, "Failed to create proxy channel with fd %d", fd);
> +        return;

pdev->com needs to be cleaned up.

> diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
> index 5887c8c6c0..54df3b254e 100644
> --- a/io/mpqemu-link.c
> +++ b/io/mpqemu-link.c
> @@ -234,6 +234,14 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
>              return false;
>          }
>          break;
> +    case CONNECT_DEV:
> +        if ((msg->num_fds != 1) ||
> +            (msg->fds[0] == -1) ||
> +            (msg->fds[0] == -1) ||

This line is duplicated.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 13/21] multi-process: Forward PCI config space acceses to the remote process
  2020-06-27 17:09 ` [PATCH v7 13/21] multi-process: Forward PCI config space acceses to " elena.ufimtseva
@ 2020-07-01  9:40   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01  9:40 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Sat, Jun 27, 2020 at 10:09:35AM -0700, elena.ufimtseva@oracle.com wrote:
> @@ -42,6 +48,12 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
>      case CONNECT_DEV:
>          process_connect_dev_msg(&msg, ioc, &local_err);
>          break;
> +    case PCI_CONFIG_WRITE:
> +        process_config_write(ioc, pci_dev, &msg);
> +        break;
> +    case PCI_CONFIG_READ:
> +        process_config_read(ioc, pci_dev, &msg);
> +        break;

pci_dev is NULL when mpqemu_process_msg() is called on the main socket.
This is an example of how the N:1 model complicates things.  Now
process_config_read/write() need to check that pci_dev is non-NULL to
avoid crashing.

>      default:
>          error_setg(&local_err, "Unknown command (%d) received from proxy \
>                     in remote process pid=%d", msg.cmd, getpid());
> @@ -89,3 +101,45 @@ exit:
>  
>      mpqemu_msg_send(&ret, com);
>  }
> +
> +static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
> +                                 MPQemuMsg *msg)
> +{
> +    struct conf_data_msg *conf = (struct conf_data_msg *)msg->data2;
> +    MPQemuMsg ret = { 0 };
> +
> +    if (conf->addr >= PCI_CFG_SPACE_EXP_SIZE) {

This check treats all devices as PCIe devices. Traditional PCI devices
have a smaller config space and pci_default_write_config() has an
assertion that fails on out-of-bounds writes:

  assert(addr + l <= pci_config_size(d));

Are you sure all devices are PCIe? If yes, please enforce that in the
code. If no, then please fix the size check.

> +struct conf_data_msg {
> +    uint32_t addr;
> +    uint32_t val;
> +    int l;
> +};

QEMU coding style uses typedefs:

  typedef struct {
      uint32_t addr;
      uint32_t val;
      int l;
  } ConfDataMsg;

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 14/21] multi-process: PCI BAR read/write handling for proxy & remote endpoints
  2020-06-27 17:09 ` [PATCH v7 14/21] multi-process: PCI BAR read/write handling for proxy & remote endpoints elena.ufimtseva
@ 2020-07-01 10:41   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01 10:41 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 2566 bytes --]

On Sat, Jun 27, 2020 at 10:09:36AM -0700, elena.ufimtseva@oracle.com wrote:
> @@ -54,6 +57,12 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
>      case PCI_CONFIG_READ:
>          process_config_read(ioc, pci_dev, &msg);
>          break;
> +    case BAR_WRITE:
> +        process_bar_write(ioc, &msg, &local_err);
> +        break;
> +    case BAR_READ:
> +        process_bar_read(ioc, &msg, &local_err);
> +        break;

These functions are more than BAR read/write functions, they are general
address space read/write functions. This is could be a security problem
in the future because the client has access to more than just the PCI
device they are supposed to communicate with.

I don't have a strong opinion against leaving it as-is, but wanted to
mention it because the current approach is risky as a long-term
solution. The protocol message could have a uint8_t bar_index field or
the remote device could validate that addr falls within the device BARs.

>      default:
>          error_setg(&local_err, "Unknown command (%d) received from proxy \
>                     in remote process pid=%d", msg.cmd, getpid());
> @@ -143,3 +152,89 @@ static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
>  
>      mpqemu_msg_send(&ret, ioc);
>  }
> +
> +static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp)
> +{
> +    BarAccessMsg *bar_access = &msg->data1.bar_access;
> +    AddressSpace *as =
> +        bar_access->memory ? &address_space_memory : &address_space_io;
> +    MPQemuMsg ret = { 0 };
> +    MemTxResult res;
> +
> +    if (!is_power_of_2(bar_access->size) ||
> +       (bar_access->size > sizeof(uint64_t))) {
> +        ret.data1.u64 = UINT64_MAX;
> +        goto fail;
> +    }
> +
> +    res = address_space_rw(as, bar_access->addr, MEMTXATTRS_UNSPECIFIED,
> +                           (void *)&bar_access->val, bar_access->size,
> +                           true);

(void *)&bar_access->val doesn't work on big-endian hosts. A uint64_t ->
{uint32_t, uint16_t, uint8_t} conversion must be performed before the
address_space_rw() call analogous to what process_bar_read() does.

Although it's unlikely that this will be run on big-endian hosts anytime
soon, it's good practice to keep the code portable when possible.

> +    case BAR_WRITE:
> +    case BAR_READ:
> +        if (msg->size != sizeof(msg->data1)) {
> +            return false;
> +        }

Is there a check that the number of passed fds is zero somewhere?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 15/21] multi-process: Synchronize remote memory
  2020-06-27 17:09 ` [PATCH v7 15/21] multi-process: Synchronize remote memory elena.ufimtseva
@ 2020-07-01 10:55   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-01 10:55 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 2617 bytes --]

On Sat, Jun 27, 2020 at 10:09:37AM -0700, elena.ufimtseva@oracle.com wrote:
> +static bool try_merge(RemoteMemSync *sync, MemoryRegionSection *section)
> +{
> +    uint64_t mrs_size, mrs_gpa, mrs_page;
> +    MemoryRegionSection *prev_sec;
> +    bool merged = false;
> +    uintptr_t mrs_host;
> +    RAMBlock *mrs_rb;
> +
> +    if (!sync->n_mr_sections) {
> +        return false;
> +    }
> +
> +    mrs_rb = section->mr->ram_block;
> +    mrs_page = (uint64_t)qemu_ram_pagesize(mrs_rb);
> +    mrs_size = int128_get64(section->size);
> +    mrs_gpa = section->offset_within_address_space;
> +    mrs_host = (uintptr_t)memory_region_get_ram_ptr(section->mr) +
> +               section->offset_within_region;
> +
> +    if (get_fd_from_hostaddr(mrs_host, NULL) < 0) {
> +        return true;
> +    }
> +
> +    mrs_host = mrs_host & ~(mrs_page - 1);
> +    mrs_gpa = mrs_gpa & ~(mrs_page - 1);
> +    mrs_size = ROUND_UP(mrs_size, mrs_page);
> +
> +    if (sync->n_mr_sections) {

The check is unnecessary because of the if statement above:

  if (!sync->n_mr_sections) {
     return false;
  }

> +static void proxy_ml_commit(MemoryListener *listener)
> +{
> +    RemoteMemSync *sync = container_of(listener, RemoteMemSync, listener);
> +    MPQemuMsg msg;
> +    MemoryRegionSection section;
> +    ram_addr_t offset;
> +    uintptr_t host_addr;
> +    int region;
> +
> +    memset(&msg, 0, sizeof(MPQemuMsg));
> +
> +    msg.cmd = SYNC_SYSMEM;
> +    msg.bytestream = 0;
> +    msg.num_fds = sync->n_mr_sections;
> +    msg.size = sizeof(msg.data1);
> +    assert(msg.num_fds <= REMOTE_MAX_FDS);
> +
> +    for (region = 0; region < sync->n_mr_sections; region++) {
> +        section = sync->mr_sections[region];

Why is section copied here? It's conventional to take a pointer to a
struct instead of copying its value.

> +        msg.data1.sync_sysmem.gpas[region] =
> +            section.offset_within_address_space;
> +        msg.data1.sync_sysmem.sizes[region] = int128_get64(section.size);
> +        host_addr = (uintptr_t)memory_region_get_ram_ptr(section.mr) +
> +                    section.offset_within_region;
> +        msg.fds[region] = get_fd_from_hostaddr(host_addr, &offset);
> +        msg.data1.sync_sysmem.offsets[region] = offset;
> +    }
> +    mpqemu_msg_send(&msg, sync->ioc);
> +}
> +
> +void deconfigure_memory_sync(RemoteMemSync *sync)
> +{

Missing proxy_ml_begin() call to free sync->mr_sections[] and unref
sections.

> +    memory_listener_unregister(&sync->listener);
> +}

This function isn't called. It's good practice to implement the object
lifecycle from the start.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 16/21] multi-process: create IOHUB object to handle irq
  2020-06-27 17:09 ` [PATCH v7 16/21] multi-process: create IOHUB object to handle irq elena.ufimtseva
@ 2020-07-02 12:09   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-02 12:09 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 19919 bytes --]

On Sat, Jun 27, 2020 at 10:09:38AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Jagannathan Raman <jag.raman@oracle.com>
> 

CCing Alex Williamson in case he has time to think about how PCI INTx
interrupts should work in multi-process QEMU where the device emulation
runs as a separate process. I don't know enough about interrupts to give
this a full review.

> IOHUB object is added to manage PCI IRQs. It uses KVM_IRQFD
> ioctl to create irqfd to injecting PCI interrupts to the guest.
> IOHUB object forwards the irqfd to the remote process. Remote process
> uses this fd to directly send interrupts to the guest, bypassing QEMU.

Each PCIDevice has exactly one legacy interrupt pin. This raises two
questions:

1. Why perform interrupt mapping in the remote device program? The IOHUB
   object really only needs a PCIDevice -> irqfd lookup so it can set
   the correct EventNotifier when the device raises an interrupt.

   The mapping scheme in this patch can suffer from collisions when
   multiple devices are attached to an IOHUB. On a physical machine that
   results in shared legacy interrupts, but there is no need to recreate
   that in the remote device program since it's a guest issue, not a
   device issue. When a collision occurs this patch doesn't seem to
   share the irqfd for both devices. It overwrites the irqfd so the
   older device stops getting interrupts!

   Unless I'm missing something I suggest dropping the N:8
   PCIDevice:pirq interrupt mapping and using a 1:1 PCIDevice:irqfd
   mapping.

2. Why is the intx pin number sent as part of the SetIrqFdMsg? The intx
   pin does not matter - each PCIDevice still just exactly one legacy
   interrupt pin, regardless of whether it is A, B, C, or D.

> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> ---
>  MAINTAINERS               |   2 +
>  hw/Makefile.objs          |   1 +
>  hw/i386/remote-msg.c      |   4 +
>  hw/i386/remote.c          |  15 ++++
>  hw/pci/proxy.c            |  52 +++++++++++++
>  hw/remote/Makefile.objs   |   1 +
>  hw/remote/iohub.c         | 153 ++++++++++++++++++++++++++++++++++++++
>  include/hw/i386/remote.h  |   2 +
>  include/hw/pci/pci_ids.h  |   3 +
>  include/hw/pci/proxy.h    |   8 ++
>  include/hw/remote/iohub.h |  50 +++++++++++++
>  include/io/mpqemu-link.h  |   6 ++
>  io/mpqemu-link.c          |   1 +
>  13 files changed, 298 insertions(+)
>  create mode 100644 hw/remote/Makefile.objs
>  create mode 100644 hw/remote/iohub.c
>  create mode 100644 include/hw/remote/iohub.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 38d605445e..f9ede7e094 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2956,6 +2956,8 @@ F: hw/pci/proxy.c
>  F: include/hw/pci/proxy.h
>  F: hw/pci/memory-sync.c
>  F: include/hw/pci/memory-sync.h
> +F: hw/remote/iohub.c
> +F: include/hw/remote/iohub.h
>  
>  Build and test automation
>  -------------------------
> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> index 4cbe5e4e57..8caf659de0 100644
> --- a/hw/Makefile.objs
> +++ b/hw/Makefile.objs
> @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
>  devices-dirs-$(CONFIG_NUBUS) += nubus/
>  devices-dirs-y += semihosting/
>  devices-dirs-y += smbios/
> +devices-dirs-y += remote/
>  endif
>  
>  common-obj-y += $(devices-dirs-y)
> diff --git a/hw/i386/remote-msg.c b/hw/i386/remote-msg.c
> index 48b153eaae..67fee4bb57 100644
> --- a/hw/i386/remote-msg.c
> +++ b/hw/i386/remote-msg.c
> @@ -10,6 +10,7 @@
>  #include "hw/pci/pci.h"
>  #include "exec/memattrs.h"
>  #include "hw/i386/remote-memory.h"
> +#include "hw/remote/iohub.h"
>  
>  static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
>                                      Error **errp);
> @@ -67,6 +68,9 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
>      case SYNC_SYSMEM:
>          remote_sysmem_reconfig(&msg, &local_err);
>          break;
> +    case SET_IRQFD:
> +        process_set_irqfd_msg(pci_dev, &msg);
> +        break;
>      default:
>          error_setg(&local_err, "Unknown command (%d) received from proxy \
>                     in remote process pid=%d", msg.cmd, getpid());
> diff --git a/hw/i386/remote.c b/hw/i386/remote.c
> index 5342e884ad..8e74a6f1af 100644
> --- a/hw/i386/remote.c
> +++ b/hw/i386/remote.c
> @@ -17,12 +17,16 @@
>  #include "qapi/error.h"
>  #include "io/channel-util.h"
>  #include "io/channel.h"
> +#include "hw/pci/pci_host.h"
> +#include "hw/remote/iohub.h"
>  
>  static void remote_machine_init(MachineState *machine)
>  {
>      MemoryRegion *system_memory, *system_io, *pci_memory;
>      RemMachineState *s = REMOTE_MACHINE(machine);
>      RemotePCIHost *rem_host;
> +    PCIHostState *pci_host;
> +    PCIDevice *pci_dev;
>  
>      system_memory = get_system_memory();
>      system_io = get_system_io();
> @@ -42,6 +46,17 @@ static void remote_machine_init(MachineState *machine)
>      memory_region_add_subregion_overlap(system_memory, 0x0, pci_memory, -1);
>  
>      qdev_realize(DEVICE(rem_host), sysbus_get_default(), &error_fatal);
> +
> +    pci_host = PCI_HOST_BRIDGE(rem_host);
> +    pci_dev = pci_create_simple_multifunction(pci_host->bus,
> +                                              PCI_DEVFN(REMOTE_IOHUB_DEV,
> +                                                        REMOTE_IOHUB_FUNC),
> +                                              true, TYPE_REMOTE_IOHUB_DEVICE);
> +
> +    s->iohub = REMOTE_IOHUB_DEVICE(pci_dev);
> +
> +    pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
> +                 s->iohub, REMOTE_IOHUB_NB_PIRQS);
>  }
>  
>  static void remote_set_socket(Object *obj, const char *str, Error **errp)
> diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
> index 5ecbdd2dcf..9d8559b6d4 100644
> --- a/hw/pci/proxy.c
> +++ b/hw/pci/proxy.c
> @@ -19,6 +19,9 @@
>  #include "qemu/error-report.h"
>  #include "hw/pci/memory-sync.h"
>  #include "qom/object.h"
> +#include "qemu/event_notifier.h"
> +#include "sysemu/kvm.h"
> +#include "util/event_notifier-posix.c"
>  
>  static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
>  {
> @@ -57,6 +60,53 @@ static Property proxy_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void proxy_intx_update(PCIDevice *pci_dev)
> +{
> +    PCIProxyDev *dev = PCI_PROXY_DEV(pci_dev);
> +    PCIINTxRoute route;
> +    int pin = pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
> +
> +    if (dev->irqfd.fd) {

0 is a valid file descriptor number. In practice fd 0 is standard input
so this code won't encounter it, but for clarity please stick to the
commonly-used -1 constant for invalid file descriptors.

> +        dev->irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
> +        (void) kvm_vm_ioctl(kvm_state, KVM_IRQFD, &dev->irqfd);
> +        memset(&dev->irqfd, 0, sizeof(struct kvm_irqfd));
> +    }
> +
> +    route = pci_device_route_intx_to_irq(pci_dev, pin);
> +
> +    dev->irqfd.fd = event_notifier_get_fd(&dev->intr);
> +    dev->irqfd.resamplefd = event_notifier_get_fd(&dev->resample);
> +    dev->irqfd.gsi = route.irq;
> +    dev->irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> +    (void) kvm_vm_ioctl(kvm_state, KVM_IRQFD, &dev->irqfd);
> +}

Can this code use kvm_irqchip_add/remove_irqfd_notifier_gsi() instead of
duplicating it?

> +
> +static void setup_irqfd(PCIProxyDev *dev)
> +{
> +    PCIDevice *pci_dev = PCI_DEVICE(dev);
> +    MPQemuMsg msg;
> +
> +    event_notifier_init(&dev->intr, 0);
> +    event_notifier_init(&dev->resample, 0);

Missing .unrealize() function that does cleanup.

> +
> +    memset(&msg, 0, sizeof(MPQemuMsg));
> +    msg.cmd = SET_IRQFD;
> +    msg.num_fds = 2;
> +    msg.fds[0] = event_notifier_get_fd(&dev->intr);
> +    msg.fds[1] = event_notifier_get_fd(&dev->resample);
> +    msg.data1.set_irqfd.intx =
> +        pci_get_byte(pci_dev->config + PCI_INTERRUPT_PIN) - 1;
> +    msg.size = sizeof(msg.data1);
> +
> +    mpqemu_msg_send(&msg, dev->dev);
> +
> +    memset(&dev->irqfd, 0, sizeof(struct kvm_irqfd));
> +
> +    proxy_intx_update(pci_dev);
> +
> +    pci_device_set_intx_routing_notifier(pci_dev, proxy_intx_update);
> +}
> +
>  static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
>  {
>      PCIProxyDev *dev = PCI_PROXY_DEV(device);
> @@ -72,6 +122,8 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
>      }
>  
>      configure_memory_sync(&dev->sync, dev->com);
> +
> +    setup_irqfd(dev);
>  }
>  
>  static int config_op_send(PCIProxyDev *pdev, uint32_t addr, uint32_t *val,
> diff --git a/hw/remote/Makefile.objs b/hw/remote/Makefile.objs
> new file mode 100644
> index 0000000000..635ce5e0ab
> --- /dev/null
> +++ b/hw/remote/Makefile.objs
> @@ -0,0 +1 @@
> +common-obj-$(CONFIG_MPQEMU) += iohub.o
> diff --git a/hw/remote/iohub.c b/hw/remote/iohub.c
> new file mode 100644
> index 0000000000..9c6cfaecd4
> --- /dev/null
> +++ b/hw/remote/iohub.c
> @@ -0,0 +1,153 @@
> +/*
> + * Remote IO Hub
> + *
> + * Copyright © 2018, 2020 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +
> +#include "hw/pci/pci.h"
> +#include "hw/pci/pci_ids.h"
> +#include "hw/pci/pci_bus.h"
> +#include "hw/remote/iohub.h"
> +#include "qemu/thread.h"
> +#include "hw/boards.h"
> +#include "hw/i386/remote.h"
> +#include "qemu/main-loop.h"
> +
> +static void remote_iohub_initfn(Object *obj)
> +{
> +    RemoteIOHubState *iohub = REMOTE_IOHUB_DEVICE(obj);
> +    int slot, intx, pirq;
> +
> +    memset(&iohub->irqfds, 0, sizeof(iohub->irqfds));
> +    memset(&iohub->resamplefds, 0, sizeof(iohub->resamplefds));
> +
> +    for (slot = 0; slot < PCI_SLOT_MAX; slot++) {
> +        for (intx = 0; intx < PCI_NUM_PINS; intx++) {
> +            iohub->irq_num[slot][intx] = (slot + intx) % 4 + 4;
> +        }
> +    }
> +
> +    for (pirq = 0; pirq < REMOTE_IOHUB_NB_PIRQS; pirq++) {
> +        qemu_mutex_init(&iohub->irq_level_lock[pirq]);
> +        iohub->irq_level[pirq] = 0;
> +        event_notifier_init_fd(&iohub->irqfds[pirq], -1);
> +        event_notifier_init_fd(&iohub->resamplefds[pirq], -1);
> +    }
> +}
> +
> +static void remote_iohub_class_init(ObjectClass *klass, void *data)
> +{
> +    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> +    k->vendor_id = PCI_VENDOR_ID_ORACLE;
> +    k->device_id = PCI_DEVICE_ID_REMOTE_IOHUB;
> +}
> +
> +static const TypeInfo remote_iohub_info = {
> +    .name       = TYPE_REMOTE_IOHUB_DEVICE,
> +    .parent     = TYPE_PCI_DEVICE,
> +    .instance_size = sizeof(RemoteIOHubState),
> +    .instance_init = remote_iohub_initfn,
> +    .class_init  = remote_iohub_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +        { INTERFACE_CONVENTIONAL_PCI_DEVICE },
> +        { }
> +    }
> +};
> +
> +static void remote_iohub_register(void)
> +{
> +    type_register_static(&remote_iohub_info);
> +}
> +
> +type_init(remote_iohub_register);

Why is this a PCI device? It has no BARs, configuration space,
interrupts, DMA transfers, etc. It's simpler to make this a plain old
struct + functions that are used by the remote machine type.

> +
> +int remote_iohub_map_irq(PCIDevice *pci_dev, int intx)
> +{
> +    BusState *bus = qdev_get_parent_bus(&pci_dev->qdev);
> +    PCIBus *pci_bus = PCI_BUS(bus);
> +    PCIDevice *pci_iohub =
> +        pci_bus->devices[PCI_DEVFN(REMOTE_IOHUB_DEV, REMOTE_IOHUB_FUNC)];
> +    RemoteIOHubState *iohub = REMOTE_IOHUB_DEVICE(pci_iohub);
> +
> +    return iohub->irq_num[PCI_SLOT(pci_dev->devfn)][intx];
> +}

This function can be rewritten like this:

  int remote_iohub_map_irq(PCIDevice *pci_dev, int intx)
  {
      return (PCI_SLOT(pci_dev->devfn) + intx) % 4 + 4;
  }

There is no need to access RemoteIOHubState and iohub->irq_num[] can be
dropped.

> +
> +/*
> + * TODO: Using lock to set the interrupt level could become a
> + *       performance bottleneck. Check if atomic arithmetic
> + *       is possible.
> + */
> +void remote_iohub_set_irq(void *opaque, int pirq, int level)
> +{
> +    RemoteIOHubState *iohub = opaque;
> +
> +    assert(pirq >= 0);
> +    assert(pirq < REMOTE_IOHUB_NB_PIRQS);
> +
> +    qemu_mutex_lock(&iohub->irq_level_lock[pirq]);
> +
> +    if (level) {
> +        if (++iohub->irq_level[pirq] == 1) {
> +            event_notifier_set(&iohub->irqfds[pirq]);
> +        }
> +    } else if (iohub->irq_level[pirq] > 0) {
> +        iohub->irq_level[pirq]--;
> +    }
> +
> +    qemu_mutex_unlock(&iohub->irq_level_lock[pirq]);
> +}
> +
> +static void intr_resample_handler(void *opaque)
> +{
> +    ResampleToken *token = opaque;
> +    RemoteIOHubState *iohub = token->iohub;
> +    int pirq, s;
> +
> +    pirq = token->pirq;
> +
> +    s = event_notifier_test_and_clear(&iohub->resamplefds[pirq]);
> +
> +    assert(s >= 0);
> +
> +    qemu_mutex_lock(&iohub->irq_level_lock[pirq]);
> +
> +    if (iohub->irq_level[pirq]) {
> +        event_notifier_set(&iohub->irqfds[pirq]);
> +    }
> +
> +    qemu_mutex_unlock(&iohub->irq_level_lock[pirq]);
> +}
> +
> +void process_set_irqfd_msg(PCIDevice *pci_dev, MPQemuMsg *msg)
> +{
> +    RemMachineState *machine = REMOTE_MACHINE(current_machine);
> +    RemoteIOHubState *iohub = machine->iohub;
> +    int pirq;
> +
> +    g_assert(msg->data1.set_irqfd.intx < 4);
> +    g_assert(msg->num_fds == 2);

Please handle errors instead of using asserts that allow clients to
terminate the program by sending invalid values.

> +
> +    pirq = remote_iohub_map_irq(pci_dev, msg->data1.set_irqfd.intx);
> +
> +    if (event_notifier_get_fd(&iohub->irqfds[pirq]) != -1) {
> +        event_notifier_cleanup(&iohub->irqfds[pirq]);
> +        event_notifier_cleanup(&iohub->resamplefds[pirq]);
> +        memset(&iohub->token[pirq], 0, sizeof(ResampleToken));
> +    }

Missing qemu_set_fd_handler(event_notifier_get_fd(&resamplefds[pirq]), NULL, NULL, NULL);

> +
> +    event_notifier_init_fd(&iohub->irqfds[pirq], msg->fds[0]);
> +    event_notifier_init_fd(&iohub->resamplefds[pirq], msg->fds[1]);
> +
> +    iohub->token[pirq].iohub = iohub;
> +    iohub->token[pirq].pirq = pirq;
> +
> +    qemu_set_fd_handler(msg->fds[1], intr_resample_handler, NULL,
> +                        &iohub->token[pirq]);
> +}

Missing event_notifier_cleanup() and qemu_set_fd_handler() calls when
the mplink connection closes. The connection lifecycle isn't implemented
and will leak resources.

> diff --git a/include/hw/i386/remote.h b/include/hw/i386/remote.h
> index c3890e57ab..6af423ab54 100644
> --- a/include/hw/i386/remote.h
> +++ b/include/hw/i386/remote.h
> @@ -18,12 +18,14 @@
>  #include "hw/boards.h"
>  #include "hw/pci-host/remote.h"
>  #include "io/channel.h"
> +#include "hw/remote/iohub.h"
>  
>  typedef struct RemMachineState {
>      MachineState parent_obj;
>  
>      RemotePCIHost *host;
>      QIOChannel *ioc;
> +    RemoteIOHubState *iohub;
>  } RemMachineState;
>  
>  #define TYPE_REMOTE_MACHINE "remote-machine"
> diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
> index 11f8ab7149..bd0c17dc78 100644
> --- a/include/hw/pci/pci_ids.h
> +++ b/include/hw/pci/pci_ids.h
> @@ -192,6 +192,9 @@
>  #define PCI_DEVICE_ID_SUN_SIMBA          0x5000
>  #define PCI_DEVICE_ID_SUN_SABRE          0xa000
>  
> +#define PCI_VENDOR_ID_ORACLE             0x108e
> +#define PCI_DEVICE_ID_REMOTE_IOHUB       0xb000
> +
>  #define PCI_VENDOR_ID_CMD                0x1095
>  #define PCI_DEVICE_ID_CMD_646            0x0646
>  
> diff --git a/include/hw/pci/proxy.h b/include/hw/pci/proxy.h
> index a41a6aeaa5..e6f076ae95 100644
> --- a/include/hw/pci/proxy.h
> +++ b/include/hw/pci/proxy.h
> @@ -12,9 +12,12 @@
>  #include "qemu/osdep.h"
>  #include "qemu-common.h"
>  
> +#include <linux/kvm.h>
> +
>  #include "hw/pci/pci.h"
>  #include "io/channel.h"
>  #include "hw/pci/memory-sync.h"
> +#include "qemu/event_notifier.h"
>  
>  #define TYPE_PCI_PROXY_DEV "pci-proxy-dev"
>  
> @@ -45,6 +48,11 @@ struct PCIProxyDev {
>  
>      RemoteMemSync sync;
>  
> +    struct kvm_irqfd irqfd;
> +
> +    EventNotifier intr;
> +    EventNotifier resample;
> +
>      ProxyMemoryRegion region[PCI_NUM_REGIONS];
>  };
>  
> diff --git a/include/hw/remote/iohub.h b/include/hw/remote/iohub.h
> new file mode 100644
> index 0000000000..9aacf3e04c
> --- /dev/null
> +++ b/include/hw/remote/iohub.h
> @@ -0,0 +1,50 @@
> +/*
> + * IO Hub for remote device
> + *
> + * Copyright © 2018, 2020 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef REMOTE_IOHUB_H
> +#define REMOTE_IOHUB_H
> +
> +#include "qemu/osdep.h"

CODING_STYLE.rst says:

  Do not include "qemu/osdep.h" from header files since the .c file will
  have already included it.

> +#include "qemu-common.h"
> +
> +#include "hw/pci/pci.h"
> +#include "qemu/event_notifier.h"
> +#include "qemu/thread-posix.h"
> +#include "io/mpqemu-link.h"
> +
> +#define REMOTE_IOHUB_NB_PIRQS    8
> +
> +#define REMOTE_IOHUB_DEV         31
> +#define REMOTE_IOHUB_FUNC        0
> +
> +#define TYPE_REMOTE_IOHUB_DEVICE "remote-iohub"
> +#define REMOTE_IOHUB_DEVICE(obj) \
> +    OBJECT_CHECK(RemoteIOHubState, (obj), TYPE_REMOTE_IOHUB_DEVICE)
> +
> +typedef struct ResampleToken {
> +    void *iohub;
> +    int pirq;
> +} ResampleToken;
> +
> +typedef struct RemoteIOHubState {
> +    PCIDevice d;
> +    uint8_t irq_num[PCI_SLOT_MAX][PCI_NUM_PINS];
> +    EventNotifier irqfds[REMOTE_IOHUB_NB_PIRQS];
> +    EventNotifier resamplefds[REMOTE_IOHUB_NB_PIRQS];
> +    unsigned int irq_level[REMOTE_IOHUB_NB_PIRQS];
> +    ResampleToken token[REMOTE_IOHUB_NB_PIRQS];
> +    QemuMutex irq_level_lock[REMOTE_IOHUB_NB_PIRQS];
> +} RemoteIOHubState;
> +
> +int remote_iohub_map_irq(PCIDevice *pci_dev, int intx);
> +void remote_iohub_set_irq(void *opaque, int pirq, int level);
> +void process_set_irqfd_msg(PCIDevice *pci_dev, MPQemuMsg *msg);
> +
> +#endif
> diff --git a/include/io/mpqemu-link.h b/include/io/mpqemu-link.h
> index 0422213863..a563b557ce 100644
> --- a/include/io/mpqemu-link.h
> +++ b/include/io/mpqemu-link.h
> @@ -42,6 +42,7 @@ typedef enum {
>      PCI_CONFIG_READ,
>      BAR_WRITE,
>      BAR_READ,
> +    SET_IRQFD,
>      MAX = INT_MAX,
>  } MPQemuCmd;
>  
> @@ -64,6 +65,10 @@ typedef struct {
>      bool memory;
>  } BarAccessMsg;
>  
> +typedef struct {
> +    int intx;
> +} SetIrqFdMsg;
> +
>  /**
>   * Maximum size of data2 field in the message to be transmitted.
>   */
> @@ -92,6 +97,7 @@ typedef struct {
>          uint64_t u64;
>          SyncSysmemMsg sync_sysmem;
>          BarAccessMsg bar_access;
> +        SetIrqFdMsg set_irqfd;
>      } data1;
>  
>      int fds[REMOTE_MAX_FDS];
> diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
> index 026e25dca4..561ac0576f 100644
> --- a/io/mpqemu-link.c
> +++ b/io/mpqemu-link.c
> @@ -258,6 +258,7 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
>          break;
>      case BAR_WRITE:
>      case BAR_READ:
> +    case SET_IRQFD:
>          if (msg->size != sizeof(msg->data1)) {
>              return false;
>          }
> -- 
> 2.25.GIT
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 17/21] multi-process: Retrieve PCI info from remote process
  2020-06-27 17:09 ` [PATCH v7 17/21] multi-process: Retrieve PCI info from remote process elena.ufimtseva
@ 2020-07-02 12:59   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-02 12:59 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

On Sat, Jun 27, 2020 at 10:09:39AM -0700, elena.ufimtseva@oracle.com wrote:
> @@ -267,3 +275,84 @@ const MemoryRegionOps proxy_mr_ops = {
>          .max_access_size = 1,
>      },
>  };
> +
> +static void probe_pci_info(PCIDevice *dev)
> +{
> +    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> +    DeviceClass *dc = DEVICE_CLASS(pc);
> +    PCIProxyDev *pdev = PCI_PROXY_DEV(dev);
> +    MPQemuMsg msg = { 0 }, ret =  { 0 };
> +    uint32_t orig_val, new_val, class;
> +    uint8_t type;
> +    int i, size;
> +    char *name;
> +    Error *local_err = NULL;
> +
> +    msg.bytestream = 0;
> +    msg.size = 0;
> +    msg.cmd = GET_PCI_INFO;
> +
> +    ret.data1.u64  = mpqemu_msg_send_reply_co(&msg, pdev->dev, &local_err);
> +    if (local_err) {
> +        error_report("Error while sending GET_PCI_INFO message");
> +    }
> +
> +    pc->vendor_id = ret.data1.ret_pci_info.vendor_id;
> +    pc->device_id = ret.data1.ret_pci_info.device_id;
> +    pc->class_id = ret.data1.ret_pci_info.class_id;
> +    pc->subsystem_id = ret.data1.ret_pci_info.subsystem_id;

Why are the vendor/device/class/subsystem IDs read using a special
GET_PCI_INFO message when they are already accessible from the PCI
Configuration Space? All the other values are being read from the config
space so I wonder why a special message is needed.

> +
> +    config_op_send(pdev, PCI_CLASS_DEVICE, &class, 1, PCI_CONFIG_READ);
> +    switch (class) {
> +    case PCI_BASE_CLASS_BRIDGE:
> +        set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
> +        break;
> +    case PCI_BASE_CLASS_STORAGE:
> +        set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> +        break;
> +    case PCI_BASE_CLASS_NETWORK:
> +        set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
> +        break;
> +    case PCI_BASE_CLASS_INPUT:
> +        set_bit(DEVICE_CATEGORY_INPUT, dc->categories);
> +        break;
> +    case PCI_BASE_CLASS_DISPLAY:
> +        set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
> +        break;
> +    case PCI_BASE_CLASS_PROCESSOR:
> +        set_bit(DEVICE_CATEGORY_CPU, dc->categories);
> +        break;
> +    default:
> +        set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +        break;
> +    }
> +
> +    for (i = 0; i < 6; i++) {

s/6/G_N_ELEMENTS(pdev->region)/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 18/21] multi-process: heartbeat messages to remote
  2020-06-27 17:09 ` [PATCH v7 18/21] multi-process: heartbeat messages to remote elena.ufimtseva
@ 2020-07-02 13:16   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-02 13:16 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 2202 bytes --]

On Sat, Jun 27, 2020 at 10:09:40AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> 
> In order to detect remote processes which are hung, the
> proxy periodically sends heartbeat messages to confirm if
> the remote process is alive. The remote process responds
> to this heartbeat message to confirm it is alive.
> 
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> ---
>  hw/i386/remote-msg.c     | 14 ++++++++++
>  hw/pci/proxy.c           | 58 ++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci/proxy.h   |  2 ++
>  include/io/mpqemu-link.h |  1 +
>  io/mpqemu-link.c         |  1 +
>  5 files changed, 76 insertions(+)
> 

This patch seems incomplete since no action is taken when the device
fails to respond. vCPU threads that access the device will still get
stuck.

The simplest way to make this useful is to close the connection when a
timeout occurs. Then the G_IO_HUP handler for the UNIX domain socket
should perform connection cleanup. At that point there are a few
choices:

1. Stop guest execution and wait for the host admin to restore the
   mplink so execution can resume. This is similar to how -drive
   rerror=stop pauses the guest when a disk I/O error is encountered.

2. Stop guest execution but defer it until this stale device is actually
   accessed. This maximizes guest uptime. Guests that rarely access the
   device may not notice at all.

3. Return 0 from MemoryRegion read operations and ignore writes. The
   guest continues executing but the device is broken. This is risky
   because device drivers inside the guest may not be ready to deal with
   this. The result could be data loss or corruption.

4. Raise a bus-level event. Maybe PCI error reporting can be used to
   offline the device.

5. Terminate the guest with an error message.

6. ?

Until the heartbeat is fully implemented and tested I suggest dropping
it from this patch series. Remember the G_IO_HUP will happen anyway if
the remote device process terminates.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 19/21] multi-process: perform device reset in the remote process
  2020-06-27 17:09 ` [PATCH v7 19/21] multi-process: perform device reset in the remote process elena.ufimtseva
@ 2020-07-02 13:19   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-02 13:19 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 417 bytes --]

On Sat, Jun 27, 2020 at 10:09:41AM -0700, elena.ufimtseva@oracle.com wrote:
> @@ -283,3 +288,14 @@ static void process_proxy_ping_msg(QIOChannel *ioc)
>  
>      mpqemu_msg_send(&ret, ioc);
>  }
> +
> +static void process_device_reset_msg(QIOChannel *ioc)
> +{
> +    MPQemuMsg ret = { 0 };
> +
> +    qemu_devices_reset();

Why are all devices reset globally instead of just the PCIDevice in
question?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 21/21] multi-process: add configure and usage information
  2020-06-27 17:09 ` [PATCH v7 21/21] multi-process: add configure and usage information elena.ufimtseva
@ 2020-07-02 13:26   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-02 13:26 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 445 bytes --]

On Sat, Jun 27, 2020 at 10:09:43AM -0700, elena.ufimtseva@oracle.com wrote:
> +2) Usage
> +--------
> +
> +Multi-process QEMU requires an orchestrator to launch. Please refer to a
> +light-weight python based orchestrator for mpqemu in
> +scripts/mpqemu-launcher.py to lauch QEMU in multi-process mode.
> +
> +Please refer to scripts/mpqemu-launcher-perf-mode.py for running the remote
> +process in performance mode.

What is performance mode?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 20/21] multi-process: add the concept description to docs/devel/qemu-multiprocess
  2020-06-27 17:09 ` [PATCH v7 20/21] multi-process: add the concept description to docs/devel/qemu-multiprocess elena.ufimtseva
@ 2020-07-02 13:31   ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-02 13:31 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1769 bytes --]

On Sat, Jun 27, 2020 at 10:09:42AM -0700, elena.ufimtseva@oracle.com wrote:
> diff --git a/docs/devel/multi-process.rst b/docs/devel/multi-process.rst
> new file mode 100644
> index 0000000000..406728854c
> --- /dev/null
> +++ b/docs/devel/multi-process.rst
> @@ -0,0 +1,957 @@
> +Multi-process QEMU
> +===================
> +
> +QEMU is often used as the hypervisor for virtual machines running in the
> +Oracle cloud. Since one of the advantages of cloud computing is the
> +ability to run many VMs from different tenants in the same cloud
> +infrastructure, a guest that compromised its hypervisor could
> +potentially use the hypervisor's access privileges to access data it is
> +not authorized for.
> +
> +QEMU can be susceptible to security attacks because it is a large,
> +monolithic program that provides many features to the VMs it services.
> +Many of these features can be configured out of QEMU, but even a reduced
> +configuration QEMU has a large amount of code a guest can potentially
> +attack. Separating QEMU reduces the attack surface by aiding to
> +limit each component in the system to only access the resources that
> +it needs to perform its job.

This document does not reflect the functionality, internals, or syntax
implemented in this patch series closely. It can still be useful as
background reading for someone interested in diving into the code, but
please add a disclaimer at the top to avoid confusion:

  This is the design document for multi-process QEMU. It does not
  necessarily reflect the status of the current implementation, which
  may lack features or be considerably different from what is described
  in this document. This document is still useful as a description of
  the goals and general direction of this feature.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 00/21] Initial support for multi-process qemu
  2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
                   ` (20 preceding siblings ...)
  2020-06-27 17:09 ` [PATCH v7 21/21] multi-process: add configure and usage information elena.ufimtseva
@ 2020-07-02 13:40 ` Stefan Hajnoczi
  2020-07-09 14:16   ` Jag Raman
  21 siblings, 1 reply; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-02 13:40 UTC (permalink / raw)
  To: elena.ufimtseva
  Cc: fam, john.g.johnson, swapnil.ingle, mst, qemu-devel, kraxel,
	jag.raman, quintela, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

On Sat, Jun 27, 2020 at 10:09:22AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> 
> This is the v7 of the patchset.

I have completed the review and left comments on the patches.

I'm glad it was possible to simplify this feature. The overall approach
makes sense to me and I see how it forms the base on which
VFIO-over-socket and smaller remote program builds using Kconfig can be
developed.

My main concern is that the object lifecycle has not been fully
implemented in the proxy and remote device. Error handling is
incomplete, resources are leaked, and hot unplug does not work. Thinking
through the lifecycle is very important so that additional work can
build on top of this later. I have tried to point out these issues in
the individual patches.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 00/21] Initial support for multi-process qemu
  2020-07-02 13:40 ` [PATCH v7 00/21] Initial support for multi-process qemu Stefan Hajnoczi
@ 2020-07-09 14:16   ` Jag Raman
  2020-07-13 11:21     ` Stefan Hajnoczi
  0 siblings, 1 reply; 51+ messages in thread
From: Jag Raman @ 2020-07-09 14:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, fam, Swapnil Ingle, John G Johnson, qemu-devel,
	kraxel, quintela, Michael S. Tsirkin, armbru, kanth.ghatraju,
	felipe, thuth, ehabkost, konrad.wilk, dgilbert, liran.alon,
	pbonzini, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, thanos.makatos



> On Jul 2, 2020, at 9:40 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sat, Jun 27, 2020 at 10:09:22AM -0700, elena.ufimtseva@oracle.com wrote:
>> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> 
>> This is the v7 of the patchset.
> 
> I have completed the review and left comments on the patches.
> 
> I'm glad it was possible to simplify this feature. The overall approach

Hi Stefan,

We’re also with you on this. The feature looks much simpler now.

> makes sense to me and I see how it forms the base on which
> VFIO-over-socket and smaller remote program builds using Kconfig can be
> developed.
> 
> My main concern is that the object lifecycle has not been fully
> implemented in the proxy and remote device. Error handling is

Thank you for your feedback on. FWIW, we did check about the unrealize() path
in the object lifecycle management. We noticed that the destructor for the PCI
devices (pci_qdev_unrealize()) is currently not invoking the instance specific
destructor/unrealize functions. While this is not an excuse for not implementing
the unrealize functions, it currently doesn’t have an impact on the hot unplug path.

You’re correct, we should implement the unrealize/destructor for the Proxy & remote
objects. We’ll also look into any background for why the PCI devices don’t call
instance specific destructor.

> incomplete, resources are leaked, and hot unplug does not work. Thinking

We’ll double check the resource leak issues, specifically with respect to open
file destructors.

> through the lifecycle is very important so that additional work can
> build on top of this later. I have tried to point out these issues in
> the individual patches.

We got a chance to go over your feedback. We will send out responses to them
shortly.

Thank you very much!
--
Jag

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 03/21] multi-process: setup PCI host bridge for remote device
  2020-06-30 15:17   ` Stefan Hajnoczi
@ 2020-07-09 14:23     ` Jag Raman
  0 siblings, 0 replies; 51+ messages in thread
From: Jag Raman @ 2020-07-09 14:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, fam, swapnil.ingle, john.g.johnson, qemu-devel,
	kraxel, quintela, Michael S. Tsirkin, armbru, kanth.ghatraju,
	felipe, thuth, ehabkost, konrad.wilk, dgilbert, liran.alon,
	pbonzini, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, thanos.makatos



> On Jun 30, 2020, at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sat, Jun 27, 2020 at 10:09:25AM -0700, elena.ufimtseva@oracle.com wrote:
>> diff --git a/hw/pci-host/remote.c b/hw/pci-host/remote.c
>> new file mode 100644
>> index 0000000000..5ea9af4154
>> --- /dev/null
>> +++ b/hw/pci-host/remote.c
>> @@ -0,0 +1,63 @@
>> +/*
>> + * Remote PCI host device
>> + *
>> + * Copyright © 2018, 2020 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
> 
> A little more detail would be nice:
> 
>  Unlike PCI host devices that model physical hardware, the purpose of
>  this PCI host is to host multi-process QEMU remote PCI devices.
> 
>  Multi-process QEMU talks to a remote PCI device that runs in a
>  separate process. In order to reuse QEMU device models in the remote
>  process we need a PCI bus that holds the devices.
> 
>  This PCI host is purely a container for PCI devices. It's fake in the
>  sense that the guest never sees this PCI host and has no way of
>  accessing it. It's job is just to provide the environment that QEMU
>  PCI device models need when running in a remote process.
> 
> I think this could be restated more clearly but hopefully it
> communicates the purpose of hw/pci-host/remote.c. :P

OK, we get the theme of the message. :)

> 
>> +typedef struct RemotePCIHost {
>> +    /*< private >*/
>> +    PCIExpressHost parent_obj;
>> +    /*< public >*/
>> +
>> +    MemoryRegion *mr_pci_mem;
>> +    MemoryRegion *mr_sys_mem;
> 
> Unused? Please add mr_sys_mem if and when it is used.

Got it. We will add it in the later patches that use these.

--
Jag

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 00/21] Initial support for multi-process qemu
  2020-07-09 14:16   ` Jag Raman
@ 2020-07-13 11:21     ` Stefan Hajnoczi
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-13 11:21 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, fam, Swapnil Ingle, John G Johnson, qemu-devel,
	kraxel, quintela, Michael S. Tsirkin, armbru, kanth.ghatraju,
	felipe, thuth, ehabkost, konrad.wilk, dgilbert, liran.alon,
	pbonzini, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, thanos.makatos

[-- Attachment #1: Type: text/plain, Size: 1337 bytes --]

On Thu, Jul 09, 2020 at 10:16:31AM -0400, Jag Raman wrote:
> > On Jul 2, 2020, at 9:40 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Sat, Jun 27, 2020 at 10:09:22AM -0700, elena.ufimtseva@oracle.com wrote:
> > makes sense to me and I see how it forms the base on which
> > VFIO-over-socket and smaller remote program builds using Kconfig can be
> > developed.
> > 
> > My main concern is that the object lifecycle has not been fully
> > implemented in the proxy and remote device. Error handling is
> 
> Thank you for your feedback on. FWIW, we did check about the unrealize() path
> in the object lifecycle management. We noticed that the destructor for the PCI
> devices (pci_qdev_unrealize()) is currently not invoking the instance specific
> destructor/unrealize functions. While this is not an excuse for not implementing
> the unrealize functions, it currently doesn’t have an impact on the hot unplug path.
> 
> You’re correct, we should implement the unrealize/destructor for the Proxy & remote
> objects. We’ll also look into any background for why the PCI devices don’t call
> instance specific destructor.

PCIDeviceClass->exit() is invoked by pci_qdev_unrealize(). I'm not sure
why it's called "exit" instead of "unrealize" but PCI devices implement
it to perform clean-up.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process
  2020-07-01  9:20   ` Stefan Hajnoczi
@ 2020-07-24 16:57     ` Jag Raman
  2020-07-27 13:18       ` Stefan Hajnoczi
  0 siblings, 1 reply; 51+ messages in thread
From: Jag Raman @ 2020-07-24 16:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, fam, swapnil.ingle, John G Johnson, qemu-devel,
	kraxel, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini



> On Jul 1, 2020, at 5:20 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sat, Jun 27, 2020 at 10:09:34AM -0700, elena.ufimtseva@oracle.com wrote:
>> From: Jagannathan Raman <jag.raman@oracle.com>
>> 
>> Send a message to the remote process to connect PCI device with the
>> corresponding Proxy object in QEMU
> 
> I thought the protocol was simplified to a 1:1 device:socket model, but
> this patch seems to implement an N:1 model?

Hi Stefan,

Thanks for your feedback on all the patches. We were able to address most
of your feedback in this series, and we are looking forward to sending the next
version out for review.

We wanted to double check with you regarding this patch alone.

The protocol is still a 1:1 device:socket model. It’s not possible for a proxy device
object to send messages to arbitrary devices in the remote.

> 
> In a 1:1 model the CONNECT_DEV message is not necessary because each
> socket is already associated with a specific remote device (e.g. qemu -M
> remote -object mplink,dev=lsi-scsi-1,sockpath=/tmp/lsi-scsi-1.sock).
> Connecting to the socket already indicates which device we are talking
> to.
> 
> The N:1 model will work but it's more complex. There is a main socket
> that is used for CONNECT_DEV (anything else?) and we need to worry about
> the lifecycle of the per-device sockets that are passed over the main
> socket.

The main socket is only used for CONNECT_DEV. The CONNECT_DEV
message sticks a fd to the remote device.

We are using the following command-line in the remote process:
qemu-system-x86_64 -machine remote,fd=4 -device lsi53c895a,id=lsi1 ...

The alternative approach would be to let the orchestrator to assign fds for
each remote device. In this approach, we would specify an ‘fd’ for each
device object like below:
qemu-system-x86_64 -machine remote -device lsi53c895a,id=lsi1,fd=4 …

The alternative approach would entail changes to the DeviceState struct
and qdev_device_add() function, which we thought is not preferable. Could
you please share your thoughts on this?

Thanks!
—
Jag

> 
>> @@ -50,3 +58,34 @@ gboolean mpqemu_process_msg(QIOChannel *ioc, GIOCondition cond,
>> 
>>     return TRUE;
>> }
>> +
>> +static void process_connect_dev_msg(MPQemuMsg *msg, QIOChannel *com,
>> +                                    Error **errp)
>> +{
>> +    char *devid = (char *)msg->data2;
>> +    QIOChannel *dioc = NULL;
>> +    DeviceState *dev = NULL;
>> +    MPQemuMsg ret = { 0 };
>> +    int rc = 0;
>> +
>> +    g_assert(devid && (devid[msg->size - 1] == '\0'));
> 
> Asserts are not suitable for external input validation since a failure
> aborts the program and lets the client cause a denial-of-service. When
> there are multiple clients, one misbehaved client shouldn't be able to
> kill the server. Please validate devid using an if statement and set
> errp on failure.
> 
> Can msg->size be 0? If yes, this code accesses before the beginning of
> the buffer.
> 
>> +
>> +    dev = qdev_find_recursive(sysbus_get_default(), devid);
>> +    if (!dev || !object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
>> +        rc = 0xff;
>> +        goto exit;
>> +    }
>> +
>> +    dioc = qio_channel_new_fd(msg->fds[0], errp);
> 
> Missing error handling if qio_channel_new_fd() fails. We need to
> close(msg->fds[0]) ourselves in this case.
> 
>> +
>> +    qio_channel_add_watch(dioc, G_IO_IN | G_IO_HUP, mpqemu_process_msg,
>> +                          (void *)dev, NULL);
>> +
>> +exit:
>> +    ret.cmd = RET_MSG;
>> +    ret.bytestream = 0;
>> +    ret.data1.u64 = rc;
>> +    ret.size = sizeof(ret.data1);
>> +
>> +    mpqemu_msg_send(&ret, com);
>> +}
>> diff --git a/hw/pci/proxy.c b/hw/pci/proxy.c
>> index 6d62399c52..16649ed0ec 100644
>> --- a/hw/pci/proxy.c
>> +++ b/hw/pci/proxy.c
>> @@ -15,10 +15,38 @@
>> #include "io/channel-util.h"
>> #include "hw/qdev-properties.h"
>> #include "monitor/monitor.h"
>> +#include "io/mpqemu-link.h"
>> 
>> static void proxy_set_socket(PCIProxyDev *pdev, int fd, Error **errp)
>> {
>> +    DeviceState *dev = DEVICE(pdev);
>> +    MPQemuMsg msg = { 0 };
>> +    int fds[2];
>> +    Error *local_err = NULL;
>> +
>>     pdev->com = qio_channel_new_fd(fd, errp);
>> +
>> +    if (socketpair(AF_UNIX, SOCK_STREAM, 0, fds)) {
>> +        error_setg(errp, "Failed to create proxy channel with fd %d", fd);
>> +        return;
> 
> pdev->com needs to be cleaned up.
> 
>> diff --git a/io/mpqemu-link.c b/io/mpqemu-link.c
>> index 5887c8c6c0..54df3b254e 100644
>> --- a/io/mpqemu-link.c
>> +++ b/io/mpqemu-link.c
>> @@ -234,6 +234,14 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
>>             return false;
>>         }
>>         break;
>> +    case CONNECT_DEV:
>> +        if ((msg->num_fds != 1) ||
>> +            (msg->fds[0] == -1) ||
>> +            (msg->fds[0] == -1) ||
> 
> This line is duplicated.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process
  2020-07-24 16:57     ` Jag Raman
@ 2020-07-27 13:18       ` Stefan Hajnoczi
  2020-07-27 13:22         ` Michael S. Tsirkin
  2020-07-31 18:31         ` Jag Raman
  0 siblings, 2 replies; 51+ messages in thread
From: Stefan Hajnoczi @ 2020-07-27 13:18 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, fam, swapnil.ingle, John G Johnson, qemu-devel,
	kraxel, quintela, mst, armbru, kanth.ghatraju, felipe, thuth,
	ehabkost, konrad.wilk, dgilbert, liran.alon, thanos.makatos, rth,
	kwolf, berrange, mreitz, ross.lagerwall, marcandre.lureau,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]

On Fri, Jul 24, 2020 at 12:57:22PM -0400, Jag Raman wrote:
> > On Jul 1, 2020, at 5:20 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Sat, Jun 27, 2020 at 10:09:34AM -0700, elena.ufimtseva@oracle.com wrote:
> > In a 1:1 model the CONNECT_DEV message is not necessary because each
> > socket is already associated with a specific remote device (e.g. qemu -M
> > remote -object mplink,dev=lsi-scsi-1,sockpath=/tmp/lsi-scsi-1.sock).
> > Connecting to the socket already indicates which device we are talking
> > to.
> > 
> > The N:1 model will work but it's more complex. There is a main socket
> > that is used for CONNECT_DEV (anything else?) and we need to worry about
> > the lifecycle of the per-device sockets that are passed over the main
> > socket.
> 
> The main socket is only used for CONNECT_DEV. The CONNECT_DEV
> message sticks a fd to the remote device.
> 
> We are using the following command-line in the remote process:
> qemu-system-x86_64 -machine remote,fd=4 -device lsi53c895a,id=lsi1 ...
> 
> The alternative approach would be to let the orchestrator to assign fds for
> each remote device. In this approach, we would specify an ‘fd’ for each
> device object like below:
> qemu-system-x86_64 -machine remote -device lsi53c895a,id=lsi1,fd=4 …
> 
> The alternative approach would entail changes to the DeviceState struct
> and qdev_device_add() function, which we thought is not preferable. Could
> you please share your thoughts on this?

I suggest dropping multi-device support for now. It will be implemented
differently with VFIO-over-socket anyway, so it's not worth investing
much time into.

The main socket approach needs authentication support if multiple guests
share a remote device emulation process. Otherwise guest A can access
guest B's devices.

It's simpler if each device has a separate UNIX domain socket. It is not
necessary to modify lsi53c895a in order to do this. Either the socket
can be associated with the remote PCIe port (although I think the
current code implements the older PCI Local Bus instead of PCIe) or a
separate -object mpqlink,device=lsi1,fd=4 object can be defined (I think
that's the syntax I've shared in the past).

For now though, just using the -machine remote,fd=4 approach is fine -
but limited to 1 device.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process
  2020-07-27 13:18       ` Stefan Hajnoczi
@ 2020-07-27 13:22         ` Michael S. Tsirkin
  2020-07-31 18:31         ` Jag Raman
  1 sibling, 0 replies; 51+ messages in thread
From: Michael S. Tsirkin @ 2020-07-27 13:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, fam, swapnil.ingle, John G Johnson, qemu-devel,
	kraxel, Jag Raman, quintela, armbru, kanth.ghatraju, felipe,
	thuth, ehabkost, konrad.wilk, dgilbert, liran.alon,
	thanos.makatos, rth, kwolf, berrange, mreitz, ross.lagerwall,
	marcandre.lureau, pbonzini

On Mon, Jul 27, 2020 at 02:18:29PM +0100, Stefan Hajnoczi wrote:
> I suggest dropping multi-device support for now. It will be implemented
> differently with VFIO-over-socket anyway, so it's not worth investing
> much time into.

I'm not sure I buy the VFIO-over-socket yet. It seems a weirdly QEMU
specific IPC mechanism. However ...

> The main socket approach needs authentication support if multiple guests
> share a remote device emulation process. Otherwise guest A can access
> guest B's devices.
> 
> It's simpler if each device has a separate UNIX domain socket. It is not
> necessary to modify lsi53c895a in order to do this. Either the socket
> can be associated with the remote PCIe port (although I think the
> current code implements the older PCI Local Bus instead of PCIe) or a
> separate -object mpqlink,device=lsi1,fd=4 object can be defined (I think
> that's the syntax I've shared in the past).
> 
> For now though, just using the -machine remote,fd=4 approach is fine -
> but limited to 1 device.
> 
> Stefan

I agree to all of the above. Starting with a single fd is a good idea.

-- 
MST




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process
  2020-07-27 13:18       ` Stefan Hajnoczi
  2020-07-27 13:22         ` Michael S. Tsirkin
@ 2020-07-31 18:31         ` Jag Raman
  1 sibling, 0 replies; 51+ messages in thread
From: Jag Raman @ 2020-07-31 18:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, fam, Swapnil Ingle, John G Johnson, qemu-devel,
	kraxel, quintela, Michael S. Tsirkin, Markus Armbruster,
	kanth.ghatraju, Felipe Franciosi, thuth, ehabkost, konrad.wilk,
	dgilbert, liran.alon, pbonzini, rth, kwolf, berrange, mreitz,
	ross.lagerwall, marcandre.lureau, thanos.makatos



> On Jul 27, 2020, at 9:18 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Jul 24, 2020 at 12:57:22PM -0400, Jag Raman wrote:
>>> On Jul 1, 2020, at 5:20 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Sat, Jun 27, 2020 at 10:09:34AM -0700, elena.ufimtseva@oracle.com wrote:
>>> In a 1:1 model the CONNECT_DEV message is not necessary because each
>>> socket is already associated with a specific remote device (e.g. qemu -M
>>> remote -object mplink,dev=lsi-scsi-1,sockpath=/tmp/lsi-scsi-1.sock).
>>> Connecting to the socket already indicates which device we are talking
>>> to.
>>> 
>>> The N:1 model will work but it's more complex. There is a main socket
>>> that is used for CONNECT_DEV (anything else?) and we need to worry about
>>> the lifecycle of the per-device sockets that are passed over the main
>>> socket.
>> 
>> The main socket is only used for CONNECT_DEV. The CONNECT_DEV
>> message sticks a fd to the remote device.
>> 
>> We are using the following command-line in the remote process:
>> qemu-system-x86_64 -machine remote,fd=4 -device lsi53c895a,id=lsi1 ...
>> 
>> The alternative approach would be to let the orchestrator to assign fds for
>> each remote device. In this approach, we would specify an ‘fd’ for each
>> device object like below:
>> qemu-system-x86_64 -machine remote -device lsi53c895a,id=lsi1,fd=4 …
>> 
>> The alternative approach would entail changes to the DeviceState struct
>> and qdev_device_add() function, which we thought is not preferable. Could
>> you please share your thoughts on this?
> 
> I suggest dropping multi-device support for now. It will be implemented
> differently with VFIO-over-socket anyway, so it's not worth investing
> much time into.
> 
> The main socket approach needs authentication support if multiple guests
> share a remote device emulation process. Otherwise guest A can access
> guest B's devices.
> 
> It's simpler if each device has a separate UNIX domain socket. It is not
> necessary to modify lsi53c895a in order to do this. Either the socket
> can be associated with the remote PCIe port (although I think the
> current code implements the older PCI Local Bus instead of PCIe) or a
> separate -object mpqlink,device=lsi1,fd=4 object can be defined (I think
> that's the syntax I've shared in the past).
> 
> For now though, just using the -machine remote,fd=4 approach is fine -
> but limited to 1 device.

Hi Stefan,

Thanks for your response.

We didn’t quite get the purpose of the “object” when you proposed the
command-line. But we understand it now.

We also agree to wait for the authentication protocol before enabling
multiple devices in the same remote process. Therefore we have limited
the number of devices to 1.

Even with one device, we still had the problem of attaching the file
descriptor with a device int the remote end. The object idea you proposed
seemed to fit well with connecting the fd with device. We have added this
object in v8, which we just sent out.

We have dropped the main-socket/control-channel as it’s not needed
anymore.

We look forward to your responses to v8.

Thank you very much!

> 
> Stefan



^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2020-07-31 18:34 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-27 17:09 [PATCH v7 00/21] Initial support for multi-process qemu elena.ufimtseva
2020-06-27 17:09 ` [PATCH v7 01/21] memory: alloc RAM from file at offset elena.ufimtseva
2020-06-30 14:59   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 02/21] multi-process: Add config option for multi-process QEMU elena.ufimtseva
2020-06-30 14:57   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 03/21] multi-process: setup PCI host bridge for remote device elena.ufimtseva
2020-06-30 15:17   ` Stefan Hajnoczi
2020-07-09 14:23     ` Jag Raman
2020-06-27 17:09 ` [PATCH v7 04/21] multi-process: setup a machine object for remote device process elena.ufimtseva
2020-06-30 15:26   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 05/21] multi-process: add qio channel function to transmit elena.ufimtseva
2020-06-30 15:29   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 06/21] multi-process: define MPQemuMsg format and transmission functions elena.ufimtseva
2020-06-30 15:53   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 07/21] multi-process: add co-routines to communicate with remote elena.ufimtseva
2020-06-30 18:31   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 08/21] multi-process: Initialize communication channel at the remote end elena.ufimtseva
2020-07-01  6:44   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 09/21] multi-process: Initialize message handler in remote device elena.ufimtseva
2020-07-01  6:53   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 10/21] multi-process: setup memory manager for " elena.ufimtseva
2020-07-01  7:58   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 11/21] multi-process: introduce proxy object elena.ufimtseva
2020-07-01  8:58   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 12/21] multi-process: Connect Proxy Object with device in the remote process elena.ufimtseva
2020-07-01  9:20   ` Stefan Hajnoczi
2020-07-24 16:57     ` Jag Raman
2020-07-27 13:18       ` Stefan Hajnoczi
2020-07-27 13:22         ` Michael S. Tsirkin
2020-07-31 18:31         ` Jag Raman
2020-06-27 17:09 ` [PATCH v7 13/21] multi-process: Forward PCI config space acceses to " elena.ufimtseva
2020-07-01  9:40   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 14/21] multi-process: PCI BAR read/write handling for proxy & remote endpoints elena.ufimtseva
2020-07-01 10:41   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 15/21] multi-process: Synchronize remote memory elena.ufimtseva
2020-07-01 10:55   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 16/21] multi-process: create IOHUB object to handle irq elena.ufimtseva
2020-07-02 12:09   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 17/21] multi-process: Retrieve PCI info from remote process elena.ufimtseva
2020-07-02 12:59   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 18/21] multi-process: heartbeat messages to remote elena.ufimtseva
2020-07-02 13:16   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 19/21] multi-process: perform device reset in the remote process elena.ufimtseva
2020-07-02 13:19   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 20/21] multi-process: add the concept description to docs/devel/qemu-multiprocess elena.ufimtseva
2020-07-02 13:31   ` Stefan Hajnoczi
2020-06-27 17:09 ` [PATCH v7 21/21] multi-process: add configure and usage information elena.ufimtseva
2020-07-02 13:26   ` Stefan Hajnoczi
2020-07-02 13:40 ` [PATCH v7 00/21] Initial support for multi-process qemu Stefan Hajnoczi
2020-07-09 14:16   ` Jag Raman
2020-07-13 11:21     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).