All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/14] vfio-user server in QEMU
@ 2021-12-15 15:35 Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 01/14] configure, meson: override C compiler for cmake Jagannathan Raman
                   ` (13 more replies)
  0 siblings, 14 replies; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Based-on: <cover.1636057885.git.john.g.johnson@oracle.com>

Hi,

Thank you very much for reviewing the patches in the previous revision!

We have addressed your comments. We added the support for MSI/x and
IOMMU in this series.

Please share your feedback for the latest series - appreciate your time.

Please the list below for changes since last revision:

[PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines
  - New in this series

[PATCH v4 03/14] vfio-user: build library
 - Added VFIO_USER_SERVER Kconfig option and vfio_user_server option to
   the configure script. VFIO_USER_SERVER depends on MULTIPROCESS
 - Reverted to enabling multiprocess by default. As such, tagging distros
   in the multiprocess acceptance test for multiprocess is not
   necessary - reverted them.

[PATCH v4 04/14] vfio-user: define vfio-user-server object
 - Using qapi_free_SocketAddress to free SocketAddress
 - Changed object name to x-vfio-user-server
 - Removed max_devs class variable. This is not needed anymore as we added
   IOMMU support in this series

[PATCH v4 05/14] vfio-user: instantiate vfio-user context
 - Redesigned init code to abort QEMU during startup upon error, whereas
   synchronously report the error and continue running QEMU during hotplug
 - Added qemu_remove_machine_init_done_notifier() in object finalize
 - Removed NULL initialization of o->vfu_ctx

[PATCH v4 07/14] vfio-user: run vfio-user context
 - Addressed comment about hot plug from this patch, in patch 4
 - poll for fd to become readable instead of busy spinning alone

[PATCH v4 11/14] vfio-user: IOMMU support for remote device
 - New in this series

[PATCH v4 12/14] handle device interrupts
 - Added support for MSI/MSI-x interrupts
 - refactored INTx handling code to leverage the hash table used by
   the MSI code
 - Added a hash table to lookup vfio-user context by PCI device
   function index

[PATCH v4 13/14] vfio-user: register handlers to facilitate migration
 - Compute the size of the migration data when registering for migration
   BAR. This alleviates the need to hard code a sufficiently large value
   for the migration BAR.

[PATCH v4 14/14] vfio-user: avocado tests for vfio-user
 - Corrected the object’s name in the test
 - Added tests for hotplug and migration cases

Thank you very much!

Jagannathan Raman (14):
  configure, meson: override C compiler for cmake
  tests/avocado: Specify target VM argument to helper routines
  vfio-user: build library
  vfio-user: define vfio-user-server object
  vfio-user: instantiate vfio-user context
  vfio-user: find and init PCI device
  vfio-user: run vfio-user context
  vfio-user: handle PCI config space accesses
  vfio-user: handle DMA mappings
  vfio-user: handle PCI BAR accesses
  vfio-user: IOMMU support for remote device
  vfio-user: handle device interrupts
  vfio-user: register handlers to facilitate migration
  vfio-user: avocado tests for vfio-user

 configure                                  |  21 +-
 meson.build                                |  44 +-
 qapi/qom.json                              |  20 +-
 include/hw/pci/pci.h                       |   8 +
 include/hw/remote/iohub.h                  |   1 +
 include/hw/remote/iommu.h                  |  24 +
 include/migration/vmstate.h                |   2 +
 migration/savevm.h                         |   2 +
 hw/pci/msi.c                               |  13 +-
 hw/pci/msix.c                              |  12 +-
 hw/pci/pci.c                               |   2 +-
 hw/remote/iohub.c                          |   7 +
 hw/remote/iommu.c                          | 117 +++
 hw/remote/machine.c                        |   5 +
 hw/remote/vfio-user-obj.c                  | 957 +++++++++++++++++++++
 migration/savevm.c                         |  73 ++
 migration/vmstate.c                        |  19 +
 .gitlab-ci.d/buildtest.yml                 |   2 +
 .gitmodules                                |   3 +
 Kconfig.host                               |   4 +
 MAINTAINERS                                |   5 +
 hw/remote/Kconfig                          |   4 +
 hw/remote/meson.build                      |   4 +
 hw/remote/trace-events                     |  11 +
 meson_options.txt                          |   2 +
 subprojects/libvfio-user                   |   1 +
 tests/avocado/avocado_qemu/__init__.py     |  10 +-
 tests/avocado/vfio-user.py                 | 225 +++++
 tests/docker/dockerfiles/centos8.docker    |   2 +
 tests/docker/dockerfiles/ubuntu2004.docker |   2 +
 30 files changed, 1591 insertions(+), 11 deletions(-)
 create mode 100644 include/hw/remote/iommu.h
 create mode 100644 hw/remote/iommu.c
 create mode 100644 hw/remote/vfio-user-obj.c
 create mode 160000 subprojects/libvfio-user
 create mode 100644 tests/avocado/vfio-user.py

-- 
2.20.1



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 01/14] configure, meson: override C compiler for cmake
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

The compiler path that cmake gets from meson is corrupted. It results in
the following error:
| -- The C compiler identification is unknown
| CMake Error at CMakeLists.txt:35 (project):
| The CMAKE_C_COMPILER:
| /opt/rh/devtoolset-9/root/bin/cc;-m64;-mcx16
| is not a full path to an existing compiler tool.

Explicitly specify the C compiler for cmake to avoid this error

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 configure | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/configure b/configure
index 48c21775f3..dd000ce299 100755
--- a/configure
+++ b/configure
@@ -3919,6 +3919,8 @@ if test "$skip_meson" = no; then
   echo "cpp_args = [$(meson_quote $CXXFLAGS $EXTRA_CXXFLAGS)]" >> $cross
   echo "c_link_args = [$(meson_quote $CFLAGS $LDFLAGS $EXTRA_CFLAGS $EXTRA_LDFLAGS)]" >> $cross
   echo "cpp_link_args = [$(meson_quote $CXXFLAGS $LDFLAGS $EXTRA_CXXFLAGS $EXTRA_LDFLAGS)]" >> $cross
+  echo "[cmake]" >> $cross
+  echo "CMAKE_C_COMPILER = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
   echo "[binaries]" >> $cross
   echo "c = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
   test -n "$cxx" && echo "cpp = [$(meson_quote $cxx $CPU_CFLAGS)]" >> $cross
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 01/14] configure, meson: override C compiler for cmake Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-15 15:54   ` Philippe Mathieu-Daudé
  2021-12-15 22:04   ` Beraldo Leal
  2021-12-15 15:35 ` [PATCH v4 03/14] vfio-user: build library Jagannathan Raman
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Specify target VM for exec_command and
exec_command_and_wait_for_pattern routines

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 tests/avocado/avocado_qemu/__init__.py | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
index 75063c0c30..26ac782f53 100644
--- a/tests/avocado/avocado_qemu/__init__.py
+++ b/tests/avocado/avocado_qemu/__init__.py
@@ -198,7 +198,7 @@ def wait_for_console_pattern(test, success_message, failure_message=None,
     """
     _console_interaction(test, success_message, failure_message, None, vm=vm)
 
-def exec_command(test, command):
+def exec_command(test, command, vm=None):
     """
     Send a command to a console (appending CRLF characters), while logging
     the content.
@@ -208,10 +208,11 @@ def exec_command(test, command):
     :param command: the command to send
     :type command: str
     """
-    _console_interaction(test, None, None, command + '\r')
+    _console_interaction(test, None, None, command + '\r', vm=vm)
 
 def exec_command_and_wait_for_pattern(test, command,
-                                      success_message, failure_message=None):
+                                      success_message, failure_message=None,
+                                      vm=None):
     """
     Send a command to a console (appending CRLF characters), then wait
     for success_message to appear on the console, while logging the.
@@ -224,7 +225,8 @@ def exec_command_and_wait_for_pattern(test, command,
     :param success_message: if this message appears, test succeeds
     :param failure_message: if this message appears, test fails
     """
-    _console_interaction(test, success_message, failure_message, command + '\r')
+    _console_interaction(test, success_message, failure_message, command + '\r',
+                         vm=vm)
 
 class QemuBaseTest(avocado.Test):
     def _get_unique_tag_val(self, tag_name):
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 03/14] vfio-user: build library
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 01/14] configure, meson: override C compiler for cmake Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 04/14] vfio-user: define vfio-user-server object Jagannathan Raman
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

add the libvfio-user library as a submodule. build it as a cmake
subproject.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 configure                                  | 19 +++++++++-
 meson.build                                | 44 +++++++++++++++++++++-
 .gitlab-ci.d/buildtest.yml                 |  2 +
 .gitmodules                                |  3 ++
 Kconfig.host                               |  4 ++
 MAINTAINERS                                |  1 +
 hw/remote/Kconfig                          |  4 ++
 hw/remote/meson.build                      |  2 +
 meson_options.txt                          |  2 +
 subprojects/libvfio-user                   |  1 +
 tests/docker/dockerfiles/centos8.docker    |  2 +
 tests/docker/dockerfiles/ubuntu2004.docker |  2 +
 12 files changed, 84 insertions(+), 2 deletions(-)
 create mode 160000 subprojects/libvfio-user

diff --git a/configure b/configure
index dd000ce299..7748802d11 100755
--- a/configure
+++ b/configure
@@ -366,6 +366,7 @@ ninja=""
 gio="$default_feature"
 skip_meson=no
 slirp_smbd="$default_feature"
+vfio_user_server="disabled"
 
 # The following Meson options are handled manually (still they
 # are included in the automatically generated help message)
@@ -1190,6 +1191,10 @@ for opt do
   ;;
   --disable-blobs) meson_option_parse --disable-install-blobs ""
   ;;
+  --enable-vfio-user-server) vfio_user_server="enabled"
+  ;;
+  --disable-vfio-user-server) vfio_user_server="disabled"
+  ;;
   --enable-tcmalloc) meson_option_parse --enable-malloc=tcmalloc tcmalloc
   ;;
   --enable-jemalloc) meson_option_parse --enable-malloc=jemalloc jemalloc
@@ -1470,6 +1475,7 @@ cat << EOF
   rng-none        dummy RNG, avoid using /dev/(u)random and getrandom()
   gio             libgio support
   slirp-smbd      use smbd (at path --smbd=*) in slirp networking
+  vfio-user-server    vfio-user server support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -3250,6 +3256,17 @@ but not implemented on your system"
     fi
 fi
 
+##########################################
+# check for vfio_user_server
+
+case "$vfio_user_server" in
+  auto | enabled )
+    if test "$git_submodules_action" != "ignore"; then
+      git_submodules="${git_submodules} subprojects/libvfio-user"
+    fi
+    ;;
+esac
+
 ##########################################
 # End of CC checks
 # After here, no more $cc or $ld runs
@@ -3998,7 +4015,7 @@ if test "$skip_meson" = no; then
         -Db_pie=$(if test "$pie" = yes; then echo true; else echo false; fi) \
         -Db_coverage=$(if test "$gcov" = yes; then echo true; else echo false; fi) \
         -Db_lto=$lto -Dcfi=$cfi -Dtcg=$tcg -Dxen=$xen \
-        -Dcapstone=$capstone -Dfdt=$fdt -Dslirp=$slirp \
+        -Dcapstone=$capstone -Dfdt=$fdt -Dslirp=$slirp -Dvfio_user_server=$vfio_user_server \
         $(test -n "${LIB_FUZZING_ENGINE+xxx}" && echo "-Dfuzzing_engine=$LIB_FUZZING_ENGINE") \
         $(if test "$default_feature" = no; then echo "-Dauto_features=disabled"; fi) \
         "$@" $cross_arg "$PWD" "$source_path"
diff --git a/meson.build b/meson.build
index 9702fdce6d..339c28ee25 100644
--- a/meson.build
+++ b/meson.build
@@ -252,6 +252,11 @@ if targetos != 'linux' and get_option('multiprocess').enabled()
 endif
 multiprocess_allowed = targetos == 'linux' and not get_option('multiprocess').disabled()
 
+if targetos != 'linux' and get_option('vfio_user_server').enabled()
+  error('vfio-user server is supported only on Linux')
+endif
+vfio_user_server_allowed = targetos == 'linux' and not get_option('vfio_user_server').disabled()
+
 libm = cc.find_library('m', required: false)
 threads = dependency('threads')
 util = cc.find_library('util', required: false)
@@ -1824,7 +1829,8 @@ host_kconfig = \
   (have_virtfs ? ['CONFIG_VIRTFS=y'] : []) + \
   ('CONFIG_LINUX' in config_host ? ['CONFIG_LINUX=y'] : []) + \
   ('CONFIG_PVRDMA' in config_host ? ['CONFIG_PVRDMA=y'] : []) + \
-  (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : [])
+  (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : []) + \
+  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : [])
 
 ignored = [ 'TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_ARCH' ]
 
@@ -2201,6 +2207,41 @@ if get_option('cfi') and slirp_opt == 'system'
          + ' Please configure with --enable-slirp=git')
 endif
 
+vfiouser = not_found
+if have_system and vfio_user_server_allowed
+  have_internal = fs.exists(meson.current_source_dir() / 'subprojects/libvfio-user/Makefile')
+
+  if not have_internal
+    error('libvfio-user source not found - please pull git submodule')
+  endif
+
+  json_c = dependency('json-c', required: false)
+  if not json_c.found()
+    json_c = dependency('libjson-c', required: false)
+  endif
+  if not json_c.found()
+    json_c = dependency('libjson-c-dev', required: false)
+  endif
+
+  if not json_c.found()
+    error('Unable to find json-c package')
+  endif
+
+  cmake = import('cmake')
+
+  vfiouser_subproj = cmake.subproject('libvfio-user')
+
+  vfiouser_sl = vfiouser_subproj.dependency('vfio-user-static')
+
+  # Although cmake links the json-c library with vfio-user-static
+  # target, that info is not available to meson via cmake.subproject.
+  # As such, we have to separately declare the json-c dependency here.
+  # This appears to be a current limitation of using cmake inside meson.
+  # libvfio-user is planning a switch to meson in the future, which
+  # would address this item automatically.
+  vfiouser = declare_dependency(dependencies: [vfiouser_sl, json_c])
+endif
+
 fdt = not_found
 fdt_opt = get_option('fdt')
 if have_system
@@ -3301,6 +3342,7 @@ summary_info += {'target list':       ' '.join(target_dirs)}
 if have_system
   summary_info += {'default devices':   get_option('default_devices')}
   summary_info += {'out of process emulation': multiprocess_allowed}
+  summary_info += {'vfio-user server': vfio_user_server_allowed}
 endif
 summary(summary_info, bool_yn: true, section: 'Targets and accelerators')
 
diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index 71d0f407ad..e29f8c1f13 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -42,6 +42,7 @@ build-system-ubuntu:
   variables:
     IMAGE: ubuntu2004
     CONFIGURE_ARGS: --enable-docs --enable-fdt=system --enable-slirp=system
+                    --enable-vfio-user-server
     TARGETS: aarch64-softmmu alpha-softmmu cris-softmmu hppa-softmmu
       microblazeel-softmmu mips64el-softmmu
     MAKE_CHECK_ARGS: check-build
@@ -142,6 +143,7 @@ build-system-centos:
     IMAGE: centos8
     CONFIGURE_ARGS: --disable-nettle --enable-gcrypt --enable-fdt=system
                     --enable-modules --enable-trace-backends=dtrace
+                    --enable-vfio-user-server
     TARGETS: ppc64-softmmu or1k-softmmu s390x-softmmu
       x86_64-softmmu rx-softmmu sh4-softmmu nios2-softmmu
     MAKE_CHECK_ARGS: check-build
diff --git a/.gitmodules b/.gitmodules
index 08b1b48a09..cfeea7cf20 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -64,3 +64,6 @@
 [submodule "roms/vbootrom"]
 	path = roms/vbootrom
 	url = https://gitlab.com/qemu-project/vbootrom.git
+[submodule "subprojects/libvfio-user"]
+	path = subprojects/libvfio-user
+	url = https://github.com/nutanix/libvfio-user.git
diff --git a/Kconfig.host b/Kconfig.host
index 60b9c07b5e..f2da8bcf8a 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -45,3 +45,7 @@ config MULTIPROCESS_ALLOWED
 config FUZZ
     bool
     select SPARSE_MEM
+
+config VFIO_USER_SERVER_ALLOWED
+    bool
+    imply VFIO_USER_SERVER
diff --git a/MAINTAINERS b/MAINTAINERS
index a8fa61a20d..e0daf349ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3464,6 +3464,7 @@ F: hw/remote/proxy-memory-listener.c
 F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
+F: subprojects/libvfio-user
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/hw/remote/Kconfig b/hw/remote/Kconfig
index 08c16e235f..2d6b4f4cf4 100644
--- a/hw/remote/Kconfig
+++ b/hw/remote/Kconfig
@@ -2,3 +2,7 @@ config MULTIPROCESS
     bool
     depends on PCI && PCI_EXPRESS && KVM
     select REMOTE_PCIHOST
+
+config VFIO_USER_SERVER
+    bool
+    depends on MULTIPROCESS
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index e6a5574242..dfea6b533b 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -7,6 +7,8 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
 
+remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: vfiouser)
+
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('memory.c'))
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy-memory-listener.c'))
 
diff --git a/meson_options.txt b/meson_options.txt
index e740dce2a5..3c2d73481a 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -66,6 +66,8 @@ option('cfi_debug', type: 'boolean', value: 'false',
        description: 'Verbose errors in case of CFI violation')
 option('multiprocess', type: 'feature', value: 'auto',
        description: 'Out of process device emulation support')
+option('vfio_user_server', type: 'feature', value: 'auto',
+       description: 'vfio-user server support')
 
 option('attr', type : 'feature', value : 'auto',
        description: 'attr/xattr support')
diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
new file mode 160000
index 0000000000..7056525da5
--- /dev/null
+++ b/subprojects/libvfio-user
@@ -0,0 +1 @@
+Subproject commit 7056525da5399d00831e90bed4aedb4b8442c9b2
diff --git a/tests/docker/dockerfiles/centos8.docker b/tests/docker/dockerfiles/centos8.docker
index 46398c61ee..646abcda1f 100644
--- a/tests/docker/dockerfiles/centos8.docker
+++ b/tests/docker/dockerfiles/centos8.docker
@@ -12,6 +12,7 @@ ENV PACKAGES \
     capstone-devel \
     ccache \
     clang \
+    cmake \
     ctags \
     cyrus-sasl-devel \
     daxctl-devel \
@@ -32,6 +33,7 @@ ENV PACKAGES \
     gtk3-devel \
     hostname \
     jemalloc-devel \
+    json-c-devel \
     libaio-devel \
     libasan \
     libattr-devel \
diff --git a/tests/docker/dockerfiles/ubuntu2004.docker b/tests/docker/dockerfiles/ubuntu2004.docker
index 39de63d012..ca4dff0e6b 100644
--- a/tests/docker/dockerfiles/ubuntu2004.docker
+++ b/tests/docker/dockerfiles/ubuntu2004.docker
@@ -6,6 +6,7 @@ ENV PACKAGES \
     ca-certificates \
     ccache \
     clang \
+    cmake \
     dbus \
     debianutils \
     diffutils \
@@ -44,6 +45,7 @@ ENV PACKAGES \
     libiscsi-dev \
     libjemalloc-dev \
     libjpeg-turbo8-dev \
+    libjson-c-dev \
     liblttng-ust-dev \
     liblzo2-dev \
     libncursesw5-dev \
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 04/14] vfio-user: define vfio-user-server object
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (2 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 03/14] vfio-user: build library Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16  9:33   ` Stefan Hajnoczi
  2021-12-16  9:58   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 05/14] vfio-user: instantiate vfio-user context Jagannathan Raman
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Define vfio-user object which is remote process server for QEMU. Setup
object initialization functions and properties necessary to instantiate
the object

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 qapi/qom.json             |  20 ++++-
 hw/remote/vfio-user-obj.c | 175 ++++++++++++++++++++++++++++++++++++++
 MAINTAINERS               |   1 +
 hw/remote/meson.build     |   1 +
 hw/remote/trace-events    |   3 +
 5 files changed, 198 insertions(+), 2 deletions(-)
 create mode 100644 hw/remote/vfio-user-obj.c

diff --git a/qapi/qom.json b/qapi/qom.json
index ccd1167808..6001a9b8f0 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -703,6 +703,20 @@
 { 'struct': 'RemoteObjectProperties',
   'data': { 'fd': 'str', 'devid': 'str' } }
 
+##
+# @VfioUserServerProperties:
+#
+# Properties for x-vfio-user-server objects.
+#
+# @socket: socket to be used by the libvfiouser library
+#
+# @device: the id of the device to be emulated at the server
+#
+# Since: 6.2
+##
+{ 'struct': 'VfioUserServerProperties',
+  'data': { 'socket': 'SocketAddress', 'device': 'str' } }
+
 ##
 # @RngProperties:
 #
@@ -837,7 +851,8 @@
     'tls-creds-psk',
     'tls-creds-x509',
     'tls-cipher-suites',
-    { 'name': 'x-remote-object', 'features': [ 'unstable' ] }
+    { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
+    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
   ] }
 
 ##
@@ -900,7 +915,8 @@
       'tls-creds-psk':              'TlsCredsPskProperties',
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
-      'x-remote-object':            'RemoteObjectProperties'
+      'x-remote-object':            'RemoteObjectProperties',
+      'x-vfio-user-server':         'VfioUserServerProperties'
   } }
 
 ##
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
new file mode 100644
index 0000000000..10296ef33c
--- /dev/null
+++ b/hw/remote/vfio-user-obj.c
@@ -0,0 +1,175 @@
+/**
+ * QEMU vfio-user-server server object
+ *
+ * Copyright © 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
+ *
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/**
+ * Usage: add options:
+ *     -machine x-remote
+ *     -device <PCI-device>,id=<pci-dev-id>
+ *     -object x-vfio-user-server,id=<id>,type=unix,path=<socket-path>,
+ *             device=<pci-dev-id>
+ *
+ * Note that x-vfio-user-server object must be used with x-remote machine only.
+ * This server could only support PCI devices for now.
+ *
+ * type - SocketAddress type - presently "unix" alone is supported. Required
+ *        option
+ *
+ * path - named unix socket, it will be created by the server. It is
+ *        a required option
+ *
+ * device - id of a device on the server, a required option. PCI devices
+ *          alone are supported presently.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qom/object.h"
+#include "qom/object_interfaces.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "sysemu/runstate.h"
+#include "hw/boards.h"
+#include "hw/remote/machine.h"
+#include "qapi/error.h"
+#include "qapi/qapi-visit-sockets.h"
+
+#define TYPE_VFU_OBJECT "x-vfio-user-server"
+OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
+
+struct VfuObjectClass {
+    ObjectClass parent_class;
+
+    unsigned int nr_devs;
+
+    bool daemon;
+};
+
+struct VfuObject {
+    /* private */
+    Object parent;
+
+    SocketAddress *socket;
+
+    char *device;
+
+    Error *err;
+};
+
+static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    qapi_free_SocketAddress(o->socket);
+
+    o->socket = NULL;
+
+    visit_type_SocketAddress(v, name, &o->socket, errp);
+
+    if (o->socket->type != SOCKET_ADDRESS_TYPE_UNIX) {
+        qapi_free_SocketAddress(o->socket);
+        o->socket = NULL;
+        error_setg(errp, "vfu: Unsupported socket type - %s",
+                   o->socket->u.q_unix.path);
+        return;
+    }
+
+    trace_vfu_prop("socket", o->socket->u.q_unix.path);
+}
+
+static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    g_free(o->device);
+
+    o->device = g_strdup(str);
+
+    trace_vfu_prop("device", str);
+}
+
+static void vfu_object_init(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
+
+    k->nr_devs++;
+
+    if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
+        error_setg(&o->err, "vfu: %s only compatible with %s machine",
+                   TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
+        return;
+    }
+}
+
+static void vfu_object_finalize(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
+
+    k->nr_devs--;
+
+    qapi_free_SocketAddress(o->socket);
+
+    o->socket = NULL;
+
+    g_free(o->device);
+
+    o->device = NULL;
+
+    if (!k->nr_devs && !k->daemon) {
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
+}
+
+static void vfu_object_class_init(ObjectClass *klass, void *data)
+{
+    VfuObjectClass *k = VFU_OBJECT_CLASS(klass);
+
+    k->nr_devs = 0;
+
+    /* Later determine how to detect a daemon */
+    k->daemon = false;
+
+    object_class_property_add(klass, "socket", "SocketAddress", NULL,
+                              vfu_object_set_socket, NULL, NULL);
+    object_class_property_set_description(klass, "socket",
+                                          "SocketAddress "
+                                          "(ex: type=unix,path=/tmp/sock). "
+                                          "Only UNIX is presently supported");
+    object_class_property_add_str(klass, "device", NULL,
+                                  vfu_object_set_device);
+    object_class_property_set_description(klass, "device",
+                                          "device ID - only PCI devices "
+                                          "are presently supported");
+}
+
+static const TypeInfo vfu_object_info = {
+    .name = TYPE_VFU_OBJECT,
+    .parent = TYPE_OBJECT,
+    .instance_size = sizeof(VfuObject),
+    .instance_init = vfu_object_init,
+    .instance_finalize = vfu_object_finalize,
+    .class_size = sizeof(VfuObjectClass),
+    .class_init = vfu_object_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_USER_CREATABLE },
+        { }
+    }
+};
+
+static void vfu_register_types(void)
+{
+    type_register_static(&vfu_object_info);
+}
+
+type_init(vfu_register_types);
diff --git a/MAINTAINERS b/MAINTAINERS
index e0daf349ae..b5eb306662 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3465,6 +3465,7 @@ F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
 F: subprojects/libvfio-user
+F: hw/remote/vfio-user-obj.c
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index dfea6b533b..534ac5df79 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -6,6 +6,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
+remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: files('vfio-user-obj.c'))
 
 remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: vfiouser)
 
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0b23974f90..7da12f0d96 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -2,3 +2,6 @@
 
 mpqemu_send_io_error(int cmd, int size, int nfds) "send command %d size %d, %d file descriptors to remote process"
 mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d, %d file descriptors to remote process"
+
+# vfio-user-obj.c
+vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 05/14] vfio-user: instantiate vfio-user context
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (3 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 04/14] vfio-user: define vfio-user-server object Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16  9:55   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 06/14] vfio-user: find and init PCI device Jagannathan Raman
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

create a context with the vfio-user library to run a PCI device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 69 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 10296ef33c..f439b81787 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -41,6 +41,9 @@
 #include "hw/remote/machine.h"
 #include "qapi/error.h"
 #include "qapi/qapi-visit-sockets.h"
+#include "qemu/notify.h"
+#include "sysemu/sysemu.h"
+#include "libvfio-user.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -62,8 +65,14 @@ struct VfuObject {
     char *device;
 
     Error *err;
+
+    Notifier machine_done;
+
+    vfu_ctx_t *vfu_ctx;
 };
 
+static void vfu_object_init_ctx(VfuObject *o, Error **errp);
+
 static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
                                   void *opaque, Error **errp)
 {
@@ -84,6 +93,8 @@ static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
     }
 
     trace_vfu_prop("socket", o->socket->u.q_unix.path);
+
+    vfu_object_init_ctx(o, errp);
 }
 
 static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
@@ -95,6 +106,50 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
     o->device = g_strdup(str);
 
     trace_vfu_prop("device", str);
+
+    vfu_object_init_ctx(o, errp);
+}
+
+/*
+ * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
+ * properties. It also depends on devices instantiated in QEMU. These
+ * dependencies are not available during the instance_init phase of this
+ * object's life-cycle. As such, the server is initialized after the
+ * machine is setup. machine_init_done_notifier notifies TYPE_VFU_OBJECT
+ * when the machine is setup, and the dependencies are available.
+ */
+static void vfu_object_machine_done(Notifier *notifier, void *data)
+{
+    VfuObject *o = container_of(notifier, VfuObject, machine_done);
+    Error *err = NULL;
+
+    vfu_object_init_ctx(o, &err);
+
+    if (err) {
+        error_propagate(&error_abort, err);
+    }
+}
+
+static void vfu_object_init_ctx(VfuObject *o, Error **errp)
+{
+    ERRP_GUARD();
+
+    if (o->vfu_ctx || !o->socket || !o->device ||
+            !phase_check(PHASE_MACHINE_READY)) {
+        return;
+    }
+
+    if (o->err) {
+        error_propagate(errp, o->err);
+        return;
+    }
+
+    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket->u.q_unix.path, 0,
+                                o, VFU_DEV_TYPE_PCI);
+    if (o->vfu_ctx == NULL) {
+        error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
+        return;
+    }
 }
 
 static void vfu_object_init(Object *obj)
@@ -104,6 +159,11 @@ static void vfu_object_init(Object *obj)
 
     k->nr_devs++;
 
+    if (!phase_check(PHASE_MACHINE_READY)) {
+        o->machine_done.notify = vfu_object_machine_done;
+        qemu_add_machine_init_done_notifier(&o->machine_done);
+    }
+
     if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
         error_setg(&o->err, "vfu: %s only compatible with %s machine",
                    TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
@@ -122,6 +182,10 @@ static void vfu_object_finalize(Object *obj)
 
     o->socket = NULL;
 
+    if (o->vfu_ctx) {
+        vfu_destroy_ctx(o->vfu_ctx);
+    }
+
     g_free(o->device);
 
     o->device = NULL;
@@ -129,6 +193,11 @@ static void vfu_object_finalize(Object *obj)
     if (!k->nr_devs && !k->daemon) {
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     }
+
+    if (o->machine_done.notify) {
+        qemu_remove_machine_init_done_notifier(&o->machine_done);
+        o->machine_done.notify = NULL;
+    }
 }
 
 static void vfu_object_class_init(ObjectClass *klass, void *data)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 06/14] vfio-user: find and init PCI device
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (4 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 05/14] vfio-user: instantiate vfio-user context Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16 10:39   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 07/14] vfio-user: run vfio-user context Jagannathan Raman
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Find the PCI device with specified id. Initialize the device context
with the QEMU PCI device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 41 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index f439b81787..bcbea59bf1 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -44,6 +44,8 @@
 #include "qemu/notify.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
+#include "hw/qdev-core.h"
+#include "hw/pci/pci.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -69,6 +71,8 @@ struct VfuObject {
     Notifier machine_done;
 
     vfu_ctx_t *vfu_ctx;
+
+    PCIDevice *pci_dev;
 };
 
 static void vfu_object_init_ctx(VfuObject *o, Error **errp);
@@ -133,6 +137,9 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
 static void vfu_object_init_ctx(VfuObject *o, Error **errp)
 {
     ERRP_GUARD();
+    DeviceState *dev = NULL;
+    vfu_pci_type_t pci_type = VFU_PCI_TYPE_CONVENTIONAL;
+    int ret;
 
     if (o->vfu_ctx || !o->socket || !o->device ||
             !phase_check(PHASE_MACHINE_READY)) {
@@ -150,6 +157,38 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
         return;
     }
+
+    dev = qdev_find_recursive(sysbus_get_default(), o->device);
+    if (dev == NULL) {
+        error_setg(errp, "vfu: Device %s not found", o->device);
+        goto fail;
+    }
+
+    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
+        error_setg(errp, "vfu: %s not a PCI device", o->device);
+        goto fail;
+    }
+
+    o->pci_dev = PCI_DEVICE(dev);
+
+    if (pci_is_express(o->pci_dev)) {
+        pci_type = VFU_PCI_TYPE_EXPRESS;
+    }
+
+    ret = vfu_pci_init(o->vfu_ctx, pci_type, PCI_HEADER_TYPE_NORMAL, 0);
+    if (ret < 0) {
+        error_setg(errp,
+                   "vfu: Failed to attach PCI device %s to context - %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
+    return;
+
+fail:
+    vfu_destroy_ctx(o->vfu_ctx);
+    o->vfu_ctx = NULL;
+    o->pci_dev = NULL;
 }
 
 static void vfu_object_init(Object *obj)
@@ -190,6 +229,8 @@ static void vfu_object_finalize(Object *obj)
 
     o->device = NULL;
 
+    o->pci_dev = NULL;
+
     if (!k->nr_devs && !k->daemon) {
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 07/14] vfio-user: run vfio-user context
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (5 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 06/14] vfio-user: find and init PCI device Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16 11:17   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 08/14] vfio-user: handle PCI config space accesses Jagannathan Raman
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Setup a handler to run vfio-user context. The context is driven by
messages to the file descriptor associated with it - get the fd for
the context and hook up the handler with it

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 80 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index bcbea59bf1..a01a0ad185 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -42,10 +42,12 @@
 #include "qapi/error.h"
 #include "qapi/qapi-visit-sockets.h"
 #include "qemu/notify.h"
+#include "qemu/thread.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
 #include "hw/qdev-core.h"
 #include "hw/pci/pci.h"
+#include "qemu/timer.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -73,6 +75,8 @@ struct VfuObject {
     vfu_ctx_t *vfu_ctx;
 
     PCIDevice *pci_dev;
+
+    int vfu_poll_fd;
 };
 
 static void vfu_object_init_ctx(VfuObject *o, Error **errp);
@@ -114,6 +118,62 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
     vfu_object_init_ctx(o, errp);
 }
 
+static void vfu_object_ctx_run(void *opaque)
+{
+    VfuObject *o = opaque;
+    int ret = -1;
+
+    while (ret != 0) {
+        ret = vfu_run_ctx(o->vfu_ctx);
+        if (ret < 0) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == ENOTCONN) {
+                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
+                o->vfu_poll_fd = -1;
+                object_unparent(OBJECT(o));
+                break;
+            } else {
+                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
+                           o->device, strerror(errno));
+                 break;
+            }
+        }
+    }
+}
+
+static void vfu_object_attach_ctx(void *opaque)
+{
+    VfuObject *o = opaque;
+    GPollFD pfds[1];
+    int ret;
+
+    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
+
+    pfds[0].fd = o->vfu_poll_fd;
+    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
+
+retry_attach:
+    ret = vfu_attach_ctx(o->vfu_ctx);
+    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
+        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
+        goto retry_attach;
+    } else if (ret < 0) {
+        error_setg(&error_abort,
+                   "vfu: Failed to attach device %s to context - %s",
+                   o->device, strerror(errno));
+        return;
+    }
+
+    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
+    if (o->vfu_poll_fd < 0) {
+        error_setg(&error_abort, "vfu: Failed to get poll fd %s", o->device);
+        return;
+    }
+
+    qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_ctx_run, NULL, o);
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -151,7 +211,8 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         return;
     }
 
-    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket->u.q_unix.path, 0,
+    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket->u.q_unix.path,
+                                LIBVFIO_USER_FLAG_ATTACH_NB,
                                 o, VFU_DEV_TYPE_PCI);
     if (o->vfu_ctx == NULL) {
         error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
@@ -183,6 +244,21 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    ret = vfu_realize_ctx(o->vfu_ctx);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to realize device %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
+    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
+    if (o->vfu_poll_fd < 0) {
+        error_setg(errp, "vfu: Failed to get poll fd %s", o->device);
+        goto fail;
+    }
+
+    qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_attach_ctx, NULL, o);
+
     return;
 
 fail:
@@ -208,6 +284,8 @@ static void vfu_object_init(Object *obj)
                    TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
         return;
     }
+
+    o->vfu_poll_fd = -1;
 }
 
 static void vfu_object_finalize(Object *obj)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 08/14] vfio-user: handle PCI config space accesses
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (6 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 07/14] vfio-user: run vfio-user context Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16 11:30   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 09/14] vfio-user: handle DMA mappings Jagannathan Raman
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Define and register handlers for PCI config space accesses

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 45 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 47 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index a01a0ad185..c6d0c675b7 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -43,6 +43,7 @@
 #include "qapi/qapi-visit-sockets.h"
 #include "qemu/notify.h"
 #include "qemu/thread.h"
+#include "qemu/main-loop.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
 #include "hw/qdev-core.h"
@@ -174,6 +175,39 @@ retry_attach:
     qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_ctx_run, NULL, o);
 }
 
+static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
+                                     size_t count, loff_t offset,
+                                     const bool is_write)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint32_t pci_access_width = sizeof(uint32_t);
+    size_t bytes = count;
+    uint32_t val = 0;
+    char *ptr = buf;
+    int len;
+
+    while (bytes > 0) {
+        len = (bytes > pci_access_width) ? pci_access_width : bytes;
+        if (is_write) {
+            memcpy(&val, ptr, len);
+            pci_host_config_write_common(o->pci_dev, offset,
+                                         pci_config_size(o->pci_dev),
+                                         val, len);
+            trace_vfu_cfg_write(offset, val);
+        } else {
+            val = pci_host_config_read_common(o->pci_dev, offset,
+                                              pci_config_size(o->pci_dev), len);
+            memcpy(ptr, &val, len);
+            trace_vfu_cfg_read(offset, val);
+        }
+        offset += len;
+        ptr += len;
+        bytes -= len;
+    }
+
+    return count;
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -244,6 +278,17 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
+                           pci_config_size(o->pci_dev), &vfu_object_cfg_access,
+                           VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
+                           NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(errp,
+                   "vfu: Failed to setup config space handlers for %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 7da12f0d96..2ef7884346 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -5,3 +5,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 
 # vfio-user-obj.c
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
+vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
+vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 09/14] vfio-user: handle DMA mappings
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (7 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 08/14] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16 13:24   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 10/14] vfio-user: handle PCI BAR accesses Jagannathan Raman
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Define and register callbacks to manage the RAM regions used for
device DMA

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/remote/vfio-user-obj.c | 48 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 50 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index c6d0c675b7..46f2251a68 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -208,6 +208,47 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     return count;
 }
 
+static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    MemoryRegion *subregion = NULL;
+    g_autofree char *name = NULL;
+    static unsigned int suffix;
+    struct iovec *iov = &info->iova;
+
+    if (!info->vaddr) {
+        return;
+    }
+
+    name = g_strdup_printf("remote-mem-%u", suffix++);
+
+    subregion = g_new0(MemoryRegion, 1);
+
+    memory_region_init_ram_ptr(subregion, NULL, name,
+                               iov->iov_len, info->vaddr);
+
+    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
+                                subregion);
+
+    trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
+}
+
+static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    MemoryRegion *mr = NULL;
+    ram_addr_t offset;
+
+    mr = memory_region_from_host(info->vaddr, &offset);
+    if (!mr) {
+        return;
+    }
+
+    memory_region_del_subregion(get_system_memory(), mr);
+
+    object_unparent((OBJECT(mr)));
+
+    trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -289,6 +330,13 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    ret = vfu_setup_device_dma(o->vfu_ctx, &dma_register, &dma_unregister);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to setup DMA handlers for %s",
+                   o->device);
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 2ef7884346..f945c7e33b 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -7,3 +7,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
 vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
+vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
+vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 10/14] vfio-user: handle PCI BAR accesses
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (8 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 09/14] vfio-user: handle DMA mappings Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16 14:10   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 11/14] vfio-user: IOMMU support for remote device Jagannathan Raman
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Determine the BARs used by the PCI device and register handlers to
manage the access to the same.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/remote/vfio-user-obj.c | 90 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  3 ++
 2 files changed, 93 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 46f2251a68..9399e87cbe 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -249,6 +249,94 @@ static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
     trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
 }
 
+static ssize_t vfu_object_bar_rw(PCIDevice *pci_dev, hwaddr addr, size_t count,
+                                 char * const buf, const bool is_write,
+                                 bool is_io)
+{
+    AddressSpace *as = NULL;
+    MemTxResult res;
+
+    if (is_io) {
+        as = &address_space_io;
+    } else {
+        as = pci_device_iommu_address_space(pci_dev);
+    }
+
+    trace_vfu_bar_rw_enter(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    res = address_space_rw(as, addr, MEMTXATTRS_UNSPECIFIED, (void *)buf,
+                           (hwaddr)count, is_write);
+    if (res != MEMTX_OK) {
+        warn_report("vfu: failed to %s 0x%"PRIx64"",
+                    is_write ? "write to" : "read from",
+                    addr);
+        return -1;
+    }
+
+    trace_vfu_bar_rw_exit(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    return count;
+}
+
+/**
+ * VFU_OBJECT_BAR_HANDLER - macro for defining handlers for PCI BARs.
+ *
+ * To create handler for BAR number 2, VFU_OBJECT_BAR_HANDLER(2) would
+ * define vfu_object_bar2_handler
+ */
+#define VFU_OBJECT_BAR_HANDLER(BAR_NO)                                         \
+    static ssize_t vfu_object_bar##BAR_NO##_handler(vfu_ctx_t *vfu_ctx,        \
+                                        char * const buf, size_t count,        \
+                                        loff_t offset, const bool is_write)    \
+    {                                                                          \
+        VfuObject *o = vfu_get_private(vfu_ctx);                               \
+        PCIDevice *pci_dev = o->pci_dev;                                       \
+        hwaddr addr = (hwaddr)(pci_get_bar_addr(pci_dev, BAR_NO) + offset);    \
+        bool is_io = !!(pci_dev->io_regions[BAR_NO].type &                     \
+                        PCI_BASE_ADDRESS_SPACE);                               \
+                                                                               \
+        return vfu_object_bar_rw(pci_dev, addr, count, buf, is_write, is_io);  \
+    }                                                                          \
+
+VFU_OBJECT_BAR_HANDLER(0)
+VFU_OBJECT_BAR_HANDLER(1)
+VFU_OBJECT_BAR_HANDLER(2)
+VFU_OBJECT_BAR_HANDLER(3)
+VFU_OBJECT_BAR_HANDLER(4)
+VFU_OBJECT_BAR_HANDLER(5)
+
+static vfu_region_access_cb_t *vfu_object_bar_handlers[PCI_NUM_REGIONS] = {
+    &vfu_object_bar0_handler,
+    &vfu_object_bar1_handler,
+    &vfu_object_bar2_handler,
+    &vfu_object_bar3_handler,
+    &vfu_object_bar4_handler,
+    &vfu_object_bar5_handler,
+};
+
+/**
+ * vfu_object_register_bars - Identify active BAR regions of pdev and setup
+ *                            callbacks to handle read/write accesses
+ */
+static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
+{
+    int i;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        if (!pdev->io_regions[i].size) {
+            continue;
+        }
+
+        vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR0_REGION_IDX + i,
+                         (size_t)pdev->io_regions[i].size,
+                         vfu_object_bar_handlers[i],
+                         VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
+
+        trace_vfu_bar_register(i, pdev->io_regions[i].addr,
+                               pdev->io_regions[i].size);
+    }
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -337,6 +425,8 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index f945c7e33b..847d50d88f 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,3 +9,6 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
+vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
+vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 11/14] vfio-user: IOMMU support for remote device
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (9 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 10/14] vfio-user: handle PCI BAR accesses Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16 14:40   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 12/14] vfio-user: handle device interrupts Jagannathan Raman
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Assign separate address space for each device in the remote processes.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/pci/pci.h      |   2 +
 include/hw/remote/iommu.h |  24 ++++++++
 hw/pci/pci.c              |   2 +-
 hw/remote/iommu.c         | 117 ++++++++++++++++++++++++++++++++++++++
 hw/remote/machine.c       |   5 ++
 hw/remote/vfio-user-obj.c |  20 ++++++-
 MAINTAINERS               |   2 +
 hw/remote/meson.build     |   1 +
 8 files changed, 169 insertions(+), 4 deletions(-)
 create mode 100644 include/hw/remote/iommu.h
 create mode 100644 hw/remote/iommu.c

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 5c4016b995..f2fc2d5375 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -734,6 +734,8 @@ void lsi53c8xx_handle_legacy_cmdline(DeviceState *lsi_dev);
 qemu_irq pci_allocate_irq(PCIDevice *pci_dev);
 void pci_set_irq(PCIDevice *pci_dev, int level);
 
+void pci_init_bus_master(PCIDevice *pci_dev);
+
 static inline void pci_irq_assert(PCIDevice *pci_dev)
 {
     pci_set_irq(pci_dev, 1);
diff --git a/include/hw/remote/iommu.h b/include/hw/remote/iommu.h
new file mode 100644
index 0000000000..42ce0ca383
--- /dev/null
+++ b/include/hw/remote/iommu.h
@@ -0,0 +1,24 @@
+/*
+ * IOMMU for remote device
+ *
+ * Copyright © 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_IOMMU_H
+#define REMOTE_IOMMU_H
+
+#include "hw/pci/pci_bus.h"
+
+void remote_iommu_free(PCIDevice *pci_dev);
+
+void remote_iommu_init(void);
+
+void remote_iommu_set(PCIBus *bus);
+
+MemoryRegion *remote_iommu_get_ram(PCIDevice *pci_dev);
+
+#endif
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 4a84e478ce..57d561cc03 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -95,7 +95,7 @@ static const VMStateDescription vmstate_pcibus = {
     }
 };
 
-static void pci_init_bus_master(PCIDevice *pci_dev)
+void pci_init_bus_master(PCIDevice *pci_dev)
 {
     AddressSpace *dma_as = pci_device_iommu_address_space(pci_dev);
 
diff --git a/hw/remote/iommu.c b/hw/remote/iommu.c
new file mode 100644
index 0000000000..30c866badb
--- /dev/null
+++ b/hw/remote/iommu.c
@@ -0,0 +1,117 @@
+/*
+ * Remote IOMMU
+ *
+ * Copyright © 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/remote/iommu.h"
+#include "hw/pci/pci_bus.h"
+#include "exec/memory.h"
+#include "exec/address-spaces.h"
+#include "trace.h"
+
+struct VFUIOMMU {
+    AddressSpace  as;
+    MemoryRegion  mr;
+};
+
+typedef struct VFUPciBus {
+    PCIBus           *bus;
+    struct VFUIOMMU  *iommu[];
+} VFUPciBus;
+
+GHashTable *remote_as_table;
+
+static AddressSpace *remote_iommu_get_as(PCIBus *bus, void *opaque, int devfn)
+{
+    VFUPciBus *vfu_pci_bus = NULL;
+    struct VFUIOMMU *iommu = NULL;
+
+    if (!remote_as_table) {
+        return &address_space_memory;
+    }
+
+    vfu_pci_bus = g_hash_table_lookup(remote_as_table, bus);
+
+    if (!vfu_pci_bus) {
+        vfu_pci_bus = g_malloc0(sizeof(VFUPciBus));
+        vfu_pci_bus->bus = bus;
+        g_hash_table_insert(remote_as_table, bus, vfu_pci_bus);
+    }
+
+    iommu = vfu_pci_bus->iommu[devfn];
+
+    if (!iommu) {
+        g_autofree char *mr_name = g_strdup_printf("vfu-ram-%d", devfn);
+        g_autofree char *as_name = g_strdup_printf("vfu-as-%d", devfn);
+
+        iommu = g_malloc0(sizeof(struct VFUIOMMU));
+
+        memory_region_init(&iommu->mr, NULL, mr_name, UINT64_MAX);
+        address_space_init(&iommu->as, &iommu->mr, as_name);
+
+        vfu_pci_bus->iommu[devfn] = iommu;
+    }
+
+    return &iommu->as;
+}
+
+void remote_iommu_free(PCIDevice *pci_dev)
+{
+    VFUPciBus *vfu_pci_bus = NULL;
+    struct VFUIOMMU *iommu = NULL;
+
+    if (!remote_as_table) {
+        return;
+    }
+
+    vfu_pci_bus = g_hash_table_lookup(remote_as_table, pci_get_bus(pci_dev));
+
+    if (!vfu_pci_bus) {
+        return;
+    }
+
+    iommu = vfu_pci_bus->iommu[pci_dev->devfn];
+
+    vfu_pci_bus->iommu[pci_dev->devfn] = NULL;
+
+    if (iommu) {
+        memory_region_unref(&iommu->mr);
+        address_space_destroy(&iommu->as);
+        g_free(iommu);
+    }
+}
+
+void remote_iommu_init(void)
+{
+    remote_as_table = g_hash_table_new_full(NULL, NULL, NULL, NULL);
+}
+
+void remote_iommu_set(PCIBus *bus)
+{
+    pci_setup_iommu(bus, remote_iommu_get_as, NULL);
+}
+
+MemoryRegion *remote_iommu_get_ram(PCIDevice *pci_dev)
+{
+    PCIBus *bus = pci_get_bus(pci_dev);
+    VFUPciBus *vfu_pci_bus;
+
+    if (!remote_as_table) {
+        return get_system_memory();
+    }
+
+    vfu_pci_bus = g_hash_table_lookup(remote_as_table, bus);
+    if (!vfu_pci_bus) {
+        return get_system_memory();
+    }
+
+    return &vfu_pci_bus->iommu[pci_dev->devfn]->mr;
+}
diff --git a/hw/remote/machine.c b/hw/remote/machine.c
index 952105eab5..023be0491e 100644
--- a/hw/remote/machine.c
+++ b/hw/remote/machine.c
@@ -21,6 +21,7 @@
 #include "qapi/error.h"
 #include "hw/pci/pci_host.h"
 #include "hw/remote/iohub.h"
+#include "hw/remote/iommu.h"
 
 static void remote_machine_init(MachineState *machine)
 {
@@ -52,6 +53,10 @@ static void remote_machine_init(MachineState *machine)
 
     remote_iohub_init(&s->iohub);
 
+    remote_iommu_init();
+
+    remote_iommu_set(pci_host->bus);
+
     pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
                  &s->iohub, REMOTE_IOHUB_NB_PIRQS);
 }
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 9399e87cbe..ae375e69b9 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -49,6 +49,7 @@
 #include "hw/qdev-core.h"
 #include "hw/pci/pci.h"
 #include "qemu/timer.h"
+#include "hw/remote/iommu.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -210,6 +211,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
 
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
+    VfuObject *o = vfu_get_private(vfu_ctx);
     MemoryRegion *subregion = NULL;
     g_autofree char *name = NULL;
     static unsigned int suffix;
@@ -226,14 +228,15 @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
     memory_region_init_ram_ptr(subregion, NULL, name,
                                iov->iov_len, info->vaddr);
 
-    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
-                                subregion);
+    memory_region_add_subregion(remote_iommu_get_ram(o->pci_dev),
+                                (hwaddr)iov->iov_base, subregion);
 
     trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
 }
 
 static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
+    VfuObject *o = vfu_get_private(vfu_ctx);
     MemoryRegion *mr = NULL;
     ram_addr_t offset;
 
@@ -242,7 +245,7 @@ static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
         return;
     }
 
-    memory_region_del_subregion(get_system_memory(), mr);
+    memory_region_del_subregion(remote_iommu_get_ram(o->pci_dev), mr);
 
     object_unparent((OBJECT(mr)));
 
@@ -320,6 +323,7 @@ static vfu_region_access_cb_t *vfu_object_bar_handlers[PCI_NUM_REGIONS] = {
  */
 static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
 {
+    VfuObject *o = vfu_get_private(vfu_ctx);
     int i;
 
     for (i = 0; i < PCI_NUM_REGIONS; i++) {
@@ -332,6 +336,12 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
                          vfu_object_bar_handlers[i],
                          VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
 
+        if ((o->pci_dev->io_regions[i].type & PCI_BASE_ADDRESS_SPACE) == 0) {
+            memory_region_unref(o->pci_dev->io_regions[i].address_space);
+            o->pci_dev->io_regions[i].address_space =
+                remote_iommu_get_ram(o->pci_dev);
+        }
+
         trace_vfu_bar_register(i, pdev->io_regions[i].addr,
                                pdev->io_regions[i].size);
     }
@@ -490,6 +500,10 @@ static void vfu_object_finalize(Object *obj)
 
     o->device = NULL;
 
+    if (o->pci_dev) {
+        remote_iommu_free(o->pci_dev);
+    }
+
     o->pci_dev = NULL;
 
     if (!k->nr_devs && !k->daemon) {
diff --git a/MAINTAINERS b/MAINTAINERS
index b5eb306662..5dc67d79a1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3466,6 +3466,8 @@ F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
 F: subprojects/libvfio-user
 F: hw/remote/vfio-user-obj.c
+F: include/hw/remote/iommu.h
+F: hw/remote/iommu.c
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index 534ac5df79..bcef83c8cc 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -6,6 +6,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
+remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iommu.c'))
 remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: files('vfio-user-obj.c'))
 
 remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: vfiouser)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 12/14] vfio-user: handle device interrupts
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (10 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 11/14] vfio-user: IOMMU support for remote device Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-16 15:56   ` Stefan Hajnoczi
  2021-12-15 15:35 ` [PATCH v4 13/14] vfio-user: register handlers to facilitate migration Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 14/14] vfio-user: avocado tests for vfio-user Jagannathan Raman
  13 siblings, 1 reply; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Forward remote device's interrupts to the guest

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/pci/pci.h      |  6 ++++
 include/hw/remote/iohub.h |  1 +
 hw/pci/msi.c              | 13 ++++++-
 hw/pci/msix.c             | 12 ++++++-
 hw/remote/iohub.c         |  7 ++++
 hw/remote/vfio-user-obj.c | 74 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  1 +
 7 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index f2fc2d5375..ffc030d9ca 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -128,6 +128,8 @@ typedef uint32_t PCIConfigReadFunc(PCIDevice *pci_dev,
 typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int region_num,
                                 pcibus_t addr, pcibus_t size, int type);
 typedef void PCIUnregisterFunc(PCIDevice *pci_dev);
+typedef void PCIMSINotify(PCIDevice *pci_dev, unsigned vector);
+typedef void PCIMSIxNotify(PCIDevice *pci_dev, unsigned vector);
 
 typedef struct PCIIORegion {
     pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
@@ -321,6 +323,10 @@ struct PCIDevice {
     /* Space to store MSIX table & pending bit array */
     uint8_t *msix_table;
     uint8_t *msix_pba;
+
+    PCIMSINotify *msi_notify;
+    PCIMSIxNotify *msix_notify;
+
     /* MemoryRegion container for msix exclusive BAR setup */
     MemoryRegion msix_exclusive_bar;
     /* Memory Regions for MSIX table and pending bit entries. */
diff --git a/include/hw/remote/iohub.h b/include/hw/remote/iohub.h
index 0bf98e0d78..70d98b38d0 100644
--- a/include/hw/remote/iohub.h
+++ b/include/hw/remote/iohub.h
@@ -30,6 +30,7 @@ typedef struct RemoteIOHubState {
     unsigned int irq_level[REMOTE_IOHUB_NB_PIRQS];
     ResampleToken token[REMOTE_IOHUB_NB_PIRQS];
     QemuMutex irq_level_lock[REMOTE_IOHUB_NB_PIRQS];
+    void (*intx_notify)(int pirq, unsigned vector);
 } RemoteIOHubState;
 
 int remote_iohub_map_irq(PCIDevice *pci_dev, int intx);
diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index 47d2b0f33c..93f5e400cc 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -51,6 +51,8 @@
  */
 bool msi_nonbroken;
 
+static void pci_msi_notify(PCIDevice *dev, unsigned int vector);
+
 /* If we get rid of cap allocator, we won't need this. */
 static inline uint8_t msi_cap_sizeof(uint16_t flags)
 {
@@ -225,6 +227,8 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
     dev->msi_cap = config_offset;
     dev->cap_present |= QEMU_PCI_CAP_MSI;
 
+    dev->msi_notify = pci_msi_notify;
+
     pci_set_word(dev->config + msi_flags_off(dev), flags);
     pci_set_word(dev->wmask + msi_flags_off(dev),
                  PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
@@ -307,7 +311,7 @@ bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
     return mask & (1U << vector);
 }
 
-void msi_notify(PCIDevice *dev, unsigned int vector)
+static void pci_msi_notify(PCIDevice *dev, unsigned int vector)
 {
     uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
     bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
@@ -332,6 +336,13 @@ void msi_notify(PCIDevice *dev, unsigned int vector)
     msi_send_message(dev, msg);
 }
 
+void msi_notify(PCIDevice *dev, unsigned int vector)
+{
+    if (dev->msi_notify) {
+        dev->msi_notify(dev, vector);
+    }
+}
+
 void msi_send_message(PCIDevice *dev, MSIMessage msg)
 {
     MemTxAttrs attrs = {};
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331cd0b..1c71e67f53 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -31,6 +31,8 @@
 #define MSIX_ENABLE_MASK (PCI_MSIX_FLAGS_ENABLE >> 8)
 #define MSIX_MASKALL_MASK (PCI_MSIX_FLAGS_MASKALL >> 8)
 
+static void pci_msix_notify(PCIDevice *dev, unsigned vector);
+
 MSIMessage msix_get_message(PCIDevice *dev, unsigned vector)
 {
     uint8_t *table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
@@ -334,6 +336,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries,
     dev->msix_table = g_malloc0(table_size);
     dev->msix_pba = g_malloc0(pba_size);
     dev->msix_entry_used = g_malloc0(nentries * sizeof *dev->msix_entry_used);
+    dev->msix_notify = pci_msix_notify;
 
     msix_mask_all(dev, nentries);
 
@@ -485,7 +488,7 @@ int msix_enabled(PCIDevice *dev)
 }
 
 /* Send an MSI-X message */
-void msix_notify(PCIDevice *dev, unsigned vector)
+static void pci_msix_notify(PCIDevice *dev, unsigned vector)
 {
     MSIMessage msg;
 
@@ -503,6 +506,13 @@ void msix_notify(PCIDevice *dev, unsigned vector)
     msi_send_message(dev, msg);
 }
 
+void msix_notify(PCIDevice *dev, unsigned vector)
+{
+    if (dev->msix_notify) {
+        dev->msix_notify(dev, vector);
+    }
+}
+
 void msix_reset(PCIDevice *dev)
 {
     if (!msix_present(dev)) {
diff --git a/hw/remote/iohub.c b/hw/remote/iohub.c
index 547d597f0f..d28d9f3ce2 100644
--- a/hw/remote/iohub.c
+++ b/hw/remote/iohub.c
@@ -17,7 +17,9 @@
 #include "qemu/thread.h"
 #include "hw/remote/machine.h"
 #include "hw/remote/iohub.h"
+#include "hw/pci/msi.h"
 #include "qemu/main-loop.h"
+#include "trace.h"
 
 void remote_iohub_init(RemoteIOHubState *iohub)
 {
@@ -32,6 +34,8 @@ void remote_iohub_init(RemoteIOHubState *iohub)
         event_notifier_init_fd(&iohub->irqfds[pirq], -1);
         event_notifier_init_fd(&iohub->resamplefds[pirq], -1);
     }
+
+    msi_nonbroken = true;
 }
 
 void remote_iohub_finalize(RemoteIOHubState *iohub)
@@ -62,6 +66,9 @@ void remote_iohub_set_irq(void *opaque, int pirq, int level)
     QEMU_LOCK_GUARD(&iohub->irq_level_lock[pirq]);
 
     if (level) {
+        if (iohub->intx_notify) {
+            iohub->intx_notify(pirq, 0);
+        }
         if (++iohub->irq_level[pirq] == 1) {
             event_notifier_set(&iohub->irqfds[pirq]);
         }
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index ae375e69b9..2b28d465d5 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -50,6 +50,9 @@
 #include "hw/pci/pci.h"
 #include "qemu/timer.h"
 #include "hw/remote/iommu.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/remote/iohub.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -81,6 +84,8 @@ struct VfuObject {
     int vfu_poll_fd;
 };
 
+static GHashTable *vfu_object_dev_table;
+
 static void vfu_object_init_ctx(VfuObject *o, Error **errp);
 
 static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
@@ -347,6 +352,54 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
     }
 }
 
+static void vfu_object_intx_notify(int pci_devfn, unsigned vector)
+{
+    vfu_ctx_t *vfu_ctx = g_hash_table_lookup(vfu_object_dev_table,
+                                             (void *)(uint64_t)pci_devfn);
+
+    if (vfu_ctx) {
+        vfu_irq_trigger(vfu_ctx, vector);
+    }
+}
+
+static void vfu_object_msi_notify(PCIDevice *pci_dev, unsigned vector)
+{
+    vfu_object_intx_notify(pci_dev->devfn, vector);
+}
+
+static int vfu_object_setup_irqs(vfu_ctx_t *vfu_ctx, PCIDevice *pci_dev)
+{
+    RemoteMachineState *machine = REMOTE_MACHINE(current_machine);
+    RemoteIOHubState *iohub = &machine->iohub;
+    int ret;
+
+    ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_INTX_IRQ, 1);
+    if (ret < 0) {
+        return ret;
+    }
+
+    iohub->intx_notify = vfu_object_intx_notify;
+
+    ret = 0;
+    if (msix_nr_vectors_allocated(pci_dev)) {
+        ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_MSIX_IRQ,
+                                       msix_nr_vectors_allocated(pci_dev));
+
+        pci_dev->msix_notify = vfu_object_msi_notify;
+    } else if (msi_nr_vectors_allocated(pci_dev)) {
+        ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_MSI_IRQ,
+                                       msi_nr_vectors_allocated(pci_dev));
+
+        pci_dev->msi_notify = vfu_object_msi_notify;
+    }
+
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -437,6 +490,13 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
 
     vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
 
+    ret = vfu_object_setup_irqs(o->vfu_ctx, o->pci_dev);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to setup interrupts for %s",
+                   o->device);
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
@@ -450,6 +510,9 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    g_hash_table_insert(vfu_object_dev_table,
+                        (void *)(uint64_t)o->pci_dev->devfn, o->vfu_ctx);
+
     qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_attach_ctx, NULL, o);
 
     return;
@@ -504,9 +567,18 @@ static void vfu_object_finalize(Object *obj)
         remote_iommu_free(o->pci_dev);
     }
 
+    if (o->pci_dev &&
+            g_hash_table_lookup(vfu_object_dev_table,
+                                (void *)(uint64_t)o->pci_dev->devfn)) {
+        g_hash_table_remove(vfu_object_dev_table,
+                            (void *)(uint64_t)o->pci_dev->devfn);
+    }
+
     o->pci_dev = NULL;
 
     if (!k->nr_devs && !k->daemon) {
+        g_hash_table_destroy(vfu_object_dev_table);
+        vfu_object_dev_table = NULL;
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     }
 
@@ -525,6 +597,8 @@ static void vfu_object_class_init(ObjectClass *klass, void *data)
     /* Later determine how to detect a daemon */
     k->daemon = false;
 
+    vfu_object_dev_table = g_hash_table_new_full(NULL, NULL, NULL, NULL);
+
     object_class_property_add(klass, "socket", "SocketAddress", NULL,
                               vfu_object_set_socket, NULL, NULL);
     object_class_property_set_description(klass, "socket",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 847d50d88f..c167b3c7a5 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -12,3 +12,4 @@ vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
+vfu_interrupt(int pirq) "vfu: sending interrupt to device - PIRQ %d"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 13/14] vfio-user: register handlers to facilitate migration
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (11 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 12/14] vfio-user: handle device interrupts Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  2021-12-15 15:35 ` [PATCH v4 14/14] vfio-user: avocado tests for vfio-user Jagannathan Raman
  13 siblings, 0 replies; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Store and load the device's state during migration. use libvfio-user's
handlers for this purpose

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/migration/vmstate.h |   2 +
 migration/savevm.h          |   2 +
 hw/remote/vfio-user-obj.c   | 323 ++++++++++++++++++++++++++++++++++++
 migration/savevm.c          |  73 ++++++++
 migration/vmstate.c         |  19 +++
 5 files changed, 419 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 017c03675c..68bea576ea 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -1165,6 +1165,8 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_END_OF_LIST()                                         \
     {}
 
+uint64_t vmstate_vmsd_size(PCIDevice *pci_dev);
+
 int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
                        void *opaque, int version_id);
 int vmstate_save_state(QEMUFile *f, const VMStateDescription *vmsd,
diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342cb4..8007064ff2 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -67,5 +67,7 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
+int qemu_remote_savevm(QEMUFile *f, DeviceState *dev);
+int qemu_remote_loadvm(QEMUFile *f);
 
 #endif
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 2b28d465d5..cc0b82445b 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -53,6 +53,11 @@
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
 #include "hw/remote/iohub.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/vmstate.h"
+#include "migration/global_state.h"
+#include "block/block.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -82,6 +87,35 @@ struct VfuObject {
     PCIDevice *pci_dev;
 
     int vfu_poll_fd;
+
+    /*
+     * vfu_mig_buf holds the migration data. In the remote server, this
+     * buffer replaces the role of an IO channel which links the source
+     * and the destination.
+     *
+     * Whenever the client QEMU process initiates migration, the remote
+     * server gets notified via libvfio-user callbacks. The remote server
+     * sets up a QEMUFile object using this buffer as backend. The remote
+     * server passes this object to its migration subsystem, which slurps
+     * the VMSD of the device ('devid' above) referenced by this object
+     * and stores the VMSD in this buffer.
+     *
+     * The client subsequetly asks the remote server for any data that
+     * needs to be moved over to the destination via libvfio-user
+     * library's vfu_migration_callbacks_t callbacks. The remote hands
+     * over this buffer as data at this time.
+     *
+     * A reverse of this process happens at the destination.
+     */
+    uint8_t *vfu_mig_buf;
+
+    uint64_t vfu_mig_buf_size;
+
+    uint64_t vfu_mig_buf_pending;
+
+    QEMUFile *vfu_mig_file;
+
+    vfu_migr_state_t vfu_state;
 };
 
 static GHashTable *vfu_object_dev_table;
@@ -125,6 +159,272 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
     vfu_object_init_ctx(o, errp);
 }
 
+/**
+ * Migration helper functions
+ *
+ * vfu_mig_buf_read & vfu_mig_buf_write are used by QEMU's migration
+ * subsystem - qemu_remote_loadvm & qemu_remote_savevm. loadvm/savevm
+ * call these functions via QEMUFileOps to load/save the VMSD of a
+ * device into vfu_mig_buf
+ *
+ */
+static ssize_t vfu_mig_buf_read(void *opaque, uint8_t *buf, int64_t pos,
+                                size_t size, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    if (pos > o->vfu_mig_buf_size) {
+        size = 0;
+    } else if ((pos + size) > o->vfu_mig_buf_size) {
+        size = o->vfu_mig_buf_size - pos;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + pos), size);
+
+    return size;
+}
+
+static ssize_t vfu_mig_buf_write(void *opaque, struct iovec *iov, int iovcnt,
+                                 int64_t pos, Error **errp)
+{
+    VfuObject *o = opaque;
+    uint64_t end = pos + iov_size(iov, iovcnt);
+    int i;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+    }
+
+    for (i = 0; i < iovcnt; i++) {
+        memcpy((o->vfu_mig_buf + o->vfu_mig_buf_size), iov[i].iov_base,
+               iov[i].iov_len);
+        o->vfu_mig_buf_size += iov[i].iov_len;
+        o->vfu_mig_buf_pending += iov[i].iov_len;
+    }
+
+    return iov_size(iov, iovcnt);
+}
+
+static int vfu_mig_buf_shutdown(void *opaque, bool rd, bool wr, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    o->vfu_mig_buf_size = 0;
+
+    g_free(o->vfu_mig_buf);
+
+    o->vfu_mig_buf = NULL;
+
+    o->vfu_mig_buf_pending = 0;
+
+    return 0;
+}
+
+static const QEMUFileOps vfu_mig_fops_save = {
+    .writev_buffer  = vfu_mig_buf_write,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+static const QEMUFileOps vfu_mig_fops_load = {
+    .get_buffer     = vfu_mig_buf_read,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+/**
+ * handlers for vfu_migration_callbacks_t
+ *
+ * The libvfio-user library accesses these handlers to drive the migration
+ * at the remote end, and also to transport the data stored in vfu_mig_buf
+ *
+ */
+static void vfu_mig_state_stop_and_copy(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    int ret;
+
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_save, false);
+    }
+
+    ret = qemu_remote_savevm(o->vfu_mig_file, DEVICE(o->pci_dev));
+    if (ret) {
+        qemu_file_shutdown(o->vfu_mig_file);
+        o->vfu_mig_file = NULL;
+        return;
+    }
+
+    qemu_fflush(o->vfu_mig_file);
+}
+
+static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
+    static int migrated_devs;
+    Error *local_err = NULL;
+    int ret;
+
+    /**
+     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
+     * VMSD data from source is not available at RESUME state.
+     * Working on a fix for this.
+     */
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
+    }
+
+    ret = qemu_remote_loadvm(o->vfu_mig_file);
+    if (ret) {
+        error_setg(&error_abort, "vfu: failed to restore device state");
+        return;
+    }
+
+    qemu_file_shutdown(o->vfu_mig_file);
+    o->vfu_mig_file = NULL;
+
+    /* VFU_MIGR_STATE_RUNNING begins here */
+    if (++migrated_devs == k->nr_devs) {
+        bdrv_invalidate_cache_all(&local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            return;
+        }
+
+        vm_start();
+    }
+}
+
+static void vfu_mig_state_stop(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
+    static int migrated_devs;
+
+    /**
+     * note: calling bdrv_inactivate_all() is not the best approach.
+     *
+     *  Ideally, we would identify the block devices (if any) indirectly
+     *  linked (such as via a scsi-hd device) to each of the migrated devices,
+     *  and inactivate them individually. This is essential while operating
+     *  the server in a storage daemon mode, with devices from different VMs.
+     *
+     *  However, we currently don't have this capability. As such, we need to
+     *  inactivate all devices at the same time when migration is completed.
+     */
+    if (++migrated_devs == k->nr_devs) {
+        vm_stop(RUN_STATE_PAUSED);
+        bdrv_inactivate_all();
+    }
+}
+
+static int vfu_mig_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (o->vfu_state == state) {
+        return 0;
+    }
+
+    switch (state) {
+    case VFU_MIGR_STATE_RESUME:
+        break;
+    case VFU_MIGR_STATE_STOP_AND_COPY:
+        vfu_mig_state_stop_and_copy(vfu_ctx);
+        break;
+    case VFU_MIGR_STATE_STOP:
+        vfu_mig_state_stop(vfu_ctx);
+        break;
+    case VFU_MIGR_STATE_PRE_COPY:
+        break;
+    case VFU_MIGR_STATE_RUNNING:
+        if (!runstate_is_running()) {
+            vfu_mig_state_running(vfu_ctx);
+        }
+        break;
+    default:
+        warn_report("vfu: Unknown migration state %d", state);
+    }
+
+    o->vfu_state = state;
+
+    return 0;
+}
+
+static uint64_t vfu_mig_get_pending_bytes(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    return o->vfu_mig_buf_pending;
+}
+
+static int vfu_mig_prepare_data(vfu_ctx_t *vfu_ctx, uint64_t *offset,
+                                uint64_t *size)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset) {
+        *offset = 0;
+    }
+
+    if (size) {
+        *size = o->vfu_mig_buf_size;
+    }
+
+    return 0;
+}
+
+static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
+                                 uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset > o->vfu_mig_buf_size) {
+        return -1;
+    }
+
+    if ((offset + size) > o->vfu_mig_buf_size) {
+        warn_report("vfu: buffer overflow - check pending_bytes");
+        size = o->vfu_mig_buf_size - offset;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + offset), size);
+
+    o->vfu_mig_buf_pending -= size;
+
+    return size;
+}
+
+static ssize_t vfu_mig_write_data(vfu_ctx_t *vfu_ctx, void *data,
+                                  uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint64_t end = offset + size;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+        o->vfu_mig_buf_size = end;
+    }
+
+    memcpy((o->vfu_mig_buf + offset), data, size);
+
+    return size;
+}
+
+static int vfu_mig_data_written(vfu_ctx_t *vfu_ctx, uint64_t count)
+{
+    return 0;
+}
+
+static const vfu_migration_callbacks_t vfu_mig_cbs = {
+    .version = VFU_MIGR_CALLBACKS_VERS,
+    .transition = &vfu_mig_transition,
+    .get_pending_bytes = &vfu_mig_get_pending_bytes,
+    .prepare_data = &vfu_mig_prepare_data,
+    .read_data = &vfu_mig_read_data,
+    .data_written = &vfu_mig_data_written,
+    .write_data = &vfu_mig_write_data,
+};
+
 static void vfu_object_ctx_run(void *opaque)
 {
     VfuObject *o = opaque;
@@ -425,6 +725,7 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
     ERRP_GUARD();
     DeviceState *dev = NULL;
     vfu_pci_type_t pci_type = VFU_PCI_TYPE_CONVENTIONAL;
+    uint64_t migr_regs_size, migr_size;
     int ret;
 
     if (o->vfu_ctx || !o->socket || !o->device ||
@@ -497,6 +798,26 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    migr_regs_size = vfu_get_migr_register_area_size();
+    migr_size = migr_regs_size + vmstate_vmsd_size(o->pci_dev);
+
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_MIGR_REGION_IDX,
+                           migr_size, NULL,
+                           VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to register migration BAR %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
+    ret = vfu_setup_device_migration_callbacks(o->vfu_ctx, &vfu_mig_cbs,
+                                               migr_regs_size);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to setup migration %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
@@ -542,6 +863,8 @@ static void vfu_object_init(Object *obj)
     }
 
     o->vfu_poll_fd = -1;
+
+    o->vfu_state = VFU_MIGR_STATE_STOP;
 }
 
 static void vfu_object_finalize(Object *obj)
diff --git a/migration/savevm.c b/migration/savevm.c
index d59e976d50..69b7ea8b09 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1605,6 +1605,49 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
     return ret;
 }
 
+static SaveStateEntry *find_se_from_dev(DeviceState *dev)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (se->opaque == dev) {
+            return se;
+        }
+    }
+
+    return NULL;
+}
+
+int qemu_remote_savevm(QEMUFile *f, DeviceState *dev)
+{
+    SaveStateEntry *se;
+    int ret = 0;
+
+    se = find_se_from_dev(dev);
+    if (!se) {
+        return -ENODEV;
+    }
+
+    if (!se->vmsd || !vmstate_save_needed(se->vmsd, se->opaque)) {
+        return ret;
+    }
+
+    save_section_header(f, se, QEMU_VM_SECTION_FULL);
+
+    ret = vmstate_save(f, se, NULL);
+    if (ret) {
+        qemu_file_set_error(f, ret);
+        return ret;
+    }
+
+    save_section_footer(f, se);
+
+    qemu_put_byte(f, QEMU_VM_EOF);
+    qemu_fflush(f);
+
+    return 0;
+}
+
 void qemu_savevm_live_state(QEMUFile *f)
 {
     /* save QEMU_VM_SECTION_END section */
@@ -2445,6 +2488,36 @@ qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis)
     return 0;
 }
 
+int qemu_remote_loadvm(QEMUFile *f)
+{
+    uint8_t section_type;
+    int ret = 0;
+
+    while (true) {
+        section_type = qemu_get_byte(f);
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            break;
+        }
+
+        switch (section_type) {
+        case QEMU_VM_SECTION_FULL:
+            ret = qemu_loadvm_section_start_full(f, NULL);
+            if (ret < 0) {
+                break;
+            }
+            break;
+        case QEMU_VM_EOF:
+            return ret;
+        default:
+            return -EINVAL;
+        }
+    }
+
+    return ret;
+}
+
 static int
 qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
 {
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 05f87cdddc..83f8562792 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -63,6 +63,25 @@ static int vmstate_size(void *opaque, const VMStateField *field)
     return size;
 }
 
+uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
+{
+    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
+    const VMStateField *field = NULL;
+    uint64_t size = 0;
+
+    if (!dc->vmsd) {
+        return 0;
+    }
+
+    field = dc->vmsd->fields;
+    while (field && field->name) {
+        size += vmstate_size(pci_dev, field);
+        field++;
+    }
+
+    return size;
+}
+
 static void vmstate_handle_alloc(void *ptr, const VMStateField *field,
                                  void *opaque)
 {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 14/14] vfio-user: avocado tests for vfio-user
  2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
                   ` (12 preceding siblings ...)
  2021-12-15 15:35 ` [PATCH v4 13/14] vfio-user: register handlers to facilitate migration Jagannathan Raman
@ 2021-12-15 15:35 ` Jagannathan Raman
  13 siblings, 0 replies; 49+ messages in thread
From: Jagannathan Raman @ 2021-12-15 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, jag.raman, bleal,
	swapnil.ingle, john.levon, philmd, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, stefanha, crosa, pbonzini,
	alex.bennee

Avocado tests for libvfio-user in QEMU - tests startup,
hotplug and migration of the server object

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 MAINTAINERS                |   1 +
 tests/avocado/vfio-user.py | 225 +++++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 tests/avocado/vfio-user.py

diff --git a/MAINTAINERS b/MAINTAINERS
index 5dc67d79a1..385a4d8869 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3468,6 +3468,7 @@ F: subprojects/libvfio-user
 F: hw/remote/vfio-user-obj.c
 F: include/hw/remote/iommu.h
 F: hw/remote/iommu.c
+F: tests/avocado/vfio-user.py
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/tests/avocado/vfio-user.py b/tests/avocado/vfio-user.py
new file mode 100644
index 0000000000..376c02c41f
--- /dev/null
+++ b/tests/avocado/vfio-user.py
@@ -0,0 +1,225 @@
+# vfio-user protocol sanity test
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+
+import os
+import socket
+import uuid
+
+from avocado_qemu import QemuSystemTest
+from avocado_qemu import wait_for_console_pattern
+from avocado_qemu import exec_command
+from avocado_qemu import exec_command_and_wait_for_pattern
+
+from avocado.utils import network
+from avocado.utils import wait
+
+class VfioUser(QemuSystemTest):
+    """
+    :avocado: tags=vfiouser
+    """
+    KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
+    timeout = 20
+
+    @staticmethod
+    def migration_finished(vm):
+        res = vm.command('query-migrate')
+        if 'status' in res:
+            return res['status'] in ('completed', 'failed')
+        else:
+            return False
+
+    def _get_free_port(self):
+        port = network.find_free_port()
+        if port is None:
+            self.cancel('Failed to find a free port')
+        return port
+
+    def validate_vm_launch(self, vm):
+        wait_for_console_pattern(self, 'as init process',
+                                 'Kernel panic - not syncing', vm=vm)
+        exec_command(self, 'mount -t sysfs sysfs /sys', vm=vm)
+        exec_command_and_wait_for_pattern(self,
+                                          'cat /sys/bus/pci/devices/*/uevent',
+                                          'PCI_ID=1000:0012', vm=vm)
+
+    def launch_server_startup(self, socket, *opts):
+        server_vm = self.get_vm()
+        server_vm.add_args('-machine', 'x-remote')
+        server_vm.add_args('-nodefaults')
+        server_vm.add_args('-device', 'lsi53c895a,id=lsi1')
+        server_vm.add_args('-object', 'x-vfio-user-server,id=vfioobj1,'
+                           'type=unix,path='+socket+',device=lsi1')
+        for opt in opts:
+            server_vm.add_args(opt)
+        server_vm.launch()
+        return server_vm
+
+    def launch_server_hotplug(self, socket):
+        server_vm = self.get_vm()
+        server_vm.add_args('-machine', 'x-remote')
+        server_vm.add_args('-nodefaults')
+        server_vm.add_args('-device', 'lsi53c895a,id=lsi1')
+        server_vm.launch()
+        server_vm.command('human-monitor-command',
+                          command_line='object_add x-vfio-user-server,'
+                                       'id=vfioobj,socket.type=unix,'
+                                       'socket.path='+socket+',device=lsi1')
+        return server_vm
+
+    def launch_client(self, kernel_path, initrd_path, kernel_command_line,
+                      machine_type, socket, *opts):
+        client_vm = self.get_vm()
+        client_vm.set_console()
+        client_vm.add_args('-machine', machine_type)
+        client_vm.add_args('-accel', 'kvm')
+        client_vm.add_args('-cpu', 'host')
+        client_vm.add_args('-object',
+                           'memory-backend-memfd,id=sysmem-file,size=2G')
+        client_vm.add_args('--numa', 'node,memdev=sysmem-file')
+        client_vm.add_args('-m', '2048')
+        client_vm.add_args('-kernel', kernel_path,
+                           '-initrd', initrd_path,
+                           '-append', kernel_command_line)
+        client_vm.add_args('-device',
+                           'vfio-user-pci,x-enable-migration=true,'
+                           'socket='+socket)
+        for opt in opts:
+            client_vm.add_args(opt)
+        client_vm.launch()
+        return client_vm
+
+    def do_test_startup(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+        socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(socket):
+            os.remove(socket)
+        self.launch_server_startup(socket)
+        client = self.launch_client(kernel_path, initrd_path,
+                                    kernel_command_line, machine_type, socket)
+        self.validate_vm_launch(client)
+
+    def do_test_hotplug(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+        socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(socket):
+            os.remove(socket)
+        self.launch_server_hotplug(socket)
+        client = self.launch_client(kernel_path, initrd_path,
+                                    kernel_command_line, machine_type, socket)
+        self.validate_vm_launch(client)
+
+    def do_test_migrate(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+        srv_socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(srv_socket):
+            os.remove(srv_socket)
+        dst_socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(dst_socket):
+            os.remove(dst_socket)
+        client_uri = 'tcp:localhost:%u' % self._get_free_port()
+        server_uri = 'tcp:localhost:%u' % self._get_free_port()
+
+        """ Launch destination VM """
+        self.launch_server_startup(dst_socket, '-incoming', server_uri)
+        dst_client = self.launch_client(kernel_path, initrd_path,
+                                        kernel_command_line, machine_type,
+                                        dst_socket, '-incoming', client_uri)
+
+        """ Launch source VM """
+        self.launch_server_startup(srv_socket)
+        src_client = self.launch_client(kernel_path, initrd_path,
+                                        kernel_command_line, machine_type,
+                                        srv_socket)
+        self.validate_vm_launch(src_client)
+
+        """ Kick off migration """
+        src_client.qmp('migrate', uri=client_uri)
+
+        wait.wait_for(self.migration_finished,
+                      timeout=self.timeout,
+                      step=0.1,
+                      args=(dst_client,))
+
+    def test_vfio_user_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        :avocado: tags=distro:centos
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test_startup(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
+    def test_vfio_user_aarch64(self):
+        """
+        :avocado: tags=arch:aarch64
+        :avocado: tags=distro:ubuntu
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'rdinit=/bin/bash console=ttyAMA0')
+        machine_type = 'virt,gic-version=3'
+        self.do_test_startup(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
+    def test_vfio_user_hotplug_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        :avocado: tags=distro:centos
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test_hotplug(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
+    def test_vfio_user_migrate_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        :avocado: tags=distro:centos
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test_migrate(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines
  2021-12-15 15:35 ` [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
@ 2021-12-15 15:54   ` Philippe Mathieu-Daudé
  2021-12-15 22:04   ` Beraldo Leal
  1 sibling, 0 replies; 49+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-12-15 15:54 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, wainersm, alex.williamson, thanos.makatos,
	marcandre.lureau, stefanha, crosa, pbonzini, alex.bennee

On 12/15/21 16:35, Jagannathan Raman wrote:
> Specify target VM for exec_command and
> exec_command_and_wait_for_pattern routines
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  tests/avocado/avocado_qemu/__init__.py | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines
  2021-12-15 15:35 ` [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
  2021-12-15 15:54   ` Philippe Mathieu-Daudé
@ 2021-12-15 22:04   ` Beraldo Leal
  2021-12-16 21:28     ` Jag Raman
  1 sibling, 1 reply; 49+ messages in thread
From: Beraldo Leal @ 2021-12-15 22:04 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, swapnil.ingle,
	john.levon, alex.bennee, qemu-devel, wainersm, alex.williamson,
	pbonzini, marcandre.lureau, stefanha, crosa, thanos.makatos,
	philmd

On Wed, Dec 15, 2021 at 10:35:26AM -0500, Jagannathan Raman wrote:
> Specify target VM for exec_command and
> exec_command_and_wait_for_pattern routines
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  tests/avocado/avocado_qemu/__init__.py | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
> index 75063c0c30..26ac782f53 100644
> --- a/tests/avocado/avocado_qemu/__init__.py
> +++ b/tests/avocado/avocado_qemu/__init__.py
> @@ -198,7 +198,7 @@ def wait_for_console_pattern(test, success_message, failure_message=None,
>      """
>      _console_interaction(test, success_message, failure_message, None, vm=vm)
>  
> -def exec_command(test, command):
> +def exec_command(test, command, vm=None):

nitpick: if possible, it would be nice to update the docstring, by
adding this new argument.

>      """
>      Send a command to a console (appending CRLF characters), while logging
>      the content.
> @@ -208,10 +208,11 @@ def exec_command(test, command):
>      :param command: the command to send
>      :type command: str
>      """
> -    _console_interaction(test, None, None, command + '\r')
> +    _console_interaction(test, None, None, command + '\r', vm=vm)
>  
>  def exec_command_and_wait_for_pattern(test, command,
> -                                      success_message, failure_message=None):
> +                                      success_message, failure_message=None,
> +                                      vm=None):

Same here.

Other than that, lgtm.

Reviewed-by: Beraldo Leal <bleal@redhat.com>

--
Beraldo



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 04/14] vfio-user: define vfio-user-server object
  2021-12-15 15:35 ` [PATCH v4 04/14] vfio-user: define vfio-user-server object Jagannathan Raman
@ 2021-12-16  9:33   ` Stefan Hajnoczi
  2021-12-17  2:17     ` Jag Raman
  2021-12-16  9:58   ` Stefan Hajnoczi
  1 sibling, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16  9:33 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1514 bytes --]

On Wed, Dec 15, 2021 at 10:35:28AM -0500, Jagannathan Raman wrote:
> diff --git a/qapi/qom.json b/qapi/qom.json
> index ccd1167808..6001a9b8f0 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -703,6 +703,20 @@
>  { 'struct': 'RemoteObjectProperties',
>    'data': { 'fd': 'str', 'devid': 'str' } }
>  
> +##
> +# @VfioUserServerProperties:
> +#
> +# Properties for x-vfio-user-server objects.
> +#
> +# @socket: socket to be used by the libvfiouser library
> +#
> +# @device: the id of the device to be emulated at the server
> +#
> +# Since: 6.2

6.2 has been released so the version number needs to be incremented.

> +struct VfuObjectClass {
> +    ObjectClass parent_class;
> +
> +    unsigned int nr_devs;
> +
> +    bool daemon;

I was wondering what this means. auto_shutdown might be a clearer name.

> +static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    VfuObject *o = VFU_OBJECT(obj);
> +
> +    qapi_free_SocketAddress(o->socket);
> +
> +    o->socket = NULL;
> +
> +    visit_type_SocketAddress(v, name, &o->socket, errp);
> +
> +    if (o->socket->type != SOCKET_ADDRESS_TYPE_UNIX) {
> +        qapi_free_SocketAddress(o->socket);
> +        o->socket = NULL;
> +        error_setg(errp, "vfu: Unsupported socket type - %s",
> +                   o->socket->u.q_unix.path);

s/o->socket->u.q_unix.path/SocketAddressType_str(o->socket->type)/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 05/14] vfio-user: instantiate vfio-user context
  2021-12-15 15:35 ` [PATCH v4 05/14] vfio-user: instantiate vfio-user context Jagannathan Raman
@ 2021-12-16  9:55   ` Stefan Hajnoczi
  2021-12-16 21:32     ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16  9:55 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 426 bytes --]

On Wed, Dec 15, 2021 at 10:35:29AM -0500, Jagannathan Raman wrote:
> +static void vfu_object_init_ctx(VfuObject *o, Error **errp)
> +{
> +    ERRP_GUARD();
> +
> +    if (o->vfu_ctx || !o->socket || !o->device ||
> +            !phase_check(PHASE_MACHINE_READY)) {
> +        return;
> +    }
> +
> +    if (o->err) {
> +        error_propagate(errp, o->err);

Missing o->err = NULL because ownership has been passed to errp.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 04/14] vfio-user: define vfio-user-server object
  2021-12-15 15:35 ` [PATCH v4 04/14] vfio-user: define vfio-user-server object Jagannathan Raman
  2021-12-16  9:33   ` Stefan Hajnoczi
@ 2021-12-16  9:58   ` Stefan Hajnoczi
  2021-12-17  2:31     ` Jag Raman
  1 sibling, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16  9:58 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1316 bytes --]

On Wed, Dec 15, 2021 at 10:35:28AM -0500, Jagannathan Raman wrote:
> +static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    VfuObject *o = VFU_OBJECT(obj);
> +
> +    qapi_free_SocketAddress(o->socket);
> +
> +    o->socket = NULL;
> +
> +    visit_type_SocketAddress(v, name, &o->socket, errp);
> +
> +    if (o->socket->type != SOCKET_ADDRESS_TYPE_UNIX) {
> +        qapi_free_SocketAddress(o->socket);
> +        o->socket = NULL;
> +        error_setg(errp, "vfu: Unsupported socket type - %s",
> +                   o->socket->u.q_unix.path);
> +        return;
> +    }
> +
> +    trace_vfu_prop("socket", o->socket->u.q_unix.path);
> +}
> +
> +static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
> +{
> +    VfuObject *o = VFU_OBJECT(obj);
> +
> +    g_free(o->device);
> +
> +    o->device = g_strdup(str);
> +
> +    trace_vfu_prop("device", str);
> +}

It appears "socket" and "device" can be changed after the
vfio-user server has started. In the best case it just means the
properties contain values that do not reflect the actual socket/device
currently in use, which is confusing.

It's safer to refuse changing these properties once the vfio-user
server has started.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 06/14] vfio-user: find and init PCI device
  2021-12-15 15:35 ` [PATCH v4 06/14] vfio-user: find and init PCI device Jagannathan Raman
@ 2021-12-16 10:39   ` Stefan Hajnoczi
  2021-12-17  3:12     ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 10:39 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1720 bytes --]

On Wed, Dec 15, 2021 at 10:35:30AM -0500, Jagannathan Raman wrote:
> @@ -150,6 +157,38 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
> +    o->pci_dev = PCI_DEVICE(dev);
...
> @@ -190,6 +229,8 @@ static void vfu_object_finalize(Object *obj)
>  
>      o->device = NULL;
>  
> +    o->pci_dev = NULL;

We need to consider how this interacts with device_del hot unplug.
o->pci_dev is a pointer and we don't hold a refcount, so I think
o->pci_dev could disappear at any moment.

A pair of object_ref/unref(OBJECT(o->pci_dev)) calls would not be enough
because device_del will still unrealize the device that's in use by the
vfio-user server.

I suggest adding a check to qdev_unplug() that prevents unplug when the
device is in use by the vfio-user server. That's similar to the code in
that function for preventing unplug during migration.

One way to do that is by adding a new API:

  /*
   * Register an Error that is raised when unplug is attempted on a
   * device. If another blocker has already been registered then that
   * Error may be raised during unplug instead.
   *
   * qdev_del_unplug_blocker() must be called to remove this blocker.
   */
  void qdev_add_unplug_blocker(DeviceState *dev, Error *blocker);

  /*
   * Deregister an Error that was previously registered with
   * qdev_add_unplug_blocker().
   */
  void qdev_del_unplug_blocker(DeviceState *dev, Error *blocker);

The vfio-user server then needs to add an Error *unplug_blocker field to
VfuObject and call qdev_add/del_unplug_blocker() on the PCI device.

From a user perspective this means that device_del fails with "Device
currently in use by vfio-user server '%s'".

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2021-12-15 15:35 ` [PATCH v4 07/14] vfio-user: run vfio-user context Jagannathan Raman
@ 2021-12-16 11:17   ` Stefan Hajnoczi
  2021-12-17 17:59     ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 11:17 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 3896 bytes --]

On Wed, Dec 15, 2021 at 10:35:31AM -0500, Jagannathan Raman wrote:
> @@ -114,6 +118,62 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
>      vfu_object_init_ctx(o, errp);
>  }
>  
> +static void vfu_object_ctx_run(void *opaque)
> +{
> +    VfuObject *o = opaque;
> +    int ret = -1;
> +
> +    while (ret != 0) {
> +        ret = vfu_run_ctx(o->vfu_ctx);
> +        if (ret < 0) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == ENOTCONN) {
> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> +                o->vfu_poll_fd = -1;
> +                object_unparent(OBJECT(o));
> +                break;

If nothing else logs a message then I think that should be done here so
users know why their vfio-user server object disappeared.

> +            } else {
> +                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
> +                           o->device, strerror(errno));

error_abort is equivalent to assuming !o->daemon. In the case where the
user doesn't want to automatically shut down the process we need to log
a message without aborting.

> +                 break;

Indentation is off.

> +            }
> +        }
> +    }
> +}
> +
> +static void vfu_object_attach_ctx(void *opaque)
> +{
> +    VfuObject *o = opaque;
> +    GPollFD pfds[1];
> +    int ret;
> +
> +    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> +
> +    pfds[0].fd = o->vfu_poll_fd;
> +    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> +
> +retry_attach:
> +    ret = vfu_attach_ctx(o->vfu_ctx);
> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> +        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
> +        goto retry_attach;

This can block the thread indefinitely. Other events like monitor
commands are not handled in this loop. Please make this asynchronous
(set an fd handler and return from this function so we can try again
later).

The vfu_attach_ctx() implementation synchronously negotiates the
vfio-user connection :(. That's a shame because even if accept(2) is
handled asynchronously, the negotiation can still block. It would be
cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
avoid blocking. Is that possible?

If libvfio-user can't make vfu_attach_ctx() fully async then it may be
possible to spawn a thread just for the blocking vfu_attach_ctx() call
and report the result back to the event loop (e.g. using EventNotifier).
That adds a bunch of code to work around a blocking API though, so I
guess we can leave the blocking part if necessary.

At the very minimum, please make EAGAIN/EWOULDBLOCK async as mentioned
above and add a comment explaining the situation with the
partially-async vfu_attach_ctx() API so it's clear that this can block
(that way it's clear that you're aware of the issue and this isn't by
accident).

> +    } else if (ret < 0) {
> +        error_setg(&error_abort,
> +                   "vfu: Failed to attach device %s to context - %s",
> +                   o->device, strerror(errno));

error_abort assumes !o->daemon. Please handle the o->daemon == true
case by logging an error without aborting.

> +        return;
> +    }
> +
> +    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
> +    if (o->vfu_poll_fd < 0) {
> +        error_setg(&error_abort, "vfu: Failed to get poll fd %s", o->device);

Same here.

> @@ -208,6 +284,8 @@ static void vfu_object_init(Object *obj)
>                     TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
>          return;
>      }
> +
> +    o->vfu_poll_fd = -1;
>  }

This must call qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL)
when o->vfu_poll_fd != -1 to avoid leaving a dangling fd handler
callback registered.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 08/14] vfio-user: handle PCI config space accesses
  2021-12-15 15:35 ` [PATCH v4 08/14] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2021-12-16 11:30   ` Stefan Hajnoczi
  2021-12-16 11:47     ` John Levon
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 11:30 UTC (permalink / raw)
  To: john.levon, thanos.makatos
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman, bleal,
	swapnil.ingle, philmd, qemu-devel, wainersm, alex.williamson,
	marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 3097 bytes --]

On Wed, Dec 15, 2021 at 10:35:32AM -0500, Jagannathan Raman wrote:
> Define and register handlers for PCI config space accesses
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 45 +++++++++++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  2 ++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index a01a0ad185..c6d0c675b7 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -43,6 +43,7 @@
>  #include "qapi/qapi-visit-sockets.h"
>  #include "qemu/notify.h"
>  #include "qemu/thread.h"
> +#include "qemu/main-loop.h"
>  #include "sysemu/sysemu.h"
>  #include "libvfio-user.h"
>  #include "hw/qdev-core.h"
> @@ -174,6 +175,39 @@ retry_attach:
>      qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_ctx_run, NULL, o);
>  }
>  
> +static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
> +                                     size_t count, loff_t offset,
> +                                     const bool is_write)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    uint32_t pci_access_width = sizeof(uint32_t);
> +    size_t bytes = count;
> +    uint32_t val = 0;
> +    char *ptr = buf;
> +    int len;
> +
> +    while (bytes > 0) {
> +        len = (bytes > pci_access_width) ? pci_access_width : bytes;
> +        if (is_write) {
> +            memcpy(&val, ptr, len);
> +            pci_host_config_write_common(o->pci_dev, offset,
> +                                         pci_config_size(o->pci_dev),
> +                                         val, len);
> +            trace_vfu_cfg_write(offset, val);
> +        } else {
> +            val = pci_host_config_read_common(o->pci_dev, offset,
> +                                              pci_config_size(o->pci_dev), len);
> +            memcpy(ptr, &val, len);
> +            trace_vfu_cfg_read(offset, val);
> +        }
> +        offset += len;
> +        ptr += len;
> +        bytes -= len;
> +    }
> +
> +    return count;
> +}
> +
>  /*
>   * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
>   * properties. It also depends on devices instantiated in QEMU. These
> @@ -244,6 +278,17 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>          goto fail;
>      }
>  
> +    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
> +                           pci_config_size(o->pci_dev), &vfu_object_cfg_access,
> +                           VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
> +                           NULL, 0, -1, 0);

Thanos or John: QEMU's pci_host_config_read/write_common() works with
host-endian values. I don't know which endianness libvfio-user's region
callbacks expect. Does this patch look endian-safe to you (e.g. will it
work on a big-endian host)?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 08/14] vfio-user: handle PCI config space accesses
  2021-12-16 11:30   ` Stefan Hajnoczi
@ 2021-12-16 11:47     ` John Levon
  2021-12-16 16:00       ` Stefan Hajnoczi
  0 siblings, 1 reply; 49+ messages in thread
From: John Levon @ 2021-12-16 11:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman, bleal,
	swapnil.ingle, alex.bennee, qemu-devel, wainersm,
	alex.williamson, pbonzini, marcandre.lureau, crosa,
	thanos.makatos, philmd

On Thu, Dec 16, 2021 at 11:30:20AM +0000, Stefan Hajnoczi wrote:

> > +    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
> > +                           pci_config_size(o->pci_dev), &vfu_object_cfg_access,
> > +                           VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
> > +                           NULL, 0, -1, 0);
> 
> Thanos or John: QEMU's pci_host_config_read/write_common() works with
> host-endian values. I don't know which endianness libvfio-user's region
> callbacks expect. Does this patch look endian-safe to you (e.g. will it
> work on a big-endian host)?

https://github.com/nutanix/libvfio-user/issues/615

regards
john


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 09/14] vfio-user: handle DMA mappings
  2021-12-15 15:35 ` [PATCH v4 09/14] vfio-user: handle DMA mappings Jagannathan Raman
@ 2021-12-16 13:24   ` Stefan Hajnoczi
  2021-12-17 19:11     ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 13:24 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 3636 bytes --]

On Wed, Dec 15, 2021 at 10:35:33AM -0500, Jagannathan Raman wrote:
> Define and register callbacks to manage the RAM regions used for
> device DMA
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/remote/vfio-user-obj.c | 48 +++++++++++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  2 ++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index c6d0c675b7..46f2251a68 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -208,6 +208,47 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>      return count;
>  }
>  
> +static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
> +{
> +    MemoryRegion *subregion = NULL;
> +    g_autofree char *name = NULL;
> +    static unsigned int suffix;
> +    struct iovec *iov = &info->iova;
> +
> +    if (!info->vaddr) {
> +        return;
> +    }
> +
> +    name = g_strdup_printf("remote-mem-%u", suffix++);
> +
> +    subregion = g_new0(MemoryRegion, 1);
> +
> +    memory_region_init_ram_ptr(subregion, NULL, name,
> +                               iov->iov_len, info->vaddr);
> +
> +    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
> +                                subregion);
> +
> +    trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
> +}
> +
> +static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
> +{
> +    MemoryRegion *mr = NULL;
> +    ram_addr_t offset;
> +
> +    mr = memory_region_from_host(info->vaddr, &offset);
> +    if (!mr) {
> +        return;
> +    }
> +
> +    memory_region_del_subregion(get_system_memory(), mr);
> +
> +    object_unparent((OBJECT(mr)));
> +
> +    trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
> +}

This does not support hot unplug (memory regions pointing to memory
mapped by libvfio-user are left registered). The code should keep a list
(e.g. https://docs.gtk.org/glib/struct.SList.html) of memory regions and
automatically remove them before destroying the vfu context.

It also doesn't support multiple vfio-user server instances running in
the same QEMU process. get_system_memory() is global but the memory
regions provided by vfio-user are per-client (i.e. VM). If multiple VMs
are connected to one vfio-user server process then they conflict.

I don't know the best way to support multiple vfio-user server
instances, it would be straightforward if QEMU supported multiple
MachineStates and didn't use global get_system_memory()/get_io_memory()
APIs. It would be nice to solve that in the future.

Maybe it's too hard to change that, I haven't looked. An alternative is
to make the x-remote machine empty (it doesn't create any devices) and
instead create a new PCI bus, interrupt controller, memory MemoryRegion,
and io MemoryRegion in VfuObject. Stop using get_system_memory() and
instead use the per-VfuObject memory MemoryRegion.

In either of those approaches it's probably necessary to specify the PCI
bus ID in --device and device_add so it's clear which vfio-user server
the PCI device is associated with.

The multiple vfio-user server instance limitation doesn't need to be
solved now, but I wanted to share some ideas around it. Maybe someone
has better ideas or is aware of limitations preventing what I described.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 10/14] vfio-user: handle PCI BAR accesses
  2021-12-15 15:35 ` [PATCH v4 10/14] vfio-user: handle PCI BAR accesses Jagannathan Raman
@ 2021-12-16 14:10   ` Stefan Hajnoczi
  2021-12-17 19:12     ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 14:10 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Wed, Dec 15, 2021 at 10:35:34AM -0500, Jagannathan Raman wrote:
> +static ssize_t vfu_object_bar_rw(PCIDevice *pci_dev, hwaddr addr, size_t count,
> +                                 char * const buf, const bool is_write,
> +                                 bool is_io)
> +{
> +    AddressSpace *as = NULL;
> +    MemTxResult res;
> +
> +    if (is_io) {
> +        as = &address_space_io;
> +    } else {
> +        as = pci_device_iommu_address_space(pci_dev);

This access is not initiated by the device, it's coming from the CPU. It
shouldn't go through the IOMMU address space.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/14] vfio-user: IOMMU support for remote device
  2021-12-15 15:35 ` [PATCH v4 11/14] vfio-user: IOMMU support for remote device Jagannathan Raman
@ 2021-12-16 14:40   ` Stefan Hajnoczi
  2021-12-17 20:00     ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 14:40 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 10561 bytes --]

On Wed, Dec 15, 2021 at 10:35:35AM -0500, Jagannathan Raman wrote:
> Assign separate address space for each device in the remote processes.

If I understand correctly this isn't really an IOMMU. It's abusing the
IOMMU APIs to create isolated address spaces for each device. This way
memory regions added by the vfio-user client do not conflict when there
are multiple vfio-user servers.

Calling pci_root_bus_new() and keeping one PCI bus per VfuObject might
be a cleaner approach:
- Lets you isolate both PCI Memory Space and IO Space.
- Isolates the PCIDevices and their addresses on the bus.
- Isolates irqs.
- No more need to abuse the IOMMU API.

I might be missing something because I haven't investigated how to do
this myself.

> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/pci/pci.h      |   2 +
>  include/hw/remote/iommu.h |  24 ++++++++
>  hw/pci/pci.c              |   2 +-
>  hw/remote/iommu.c         | 117 ++++++++++++++++++++++++++++++++++++++
>  hw/remote/machine.c       |   5 ++
>  hw/remote/vfio-user-obj.c |  20 ++++++-
>  MAINTAINERS               |   2 +
>  hw/remote/meson.build     |   1 +
>  8 files changed, 169 insertions(+), 4 deletions(-)
>  create mode 100644 include/hw/remote/iommu.h
>  create mode 100644 hw/remote/iommu.c
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 5c4016b995..f2fc2d5375 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -734,6 +734,8 @@ void lsi53c8xx_handle_legacy_cmdline(DeviceState *lsi_dev);
>  qemu_irq pci_allocate_irq(PCIDevice *pci_dev);
>  void pci_set_irq(PCIDevice *pci_dev, int level);
>  
> +void pci_init_bus_master(PCIDevice *pci_dev);

This function isn't used in this patch. Why make it public?

> +
>  static inline void pci_irq_assert(PCIDevice *pci_dev)
>  {
>      pci_set_irq(pci_dev, 1);
> diff --git a/include/hw/remote/iommu.h b/include/hw/remote/iommu.h
> new file mode 100644
> index 0000000000..42ce0ca383
> --- /dev/null
> +++ b/include/hw/remote/iommu.h
> @@ -0,0 +1,24 @@
> +/*
> + * IOMMU for remote device
> + *
> + * Copyright © 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef REMOTE_IOMMU_H
> +#define REMOTE_IOMMU_H
> +
> +#include "hw/pci/pci_bus.h"
> +
> +void remote_iommu_free(PCIDevice *pci_dev);
> +
> +void remote_iommu_init(void);
> +
> +void remote_iommu_set(PCIBus *bus);
> +
> +MemoryRegion *remote_iommu_get_ram(PCIDevice *pci_dev);
> +
> +#endif
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 4a84e478ce..57d561cc03 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -95,7 +95,7 @@ static const VMStateDescription vmstate_pcibus = {
>      }
>  };
>  
> -static void pci_init_bus_master(PCIDevice *pci_dev)
> +void pci_init_bus_master(PCIDevice *pci_dev)
>  {
>      AddressSpace *dma_as = pci_device_iommu_address_space(pci_dev);
>  
> diff --git a/hw/remote/iommu.c b/hw/remote/iommu.c
> new file mode 100644
> index 0000000000..30c866badb
> --- /dev/null
> +++ b/hw/remote/iommu.c
> @@ -0,0 +1,117 @@
> +/*
> + * Remote IOMMU
> + *
> + * Copyright © 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +
> +#include "hw/remote/iommu.h"
> +#include "hw/pci/pci_bus.h"
> +#include "exec/memory.h"
> +#include "exec/address-spaces.h"
> +#include "trace.h"
> +
> +struct VFUIOMMU {
> +    AddressSpace  as;
> +    MemoryRegion  mr;

I guess this is the root MemoryRegion container? Calling it "root" or
"root_mr" instead of "mr" would make that clearer.

> +};
> +
> +typedef struct VFUPciBus {

There is no uppercase/lowercase consistency between VfuObject vs
VFUIOMMU vs VFUPciBus. Although the coding standard doesn't dictate ABC
vs Abc, please be consistent. I suggest following the VfuObject
convention started in the previous patches. The names would be VfuIommu
and VfuPciBus.

> +    PCIBus           *bus;
> +    struct VFUIOMMU  *iommu[];
> +} VFUPciBus;
> +
> +GHashTable *remote_as_table;
> +
> +static AddressSpace *remote_iommu_get_as(PCIBus *bus, void *opaque, int devfn)
> +{
> +    VFUPciBus *vfu_pci_bus = NULL;
> +    struct VFUIOMMU *iommu = NULL;
> +
> +    if (!remote_as_table) {
> +        return &address_space_memory;
> +    }
> +
> +    vfu_pci_bus = g_hash_table_lookup(remote_as_table, bus);
> +
> +    if (!vfu_pci_bus) {
> +        vfu_pci_bus = g_malloc0(sizeof(VFUPciBus));
> +        vfu_pci_bus->bus = bus;
> +        g_hash_table_insert(remote_as_table, bus, vfu_pci_bus);
> +    }
> +
> +    iommu = vfu_pci_bus->iommu[devfn];
> +
> +    if (!iommu) {
> +        g_autofree char *mr_name = g_strdup_printf("vfu-ram-%d", devfn);
> +        g_autofree char *as_name = g_strdup_printf("vfu-as-%d", devfn);
> +
> +        iommu = g_malloc0(sizeof(struct VFUIOMMU));
> +
> +        memory_region_init(&iommu->mr, NULL, mr_name, UINT64_MAX);
> +        address_space_init(&iommu->as, &iommu->mr, as_name);
> +
> +        vfu_pci_bus->iommu[devfn] = iommu;
> +    }
> +
> +    return &iommu->as;
> +}
> +
> +void remote_iommu_free(PCIDevice *pci_dev)
> +{
> +    VFUPciBus *vfu_pci_bus = NULL;
> +    struct VFUIOMMU *iommu = NULL;
> +
> +    if (!remote_as_table) {
> +        return;
> +    }
> +
> +    vfu_pci_bus = g_hash_table_lookup(remote_as_table, pci_get_bus(pci_dev));
> +
> +    if (!vfu_pci_bus) {
> +        return;
> +    }
> +
> +    iommu = vfu_pci_bus->iommu[pci_dev->devfn];
> +
> +    vfu_pci_bus->iommu[pci_dev->devfn] = NULL;
> +
> +    if (iommu) {
> +        memory_region_unref(&iommu->mr);
> +        address_space_destroy(&iommu->as);
> +        g_free(iommu);
> +    }
> +}
> +
> +void remote_iommu_init(void)
> +{
> +    remote_as_table = g_hash_table_new_full(NULL, NULL, NULL, NULL);
> +}
> +
> +void remote_iommu_set(PCIBus *bus)
> +{
> +    pci_setup_iommu(bus, remote_iommu_get_as, NULL);
> +}
> +
> +MemoryRegion *remote_iommu_get_ram(PCIDevice *pci_dev)
> +{
> +    PCIBus *bus = pci_get_bus(pci_dev);
> +    VFUPciBus *vfu_pci_bus;
> +
> +    if (!remote_as_table) {
> +        return get_system_memory();
> +    }
> +
> +    vfu_pci_bus = g_hash_table_lookup(remote_as_table, bus);
> +    if (!vfu_pci_bus) {
> +        return get_system_memory();
> +    }
> +
> +    return &vfu_pci_bus->iommu[pci_dev->devfn]->mr;
> +}
> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
> index 952105eab5..023be0491e 100644
> --- a/hw/remote/machine.c
> +++ b/hw/remote/machine.c
> @@ -21,6 +21,7 @@
>  #include "qapi/error.h"
>  #include "hw/pci/pci_host.h"
>  #include "hw/remote/iohub.h"
> +#include "hw/remote/iommu.h"
>  
>  static void remote_machine_init(MachineState *machine)
>  {
> @@ -52,6 +53,10 @@ static void remote_machine_init(MachineState *machine)
>  
>      remote_iohub_init(&s->iohub);
>  
> +    remote_iommu_init();
> +
> +    remote_iommu_set(pci_host->bus);
> +
>      pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
>                   &s->iohub, REMOTE_IOHUB_NB_PIRQS);
>  }
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 9399e87cbe..ae375e69b9 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -49,6 +49,7 @@
>  #include "hw/qdev-core.h"
>  #include "hw/pci/pci.h"
>  #include "qemu/timer.h"
> +#include "hw/remote/iommu.h"
>  
>  #define TYPE_VFU_OBJECT "x-vfio-user-server"
>  OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
> @@ -210,6 +211,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>  
>  static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>  {
> +    VfuObject *o = vfu_get_private(vfu_ctx);
>      MemoryRegion *subregion = NULL;
>      g_autofree char *name = NULL;
>      static unsigned int suffix;
> @@ -226,14 +228,15 @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>      memory_region_init_ram_ptr(subregion, NULL, name,
>                                 iov->iov_len, info->vaddr);
>  
> -    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
> -                                subregion);
> +    memory_region_add_subregion(remote_iommu_get_ram(o->pci_dev),
> +                                (hwaddr)iov->iov_base, subregion);
>  
>      trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
>  }
>  
>  static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>  {
> +    VfuObject *o = vfu_get_private(vfu_ctx);
>      MemoryRegion *mr = NULL;
>      ram_addr_t offset;
>  
> @@ -242,7 +245,7 @@ static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>          return;
>      }
>  
> -    memory_region_del_subregion(get_system_memory(), mr);
> +    memory_region_del_subregion(remote_iommu_get_ram(o->pci_dev), mr);
>  
>      object_unparent((OBJECT(mr)));
>  
> @@ -320,6 +323,7 @@ static vfu_region_access_cb_t *vfu_object_bar_handlers[PCI_NUM_REGIONS] = {
>   */
>  static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
>  {
> +    VfuObject *o = vfu_get_private(vfu_ctx);
>      int i;
>  
>      for (i = 0; i < PCI_NUM_REGIONS; i++) {
> @@ -332,6 +336,12 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
>                           vfu_object_bar_handlers[i],
>                           VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
>  
> +        if ((o->pci_dev->io_regions[i].type & PCI_BASE_ADDRESS_SPACE) == 0) {
> +            memory_region_unref(o->pci_dev->io_regions[i].address_space);
> +            o->pci_dev->io_regions[i].address_space =
> +                remote_iommu_get_ram(o->pci_dev);
> +        }

This looks hacky. If you create a separate PCIHost for each device
instead then the BARs will be created in the MemoryRegion (confusingly
named "address_space" in the PCI code) of your choosing.

Also, why is PCI Memory Space isolated via VFUIOMMU but PCI IO Space is
not?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 12/14] vfio-user: handle device interrupts
  2021-12-15 15:35 ` [PATCH v4 12/14] vfio-user: handle device interrupts Jagannathan Raman
@ 2021-12-16 15:56   ` Stefan Hajnoczi
  0 siblings, 0 replies; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 15:56 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: elena.ufimtseva, john.g.johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, alex.williamson,
	thanos.makatos, marcandre.lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 4933 bytes --]

On Wed, Dec 15, 2021 at 10:35:36AM -0500, Jagannathan Raman wrote:
> @@ -62,6 +66,9 @@ void remote_iohub_set_irq(void *opaque, int pirq, int level)
>      QEMU_LOCK_GUARD(&iohub->irq_level_lock[pirq]);
>  
>      if (level) {
> +        if (iohub->intx_notify) {
> +            iohub->intx_notify(pirq, 0);
> +        }
>          if (++iohub->irq_level[pirq] == 1) {
>              event_notifier_set(&iohub->irqfds[pirq]);
>          }

Does it make sense to use iohub.c in vfio-user since these irqfds are
unused?

Instead of adding iohub->intx_notify(), can vfio-user register its own
set_irq() callback for the PCI bus?

> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index ae375e69b9..2b28d465d5 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -50,6 +50,9 @@
>  #include "hw/pci/pci.h"
>  #include "qemu/timer.h"
>  #include "hw/remote/iommu.h"
> +#include "hw/pci/msi.h"
> +#include "hw/pci/msix.h"
> +#include "hw/remote/iohub.h"
>  
>  #define TYPE_VFU_OBJECT "x-vfio-user-server"
>  OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
> @@ -81,6 +84,8 @@ struct VfuObject {
>      int vfu_poll_fd;
>  };
>  
> +static GHashTable *vfu_object_dev_table;
> +
>  static void vfu_object_init_ctx(VfuObject *o, Error **errp);
>  
>  static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
> @@ -347,6 +352,54 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
>      }
>  }
>  
> +static void vfu_object_intx_notify(int pci_devfn, unsigned vector)
> +{
> +    vfu_ctx_t *vfu_ctx = g_hash_table_lookup(vfu_object_dev_table,
> +                                             (void *)(uint64_t)pci_devfn);

I'm not sure the hash table is necessary. The function arguments
currently don't contain the information we need, but that's easy to fix
because these functions are added by this patch. Add an opaque pointer
argument to ->intx_notify, ->msix_notify, and ->msi_notify in order to
pass along the VfuObject we need.

> +
> +    if (vfu_ctx) {
> +        vfu_irq_trigger(vfu_ctx, vector);
> +    }
> +}
> +
> +static void vfu_object_msi_notify(PCIDevice *pci_dev, unsigned vector)
> +{
> +    vfu_object_intx_notify(pci_dev->devfn, vector);
> +}
> +
> +static int vfu_object_setup_irqs(vfu_ctx_t *vfu_ctx, PCIDevice *pci_dev)
> +{
> +    RemoteMachineState *machine = REMOTE_MACHINE(current_machine);
> +    RemoteIOHubState *iohub = &machine->iohub;
> +    int ret;
> +
> +    ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_INTX_IRQ, 1);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    iohub->intx_notify = vfu_object_intx_notify;
> +
> +    ret = 0;
> +    if (msix_nr_vectors_allocated(pci_dev)) {
> +        ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_MSIX_IRQ,
> +                                       msix_nr_vectors_allocated(pci_dev));
> +
> +        pci_dev->msix_notify = vfu_object_msi_notify;
> +    } else if (msi_nr_vectors_allocated(pci_dev)) {
> +        ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_MSI_IRQ,
> +                                       msi_nr_vectors_allocated(pci_dev));
> +
> +        pci_dev->msi_notify = vfu_object_msi_notify;
> +    }
> +
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
>  /*
>   * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
>   * properties. It also depends on devices instantiated in QEMU. These
> @@ -437,6 +490,13 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>  
>      vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
>  
> +    ret = vfu_object_setup_irqs(o->vfu_ctx, o->pci_dev);
> +    if (ret < 0) {
> +        error_setg(errp, "vfu: Failed to setup interrupts for %s",
> +                   o->device);
> +        goto fail;
> +    }
> +
>      ret = vfu_realize_ctx(o->vfu_ctx);
>      if (ret < 0) {
>          error_setg(errp, "vfu: Failed to realize device %s- %s",
> @@ -450,6 +510,9 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>          goto fail;
>      }
>  
> +    g_hash_table_insert(vfu_object_dev_table,
> +                        (void *)(uint64_t)o->pci_dev->devfn, o->vfu_ctx);

vfu_object_dev_table seems like a misnomer since the values stored are
actually vfu_ctx_t, not VfuObjects. vfu_devfn_to_vfu_ctx_table?

> +
>      qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_attach_ctx, NULL, o);
>  
>      return;
> @@ -504,9 +567,18 @@ static void vfu_object_finalize(Object *obj)
>          remote_iommu_free(o->pci_dev);
>      }
>  
> +    if (o->pci_dev &&
> +            g_hash_table_lookup(vfu_object_dev_table,
> +                                (void *)(uint64_t)o->pci_dev->devfn)) {

lookup() is unnecessary since remove() is a nop if the key does not
exist.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 08/14] vfio-user: handle PCI config space accesses
  2021-12-16 11:47     ` John Levon
@ 2021-12-16 16:00       ` Stefan Hajnoczi
  0 siblings, 0 replies; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-16 16:00 UTC (permalink / raw)
  To: John Levon
  Cc: elena.ufimtseva, john.g.johnson, thuth, Jagannathan Raman, bleal,
	swapnil.ingle, alex.bennee, qemu-devel, wainersm,
	alex.williamson, pbonzini, marcandre.lureau, crosa,
	thanos.makatos, philmd

[-- Attachment #1: Type: text/plain, Size: 778 bytes --]

On Thu, Dec 16, 2021 at 11:47:40AM +0000, John Levon wrote:
> On Thu, Dec 16, 2021 at 11:30:20AM +0000, Stefan Hajnoczi wrote:
> 
> > > +    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
> > > +                           pci_config_size(o->pci_dev), &vfu_object_cfg_access,
> > > +                           VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
> > > +                           NULL, 0, -1, 0);
> > 
> > Thanos or John: QEMU's pci_host_config_read/write_common() works with
> > host-endian values. I don't know which endianness libvfio-user's region
> > callbacks expect. Does this patch look endian-safe to you (e.g. will it
> > work on a big-endian host)?
> 
> https://github.com/nutanix/libvfio-user/issues/615

Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines
  2021-12-15 22:04   ` Beraldo Leal
@ 2021-12-16 21:28     ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-16 21:28 UTC (permalink / raw)
  To: Beraldo Leal
  Cc: Elena Ufimtseva, John Johnson, thuth, swapnil.ingle, john.levon,
	alex.bennee, qemu-devel, wainersm, alex.williamson, pbonzini,
	marcandre.lureau, stefanha, crosa, thanos.makatos, philmd



> On Dec 15, 2021, at 5:04 PM, Beraldo Leal <bleal@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:26AM -0500, Jagannathan Raman wrote:
>> Specify target VM for exec_command and
>> exec_command_and_wait_for_pattern routines
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> tests/avocado/avocado_qemu/__init__.py | 10 ++++++----
>> 1 file changed, 6 insertions(+), 4 deletions(-)
>> 
>> diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
>> index 75063c0c30..26ac782f53 100644
>> --- a/tests/avocado/avocado_qemu/__init__.py
>> +++ b/tests/avocado/avocado_qemu/__init__.py
>> @@ -198,7 +198,7 @@ def wait_for_console_pattern(test, success_message, failure_message=None,
>>     """
>>     _console_interaction(test, success_message, failure_message, None, vm=vm)
>> 
>> -def exec_command(test, command):
>> +def exec_command(test, command, vm=None):
> 
> nitpick: if possible, it would be nice to update the docstring, by
> adding this new argument.
> 
>>     """
>>     Send a command to a console (appending CRLF characters), while logging
>>     the content.
>> @@ -208,10 +208,11 @@ def exec_command(test, command):
>>     :param command: the command to send
>>     :type command: str
>>     """
>> -    _console_interaction(test, None, None, command + '\r')
>> +    _console_interaction(test, None, None, command + '\r', vm=vm)
>> 
>> def exec_command_and_wait_for_pattern(test, command,
>> -                                      success_message, failure_message=None):
>> +                                      success_message, failure_message=None,
>> +                                      vm=None):
> 
> Same here.

OK sure, will do. Thank you!
--
Jag

> 
> Other than that, lgtm.
> 
> Reviewed-by: Beraldo Leal <bleal@redhat.com>
> 
> --
> Beraldo
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 05/14] vfio-user: instantiate vfio-user context
  2021-12-16  9:55   ` Stefan Hajnoczi
@ 2021-12-16 21:32     ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-16 21:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, philmd, qemu-devel, wainersm, Alex Williamson,
	thanos.makatos, Marc-André Lureau, crosa, pbonzini,
	alex.bennee



> On Dec 16, 2021, at 4:55 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:29AM -0500, Jagannathan Raman wrote:
>> +static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +
>> +    if (o->vfu_ctx || !o->socket || !o->device ||
>> +            !phase_check(PHASE_MACHINE_READY)) {
>> +        return;
>> +    }
>> +
>> +    if (o->err) {
>> +        error_propagate(errp, o->err);
> 
> Missing o->err = NULL because ownership has been passed to errp.

OK, will do!

Thank you!
--
Jag



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 04/14] vfio-user: define vfio-user-server object
  2021-12-16  9:33   ` Stefan Hajnoczi
@ 2021-12-17  2:17     ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-17  2:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee



> On Dec 16, 2021, at 4:33 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:28AM -0500, Jagannathan Raman wrote:
>> diff --git a/qapi/qom.json b/qapi/qom.json
>> index ccd1167808..6001a9b8f0 100644
>> --- a/qapi/qom.json
>> +++ b/qapi/qom.json
>> @@ -703,6 +703,20 @@
>> { 'struct': 'RemoteObjectProperties',
>>   'data': { 'fd': 'str', 'devid': 'str' } }
>> 
>> +##
>> +# @VfioUserServerProperties:
>> +#
>> +# Properties for x-vfio-user-server objects.
>> +#
>> +# @socket: socket to be used by the libvfiouser library
>> +#
>> +# @device: the id of the device to be emulated at the server
>> +#
>> +# Since: 6.2
> 
> 6.2 has been released so the version number needs to be incremented.

OK, thanks!

> 
>> +struct VfuObjectClass {
>> +    ObjectClass parent_class;
>> +
>> +    unsigned int nr_devs;
>> +
>> +    bool daemon;
> 
> I was wondering what this means. auto_shutdown might be a clearer name.

Sure, will do. 

> 
>> +static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
>> +                                  void *opaque, Error **errp)
>> +{
>> +    VfuObject *o = VFU_OBJECT(obj);
>> +
>> +    qapi_free_SocketAddress(o->socket);
>> +
>> +    o->socket = NULL;
>> +
>> +    visit_type_SocketAddress(v, name, &o->socket, errp);
>> +
>> +    if (o->socket->type != SOCKET_ADDRESS_TYPE_UNIX) {
>> +        qapi_free_SocketAddress(o->socket);
>> +        o->socket = NULL;
>> +        error_setg(errp, "vfu: Unsupported socket type - %s",
>> +                   o->socket->u.q_unix.path);
> 
> s/o->socket->u.q_unix.path/SocketAddressType_str(o->socket->type)/

Ok, got it.

Thanks!
--
Jag



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 04/14] vfio-user: define vfio-user-server object
  2021-12-16  9:58   ` Stefan Hajnoczi
@ 2021-12-17  2:31     ` Jag Raman
  2021-12-17  8:28       ` Stefan Hajnoczi
  0 siblings, 1 reply; 49+ messages in thread
From: Jag Raman @ 2021-12-17  2:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee



> On Dec 16, 2021, at 4:58 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:28AM -0500, Jagannathan Raman wrote:
>> +static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
>> +                                  void *opaque, Error **errp)
>> +{
>> +    VfuObject *o = VFU_OBJECT(obj);
>> +
>> +    qapi_free_SocketAddress(o->socket);
>> +
>> +    o->socket = NULL;
>> +
>> +    visit_type_SocketAddress(v, name, &o->socket, errp);
>> +
>> +    if (o->socket->type != SOCKET_ADDRESS_TYPE_UNIX) {
>> +        qapi_free_SocketAddress(o->socket);
>> +        o->socket = NULL;
>> +        error_setg(errp, "vfu: Unsupported socket type - %s",
>> +                   o->socket->u.q_unix.path);
>> +        return;
>> +    }
>> +
>> +    trace_vfu_prop("socket", o->socket->u.q_unix.path);
>> +}
>> +
>> +static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
>> +{
>> +    VfuObject *o = VFU_OBJECT(obj);
>> +
>> +    g_free(o->device);
>> +
>> +    o->device = g_strdup(str);
>> +
>> +    trace_vfu_prop("device", str);
>> +}
> 
> It appears "socket" and "device" can be changed after the
> vfio-user server has started. In the best case it just means the
> properties contain values that do not reflect the actual socket/device
> currently in use, which is confusing.

I’m not sure about the scenario in which that would occur, but will add that check.

Thank you!
--
Jag

> 
> It's safer to refuse changing these properties once the vfio-user
> server has started.
> 
> Stefan


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 06/14] vfio-user: find and init PCI device
  2021-12-16 10:39   ` Stefan Hajnoczi
@ 2021-12-17  3:12     ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-17  3:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, Paolo Bonzini, alex.bennee



> On Dec 16, 2021, at 5:39 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:30AM -0500, Jagannathan Raman wrote:
>> @@ -150,6 +157,38 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>> +    o->pci_dev = PCI_DEVICE(dev);
> ...
>> @@ -190,6 +229,8 @@ static void vfu_object_finalize(Object *obj)
>> 
>>     o->device = NULL;
>> 
>> +    o->pci_dev = NULL;
> 
> We need to consider how this interacts with device_del hot unplug.
> o->pci_dev is a pointer and we don't hold a refcount, so I think
> o->pci_dev could disappear at any moment.
> 
> A pair of object_ref/unref(OBJECT(o->pci_dev)) calls would not be enough
> because device_del will still unrealize the device that's in use by the
> vfio-user server.
> 
> I suggest adding a check to qdev_unplug() that prevents unplug when the
> device is in use by the vfio-user server. That's similar to the code in
> that function for preventing unplug during migration.
> 
> One way to do that is by adding a new API:
> 
>  /*
>   * Register an Error that is raised when unplug is attempted on a
>   * device. If another blocker has already been registered then that
>   * Error may be raised during unplug instead.
>   *
>   * qdev_del_unplug_blocker() must be called to remove this blocker.
>   */
>  void qdev_add_unplug_blocker(DeviceState *dev, Error *blocker);
> 
>  /*
>   * Deregister an Error that was previously registered with
>   * qdev_add_unplug_blocker().
>   */
>  void qdev_del_unplug_blocker(DeviceState *dev, Error *blocker);
> 
> The vfio-user server then needs to add an Error *unplug_blocker field to
> VfuObject and call qdev_add/del_unplug_blocker() on the PCI device.
> 
> From a user perspective this means that device_del fails with "Device
> currently in use by vfio-user server '%s'".

OK sounds good, will add the above unplug blocker API for PCI devices.

Thank you!
--
Jag

> 
> Stefan



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 04/14] vfio-user: define vfio-user-server object
  2021-12-17  2:31     ` Jag Raman
@ 2021-12-17  8:28       ` Stefan Hajnoczi
  0 siblings, 0 replies; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-17  8:28 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 1727 bytes --]

On Fri, Dec 17, 2021 at 02:31:14AM +0000, Jag Raman wrote:
> 
> 
> > On Dec 16, 2021, at 4:58 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Dec 15, 2021 at 10:35:28AM -0500, Jagannathan Raman wrote:
> >> +static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
> >> +                                  void *opaque, Error **errp)
> >> +{
> >> +    VfuObject *o = VFU_OBJECT(obj);
> >> +
> >> +    qapi_free_SocketAddress(o->socket);
> >> +
> >> +    o->socket = NULL;
> >> +
> >> +    visit_type_SocketAddress(v, name, &o->socket, errp);
> >> +
> >> +    if (o->socket->type != SOCKET_ADDRESS_TYPE_UNIX) {
> >> +        qapi_free_SocketAddress(o->socket);
> >> +        o->socket = NULL;
> >> +        error_setg(errp, "vfu: Unsupported socket type - %s",
> >> +                   o->socket->u.q_unix.path);
> >> +        return;
> >> +    }
> >> +
> >> +    trace_vfu_prop("socket", o->socket->u.q_unix.path);
> >> +}
> >> +
> >> +static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
> >> +{
> >> +    VfuObject *o = VFU_OBJECT(obj);
> >> +
> >> +    g_free(o->device);
> >> +
> >> +    o->device = g_strdup(str);
> >> +
> >> +    trace_vfu_prop("device", str);
> >> +}
> > 
> > It appears "socket" and "device" can be changed after the
> > vfio-user server has started. In the best case it just means the
> > properties contain values that do not reflect the actual socket/device
> > currently in use, which is confusing.
> 
> I’m not sure about the scenario in which that would occur, but will add that check.

Properties can be modified using the qom-set monitor command. This can
happen any time.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2021-12-16 11:17   ` Stefan Hajnoczi
@ 2021-12-17 17:59     ` Jag Raman
  2021-12-20  8:29       ` Stefan Hajnoczi
  2022-01-05 10:38       ` Thanos Makatos
  0 siblings, 2 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-17 17:59 UTC (permalink / raw)
  To: Stefan Hajnoczi, John Levon, Thanos Makatos
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee



> On Dec 16, 2021, at 6:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:31AM -0500, Jagannathan Raman wrote:
>> @@ -114,6 +118,62 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
>>     vfu_object_init_ctx(o, errp);
>> }
>> 
>> +static void vfu_object_ctx_run(void *opaque)
>> +{
>> +    VfuObject *o = opaque;
>> +    int ret = -1;
>> +
>> +    while (ret != 0) {
>> +        ret = vfu_run_ctx(o->vfu_ctx);
>> +        if (ret < 0) {
>> +            if (errno == EINTR) {
>> +                continue;
>> +            } else if (errno == ENOTCONN) {
>> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
>> +                o->vfu_poll_fd = -1;
>> +                object_unparent(OBJECT(o));
>> +                break;
> 
> If nothing else logs a message then I think that should be done here so
> users know why their vfio-user server object disappeared.

Sure will do.

Do you prefer a trace, or a message to the console? Trace makes sense to me.
Presently, the client could unplug the vfio-user device which would trigger the
deletion of this object. This process could happen quietly.

> 
>> +            } else {
>> +                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
>> +                           o->device, strerror(errno));
> 
> error_abort is equivalent to assuming !o->daemon. In the case where the
> user doesn't want to automatically shut down the process we need to log
> a message without aborting.

OK, makes sense.

> 
>> +                 break;
> 
> Indentation is off.
> 
>> +            }
>> +        }
>> +    }
>> +}
>> +
>> +static void vfu_object_attach_ctx(void *opaque)
>> +{
>> +    VfuObject *o = opaque;
>> +    GPollFD pfds[1];
>> +    int ret;
>> +
>> +    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
>> +
>> +    pfds[0].fd = o->vfu_poll_fd;
>> +    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>> +
>> +retry_attach:
>> +    ret = vfu_attach_ctx(o->vfu_ctx);
>> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
>> +        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
>> +        goto retry_attach;
> 
> This can block the thread indefinitely. Other events like monitor
> commands are not handled in this loop. Please make this asynchronous
> (set an fd handler and return from this function so we can try again
> later).
> 
> The vfu_attach_ctx() implementation synchronously negotiates the
> vfio-user connection :(. That's a shame because even if accept(2) is
> handled asynchronously, the negotiation can still block. It would be
> cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
> avoid blocking. Is that possible?

Thanos / John,

    Any thoughts on this?

> 
> If libvfio-user can't make vfu_attach_ctx() fully async then it may be
> possible to spawn a thread just for the blocking vfu_attach_ctx() call
> and report the result back to the event loop (e.g. using EventNotifier).
> That adds a bunch of code to work around a blocking API though, so I
> guess we can leave the blocking part if necessary.
> 
> At the very minimum, please make EAGAIN/EWOULDBLOCK async as mentioned
> above and add a comment explaining the situation with the
> partially-async vfu_attach_ctx() API so it's clear that this can block
> (that way it's clear that you're aware of the issue and this isn't by
> accident).

Sure, we could make the attach async at QEMU depending on how the
library prefers to do this.

> 
>> +    } else if (ret < 0) {
>> +        error_setg(&error_abort,
>> +                   "vfu: Failed to attach device %s to context - %s",
>> +                   o->device, strerror(errno));
> 
> error_abort assumes !o->daemon. Please handle the o->daemon == true
> case by logging an error without aborting.
> 
>> +        return;
>> +    }
>> +
>> +    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
>> +    if (o->vfu_poll_fd < 0) {
>> +        error_setg(&error_abort, "vfu: Failed to get poll fd %s", o->device);
> 
> Same here.
> 
>> @@ -208,6 +284,8 @@ static void vfu_object_init(Object *obj)
>>                    TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
>>         return;
>>     }
>> +
>> +    o->vfu_poll_fd = -1;
>> }
> 
> This must call qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL)
> when o->vfu_poll_fd != -1 to avoid leaving a dangling fd handler
> callback registered.

This is during the init phase, and the FD handlers are not set. Do you mean
to add this at finalize?

I agree it’s good to explicitly add this at finalize. But vfu_destroy_ctx() should
trigger a ENOTCONN, which would do it anyway.

Thank you!
--
Jag


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 09/14] vfio-user: handle DMA mappings
  2021-12-16 13:24   ` Stefan Hajnoczi
@ 2021-12-17 19:11     ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-17 19:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee



> On Dec 16, 2021, at 8:24 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:33AM -0500, Jagannathan Raman wrote:
>> Define and register callbacks to manage the RAM regions used for
>> device DMA
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>> hw/remote/vfio-user-obj.c | 48 +++++++++++++++++++++++++++++++++++++++
>> hw/remote/trace-events    |  2 ++
>> 2 files changed, 50 insertions(+)
>> 
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index c6d0c675b7..46f2251a68 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -208,6 +208,47 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>>     return count;
>> }
>> 
>> +static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>> +{
>> +    MemoryRegion *subregion = NULL;
>> +    g_autofree char *name = NULL;
>> +    static unsigned int suffix;
>> +    struct iovec *iov = &info->iova;
>> +
>> +    if (!info->vaddr) {
>> +        return;
>> +    }
>> +
>> +    name = g_strdup_printf("remote-mem-%u", suffix++);
>> +
>> +    subregion = g_new0(MemoryRegion, 1);
>> +
>> +    memory_region_init_ram_ptr(subregion, NULL, name,
>> +                               iov->iov_len, info->vaddr);
>> +
>> +    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
>> +                                subregion);
>> +
>> +    trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
>> +}
>> +
>> +static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>> +{
>> +    MemoryRegion *mr = NULL;
>> +    ram_addr_t offset;
>> +
>> +    mr = memory_region_from_host(info->vaddr, &offset);
>> +    if (!mr) {
>> +        return;
>> +    }
>> +
>> +    memory_region_del_subregion(get_system_memory(), mr);
>> +
>> +    object_unparent((OBJECT(mr)));
>> +
>> +    trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
>> +}
> 
> This does not support hot unplug (memory regions pointing to memory
> mapped by libvfio-user are left registered). The code should keep a list
> (e.g. https://docs.gtk.org/glib/struct.SList.html) of memory regions and
> automatically remove them before destroying the vfu context.
> 
> It also doesn't support multiple vfio-user server instances running in
> the same QEMU process. get_system_memory() is global but the memory
> regions provided by vfio-user are per-client (i.e. VM). If multiple VMs
> are connected to one vfio-user server process then they conflict.
> 
> I don't know the best way to support multiple vfio-user server
> instances, it would be straightforward if QEMU supported multiple
> MachineStates and didn't use global get_system_memory()/get_io_memory()
> APIs. It would be nice to solve that in the future.

We’ve addressed the multiple vfio-user-server instances in
"[PATCH v4 11/14] vfio-user: IOMMU support for remote device” patch
down the line. I see your comments there, will address them.

Thank you!
--
Jag

> 
> Maybe it's too hard to change that, I haven't looked. An alternative is
> to make the x-remote machine empty (it doesn't create any devices) and
> instead create a new PCI bus, interrupt controller, memory MemoryRegion,
> and io MemoryRegion in VfuObject. Stop using get_system_memory() and
> instead use the per-VfuObject memory MemoryRegion.
> 
> In either of those approaches it's probably necessary to specify the PCI
> bus ID in --device and device_add so it's clear which vfio-user server
> the PCI device is associated with.
> 
> The multiple vfio-user server instance limitation doesn't need to be
> solved now, but I wanted to share some ideas around it. Maybe someone
> has better ideas or is aware of limitations preventing what I described.
> 
> Stefan


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 10/14] vfio-user: handle PCI BAR accesses
  2021-12-16 14:10   ` Stefan Hajnoczi
@ 2021-12-17 19:12     ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-17 19:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee



> On Dec 16, 2021, at 9:10 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:34AM -0500, Jagannathan Raman wrote:
>> +static ssize_t vfu_object_bar_rw(PCIDevice *pci_dev, hwaddr addr, size_t count,
>> +                                 char * const buf, const bool is_write,
>> +                                 bool is_io)
>> +{
>> +    AddressSpace *as = NULL;
>> +    MemTxResult res;
>> +
>> +    if (is_io) {
>> +        as = &address_space_io;
>> +    } else {
>> +        as = pci_device_iommu_address_space(pci_dev);
> 
> This access is not initiated by the device, it's coming from the CPU. It
> shouldn't go through the IOMMU address space.

Got it, thank you!
--
Jag



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/14] vfio-user: IOMMU support for remote device
  2021-12-16 14:40   ` Stefan Hajnoczi
@ 2021-12-17 20:00     ` Jag Raman
  2021-12-20 14:36       ` Stefan Hajnoczi
  0 siblings, 1 reply; 49+ messages in thread
From: Jag Raman @ 2021-12-17 20:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee



> On Dec 16, 2021, at 9:40 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Dec 15, 2021 at 10:35:35AM -0500, Jagannathan Raman wrote:
>> Assign separate address space for each device in the remote processes.
> 
> If I understand correctly this isn't really an IOMMU. It's abusing the
> IOMMU APIs to create isolated address spaces for each device. This way
> memory regions added by the vfio-user client do not conflict when there
> are multiple vfio-user servers.

Like you already figured out, having isolated DMA address space alone is not
sufficient for this application, we also needed to isolate the sysmem/RAM address
space. As such, the available IOMMU APIs alone were not sufficient, so we had
to improvise.

> 
> Calling pci_root_bus_new() and keeping one PCI bus per VfuObject might
> be a cleaner approach:
> - Lets you isolate both PCI Memory Space and IO Space.
> - Isolates the PCIDevices and their addresses on the bus.
> - Isolates irqs.
> - No more need to abuse the IOMMU API.

I believe we would still need to have an IOMMU. It’s because, devices use the
pci_dma_read()/_write() functions. These functions look up the address in DMA
address space (via pci_get_address_space() -> PCIDevice->bus_master_as ->
PCIDevice->bus_master_enable_region -> PCIDevice->bus_master_container_region).
 bus_master_enable_region and bus_master_container_region are effectively aliases
to the DMA address space - without an IOMMU, the dma_as would be the shared
global sysmem/RAM space (address_space_mem, please see pci_init_bus_master())

> 
> I might be missing something because I haven't investigated how to do
> this myself.
> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> include/hw/pci/pci.h      |   2 +
>> include/hw/remote/iommu.h |  24 ++++++++
>> hw/pci/pci.c              |   2 +-
>> hw/remote/iommu.c         | 117 ++++++++++++++++++++++++++++++++++++++
>> hw/remote/machine.c       |   5 ++
>> hw/remote/vfio-user-obj.c |  20 ++++++-
>> MAINTAINERS               |   2 +
>> hw/remote/meson.build     |   1 +
>> 8 files changed, 169 insertions(+), 4 deletions(-)
>> create mode 100644 include/hw/remote/iommu.h
>> create mode 100644 hw/remote/iommu.c
>> 
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index 5c4016b995..f2fc2d5375 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -734,6 +734,8 @@ void lsi53c8xx_handle_legacy_cmdline(DeviceState *lsi_dev);
>> qemu_irq pci_allocate_irq(PCIDevice *pci_dev);
>> void pci_set_irq(PCIDevice *pci_dev, int level);
>> 
>> +void pci_init_bus_master(PCIDevice *pci_dev);
> 
> This function isn't used in this patch. Why make it public?

We were investigating updating the bus’s address space before the PCI device
initialized, but we dropped it as that would be an invasive change. This is cruft
from that effort, sorry - will remove.

> 
>> +
>> static inline void pci_irq_assert(PCIDevice *pci_dev)
>> {
>>     pci_set_irq(pci_dev, 1);
>> diff --git a/include/hw/remote/iommu.h b/include/hw/remote/iommu.h
>> new file mode 100644
>> index 0000000000..42ce0ca383
>> --- /dev/null
>> +++ b/include/hw/remote/iommu.h
>> @@ -0,0 +1,24 @@
>> +/*
>> + * IOMMU for remote device
>> + *
>> + * Copyright © 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#ifndef REMOTE_IOMMU_H
>> +#define REMOTE_IOMMU_H
>> +
>> +#include "hw/pci/pci_bus.h"
>> +
>> +void remote_iommu_free(PCIDevice *pci_dev);
>> +
>> +void remote_iommu_init(void);
>> +
>> +void remote_iommu_set(PCIBus *bus);
>> +
>> +MemoryRegion *remote_iommu_get_ram(PCIDevice *pci_dev);
>> +
>> +#endif
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 4a84e478ce..57d561cc03 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -95,7 +95,7 @@ static const VMStateDescription vmstate_pcibus = {
>>     }
>> };
>> 
>> -static void pci_init_bus_master(PCIDevice *pci_dev)
>> +void pci_init_bus_master(PCIDevice *pci_dev)
>> {
>>     AddressSpace *dma_as = pci_device_iommu_address_space(pci_dev);
>> 
>> diff --git a/hw/remote/iommu.c b/hw/remote/iommu.c
>> new file mode 100644
>> index 0000000000..30c866badb
>> --- /dev/null
>> +++ b/hw/remote/iommu.c
>> @@ -0,0 +1,117 @@
>> +/*
>> + * Remote IOMMU
>> + *
>> + * Copyright © 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +
>> +#include "hw/remote/iommu.h"
>> +#include "hw/pci/pci_bus.h"
>> +#include "exec/memory.h"
>> +#include "exec/address-spaces.h"
>> +#include "trace.h"
>> +
>> +struct VFUIOMMU {
>> +    AddressSpace  as;
>> +    MemoryRegion  mr;
> 
> I guess this is the root MemoryRegion container? Calling it "root" or
> "root_mr" instead of "mr" would make that clearer.

OK

> 
>> +};
>> +
>> +typedef struct VFUPciBus {
> 
> There is no uppercase/lowercase consistency between VfuObject vs
> VFUIOMMU vs VFUPciBus. Although the coding standard doesn't dictate ABC
> vs Abc, please be consistent. I suggest following the VfuObject
> convention started in the previous patches. The names would be VfuIommu
> and VfuPciBus.

Sounds good, thank you!

> 
>> +    PCIBus           *bus;
>> +    struct VFUIOMMU  *iommu[];
>> +} VFUPciBus;
>> +
>> +GHashTable *remote_as_table;
>> +
>> +static AddressSpace *remote_iommu_get_as(PCIBus *bus, void *opaque, int devfn)
>> +{
>> +    VFUPciBus *vfu_pci_bus = NULL;
>> +    struct VFUIOMMU *iommu = NULL;
>> +
>> +    if (!remote_as_table) {
>> +        return &address_space_memory;
>> +    }
>> +
>> +    vfu_pci_bus = g_hash_table_lookup(remote_as_table, bus);
>> +
>> +    if (!vfu_pci_bus) {
>> +        vfu_pci_bus = g_malloc0(sizeof(VFUPciBus));
>> +        vfu_pci_bus->bus = bus;
>> +        g_hash_table_insert(remote_as_table, bus, vfu_pci_bus);
>> +    }
>> +
>> +    iommu = vfu_pci_bus->iommu[devfn];
>> +
>> +    if (!iommu) {
>> +        g_autofree char *mr_name = g_strdup_printf("vfu-ram-%d", devfn);
>> +        g_autofree char *as_name = g_strdup_printf("vfu-as-%d", devfn);
>> +
>> +        iommu = g_malloc0(sizeof(struct VFUIOMMU));
>> +
>> +        memory_region_init(&iommu->mr, NULL, mr_name, UINT64_MAX);
>> +        address_space_init(&iommu->as, &iommu->mr, as_name);
>> +
>> +        vfu_pci_bus->iommu[devfn] = iommu;
>> +    }
>> +
>> +    return &iommu->as;
>> +}
>> +
>> +void remote_iommu_free(PCIDevice *pci_dev)
>> +{
>> +    VFUPciBus *vfu_pci_bus = NULL;
>> +    struct VFUIOMMU *iommu = NULL;
>> +
>> +    if (!remote_as_table) {
>> +        return;
>> +    }
>> +
>> +    vfu_pci_bus = g_hash_table_lookup(remote_as_table, pci_get_bus(pci_dev));
>> +
>> +    if (!vfu_pci_bus) {
>> +        return;
>> +    }
>> +
>> +    iommu = vfu_pci_bus->iommu[pci_dev->devfn];
>> +
>> +    vfu_pci_bus->iommu[pci_dev->devfn] = NULL;
>> +
>> +    if (iommu) {
>> +        memory_region_unref(&iommu->mr);
>> +        address_space_destroy(&iommu->as);
>> +        g_free(iommu);
>> +    }
>> +}
>> +
>> +void remote_iommu_init(void)
>> +{
>> +    remote_as_table = g_hash_table_new_full(NULL, NULL, NULL, NULL);
>> +}
>> +
>> +void remote_iommu_set(PCIBus *bus)
>> +{
>> +    pci_setup_iommu(bus, remote_iommu_get_as, NULL);
>> +}
>> +
>> +MemoryRegion *remote_iommu_get_ram(PCIDevice *pci_dev)
>> +{
>> +    PCIBus *bus = pci_get_bus(pci_dev);
>> +    VFUPciBus *vfu_pci_bus;
>> +
>> +    if (!remote_as_table) {
>> +        return get_system_memory();
>> +    }
>> +
>> +    vfu_pci_bus = g_hash_table_lookup(remote_as_table, bus);
>> +    if (!vfu_pci_bus) {
>> +        return get_system_memory();
>> +    }
>> +
>> +    return &vfu_pci_bus->iommu[pci_dev->devfn]->mr;
>> +}
>> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
>> index 952105eab5..023be0491e 100644
>> --- a/hw/remote/machine.c
>> +++ b/hw/remote/machine.c
>> @@ -21,6 +21,7 @@
>> #include "qapi/error.h"
>> #include "hw/pci/pci_host.h"
>> #include "hw/remote/iohub.h"
>> +#include "hw/remote/iommu.h"
>> 
>> static void remote_machine_init(MachineState *machine)
>> {
>> @@ -52,6 +53,10 @@ static void remote_machine_init(MachineState *machine)
>> 
>>     remote_iohub_init(&s->iohub);
>> 
>> +    remote_iommu_init();
>> +
>> +    remote_iommu_set(pci_host->bus);
>> +
>>     pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
>>                  &s->iohub, REMOTE_IOHUB_NB_PIRQS);
>> }
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 9399e87cbe..ae375e69b9 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -49,6 +49,7 @@
>> #include "hw/qdev-core.h"
>> #include "hw/pci/pci.h"
>> #include "qemu/timer.h"
>> +#include "hw/remote/iommu.h"
>> 
>> #define TYPE_VFU_OBJECT "x-vfio-user-server"
>> OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
>> @@ -210,6 +211,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
>> 
>> static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>> {
>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>>     MemoryRegion *subregion = NULL;
>>     g_autofree char *name = NULL;
>>     static unsigned int suffix;
>> @@ -226,14 +228,15 @@ static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>>     memory_region_init_ram_ptr(subregion, NULL, name,
>>                                iov->iov_len, info->vaddr);
>> 
>> -    memory_region_add_subregion(get_system_memory(), (hwaddr)iov->iov_base,
>> -                                subregion);
>> +    memory_region_add_subregion(remote_iommu_get_ram(o->pci_dev),
>> +                                (hwaddr)iov->iov_base, subregion);
>> 
>>     trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
>> }
>> 
>> static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>> {
>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>>     MemoryRegion *mr = NULL;
>>     ram_addr_t offset;
>> 
>> @@ -242,7 +245,7 @@ static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
>>         return;
>>     }
>> 
>> -    memory_region_del_subregion(get_system_memory(), mr);
>> +    memory_region_del_subregion(remote_iommu_get_ram(o->pci_dev), mr);
>> 
>>     object_unparent((OBJECT(mr)));
>> 
>> @@ -320,6 +323,7 @@ static vfu_region_access_cb_t *vfu_object_bar_handlers[PCI_NUM_REGIONS] = {
>>  */
>> static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
>> {
>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>>     int i;
>> 
>>     for (i = 0; i < PCI_NUM_REGIONS; i++) {
>> @@ -332,6 +336,12 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
>>                          vfu_object_bar_handlers[i],
>>                          VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
>> 
>> +        if ((o->pci_dev->io_regions[i].type & PCI_BASE_ADDRESS_SPACE) == 0) {
>> +            memory_region_unref(o->pci_dev->io_regions[i].address_space);
>> +            o->pci_dev->io_regions[i].address_space =
>> +                remote_iommu_get_ram(o->pci_dev);
>> +        }
> 
> This looks hacky. If you create a separate PCIHost for each device
> instead then the BARs will be created in the MemoryRegion (confusingly
> named "address_space" in the PCI code) of your choosing.

I was also not very comfortable with this - added it very grudgingly out of
necessity. Thank god this can go away with separate bus for each device.

> 
> Also, why is PCI Memory Space isolated via VFUIOMMU but PCI IO Space is
> not?

If I understand correctly, the IO address space translates sysmem address to
direct device access (such as I2C). Once we are inside a device, we already
have access to all parts of the device (unlike RAM which sits outside the device).
So didn’t think device would go via IOMMU to access IO. Also didn’t see any
other IOMMU translating IO address space accesses.

Thank you!
--
Jag


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2021-12-17 17:59     ` Jag Raman
@ 2021-12-20  8:29       ` Stefan Hajnoczi
  2021-12-21  3:04         ` Jag Raman
  2022-01-05 10:38       ` Thanos Makatos
  1 sibling, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-20  8:29 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	John Levon, alex.bennee, crosa, qemu-devel, wainersm,
	Alex Williamson, Marc-André Lureau, pbonzini,
	Thanos Makatos, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 3360 bytes --]

On Fri, Dec 17, 2021 at 05:59:48PM +0000, Jag Raman wrote:
> 
> 
> > On Dec 16, 2021, at 6:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Dec 15, 2021 at 10:35:31AM -0500, Jagannathan Raman wrote:
> >> @@ -114,6 +118,62 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
> >>     vfu_object_init_ctx(o, errp);
> >> }
> >> 
> >> +static void vfu_object_ctx_run(void *opaque)
> >> +{
> >> +    VfuObject *o = opaque;
> >> +    int ret = -1;
> >> +
> >> +    while (ret != 0) {
> >> +        ret = vfu_run_ctx(o->vfu_ctx);
> >> +        if (ret < 0) {
> >> +            if (errno == EINTR) {
> >> +                continue;
> >> +            } else if (errno == ENOTCONN) {
> >> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> >> +                o->vfu_poll_fd = -1;
> >> +                object_unparent(OBJECT(o));
> >> +                break;
> > 
> > If nothing else logs a message then I think that should be done here so
> > users know why their vfio-user server object disappeared.
> 
> Sure will do.
> 
> Do you prefer a trace, or a message to the console? Trace makes sense to me.
> Presently, the client could unplug the vfio-user device which would trigger the
> deletion of this object. This process could happen quietly.

If there is no way to differentiate graceful disconnect from unexpected
disconnect then logging might be too noisy.

Regarding the automatic deletion of the object, that might not be
desirable for two reasons:
1. It prevents reconnection or another client connecting.
2. Management tools are in the dark about it.

For #2 there are monitor events that QEMU emits to notify management
tools about state changes like disconnections.

It's worth thinking about current and future use cases before baking in
a policy like automatically deleting VfuObject on disconnect because
it's inflexible and would require a QEMU update in the future to support
a different policy.

One approach is to emit a disconnect event but leave the VfuObject in a
disconnected state. The management tool can then restart or clean up,
depending on its policy.

I'm not sure what's best because it depends on the use cases, but maybe
you and others have ideas here.

> >> @@ -208,6 +284,8 @@ static void vfu_object_init(Object *obj)
> >>                    TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
> >>         return;
> >>     }
> >> +
> >> +    o->vfu_poll_fd = -1;
> >> }
> > 
> > This must call qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL)
> > when o->vfu_poll_fd != -1 to avoid leaving a dangling fd handler
> > callback registered.
> 
> This is during the init phase, and the FD handlers are not set. Do you mean
> to add this at finalize?
> 
> I agree it’s good to explicitly add this at finalize. But vfu_destroy_ctx() should
> trigger a ENOTCONN, which would do it anyway.

I'm not sure my comment makes sense since this is the init function, not
finalize.

However, it's not clear to me that the o->vfu_poll_fd fd handler is
unregistered from the event loop when VfuObject is finalized (e.g. by
the object-del monitor command). You say vfu_destroy_ctx() triggers
ENOTCONN, but the VfuObject is freed after finalize returns so when is
the fd handler deregistered?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/14] vfio-user: IOMMU support for remote device
  2021-12-17 20:00     ` Jag Raman
@ 2021-12-20 14:36       ` Stefan Hajnoczi
  2021-12-21  4:32         ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2021-12-20 14:36 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 5428 bytes --]

On Fri, Dec 17, 2021 at 08:00:35PM +0000, Jag Raman wrote:
> > On Dec 16, 2021, at 9:40 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Dec 15, 2021 at 10:35:35AM -0500, Jagannathan Raman wrote:
> >> Assign separate address space for each device in the remote processes.
> > 
> > If I understand correctly this isn't really an IOMMU. It's abusing the
> > IOMMU APIs to create isolated address spaces for each device. This way
> > memory regions added by the vfio-user client do not conflict when there
> > are multiple vfio-user servers.
> 
> Like you already figured out, having isolated DMA address space alone is not
> sufficient for this application, we also needed to isolate the sysmem/RAM address
> space. As such, the available IOMMU APIs alone were not sufficient, so we had
> to improvise.
> 
> > 
> > Calling pci_root_bus_new() and keeping one PCI bus per VfuObject might
> > be a cleaner approach:
> > - Lets you isolate both PCI Memory Space and IO Space.
> > - Isolates the PCIDevices and their addresses on the bus.
> > - Isolates irqs.
> > - No more need to abuse the IOMMU API.
> 
> I believe we would still need to have an IOMMU. It’s because, devices use the
> pci_dma_read()/_write() functions. These functions look up the address in DMA
> address space (via pci_get_address_space() -> PCIDevice->bus_master_as ->
> PCIDevice->bus_master_enable_region -> PCIDevice->bus_master_container_region).
>  bus_master_enable_region and bus_master_container_region are effectively aliases
> to the DMA address space - without an IOMMU, the dma_as would be the shared
> global sysmem/RAM space (address_space_mem, please see pci_init_bus_master())

Good point, that code assumes there is a global address space. Creating
a fake IOMMU works around that assumption but it seems cleaner to
eliminate it:

  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
  {
      ...
      if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
          return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
      }
      return &address_space_memory;
              ^^^^^^^^^^^^^^^^^^^^

When creating a PCI root bus an AddressSpace argument could be provided,
just like it already does for the address_space_memory and
address_space_io MemoryRegions. Then the hardcoded return can be
changed to something like:

  return bus->dma_address_space;

> >> @@ -332,6 +336,12 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
> >>                          vfu_object_bar_handlers[i],
> >>                          VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
> >> 
> >> +        if ((o->pci_dev->io_regions[i].type & PCI_BASE_ADDRESS_SPACE) == 0) {
> >> +            memory_region_unref(o->pci_dev->io_regions[i].address_space);
> >> +            o->pci_dev->io_regions[i].address_space =
> >> +                remote_iommu_get_ram(o->pci_dev);
> >> +        }
> > 
> > This looks hacky. If you create a separate PCIHost for each device
> > instead then the BARs will be created in the MemoryRegion (confusingly
> > named "address_space" in the PCI code) of your choosing.
> 
> I was also not very comfortable with this - added it very grudgingly out of
> necessity. Thank god this can go away with separate bus for each device.

I talked to Kevin Wolf about having separate busses. qdev currently
requires each DeviceState to have a parent bus and each bus must have a
parent DeviceState. There is only one exception: a special check that
allows the global system bus (sysbus_get_default()) to be created
without a parent DeviceState.

This restriction probably needs to be loosened in order to support an
isolated PCIHost for each vfio-user server. The challenge is that
qdev_find_recursive() and monitor commands like device_add currently
only search the global system bus. Maybe new syntax is needed for the
multiple root bus case or the behavior of existing monitor commands
needs to be understood and extended without breaking anything.

> > 
> > Also, why is PCI Memory Space isolated via VFUIOMMU but PCI IO Space is
> > not?
> 
> If I understand correctly, the IO address space translates sysmem address to
> direct device access (such as I2C). Once we are inside a device, we already
> have access to all parts of the device (unlike RAM which sits outside the device).
> So didn’t think device would go via IOMMU to access IO. Also didn’t see any
> other IOMMU translating IO address space accesses.

I reviewed how BARs are configured with VFIO:

1. When the guest writes to the vfio-pci PCIDevice's Configuration Space
   the write is forwarded to the VFIO device (i.e. vfio-user or VFIO
   kernel ioctl).

2. The vfio-user server receives the Configuration Space write and
   forwards it to pci_dev (the PCIDevice we're serving up). BAR mappings
   are updated in the vfio-user server so the BAR MemoryRegions are
   mapped/unmapped at the locations given by the guest.

This applies for both Memory and IO Space accesses.

Because this patch series does not isolate IO Space between VfuObject
instances the MemoryRegions will collide when two guests map IO Space
BARs of different devices at the same IO Space address. In other words,
vfu_object_bar_rw() uses the global address_space_io and that means
collisions can occur.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2021-12-20  8:29       ` Stefan Hajnoczi
@ 2021-12-21  3:04         ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2021-12-21  3:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	John Levon, alex.bennee, crosa, qemu-devel, wainersm,
	Alex Williamson, Marc-André Lureau, pbonzini,
	Thanos Makatos, Philippe Mathieu-Daudé



> On Dec 20, 2021, at 3:29 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Dec 17, 2021 at 05:59:48PM +0000, Jag Raman wrote:
>> 
>> 
>>> On Dec 16, 2021, at 6:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Dec 15, 2021 at 10:35:31AM -0500, Jagannathan Raman wrote:
>>>> @@ -114,6 +118,62 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
>>>>    vfu_object_init_ctx(o, errp);
>>>> }
>>>> 
>>>> +static void vfu_object_ctx_run(void *opaque)
>>>> +{
>>>> +    VfuObject *o = opaque;
>>>> +    int ret = -1;
>>>> +
>>>> +    while (ret != 0) {
>>>> +        ret = vfu_run_ctx(o->vfu_ctx);
>>>> +        if (ret < 0) {
>>>> +            if (errno == EINTR) {
>>>> +                continue;
>>>> +            } else if (errno == ENOTCONN) {
>>>> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
>>>> +                o->vfu_poll_fd = -1;
>>>> +                object_unparent(OBJECT(o));
>>>> +                break;
>>> 
>>> If nothing else logs a message then I think that should be done here so
>>> users know why their vfio-user server object disappeared.
>> 
>> Sure will do.
>> 
>> Do you prefer a trace, or a message to the console? Trace makes sense to me.
>> Presently, the client could unplug the vfio-user device which would trigger the
>> deletion of this object. This process could happen quietly.
> 
> If there is no way to differentiate graceful disconnect from unexpected
> disconnect then logging might be too noisy.

I think that’s what happens in the regular VFIO case also.
vfio_put_base_device() closes the FD used for ioctls.

> 
> Regarding the automatic deletion of the object, that might not be
> desirable for two reasons:
> 1. It prevents reconnection or another client connecting.
> 2. Management tools are in the dark about it.
> 
> For #2 there are monitor events that QEMU emits to notify management
> tools about state changes like disconnections.

This is very interesting. I suppose you’re referring to something like
‘BLOCK_JOB_COMPLETED’ event.

It’d be good to inform the management tools about disconnection. Not
used this before, will check it out to gather ideas on how to use it.

> 
> It's worth thinking about current and future use cases before baking in
> a policy like automatically deleting VfuObject on disconnect because
> it's inflexible and would require a QEMU update in the future to support
> a different policy.
> 
> One approach is to emit a disconnect event but leave the VfuObject in a
> disconnected state. The management tool can then restart or clean up,
> depending on its policy.
> 
> I'm not sure what's best because it depends on the use cases, but maybe
> you and others have ideas here.
> 
>>>> @@ -208,6 +284,8 @@ static void vfu_object_init(Object *obj)
>>>>                   TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
>>>>        return;
>>>>    }
>>>> +
>>>> +    o->vfu_poll_fd = -1;
>>>> }
>>> 
>>> This must call qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL)
>>> when o->vfu_poll_fd != -1 to avoid leaving a dangling fd handler
>>> callback registered.
>> 
>> This is during the init phase, and the FD handlers are not set. Do you mean
>> to add this at finalize?
>> 
>> I agree it’s good to explicitly add this at finalize. But vfu_destroy_ctx() should
>> trigger a ENOTCONN, which would do it anyway.
> 
> I'm not sure my comment makes sense since this is the init function, not
> finalize.
> 
> However, it's not clear to me that the o->vfu_poll_fd fd handler is
> unregistered from the event loop when VfuObject is finalized (e.g. by
> the object-del monitor command). You say vfu_destroy_ctx() triggers
> ENOTCONN, but the VfuObject is freed after finalize returns so when is
> the fd handler deregistered?

That is correct - will remove the FD handler in finalize also.

Thank you!
--
Jag

> 
> Stefan


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/14] vfio-user: IOMMU support for remote device
  2021-12-20 14:36       ` Stefan Hajnoczi
@ 2021-12-21  4:32         ` Jag Raman
  2022-01-06 13:10           ` Stefan Hajnoczi
  0 siblings, 1 reply; 49+ messages in thread
From: Jag Raman @ 2021-12-21  4:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee



> On Dec 20, 2021, at 9:36 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Dec 17, 2021 at 08:00:35PM +0000, Jag Raman wrote:
>>> On Dec 16, 2021, at 9:40 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Dec 15, 2021 at 10:35:35AM -0500, Jagannathan Raman wrote:
>>>> Assign separate address space for each device in the remote processes.
>>> 
>>> If I understand correctly this isn't really an IOMMU. It's abusing the
>>> IOMMU APIs to create isolated address spaces for each device. This way
>>> memory regions added by the vfio-user client do not conflict when there
>>> are multiple vfio-user servers.
>> 
>> Like you already figured out, having isolated DMA address space alone is not
>> sufficient for this application, we also needed to isolate the sysmem/RAM address
>> space. As such, the available IOMMU APIs alone were not sufficient, so we had
>> to improvise.
>> 
>>> 
>>> Calling pci_root_bus_new() and keeping one PCI bus per VfuObject might
>>> be a cleaner approach:
>>> - Lets you isolate both PCI Memory Space and IO Space.
>>> - Isolates the PCIDevices and their addresses on the bus.
>>> - Isolates irqs.
>>> - No more need to abuse the IOMMU API.
>> 
>> I believe we would still need to have an IOMMU. It’s because, devices use the
>> pci_dma_read()/_write() functions. These functions look up the address in DMA
>> address space (via pci_get_address_space() -> PCIDevice->bus_master_as ->
>> PCIDevice->bus_master_enable_region -> PCIDevice->bus_master_container_region).
>> bus_master_enable_region and bus_master_container_region are effectively aliases
>> to the DMA address space - without an IOMMU, the dma_as would be the shared
>> global sysmem/RAM space (address_space_mem, please see pci_init_bus_master())
> 
> Good point, that code assumes there is a global address space. Creating
> a fake IOMMU works around that assumption but it seems cleaner to
> eliminate it:
> 
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  {
>      ...
>      if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>          return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>      }
>      return &address_space_memory;
>              ^^^^^^^^^^^^^^^^^^^^
> 
> When creating a PCI root bus an AddressSpace argument could be provided,
> just like it already does for the address_space_memory and
> address_space_io MemoryRegions. Then the hardcoded return can be
> changed to something like:
> 
>  return bus->dma_address_space;

This approach should work when we are using separate PCIBus for each PCIDevice.

> 
>>>> @@ -332,6 +336,12 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
>>>>                         vfu_object_bar_handlers[i],
>>>>                         VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
>>>> 
>>>> +        if ((o->pci_dev->io_regions[i].type & PCI_BASE_ADDRESS_SPACE) == 0) {
>>>> +            memory_region_unref(o->pci_dev->io_regions[i].address_space);
>>>> +            o->pci_dev->io_regions[i].address_space =
>>>> +                remote_iommu_get_ram(o->pci_dev);
>>>> +        }
>>> 
>>> This looks hacky. If you create a separate PCIHost for each device
>>> instead then the BARs will be created in the MemoryRegion (confusingly
>>> named "address_space" in the PCI code) of your choosing.
>> 
>> I was also not very comfortable with this - added it very grudgingly out of
>> necessity. Thank god this can go away with separate bus for each device.
> 
> I talked to Kevin Wolf about having separate busses. qdev currently
> requires each DeviceState to have a parent bus and each bus must have a
> parent DeviceState. There is only one exception: a special check that
> allows the global system bus (sysbus_get_default()) to be created
> without a parent DeviceState.
> 
> This restriction probably needs to be loosened in order to support an
> isolated PCIHost for each vfio-user server. The challenge is that
> qdev_find_recursive() and monitor commands like device_add currently
> only search the global system bus. Maybe new syntax is needed for the
> multiple root bus case or the behavior of existing monitor commands
> needs to be understood and extended without breaking anything.

Lemme check if it’s possible to create multiple PCIBuses within the global
system bus, something similar to what PCI expansion cards are doing. That
would help avoid the complexities you just mentioned.
> 
>>> 
>>> Also, why is PCI Memory Space isolated via VFUIOMMU but PCI IO Space is
>>> not?
>> 
>> If I understand correctly, the IO address space translates sysmem address to
>> direct device access (such as I2C). Once we are inside a device, we already
>> have access to all parts of the device (unlike RAM which sits outside the device).
>> So didn’t think device would go via IOMMU to access IO. Also didn’t see any
>> other IOMMU translating IO address space accesses.
> 
> I reviewed how BARs are configured with VFIO:
> 
> 1. When the guest writes to the vfio-pci PCIDevice's Configuration Space
>   the write is forwarded to the VFIO device (i.e. vfio-user or VFIO
>   kernel ioctl).
> 
> 2. The vfio-user server receives the Configuration Space write and
>   forwards it to pci_dev (the PCIDevice we're serving up). BAR mappings
>   are updated in the vfio-user server so the BAR MemoryRegions are
>   mapped/unmapped at the locations given by the guest.
> 
> This applies for both Memory and IO Space accesses.
> 
> Because this patch series does not isolate IO Space between VfuObject
> instances the MemoryRegions will collide when two guests map IO Space
> BARs of different devices at the same IO Space address. In other words,
> vfu_object_bar_rw() uses the global address_space_io and that means
> collisions can occur.

I agree that collision could occur from the CPU end. But I'm not if IOMMU
needs to translate IO space.

Thank you!
--
Jag

> 
> Stefan


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v4 07/14] vfio-user: run vfio-user context
  2021-12-17 17:59     ` Jag Raman
  2021-12-20  8:29       ` Stefan Hajnoczi
@ 2022-01-05 10:38       ` Thanos Makatos
  2022-01-06 13:35         ` Stefan Hajnoczi
  1 sibling, 1 reply; 49+ messages in thread
From: Thanos Makatos @ 2022-01-05 10:38 UTC (permalink / raw)
  To: Jag Raman, Stefan Hajnoczi, John Levon
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, Swapnil Ingle,
	John Levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, Marc-André Lureau,
	crosa, pbonzini, alex.bennee



> -----Original Message-----
> From: Jag Raman <jag.raman@oracle.com>
> Sent: 17 December 2021 18:00
> To: Stefan Hajnoczi <stefanha@redhat.com>; John Levon
> <john.levon@nutanix.com>; Thanos Makatos <thanos.makatos@nutanix.com>
> Cc: qemu-devel <qemu-devel@nongnu.org>; Alex Williamson
> <alex.williamson@redhat.com>; Marc-André Lureau
> <marcandre.lureau@gmail.com>; Philippe Mathieu-Daudé
> <philmd@redhat.com>; pbonzini@redhat.com; alex.bennee@linaro.org;
> thuth@redhat.com; crosa@redhat.com; wainersm@redhat.com;
> bleal@redhat.com; Elena Ufimtseva <elena.ufimtseva@oracle.com>; John
> Levon <john.levon@nutanix.com>; John Johnson
> <john.g.johnson@oracle.com>; Thanos Makatos
> <thanos.makatos@nutanix.com>; Swapnil Ingle <swapnil.ingle@nutanix.com>
> Subject: Re: [PATCH v4 07/14] vfio-user: run vfio-user context
> 
> 
> 
> > On Dec 16, 2021, at 6:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Wed, Dec 15, 2021 at 10:35:31AM -0500, Jagannathan Raman wrote:
> >> @@ -114,6 +118,62 @@ static void vfu_object_set_device(Object *obj,
> const char *str, Error **errp)
> >>     vfu_object_init_ctx(o, errp);
> >> }
> >>
> >> +static void vfu_object_ctx_run(void *opaque)
> >> +{
> >> +    VfuObject *o = opaque;
> >> +    int ret = -1;
> >> +
> >> +    while (ret != 0) {
> >> +        ret = vfu_run_ctx(o->vfu_ctx);
> >> +        if (ret < 0) {
> >> +            if (errno == EINTR) {
> >> +                continue;
> >> +            } else if (errno == ENOTCONN) {
> >> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> >> +                o->vfu_poll_fd = -1;
> >> +                object_unparent(OBJECT(o));
> >> +                break;
> >
> > If nothing else logs a message then I think that should be done here so
> > users know why their vfio-user server object disappeared.
> 
> Sure will do.
> 
> Do you prefer a trace, or a message to the console? Trace makes sense to me.
> Presently, the client could unplug the vfio-user device which would trigger the
> deletion of this object. This process could happen quietly.
> 
> >
> >> +            } else {
> >> +                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
> >> +                           o->device, strerror(errno));
> >
> > error_abort is equivalent to assuming !o->daemon. In the case where the
> > user doesn't want to automatically shut down the process we need to log
> > a message without aborting.
> 
> OK, makes sense.
> 
> >
> >> +                 break;
> >
> > Indentation is off.
> >
> >> +            }
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +static void vfu_object_attach_ctx(void *opaque)
> >> +{
> >> +    VfuObject *o = opaque;
> >> +    GPollFD pfds[1];
> >> +    int ret;
> >> +
> >> +    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> >> +
> >> +    pfds[0].fd = o->vfu_poll_fd;
> >> +    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> >> +
> >> +retry_attach:
> >> +    ret = vfu_attach_ctx(o->vfu_ctx);
> >> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> >> +        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
> >> +        goto retry_attach;
> >
> > This can block the thread indefinitely. Other events like monitor
> > commands are not handled in this loop. Please make this asynchronous
> > (set an fd handler and return from this function so we can try again
> > later).
> >
> > The vfu_attach_ctx() implementation synchronously negotiates the
> > vfio-user connection :(. That's a shame because even if accept(2) is
> > handled asynchronously, the negotiation can still block. It would be
> > cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
> > avoid blocking. Is that possible?
> 
> Thanos / John,
> 
>     Any thoughts on this?

I'm discussing this with John and FYI there are other places where libvfio-user can block, e.g. sending a response or receiving a command. Is it just the negotiation you want it to be asynchronous or _all_ libvfio-user operations? Making libvfio-user fully asynchronous might require a substantial API rewrite.

> 
> >
> > If libvfio-user can't make vfu_attach_ctx() fully async then it may be
> > possible to spawn a thread just for the blocking vfu_attach_ctx() call
> > and report the result back to the event loop (e.g. using EventNotifier).
> > That adds a bunch of code to work around a blocking API though, so I
> > guess we can leave the blocking part if necessary.
> >
> > At the very minimum, please make EAGAIN/EWOULDBLOCK async as
> mentioned
> > above and add a comment explaining the situation with the
> > partially-async vfu_attach_ctx() API so it's clear that this can block
> > (that way it's clear that you're aware of the issue and this isn't by
> > accident).
> 
> Sure, we could make the attach async at QEMU depending on how the
> library prefers to do this.
> 
> >
> >> +    } else if (ret < 0) {
> >> +        error_setg(&error_abort,
> >> +                   "vfu: Failed to attach device %s to context - %s",
> >> +                   o->device, strerror(errno));
> >
> > error_abort assumes !o->daemon. Please handle the o->daemon == true
> > case by logging an error without aborting.
> >
> >> +        return;
> >> +    }
> >> +
> >> +    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
> >> +    if (o->vfu_poll_fd < 0) {
> >> +        error_setg(&error_abort, "vfu: Failed to get poll fd %s", o->device);
> >
> > Same here.
> >
> >> @@ -208,6 +284,8 @@ static void vfu_object_init(Object *obj)
> >>                    TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
> >>         return;
> >>     }
> >> +
> >> +    o->vfu_poll_fd = -1;
> >> }
> >
> > This must call qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL)
> > when o->vfu_poll_fd != -1 to avoid leaving a dangling fd handler
> > callback registered.
> 
> This is during the init phase, and the FD handlers are not set. Do you mean
> to add this at finalize?
> 
> I agree it’s good to explicitly add this at finalize. But vfu_destroy_ctx() should
> trigger a ENOTCONN, which would do it anyway.
> 
> Thank you!
> --
> Jag


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/14] vfio-user: IOMMU support for remote device
  2021-12-21  4:32         ` Jag Raman
@ 2022-01-06 13:10           ` Stefan Hajnoczi
  0 siblings, 0 replies; 49+ messages in thread
From: Stefan Hajnoczi @ 2022-01-06 13:10 UTC (permalink / raw)
  To: Jag Raman
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, swapnil.ingle,
	john.levon, Philippe Mathieu-Daudé,
	qemu-devel, wainersm, Alex Williamson, thanos.makatos,
	Marc-André Lureau, crosa, pbonzini, alex.bennee

[-- Attachment #1: Type: text/plain, Size: 2263 bytes --]

On Tue, Dec 21, 2021 at 04:32:05AM +0000, Jag Raman wrote:
> > On Dec 20, 2021, at 9:36 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Fri, Dec 17, 2021 at 08:00:35PM +0000, Jag Raman wrote:
> >>> On Dec 16, 2021, at 9:40 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> On Wed, Dec 15, 2021 at 10:35:35AM -0500, Jagannathan Raman wrote:
> >>> Also, why is PCI Memory Space isolated via VFUIOMMU but PCI IO Space is
> >>> not?
> >> 
> >> If I understand correctly, the IO address space translates sysmem address to
> >> direct device access (such as I2C). Once we are inside a device, we already
> >> have access to all parts of the device (unlike RAM which sits outside the device).
> >> So didn’t think device would go via IOMMU to access IO. Also didn’t see any
> >> other IOMMU translating IO address space accesses.
> > 
> > I reviewed how BARs are configured with VFIO:
> > 
> > 1. When the guest writes to the vfio-pci PCIDevice's Configuration Space
> >   the write is forwarded to the VFIO device (i.e. vfio-user or VFIO
> >   kernel ioctl).
> > 
> > 2. The vfio-user server receives the Configuration Space write and
> >   forwards it to pci_dev (the PCIDevice we're serving up). BAR mappings
> >   are updated in the vfio-user server so the BAR MemoryRegions are
> >   mapped/unmapped at the locations given by the guest.
> > 
> > This applies for both Memory and IO Space accesses.
> > 
> > Because this patch series does not isolate IO Space between VfuObject
> > instances the MemoryRegions will collide when two guests map IO Space
> > BARs of different devices at the same IO Space address. In other words,
> > vfu_object_bar_rw() uses the global address_space_io and that means
> > collisions can occur.
> 
> I agree that collision could occur from the CPU end. But I'm not if IOMMU
> needs to translate IO space.

QEMU's IOMMUs do not translate IO Space addresses AFAIK.

IO Space just needs to be isolated between vfio-user server instances so
there is no collision when one client maps an IO Space BAR to the same
address as another client.

I think the cleanest way of achieving that is by creating a
per-vfio-user server PCI bus with an address_space_io MemoryRegion.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2022-01-05 10:38       ` Thanos Makatos
@ 2022-01-06 13:35         ` Stefan Hajnoczi
  2022-01-10 17:56           ` John Levon
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2022-01-06 13:35 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: Elena Ufimtseva, John Johnson, thuth, Jag Raman, bleal,
	Swapnil Ingle, John Levon, alex.bennee, qemu-devel, wainersm,
	Alex Williamson, Marc-André Lureau, crosa, pbonzini,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 5805 bytes --]

On Wed, Jan 05, 2022 at 10:38:10AM +0000, Thanos Makatos wrote:
> 
> 
> > -----Original Message-----
> > From: Jag Raman <jag.raman@oracle.com>
> > Sent: 17 December 2021 18:00
> > To: Stefan Hajnoczi <stefanha@redhat.com>; John Levon
> > <john.levon@nutanix.com>; Thanos Makatos <thanos.makatos@nutanix.com>
> > Cc: qemu-devel <qemu-devel@nongnu.org>; Alex Williamson
> > <alex.williamson@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@gmail.com>; Philippe Mathieu-Daudé
> > <philmd@redhat.com>; pbonzini@redhat.com; alex.bennee@linaro.org;
> > thuth@redhat.com; crosa@redhat.com; wainersm@redhat.com;
> > bleal@redhat.com; Elena Ufimtseva <elena.ufimtseva@oracle.com>; John
> > Levon <john.levon@nutanix.com>; John Johnson
> > <john.g.johnson@oracle.com>; Thanos Makatos
> > <thanos.makatos@nutanix.com>; Swapnil Ingle <swapnil.ingle@nutanix.com>
> > Subject: Re: [PATCH v4 07/14] vfio-user: run vfio-user context
> > 
> > 
> > 
> > > On Dec 16, 2021, at 6:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Wed, Dec 15, 2021 at 10:35:31AM -0500, Jagannathan Raman wrote:
> > >> @@ -114,6 +118,62 @@ static void vfu_object_set_device(Object *obj,
> > const char *str, Error **errp)
> > >>     vfu_object_init_ctx(o, errp);
> > >> }
> > >>
> > >> +static void vfu_object_ctx_run(void *opaque)
> > >> +{
> > >> +    VfuObject *o = opaque;
> > >> +    int ret = -1;
> > >> +
> > >> +    while (ret != 0) {
> > >> +        ret = vfu_run_ctx(o->vfu_ctx);
> > >> +        if (ret < 0) {
> > >> +            if (errno == EINTR) {
> > >> +                continue;
> > >> +            } else if (errno == ENOTCONN) {
> > >> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> > >> +                o->vfu_poll_fd = -1;
> > >> +                object_unparent(OBJECT(o));
> > >> +                break;
> > >
> > > If nothing else logs a message then I think that should be done here so
> > > users know why their vfio-user server object disappeared.
> > 
> > Sure will do.
> > 
> > Do you prefer a trace, or a message to the console? Trace makes sense to me.
> > Presently, the client could unplug the vfio-user device which would trigger the
> > deletion of this object. This process could happen quietly.
> > 
> > >
> > >> +            } else {
> > >> +                error_setg(&error_abort, "vfu: Failed to run device %s - %s",
> > >> +                           o->device, strerror(errno));
> > >
> > > error_abort is equivalent to assuming !o->daemon. In the case where the
> > > user doesn't want to automatically shut down the process we need to log
> > > a message without aborting.
> > 
> > OK, makes sense.
> > 
> > >
> > >> +                 break;
> > >
> > > Indentation is off.
> > >
> > >> +            }
> > >> +        }
> > >> +    }
> > >> +}
> > >> +
> > >> +static void vfu_object_attach_ctx(void *opaque)
> > >> +{
> > >> +    VfuObject *o = opaque;
> > >> +    GPollFD pfds[1];
> > >> +    int ret;
> > >> +
> > >> +    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> > >> +
> > >> +    pfds[0].fd = o->vfu_poll_fd;
> > >> +    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > >> +
> > >> +retry_attach:
> > >> +    ret = vfu_attach_ctx(o->vfu_ctx);
> > >> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> > >> +        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
> > >> +        goto retry_attach;
> > >
> > > This can block the thread indefinitely. Other events like monitor
> > > commands are not handled in this loop. Please make this asynchronous
> > > (set an fd handler and return from this function so we can try again
> > > later).
> > >
> > > The vfu_attach_ctx() implementation synchronously negotiates the
> > > vfio-user connection :(. That's a shame because even if accept(2) is
> > > handled asynchronously, the negotiation can still block. It would be
> > > cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
> > > avoid blocking. Is that possible?
> > 
> > Thanos / John,
> > 
> >     Any thoughts on this?
> 
> I'm discussing this with John and FYI there are other places where libvfio-user can block, e.g. sending a response or receiving a command. Is it just the negotiation you want it to be asynchronous or _all_ libvfio-user operations? Making libvfio-user fully asynchronous might require a substantial API rewrite.

I see at least two reasons for a fully async API:

1. The program wants to handle other events (e.g. a management REST API)
   from the same event loop thread that invokes libvfio-user. If
   libvfio-user blocks then the other events cannot be handled within a
   reasonable time frame.

   The workaround for this is to use multi-threading and ignore the
   event-driven architecture implied by vfu_get_poll_fd().

2. The program handles multiple clients that do not trust each other.
   This could be a software-defined network switch or storage appliance.
   A malicious client can cause a denial-of-service by making a
   libvfio-user call block.

   Again, the program needs separate threads instead of an event loop to
   work around this.

The downside to a sync approach is that programs that already have an
event loop require extra code to set up dedicated threads for
libvfio-user. That's a library integration/usability issue.

In some cases it's okay to block: when the program doesn't need to
handle other events. If most users of libvfio-user are expected to fall
into this category then there's no need to change the API.

Either way, the doc comments in libvfio-user.h aren't very clear.
Someone integrating this library may think vfu_get_poll_fd() allows for
fully async operation.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2022-01-06 13:35         ` Stefan Hajnoczi
@ 2022-01-10 17:56           ` John Levon
  2022-01-11  9:36             ` Stefan Hajnoczi
  0 siblings, 1 reply; 49+ messages in thread
From: John Levon @ 2022-01-10 17:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, Jag Raman, bleal,
	Swapnil Ingle, alex.bennee, crosa, qemu-devel, wainersm,
	Alex Williamson, Marc-André Lureau, pbonzini,
	Thanos Makatos, Philippe Mathieu-Daudé

On Thu, Jan 06, 2022 at 01:35:32PM +0000, Stefan Hajnoczi wrote:

> > > >> +static void vfu_object_attach_ctx(void *opaque)
> > > >> +{
> > > >> +    VfuObject *o = opaque;
> > > >> +    GPollFD pfds[1];
> > > >> +    int ret;
> > > >> +
> > > >> +    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> > > >> +
> > > >> +    pfds[0].fd = o->vfu_poll_fd;
> > > >> +    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > > >> +
> > > >> +retry_attach:
> > > >> +    ret = vfu_attach_ctx(o->vfu_ctx);
> > > >> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> > > >> +        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
> > > >> +        goto retry_attach;
> > > >
> > > > This can block the thread indefinitely. Other events like monitor
> > > > commands are not handled in this loop. Please make this asynchronous
> > > > (set an fd handler and return from this function so we can try again
> > > > later).
> > > >
> > > > The vfu_attach_ctx() implementation synchronously negotiates the
> > > > vfio-user connection :(. That's a shame because even if accept(2) is
> > > > handled asynchronously, the negotiation can still block. It would be
> > > > cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
> > > > avoid blocking. Is that possible?
> > > 
> > > Thanos / John,
> > > 
> > >     Any thoughts on this?
> > 
> > I'm discussing this with John and FYI there are other places where libvfio-user can block, e.g. sending a response or receiving a command. Is it just the negotiation you want it to be asynchronous or _all_ libvfio-user operations? Making libvfio-user fully asynchronous might require a substantial API rewrite.
> 
> I see at least two reasons for a fully async API:
> 
> 1. The program wants to handle other events (e.g. a management REST API)
>    from the same event loop thread that invokes libvfio-user. If
>    libvfio-user blocks then the other events cannot be handled within a
>    reasonable time frame.
> 
>    The workaround for this is to use multi-threading and ignore the
>    event-driven architecture implied by vfu_get_poll_fd().
> 
> 2. The program handles multiple clients that do not trust each other.
>    This could be a software-defined network switch or storage appliance.
>    A malicious client can cause a denial-of-service by making a
>    libvfio-user call block.
> 
>    Again, the program needs separate threads instead of an event loop to
>    work around this.

Hi Stefan, we're certainly aware of the rationale. Ultimately we just haven't
yet found resources to fix this: in practice, we don't really hit problems from
the parts that are still synchronous. Of course, that's not a good argument when
it comes to your second issue, and indeed the library is not currently suitable
for multiple untrusted clients.

For Jag but: patches are welcome. But it's not just negotiate(): all sorts of
things are potentially synchronous - for example, it's expected that the region
read/write callbacks are done synchronously. Any other client reply, or
server->client message, is also synchronous.

It's partly why we haven't yet stabilised the API.

regards
john

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2022-01-10 17:56           ` John Levon
@ 2022-01-11  9:36             ` Stefan Hajnoczi
  2022-01-11 13:12               ` Jag Raman
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2022-01-11  9:36 UTC (permalink / raw)
  To: John Levon
  Cc: Elena Ufimtseva, John Johnson, thuth, Jag Raman, bleal,
	Swapnil Ingle, alex.bennee, crosa, qemu-devel, wainersm,
	Alex Williamson, Marc-André Lureau, pbonzini,
	Thanos Makatos, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 4119 bytes --]

On Mon, Jan 10, 2022 at 05:56:25PM +0000, John Levon wrote:
> On Thu, Jan 06, 2022 at 01:35:32PM +0000, Stefan Hajnoczi wrote:
> 
> > > > >> +static void vfu_object_attach_ctx(void *opaque)
> > > > >> +{
> > > > >> +    VfuObject *o = opaque;
> > > > >> +    GPollFD pfds[1];
> > > > >> +    int ret;
> > > > >> +
> > > > >> +    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> > > > >> +
> > > > >> +    pfds[0].fd = o->vfu_poll_fd;
> > > > >> +    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > > > >> +
> > > > >> +retry_attach:
> > > > >> +    ret = vfu_attach_ctx(o->vfu_ctx);
> > > > >> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> > > > >> +        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
> > > > >> +        goto retry_attach;
> > > > >
> > > > > This can block the thread indefinitely. Other events like monitor
> > > > > commands are not handled in this loop. Please make this asynchronous
> > > > > (set an fd handler and return from this function so we can try again
> > > > > later).
> > > > >
> > > > > The vfu_attach_ctx() implementation synchronously negotiates the
> > > > > vfio-user connection :(. That's a shame because even if accept(2) is
> > > > > handled asynchronously, the negotiation can still block. It would be
> > > > > cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
> > > > > avoid blocking. Is that possible?
> > > > 
> > > > Thanos / John,
> > > > 
> > > >     Any thoughts on this?
> > > 
> > > I'm discussing this with John and FYI there are other places where libvfio-user can block, e.g. sending a response or receiving a command. Is it just the negotiation you want it to be asynchronous or _all_ libvfio-user operations? Making libvfio-user fully asynchronous might require a substantial API rewrite.
> > 
> > I see at least two reasons for a fully async API:
> > 
> > 1. The program wants to handle other events (e.g. a management REST API)
> >    from the same event loop thread that invokes libvfio-user. If
> >    libvfio-user blocks then the other events cannot be handled within a
> >    reasonable time frame.
> > 
> >    The workaround for this is to use multi-threading and ignore the
> >    event-driven architecture implied by vfu_get_poll_fd().
> > 
> > 2. The program handles multiple clients that do not trust each other.
> >    This could be a software-defined network switch or storage appliance.
> >    A malicious client can cause a denial-of-service by making a
> >    libvfio-user call block.
> > 
> >    Again, the program needs separate threads instead of an event loop to
> >    work around this.
> 
> Hi Stefan, we're certainly aware of the rationale. Ultimately we just haven't
> yet found resources to fix this: in practice, we don't really hit problems from
> the parts that are still synchronous. Of course, that's not a good argument when
> it comes to your second issue, and indeed the library is not currently suitable
> for multiple untrusted clients.
> 
> For Jag but: patches are welcome. But it's not just negotiate(): all sorts of
> things are potentially synchronous - for example, it's expected that the region
> read/write callbacks are done synchronously. Any other client reply, or
> server->client message, is also synchronous.
> 
> It's partly why we haven't yet stabilised the API.

From the QEMU side it's okay to have blocking code in this experimental
feature if users are aware of the limitation (e.g. the monitor is
unresponsive if vfio-user blocks) and multiple untrusted clients are not
supported. The QEMU code should also make its limitations clear so
readers are aware of them and know the author chose this approach
intentionally rather than due to an oversight.

Jag: Please mention this in the documentation and add a comment to
vfu_object_attach_ctx() explaining that this code currently blocks.

I think in the long run it will be necessary to address it, e.g. in the
multi-client case. That can either be done with threads or by making
libvfio-user fully async.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 07/14] vfio-user: run vfio-user context
  2022-01-11  9:36             ` Stefan Hajnoczi
@ 2022-01-11 13:12               ` Jag Raman
  0 siblings, 0 replies; 49+ messages in thread
From: Jag Raman @ 2022-01-11 13:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John Johnson, thuth, bleal, Swapnil Ingle,
	John Levon, alex.bennee, crosa, qemu-devel, wainersm,
	Alex Williamson, Marc-André Lureau, pbonzini,
	Thanos Makatos, Philippe Mathieu-Daudé



> On Jan 11, 2022, at 4:36 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Mon, Jan 10, 2022 at 05:56:25PM +0000, John Levon wrote:
>> On Thu, Jan 06, 2022 at 01:35:32PM +0000, Stefan Hajnoczi wrote:
>> 
>>>>>>> +static void vfu_object_attach_ctx(void *opaque)
>>>>>>> +{
>>>>>>> +    VfuObject *o = opaque;
>>>>>>> +    GPollFD pfds[1];
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
>>>>>>> +
>>>>>>> +    pfds[0].fd = o->vfu_poll_fd;
>>>>>>> +    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>>>>>> +
>>>>>>> +retry_attach:
>>>>>>> +    ret = vfu_attach_ctx(o->vfu_ctx);
>>>>>>> +    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
>>>>>>> +        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
>>>>>>> +        goto retry_attach;
>>>>>> 
>>>>>> This can block the thread indefinitely. Other events like monitor
>>>>>> commands are not handled in this loop. Please make this asynchronous
>>>>>> (set an fd handler and return from this function so we can try again
>>>>>> later).
>>>>>> 
>>>>>> The vfu_attach_ctx() implementation synchronously negotiates the
>>>>>> vfio-user connection :(. That's a shame because even if accept(2) is
>>>>>> handled asynchronously, the negotiation can still block. It would be
>>>>>> cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
>>>>>> avoid blocking. Is that possible?
>>>>> 
>>>>> Thanos / John,
>>>>> 
>>>>>    Any thoughts on this?
>>>> 
>>>> I'm discussing this with John and FYI there are other places where libvfio-user can block, e.g. sending a response or receiving a command. Is it just the negotiation you want it to be asynchronous or _all_ libvfio-user operations? Making libvfio-user fully asynchronous might require a substantial API rewrite.
>>> 
>>> I see at least two reasons for a fully async API:
>>> 
>>> 1. The program wants to handle other events (e.g. a management REST API)
>>>   from the same event loop thread that invokes libvfio-user. If
>>>   libvfio-user blocks then the other events cannot be handled within a
>>>   reasonable time frame.
>>> 
>>>   The workaround for this is to use multi-threading and ignore the
>>>   event-driven architecture implied by vfu_get_poll_fd().
>>> 
>>> 2. The program handles multiple clients that do not trust each other.
>>>   This could be a software-defined network switch or storage appliance.
>>>   A malicious client can cause a denial-of-service by making a
>>>   libvfio-user call block.
>>> 
>>>   Again, the program needs separate threads instead of an event loop to
>>>   work around this.
>> 
>> Hi Stefan, we're certainly aware of the rationale. Ultimately we just haven't
>> yet found resources to fix this: in practice, we don't really hit problems from
>> the parts that are still synchronous. Of course, that's not a good argument when
>> it comes to your second issue, and indeed the library is not currently suitable
>> for multiple untrusted clients.
>> 
>> For Jag but: patches are welcome. But it's not just negotiate(): all sorts of
>> things are potentially synchronous - for example, it's expected that the region
>> read/write callbacks are done synchronously. Any other client reply, or
>> server->client message, is also synchronous.
>> 
>> It's partly why we haven't yet stabilised the API.
> 
> From the QEMU side it's okay to have blocking code in this experimental
> feature if users are aware of the limitation (e.g. the monitor is
> unresponsive if vfio-user blocks) and multiple untrusted clients are not
> supported. The QEMU code should also make its limitations clear so
> readers are aware of them and know the author chose this approach
> intentionally rather than due to an oversight.
> 
> Jag: Please mention this in the documentation and add a comment to
> vfu_object_attach_ctx() explaining that this code currently blocks.

Sure, will do.

Thank you!
--
Jag

> 
> I think in the long run it will be necessary to address it, e.g. in the
> multi-client case. That can either be done with threads or by making
> libvfio-user fully async.
> 
> Stefan



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-01-11 14:09 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-15 15:35 [PATCH v4 00/14] vfio-user server in QEMU Jagannathan Raman
2021-12-15 15:35 ` [PATCH v4 01/14] configure, meson: override C compiler for cmake Jagannathan Raman
2021-12-15 15:35 ` [PATCH v4 02/14] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
2021-12-15 15:54   ` Philippe Mathieu-Daudé
2021-12-15 22:04   ` Beraldo Leal
2021-12-16 21:28     ` Jag Raman
2021-12-15 15:35 ` [PATCH v4 03/14] vfio-user: build library Jagannathan Raman
2021-12-15 15:35 ` [PATCH v4 04/14] vfio-user: define vfio-user-server object Jagannathan Raman
2021-12-16  9:33   ` Stefan Hajnoczi
2021-12-17  2:17     ` Jag Raman
2021-12-16  9:58   ` Stefan Hajnoczi
2021-12-17  2:31     ` Jag Raman
2021-12-17  8:28       ` Stefan Hajnoczi
2021-12-15 15:35 ` [PATCH v4 05/14] vfio-user: instantiate vfio-user context Jagannathan Raman
2021-12-16  9:55   ` Stefan Hajnoczi
2021-12-16 21:32     ` Jag Raman
2021-12-15 15:35 ` [PATCH v4 06/14] vfio-user: find and init PCI device Jagannathan Raman
2021-12-16 10:39   ` Stefan Hajnoczi
2021-12-17  3:12     ` Jag Raman
2021-12-15 15:35 ` [PATCH v4 07/14] vfio-user: run vfio-user context Jagannathan Raman
2021-12-16 11:17   ` Stefan Hajnoczi
2021-12-17 17:59     ` Jag Raman
2021-12-20  8:29       ` Stefan Hajnoczi
2021-12-21  3:04         ` Jag Raman
2022-01-05 10:38       ` Thanos Makatos
2022-01-06 13:35         ` Stefan Hajnoczi
2022-01-10 17:56           ` John Levon
2022-01-11  9:36             ` Stefan Hajnoczi
2022-01-11 13:12               ` Jag Raman
2021-12-15 15:35 ` [PATCH v4 08/14] vfio-user: handle PCI config space accesses Jagannathan Raman
2021-12-16 11:30   ` Stefan Hajnoczi
2021-12-16 11:47     ` John Levon
2021-12-16 16:00       ` Stefan Hajnoczi
2021-12-15 15:35 ` [PATCH v4 09/14] vfio-user: handle DMA mappings Jagannathan Raman
2021-12-16 13:24   ` Stefan Hajnoczi
2021-12-17 19:11     ` Jag Raman
2021-12-15 15:35 ` [PATCH v4 10/14] vfio-user: handle PCI BAR accesses Jagannathan Raman
2021-12-16 14:10   ` Stefan Hajnoczi
2021-12-17 19:12     ` Jag Raman
2021-12-15 15:35 ` [PATCH v4 11/14] vfio-user: IOMMU support for remote device Jagannathan Raman
2021-12-16 14:40   ` Stefan Hajnoczi
2021-12-17 20:00     ` Jag Raman
2021-12-20 14:36       ` Stefan Hajnoczi
2021-12-21  4:32         ` Jag Raman
2022-01-06 13:10           ` Stefan Hajnoczi
2021-12-15 15:35 ` [PATCH v4 12/14] vfio-user: handle device interrupts Jagannathan Raman
2021-12-16 15:56   ` Stefan Hajnoczi
2021-12-15 15:35 ` [PATCH v4 13/14] vfio-user: register handlers to facilitate migration Jagannathan Raman
2021-12-15 15:35 ` [PATCH v4 14/14] vfio-user: avocado tests for vfio-user Jagannathan Raman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.