All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements
@ 2024-02-14  5:13 Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 1/9] hw/pci: Use -1 as a default value for rombar Akihiko Odaki
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

I submitted a RFC series[1] to add support for SR-IOV emulation to
virtio-net-pci. During the development of the series, I fixed some
trivial bugs and made improvements that I think are independently
useful. This series extracts those fixes and improvements from the RFC
series. Below is an explanation of the patches:

Patch 1 adds a function to check if ROM BAR is explicitly enabled. It
is used in the RFC series to report an error if the user requests to
enable ROM BAR for SR-IOV VF. Patch 2 and 3 use it for vfio to remove
hacky device option dictionary inspection.

Patch 4 adds SR-IOV NumVFs validation to fix potential buffer overflow.

Patch 5 changes to realize SR-IOV VFs when the PF is being realized to
validate VF configuration.

Patch 6 fixes memory leak that occurs if a SR-IOV VF fails to realize.

[1]: https://patchew.org/QEMU/20231210-sriov-v2-0-b959e8a6dfaf@daynix.com/

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
Changes in v4:
- Reverted the change to pci_rom_bar_explicitly_enabled().
  (Michael S. Tsirkin)
- Added patch "pcie_sriov: Do not reset NumVFs after unregistering VFs".
- Added patch "hw/nvme: Refer to dev->exp.sriov_pf.num_vfs".
- Link to v3: https://lore.kernel.org/r/20240212-reuse-v3-0-8017b689ce7f@daynix.com

Changes in v3:
- Extracted patch "hw/pci: Use -1 as a default value for rombar" from
  patch "hw/pci: Determine if rombar is explicitly enabled"
  (Philippe Mathieu-Daudé)
- Added an audit result of PCIDevice::rom_bar to the message of patch
  "hw/pci: Use -1 as a default value for rombar"
  (Philippe Mathieu-Daudé)
- Link to v2: https://lore.kernel.org/r/20240210-reuse-v2-0-24ba2a502692@daynix.com

Changes in v2:
- Reset after enabling a function so that NVMe VF state gets updated.
- Link to v1: https://lore.kernel.org/r/20240203-reuse-v1-0-5be8c5ce6338@daynix.com

---
Akihiko Odaki (9):
      hw/pci: Use -1 as a default value for rombar
      hw/pci: Determine if rombar is explicitly enabled
      vfio: Avoid inspecting option QDict for rombar
      hw/qdev: Remove opts member
      pcie_sriov: Validate NumVFs
      pcie_sriov: Reuse SR-IOV VF device instances
      pcie_sriov: Release VFs failed to realize
      pcie_sriov: Do not reset NumVFs after unregistering VFs
      hw/nvme: Refer to dev->exp.sriov_pf.num_vfs

 docs/pcie_sriov.txt         |   8 ++--
 include/hw/pci/pci.h        |   2 +-
 include/hw/pci/pci_device.h |   7 ++-
 include/hw/pci/pcie_sriov.h |   6 +--
 include/hw/qdev-core.h      |   4 --
 hw/core/qdev.c              |   1 -
 hw/net/igb.c                |  13 ++++--
 hw/nvme/ctrl.c              |  29 +++++++-----
 hw/pci/pci.c                |  20 +++++----
 hw/pci/pci_host.c           |   4 +-
 hw/pci/pcie.c               |   4 +-
 hw/pci/pcie_sriov.c         | 106 +++++++++++++++++++++-----------------------
 hw/vfio/pci.c               |   3 +-
 system/qdev-monitor.c       |  12 ++---
 14 files changed, 118 insertions(+), 101 deletions(-)
---
base-commit: 5005aed8a7e728d028efb40e243ecfc2b4f3df3a
change-id: 20240129-reuse-faae22b11934

Best regards,
-- 
Akihiko Odaki <akihiko.odaki@daynix.com>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/9] hw/pci: Use -1 as a default value for rombar
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 2/9] hw/pci: Determine if rombar is explicitly enabled Akihiko Odaki
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

Currently there is no way to distinguish the case that rombar is
explicitly specified as 1 and the case that rombar is not specified.

Set rombar -1 by default to distinguish these cases just as it is done
for addr and romsize. It was confirmed that changing the default value
to -1 will not change the behavior by looking at occurences of rom_bar.

$ git grep -w rom_bar
hw/display/qxl.c:328:    QXLRom *rom = memory_region_get_ram_ptr(&d->rom_bar);
hw/display/qxl.c:431:    qxl_set_dirty(&qxl->rom_bar, 0, qxl->rom_size);
hw/display/qxl.c:1048:    QXLRom *rom = memory_region_get_ram_ptr(&qxl->rom_bar);
hw/display/qxl.c:2131:    memory_region_init_rom(&qxl->rom_bar, OBJECT(qxl), "qxl.vrom",
hw/display/qxl.c:2154: PCI_BASE_ADDRESS_SPACE_MEMORY, &qxl->rom_bar);
hw/display/qxl.h:101:    MemoryRegion       rom_bar;
hw/pci/pci.c:74:    DEFINE_PROP_UINT32("rombar",  PCIDevice, rom_bar, 1),
hw/pci/pci.c:2329:    if (!pdev->rom_bar) {
hw/vfio/pci.c:1019:    if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
hw/xen/xen_pt_load_rom.c:29:    if (dev->romfile || !dev->rom_bar) {
include/hw/pci/pci_device.h:150:    uint32_t rom_bar;

rom_bar refers to a different variable in qxl. It is only tested if
the value is 0 or not in the other places.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 hw/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 6496d027ca61..47f38375bb09 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -71,7 +71,7 @@ static Property pci_props[] = {
     DEFINE_PROP_PCI_DEVFN("addr", PCIDevice, devfn, -1),
     DEFINE_PROP_STRING("romfile", PCIDevice, romfile),
     DEFINE_PROP_UINT32("romsize", PCIDevice, romsize, -1),
-    DEFINE_PROP_UINT32("rombar",  PCIDevice, rom_bar, 1),
+    DEFINE_PROP_UINT32("rombar",  PCIDevice, rom_bar, -1),
     DEFINE_PROP_BIT("multifunction", PCIDevice, cap_present,
                     QEMU_PCI_CAP_MULTIFUNCTION_BITNR, false),
     DEFINE_PROP_BIT("x-pcie-lnksta-dllla", PCIDevice, cap_present,

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/9] hw/pci: Determine if rombar is explicitly enabled
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 1/9] hw/pci: Use -1 as a default value for rombar Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  6:49   ` Michael S. Tsirkin
  2024-02-14  5:13 ` [PATCH v4 3/9] vfio: Avoid inspecting option QDict for rombar Akihiko Odaki
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

vfio determines if rombar is explicitly enabled by inspecting QDict.
Inspecting QDict is not nice because QDict is untyped and depends on the
details on the external interface. Add an infrastructure to determine if
rombar is explicitly enabled to hw/pci.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 include/hw/pci/pci_device.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b273..54fa0676abf1 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -205,6 +205,11 @@ static inline uint16_t pci_get_bdf(PCIDevice *dev)
     return PCI_BUILD_BDF(pci_bus_num(pci_get_bus(dev)), dev->devfn);
 }
 
+static inline bool pci_rom_bar_explicitly_enabled(PCIDevice *dev)
+{
+    return dev->rom_bar && dev->rom_bar != -1;
+}
+
 uint16_t pci_requester_id(PCIDevice *dev);
 
 /* DMA access functions */

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 3/9] vfio: Avoid inspecting option QDict for rombar
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 1/9] hw/pci: Use -1 as a default value for rombar Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 2/9] hw/pci: Determine if rombar is explicitly enabled Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 4/9] hw/qdev: Remove opts member Akihiko Odaki
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

Use pci_rom_bar_explicitly_enabled() to determine if rombar is explicitly
enabled.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 hw/vfio/pci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4fa387f0430d..647f15b2a060 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1012,7 +1012,6 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 {
     uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK);
     off_t offset = vdev->config_offset + PCI_ROM_ADDRESS;
-    DeviceState *dev = DEVICE(vdev);
     char *name;
     int fd = vdev->vbasedev.fd;
 
@@ -1046,7 +1045,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
     }
 
     if (vfio_opt_rom_in_denylist(vdev)) {
-        if (dev->opts && qdict_haskey(dev->opts, "rombar")) {
+        if (pci_rom_bar_explicitly_enabled(&vdev->pdev)) {
             warn_report("Device at %s is known to cause system instability"
                         " issues during option rom execution",
                         vdev->vbasedev.name);

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 4/9] hw/qdev: Remove opts member
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
                   ` (2 preceding siblings ...)
  2024-02-14  5:13 ` [PATCH v4 3/9] vfio: Avoid inspecting option QDict for rombar Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 5/9] pcie_sriov: Validate NumVFs Akihiko Odaki
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

It is no longer used.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 include/hw/qdev-core.h |  4 ----
 hw/core/qdev.c         |  1 -
 system/qdev-monitor.c  | 12 +++++++-----
 3 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 9228e96c87e9..5954404dcbfe 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -237,10 +237,6 @@ struct DeviceState {
      * @pending_deleted_expires_ms: optional timeout for deletion events
      */
     int64_t pending_deleted_expires_ms;
-    /**
-     * @opts: QDict of options for the device
-     */
-    QDict *opts;
     /**
      * @hotplugged: was device added after PHASE_MACHINE_READY?
      */
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index c68d0f7c512f..7349c9a86be8 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -706,7 +706,6 @@ static void device_finalize(Object *obj)
         dev->canonical_path = NULL;
     }
 
-    qobject_unref(dev->opts);
     g_free(dev->id);
 }
 
diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index a13db763e5dd..71c00f62ee38 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -625,6 +625,7 @@ DeviceState *qdev_device_add_from_qdict(const QDict *opts,
     char *id;
     DeviceState *dev = NULL;
     BusState *bus = NULL;
+    QDict *properties;
 
     driver = qdict_get_try_str(opts, "driver");
     if (!driver) {
@@ -705,13 +706,14 @@ DeviceState *qdev_device_add_from_qdict(const QDict *opts,
     }
 
     /* set properties */
-    dev->opts = qdict_clone_shallow(opts);
-    qdict_del(dev->opts, "driver");
-    qdict_del(dev->opts, "bus");
-    qdict_del(dev->opts, "id");
+    properties = qdict_clone_shallow(opts);
+    qdict_del(properties, "driver");
+    qdict_del(properties, "bus");
+    qdict_del(properties, "id");
 
-    object_set_properties_from_keyval(&dev->parent_obj, dev->opts, from_json,
+    object_set_properties_from_keyval(&dev->parent_obj, properties, from_json,
                                       errp);
+    qobject_unref(properties);
     if (*errp) {
         goto err_del_dev;
     }

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
                   ` (3 preceding siblings ...)
  2024-02-14  5:13 ` [PATCH v4 4/9] hw/qdev: Remove opts member Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  6:52   ` Michael S. Tsirkin
  2024-02-14  8:58   ` Michael Tokarev
  2024-02-14  5:13 ` [PATCH v4 6/9] pcie_sriov: Reuse SR-IOV VF device instances Akihiko Odaki
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

The guest may write NumVFs greater than TotalVFs and that can lead
to buffer overflow in VF implementations.

Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 hw/pci/pcie_sriov.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index a1fe65f5d801..da209b7f47fd 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -176,6 +176,9 @@ static void register_vfs(PCIDevice *dev)
 
     assert(sriov_cap > 0);
     num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
+    if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
+        return;
+    }
 
     dev->exp.sriov_pf.vf = g_new(PCIDevice *, num_vfs);
 

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 6/9] pcie_sriov: Reuse SR-IOV VF device instances
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
                   ` (4 preceding siblings ...)
  2024-02-14  5:13 ` [PATCH v4 5/9] pcie_sriov: Validate NumVFs Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  7:54   ` Michael S. Tsirkin
  2024-02-14  5:13 ` [PATCH v4 7/9] pcie_sriov: Release VFs failed to realize Akihiko Odaki
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

Disable SR-IOV VF devices by reusing code to power down PCI devices
instead of removing them when the guest requests to disable VFs. This
allows to realize devices and report VF realization errors at PF
realization time.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 docs/pcie_sriov.txt         |   8 ++--
 include/hw/pci/pci.h        |   2 +-
 include/hw/pci/pci_device.h |   2 +-
 include/hw/pci/pcie_sriov.h |   6 +--
 hw/net/igb.c                |  13 ++++--
 hw/nvme/ctrl.c              |  24 +++++++----
 hw/pci/pci.c                |  18 ++++----
 hw/pci/pci_host.c           |   4 +-
 hw/pci/pcie.c               |   4 +-
 hw/pci/pcie_sriov.c         | 100 ++++++++++++++++++++------------------------
 10 files changed, 97 insertions(+), 84 deletions(-)

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
index a47aad0bfab0..ab2142807f79 100644
--- a/docs/pcie_sriov.txt
+++ b/docs/pcie_sriov.txt
@@ -52,9 +52,11 @@ setting up a BAR for a VF.
       ...
 
       /* Add and initialize the SR/IOV capability */
-      pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
-                       vf_devid, initial_vfs, total_vfs,
-                       fun_offset, stride);
+      if (!pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+                              vf_devid, initial_vfs, total_vfs,
+                              fun_offset, stride, errp)) {
+         return;
+      }
 
       /* Set up individual VF BARs (parameters as for normal BARs) */
       pcie_sriov_pf_init_vf_bar( ... )
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index eaa3fc99d884..442017b4865d 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -642,6 +642,6 @@ static inline void pci_irq_pulse(PCIDevice *pci_dev)
 }
 
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
-void pci_set_power(PCIDevice *pci_dev, bool state);
+void pci_set_enabled(PCIDevice *pci_dev, bool state);
 
 #endif
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 54fa0676abf1..f5aba8ae2675 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -56,7 +56,7 @@ typedef struct PCIReqIDCache PCIReqIDCache;
 struct PCIDevice {
     DeviceState qdev;
     bool partially_hotplugged;
-    bool has_power;
+    bool is_enabled;
 
     /* PCI config space */
     uint8_t *config;
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 095fb0c9edf9..d9a39daccac4 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -18,7 +18,6 @@
 struct PCIESriovPF {
     uint16_t num_vfs;   /* Number of virtual functions created */
     uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
-    const char *vfname; /* Reference to the device type used for the VFs */
     PCIDevice **vf;     /* Pointer to an array of num_vfs VF devices */
 };
 
@@ -27,10 +26,11 @@ struct PCIESriovVF {
     uint16_t vf_number; /* Logical VF number of this function */
 };
 
-void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
                         const char *vfname, uint16_t vf_dev_id,
                         uint16_t init_vfs, uint16_t total_vfs,
-                        uint16_t vf_offset, uint16_t vf_stride);
+                        uint16_t vf_offset, uint16_t vf_stride,
+                        Error **errp);
 void pcie_sriov_pf_exit(PCIDevice *dev);
 
 /* Set up a VF bar in the SR/IOV bar area */
diff --git a/hw/net/igb.c b/hw/net/igb.c
index 0b5c31a58bba..1079a33d4000 100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -447,9 +447,16 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error **errp)
 
     pcie_ari_init(pci_dev, 0x150);
 
-    pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET, TYPE_IGBVF,
-        IGB_82576_VF_DEV_ID, IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
-        IGB_VF_OFFSET, IGB_VF_STRIDE);
+    if (!pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET,
+                            TYPE_IGBVF, IGB_82576_VF_DEV_ID,
+                            IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
+                            IGB_VF_OFFSET, IGB_VF_STRIDE,
+                            errp)) {
+        pcie_cap_exit(pci_dev);
+        igb_cleanup_msix(s);
+        msi_uninit(pci_dev);
+        return;
+    }
 
     pcie_sriov_pf_init_vf_bar(pci_dev, IGBVF_MMIO_BAR_IDX,
         PCI_BASE_ADDRESS_MEM_TYPE_64 | PCI_BASE_ADDRESS_MEM_PREFETCH,
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f026245d1e9e..f8df622fe590 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8031,7 +8031,8 @@ static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
     return bar_size;
 }
 
-static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
+static bool nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
+                            Error **errp)
 {
     uint16_t vf_dev_id = n->params.use_intel_id ?
                          PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
@@ -8040,12 +8041,17 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
                                       le16_to_cpu(cap->vifrsm),
                                       NULL, NULL);
 
-    pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
-                       n->params.sriov_max_vfs, n->params.sriov_max_vfs,
-                       NVME_VF_OFFSET, NVME_VF_STRIDE);
+    if (!pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
+                            n->params.sriov_max_vfs, n->params.sriov_max_vfs,
+                            NVME_VF_OFFSET, NVME_VF_STRIDE,
+                            errp)) {
+        return false;
+    }
 
     pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
                               PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
+
+    return true;
 }
 
 static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
@@ -8124,6 +8130,12 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         return false;
     }
 
+    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs &&
+        !nvme_init_sriov(n, pci_dev, 0x120, errp)) {
+        msix_uninit(pci_dev, &n->bar0, &n->bar0);
+        return false;
+    }
+
     nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
 
     if (n->params.cmb_size_mb) {
@@ -8134,10 +8146,6 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         nvme_init_pmr(n, pci_dev);
     }
 
-    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
-        nvme_init_sriov(n, pci_dev, 0x120);
-    }
-
     return true;
 }
 
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 47f38375bb09..838c261c21f0 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1524,7 +1524,7 @@ static void pci_update_mappings(PCIDevice *d)
             continue;
 
         new_addr = pci_bar_address(d, i, r->type, r->size);
-        if (!d->has_power) {
+        if (!d->is_enabled) {
             new_addr = PCI_BAR_UNMAPPED;
         }
 
@@ -1612,7 +1612,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
         pci_update_irq_disabled(d, was_irq_disabled);
         memory_region_set_enabled(&d->bus_master_enable_region,
                                   (pci_get_word(d->config + PCI_COMMAND)
-                                   & PCI_COMMAND_MASTER) && d->has_power);
+                                   & PCI_COMMAND_MASTER) && d->is_enabled);
     }
 
     msi_write_config(d, addr, val_in, l);
@@ -2154,7 +2154,9 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
         return;
     }
 
-    pci_set_power(pci_dev, true);
+    if (!pci_is_vf(pci_dev)) {
+        pci_set_enabled(pci_dev, true);
+    }
 
     pci_dev->msi_trigger = pci_msi_trigger;
 }
@@ -2810,18 +2812,18 @@ MSIMessage pci_get_msi_message(PCIDevice *dev, int vector)
     return msg;
 }
 
-void pci_set_power(PCIDevice *d, bool state)
+void pci_set_enabled(PCIDevice *d, bool state)
 {
-    if (d->has_power == state) {
+    if (d->is_enabled == state) {
         return;
     }
 
-    d->has_power = state;
+    d->is_enabled = state;
     pci_update_mappings(d);
     memory_region_set_enabled(&d->bus_master_enable_region,
                               (pci_get_word(d->config + PCI_COMMAND)
-                               & PCI_COMMAND_MASTER) && d->has_power);
-    if (!d->has_power) {
+                               & PCI_COMMAND_MASTER) && d->is_enabled);
+    if (d->qdev.realized) {
         pci_device_reset(d);
     }
 }
diff --git a/hw/pci/pci_host.c b/hw/pci/pci_host.c
index dfe6fe618401..d7e13d72ce07 100644
--- a/hw/pci/pci_host.c
+++ b/hw/pci/pci_host.c
@@ -86,7 +86,7 @@ void pci_host_config_write_common(PCIDevice *pci_dev, uint32_t addr,
      * allowing direct removal of unexposed functions.
      */
     if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
-        !pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
+        !pci_dev->is_enabled || is_pci_dev_ejected(pci_dev)) {
         return;
     }
 
@@ -111,7 +111,7 @@ uint32_t pci_host_config_read_common(PCIDevice *pci_dev, uint32_t addr,
      * allowing direct removal of unexposed functions.
      */
     if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
-        !pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
+        !pci_dev->is_enabled || is_pci_dev_ejected(pci_dev)) {
         return ~0x0;
     }
 
diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 6db0cf69cd8a..f34c157e1fd3 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -394,7 +394,9 @@ static void pcie_set_power_device(PCIBus *bus, PCIDevice *dev, void *opaque)
 {
     bool *power = opaque;
 
-    pci_set_power(dev, *power);
+    if (!pci_is_vf(dev)) {
+        pci_set_enabled(dev, *power);
+    }
 }
 
 static void pcie_cap_update_power(PCIDevice *hotplug_dev)
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index da209b7f47fd..9ba34cf8f8ed 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -20,15 +20,29 @@
 #include "qapi/error.h"
 #include "trace.h"
 
-static PCIDevice *register_vf(PCIDevice *pf, int devfn,
-                              const char *name, uint16_t vf_num);
-static void unregister_vfs(PCIDevice *dev);
+static void unrealize_vfs(PCIDevice *dev, uint16_t total_vfs)
+{
+    for (uint16_t i = 0; i < total_vfs; i++) {
+        Error *err = NULL;
+        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
+        if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
+            error_reportf_err(err, "Failed to unplug: ");
+        }
+        object_unparent(OBJECT(vf));
+        object_unref(OBJECT(vf));
+    }
+    g_free(dev->exp.sriov_pf.vf);
+    dev->exp.sriov_pf.vf = NULL;
+}
 
-void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
                         const char *vfname, uint16_t vf_dev_id,
                         uint16_t init_vfs, uint16_t total_vfs,
-                        uint16_t vf_offset, uint16_t vf_stride)
+                        uint16_t vf_offset, uint16_t vf_stride,
+                        Error **errp)
 {
+    BusState *bus = qdev_get_parent_bus(&dev->qdev);
+    int32_t devfn = dev->devfn + vf_offset;
     uint8_t *cfg = dev->config + offset;
     uint8_t *wmask;
 
@@ -36,7 +50,6 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
                         offset, PCI_EXT_CAP_SRIOV_SIZEOF);
     dev->exp.sriov_cap = offset;
     dev->exp.sriov_pf.num_vfs = 0;
-    dev->exp.sriov_pf.vfname = g_strdup(vfname);
     dev->exp.sriov_pf.vf = NULL;
 
     pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
@@ -69,13 +82,35 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
     pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
 
     qdev_prop_set_bit(&dev->qdev, "multifunction", true);
+
+    dev->exp.sriov_pf.vf = g_new(PCIDevice *, total_vfs);
+
+    for (uint16_t i = 0; i < total_vfs; i++) {
+        PCIDevice *vf = pci_new(devfn, vfname);
+        vf->exp.sriov_vf.pf = dev;
+        vf->exp.sriov_vf.vf_number = i;
+
+        if (!qdev_realize(&vf->qdev, bus, errp)) {
+            unrealize_vfs(dev, i);
+            return false;
+        }
+
+        /* set vid/did according to sr/iov spec - they are not used */
+        pci_config_set_vendor_id(vf->config, 0xffff);
+        pci_config_set_device_id(vf->config, 0xffff);
+
+        dev->exp.sriov_pf.vf[i] = vf;
+        devfn += vf_stride;
+    }
+
+    return true;
 }
 
 void pcie_sriov_pf_exit(PCIDevice *dev)
 {
-    unregister_vfs(dev);
-    g_free((char *)dev->exp.sriov_pf.vfname);
-    dev->exp.sriov_pf.vfname = NULL;
+    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
+
+    unrealize_vfs(dev, pci_get_word(cfg + PCI_SRIOV_TOTAL_VF));
 }
 
 void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
@@ -141,38 +176,11 @@ void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
     }
 }
 
-static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name,
-                              uint16_t vf_num)
-{
-    PCIDevice *dev = pci_new(devfn, name);
-    dev->exp.sriov_vf.pf = pf;
-    dev->exp.sriov_vf.vf_number = vf_num;
-    PCIBus *bus = pci_get_bus(pf);
-    Error *local_err = NULL;
-
-    qdev_realize(&dev->qdev, &bus->qbus, &local_err);
-    if (local_err) {
-        error_report_err(local_err);
-        return NULL;
-    }
-
-    /* set vid/did according to sr/iov spec - they are not used */
-    pci_config_set_vendor_id(dev->config, 0xffff);
-    pci_config_set_device_id(dev->config, 0xffff);
-
-    return dev;
-}
-
 static void register_vfs(PCIDevice *dev)
 {
     uint16_t num_vfs;
     uint16_t i;
     uint16_t sriov_cap = dev->exp.sriov_cap;
-    uint16_t vf_offset =
-        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
-    uint16_t vf_stride =
-        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
-    int32_t devfn = dev->devfn + vf_offset;
 
     assert(sriov_cap > 0);
     num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
@@ -180,18 +188,10 @@ static void register_vfs(PCIDevice *dev)
         return;
     }
 
-    dev->exp.sriov_pf.vf = g_new(PCIDevice *, num_vfs);
-
     trace_sriov_register_vfs(dev->name, PCI_SLOT(dev->devfn),
                              PCI_FUNC(dev->devfn), num_vfs);
     for (i = 0; i < num_vfs; i++) {
-        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn,
-                                              dev->exp.sriov_pf.vfname, i);
-        if (!dev->exp.sriov_pf.vf[i]) {
-            num_vfs = i;
-            break;
-        }
-        devfn += vf_stride;
+        pci_set_enabled(dev->exp.sriov_pf.vf[i], true);
     }
     dev->exp.sriov_pf.num_vfs = num_vfs;
 }
@@ -204,16 +204,8 @@ static void unregister_vfs(PCIDevice *dev)
     trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
                                PCI_FUNC(dev->devfn), num_vfs);
     for (i = 0; i < num_vfs; i++) {
-        Error *err = NULL;
-        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
-        if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
-            error_reportf_err(err, "Failed to unplug: ");
-        }
-        object_unparent(OBJECT(vf));
-        object_unref(OBJECT(vf));
+        pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
     }
-    g_free(dev->exp.sriov_pf.vf);
-    dev->exp.sriov_pf.vf = NULL;
     dev->exp.sriov_pf.num_vfs = 0;
     pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
 }

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 7/9] pcie_sriov: Release VFs failed to realize
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
                   ` (5 preceding siblings ...)
  2024-02-14  5:13 ` [PATCH v4 6/9] pcie_sriov: Reuse SR-IOV VF device instances Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  7:54   ` Michael S. Tsirkin
  2024-02-14  5:13 ` [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs Akihiko Odaki
  2024-02-14  5:13 ` [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs Akihiko Odaki
  8 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

Release VFs failed to realize just as we do in unregister_vfs().

Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 hw/pci/pcie_sriov.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 9ba34cf8f8ed..9d668b8d6c17 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -91,6 +91,8 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
         vf->exp.sriov_vf.vf_number = i;
 
         if (!qdev_realize(&vf->qdev, bus, errp)) {
+            object_unparent(OBJECT(vf));
+            object_unref(vf);
             unrealize_vfs(dev, i);
             return false;
         }

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
                   ` (6 preceding siblings ...)
  2024-02-14  5:13 ` [PATCH v4 7/9] pcie_sriov: Release VFs failed to realize Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  6:53   ` Michael S. Tsirkin
  2024-02-14  5:13 ` [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs Akihiko Odaki
  8 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

I couldn't find such a behavior specified.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 hw/pci/pcie_sriov.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 9d668b8d6c17..410bc090fc58 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -209,7 +209,6 @@ static void unregister_vfs(PCIDevice *dev)
         pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
     }
     dev->exp.sriov_pf.num_vfs = 0;
-    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
 }
 
 void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs
  2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
                   ` (7 preceding siblings ...)
  2024-02-14  5:13 ` [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs Akihiko Odaki
@ 2024-02-14  5:13 ` Akihiko Odaki
  2024-02-14  7:07   ` Michael S. Tsirkin
  8 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  5:13 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, Akihiko Odaki

NumVFs may not equal to the current effective number of VFs because VF
Enable is cleared, NumVFs is set after VF Enable is set, or NumVFs is
greater than TotalVFs.

Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 hw/nvme/ctrl.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f8df622fe590..daedda5d326f 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8481,7 +8481,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
     NvmeSecCtrlEntry *sctrl;
     uint16_t sriov_cap = dev->exp.sriov_cap;
     uint32_t off = address - sriov_cap;
-    int i, num_vfs;
+    int i;
 
     if (!sriov_cap) {
         return;
@@ -8489,8 +8489,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
 
     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
         if (!(val & PCI_SRIOV_CTRL_VFE)) {
-            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
-            for (i = 0; i < num_vfs; i++) {
+            for (i = 0; i < dev->exp.sriov_pf.num_vfs; i++) {
                 sctrl = &n->sec_ctrl_list.sec[i];
                 nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
             }

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/9] hw/pci: Determine if rombar is explicitly enabled
  2024-02-14  5:13 ` [PATCH v4 2/9] hw/pci: Determine if rombar is explicitly enabled Akihiko Odaki
@ 2024-02-14  6:49   ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14  6:49 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 02:13:40PM +0900, Akihiko Odaki wrote:
> vfio determines if rombar is explicitly enabled by inspecting QDict.
> Inspecting QDict is not nice because QDict is untyped and depends on the
> details on the external interface. Add an infrastructure to determine if
> rombar is explicitly enabled to hw/pci.
> 
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>

I frankly don't know what the issue with using qdict is.
Alex do you want to switch?

> ---
>  include/hw/pci/pci_device.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index d3dd0f64b273..54fa0676abf1 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -205,6 +205,11 @@ static inline uint16_t pci_get_bdf(PCIDevice *dev)
>      return PCI_BUILD_BDF(pci_bus_num(pci_get_bus(dev)), dev->devfn);
>  }
>  
> +static inline bool pci_rom_bar_explicitly_enabled(PCIDevice *dev)
> +{
> +    return dev->rom_bar && dev->rom_bar != -1;
> +}
> +
>  uint16_t pci_requester_id(PCIDevice *dev);
>  
>  /* DMA access functions */
> 
> -- 
> 2.43.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14  5:13 ` [PATCH v4 5/9] pcie_sriov: Validate NumVFs Akihiko Odaki
@ 2024-02-14  6:52   ` Michael S. Tsirkin
  2024-02-14 14:49     ` Akihiko Odaki
  2024-02-14  8:58   ` Michael Tokarev
  1 sibling, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14  6:52 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 02:13:43PM +0900, Akihiko Odaki wrote:
> The guest may write NumVFs greater than TotalVFs and that can lead
> to buffer overflow in VF implementations.
> 
> Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
>  hw/pci/pcie_sriov.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index a1fe65f5d801..da209b7f47fd 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -176,6 +176,9 @@ static void register_vfs(PCIDevice *dev)
>  
>      assert(sriov_cap > 0);
>      num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> +    if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
> +        return;
> +    }


yes but with your change PCI_SRIOV_NUM_VF no longer reflects the
number of registered VFs and that might lead to uninitialized
data read which is not better :(.

How about just forcing the PCI_SRIOV_NUM_VF register to be
below PCI_SRIOV_TOTAL_VF at all times?

>      dev->exp.sriov_pf.vf = g_new(PCIDevice *, num_vfs);
>  
> 
> -- 
> 2.43.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs
  2024-02-14  5:13 ` [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs Akihiko Odaki
@ 2024-02-14  6:53   ` Michael S. Tsirkin
  2024-02-14 14:32     ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14  6:53 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 02:13:46PM +0900, Akihiko Odaki wrote:
> I couldn't find such a behavior specified.

Is it fixing a bug or just removing unnecessary code?
Is this guest visible at all?

> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
>  hw/pci/pcie_sriov.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index 9d668b8d6c17..410bc090fc58 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -209,7 +209,6 @@ static void unregister_vfs(PCIDevice *dev)
>          pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
>      }
>      dev->exp.sriov_pf.num_vfs = 0;
> -    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
>  }
>  
>  void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> 
> -- 
> 2.43.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs
  2024-02-14  5:13 ` [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs Akihiko Odaki
@ 2024-02-14  7:07   ` Michael S. Tsirkin
  2024-02-14 14:09     ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14  7:07 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 02:13:47PM +0900, Akihiko Odaki wrote:
> NumVFs may not equal to the current effective number of VFs because VF
> Enable is cleared, NumVFs is set after VF Enable is set, or NumVFs is
> greater than TotalVFs.
> 
> Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command")
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>

I don't get what this is saying about VF enable.
This code will not trigger on numVFs write when VF enable is set.
Generally this commit makes no sense on its own, squash it with
the pci core change pls.

> ---
>  hw/nvme/ctrl.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index f8df622fe590..daedda5d326f 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -8481,7 +8481,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>      NvmeSecCtrlEntry *sctrl;
>      uint16_t sriov_cap = dev->exp.sriov_cap;
>      uint32_t off = address - sriov_cap;
> -    int i, num_vfs;
> +    int i;
>  
>      if (!sriov_cap) {
>          return;
> @@ -8489,8 +8489,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>  
>      if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
>          if (!(val & PCI_SRIOV_CTRL_VFE)) {
> -            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> -            for (i = 0; i < num_vfs; i++) {
> +            for (i = 0; i < dev->exp.sriov_pf.num_vfs; i++) {
>                  sctrl = &n->sec_ctrl_list.sec[i];
>                  nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
>              }
> 
> -- 
> 2.43.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/9] pcie_sriov: Reuse SR-IOV VF device instances
  2024-02-14  5:13 ` [PATCH v4 6/9] pcie_sriov: Reuse SR-IOV VF device instances Akihiko Odaki
@ 2024-02-14  7:54   ` Michael S. Tsirkin
  2024-02-14 14:44     ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14  7:54 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 02:13:44PM +0900, Akihiko Odaki wrote:
> Disable SR-IOV VF devices by reusing code to power down PCI devices
> instead of removing them when the guest requests to disable VFs. This
> allows to realize devices and report VF realization errors at PF
> realization time.
> 
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>

This patch does much more than what commit log says.
This really needs to be split up with each change you make
documented.

> ---
>  docs/pcie_sriov.txt         |   8 ++--
>  include/hw/pci/pci.h        |   2 +-
>  include/hw/pci/pci_device.h |   2 +-
>  include/hw/pci/pcie_sriov.h |   6 +--
>  hw/net/igb.c                |  13 ++++--
>  hw/nvme/ctrl.c              |  24 +++++++----
>  hw/pci/pci.c                |  18 ++++----
>  hw/pci/pci_host.c           |   4 +-
>  hw/pci/pcie.c               |   4 +-
>  hw/pci/pcie_sriov.c         | 100 ++++++++++++++++++++------------------------
>  10 files changed, 97 insertions(+), 84 deletions(-)
> 
> diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> index a47aad0bfab0..ab2142807f79 100644
> --- a/docs/pcie_sriov.txt
> +++ b/docs/pcie_sriov.txt
> @@ -52,9 +52,11 @@ setting up a BAR for a VF.
>        ...
>  
>        /* Add and initialize the SR/IOV capability */
> -      pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> -                       vf_devid, initial_vfs, total_vfs,
> -                       fun_offset, stride);
> +      if (!pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> +                              vf_devid, initial_vfs, total_vfs,
> +                              fun_offset, stride, errp)) {
> +         return;
> +      }
>  
>        /* Set up individual VF BARs (parameters as for normal BARs) */
>        pcie_sriov_pf_init_vf_bar( ... )
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index eaa3fc99d884..442017b4865d 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -642,6 +642,6 @@ static inline void pci_irq_pulse(PCIDevice *pci_dev)
>  }
>  
>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
> -void pci_set_power(PCIDevice *pci_dev, bool state);
> +void pci_set_enabled(PCIDevice *pci_dev, bool state);

Not a good name. Does this set enabled state? Test whether
some kind of set is enabled?

>  
>  #endif
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index 54fa0676abf1..f5aba8ae2675 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -56,7 +56,7 @@ typedef struct PCIReqIDCache PCIReqIDCache;
>  struct PCIDevice {
>      DeviceState qdev;
>      bool partially_hotplugged;
> -    bool has_power;
> +    bool is_enabled;
>  
>      /* PCI config space */
>      uint8_t *config;
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> index 095fb0c9edf9..d9a39daccac4 100644
> --- a/include/hw/pci/pcie_sriov.h
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -18,7 +18,6 @@
>  struct PCIESriovPF {
>      uint16_t num_vfs;   /* Number of virtual functions created */
>      uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
> -    const char *vfname; /* Reference to the device type used for the VFs */
>      PCIDevice **vf;     /* Pointer to an array of num_vfs VF devices */
>  };
>  
> @@ -27,10 +26,11 @@ struct PCIESriovVF {
>      uint16_t vf_number; /* Logical VF number of this function */
>  };
>  
> -void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>                          const char *vfname, uint16_t vf_dev_id,
>                          uint16_t init_vfs, uint16_t total_vfs,
> -                        uint16_t vf_offset, uint16_t vf_stride);
> +                        uint16_t vf_offset, uint16_t vf_stride,
> +                        Error **errp);

Returning false to indicate an error is a bad idea -
most APIs we have return 0 on success, < 0 on error.
Or just use errp.


>  void pcie_sriov_pf_exit(PCIDevice *dev);
>  
>  /* Set up a VF bar in the SR/IOV bar area */
> diff --git a/hw/net/igb.c b/hw/net/igb.c
> index 0b5c31a58bba..1079a33d4000 100644
> --- a/hw/net/igb.c
> +++ b/hw/net/igb.c
> @@ -447,9 +447,16 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error **errp)
>  
>      pcie_ari_init(pci_dev, 0x150);
>  
> -    pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET, TYPE_IGBVF,
> -        IGB_82576_VF_DEV_ID, IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
> -        IGB_VF_OFFSET, IGB_VF_STRIDE);
> +    if (!pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET,
> +                            TYPE_IGBVF, IGB_82576_VF_DEV_ID,
> +                            IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
> +                            IGB_VF_OFFSET, IGB_VF_STRIDE,
> +                            errp)) {
> +        pcie_cap_exit(pci_dev);
> +        igb_cleanup_msix(s);
> +        msi_uninit(pci_dev);
> +        return;
> +    }
>  
>      pcie_sriov_pf_init_vf_bar(pci_dev, IGBVF_MMIO_BAR_IDX,
>          PCI_BASE_ADDRESS_MEM_TYPE_64 | PCI_BASE_ADDRESS_MEM_PREFETCH,
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index f026245d1e9e..f8df622fe590 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -8031,7 +8031,8 @@ static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
>      return bar_size;
>  }
>  
> -static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
> +static bool nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
> +                            Error **errp)
>  {
>      uint16_t vf_dev_id = n->params.use_intel_id ?
>                           PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
> @@ -8040,12 +8041,17 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
>                                        le16_to_cpu(cap->vifrsm),
>                                        NULL, NULL);
>  
> -    pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
> -                       n->params.sriov_max_vfs, n->params.sriov_max_vfs,
> -                       NVME_VF_OFFSET, NVME_VF_STRIDE);
> +    if (!pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
> +                            n->params.sriov_max_vfs, n->params.sriov_max_vfs,
> +                            NVME_VF_OFFSET, NVME_VF_STRIDE,
> +                            errp)) {
> +        return false;
> +    }
>  
>      pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
>                                PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
> +
> +    return true;
>  }
>  
>  static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
> @@ -8124,6 +8130,12 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>          return false;
>      }
>  
> +    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs &&
> +        !nvme_init_sriov(n, pci_dev, 0x120, errp)) {
> +        msix_uninit(pci_dev, &n->bar0, &n->bar0);
> +        return false;
> +    }
> +
>      nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
>  
>      if (n->params.cmb_size_mb) {
> @@ -8134,10 +8146,6 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>          nvme_init_pmr(n, pci_dev);
>      }
>  
> -    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
> -        nvme_init_sriov(n, pci_dev, 0x120);
> -    }
> -
>      return true;
>  }

why this change?

> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 47f38375bb09..838c261c21f0 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1524,7 +1524,7 @@ static void pci_update_mappings(PCIDevice *d)
>              continue;
>  
>          new_addr = pci_bar_address(d, i, r->type, r->size);
> -        if (!d->has_power) {
> +        if (!d->is_enabled) {
>              new_addr = PCI_BAR_UNMAPPED;
>          }
>  
> @@ -1612,7 +1612,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>          pci_update_irq_disabled(d, was_irq_disabled);
>          memory_region_set_enabled(&d->bus_master_enable_region,
>                                    (pci_get_word(d->config + PCI_COMMAND)
> -                                   & PCI_COMMAND_MASTER) && d->has_power);
> +                                   & PCI_COMMAND_MASTER) && d->is_enabled);
>      }
>  
>      msi_write_config(d, addr, val_in, l);
> @@ -2154,7 +2154,9 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
>          return;
>      }
>  
> -    pci_set_power(pci_dev, true);
> +    if (!pci_is_vf(pci_dev)) {
> +        pci_set_enabled(pci_dev, true);
> +    }

why are you not enabling vfs specifically? seems confusing. you also
duplicate this logic in a couple more places. needs a
comment at least. Or better, add sane APIs that do this correctly.
E.g. keep pci_set_power around?

>  
>      pci_dev->msi_trigger = pci_msi_trigger;
>  }
> @@ -2810,18 +2812,18 @@ MSIMessage pci_get_msi_message(PCIDevice *dev, int vector)
>      return msg;
>  }
>  
> -void pci_set_power(PCIDevice *d, bool state)
> +void pci_set_enabled(PCIDevice *d, bool state)
>  {
> -    if (d->has_power == state) {
> +    if (d->is_enabled == state) {
>          return;
>      }
>  
> -    d->has_power = state;
> +    d->is_enabled = state;
>      pci_update_mappings(d);
>      memory_region_set_enabled(&d->bus_master_enable_region,
>                                (pci_get_word(d->config + PCI_COMMAND)
> -                               & PCI_COMMAND_MASTER) && d->has_power);
> -    if (!d->has_power) {
> +                               & PCI_COMMAND_MASTER) && d->is_enabled);
> +    if (d->qdev.realized) {
>          pci_device_reset(d);
>      }
>  }
> diff --git a/hw/pci/pci_host.c b/hw/pci/pci_host.c
> index dfe6fe618401..d7e13d72ce07 100644
> --- a/hw/pci/pci_host.c
> +++ b/hw/pci/pci_host.c
> @@ -86,7 +86,7 @@ void pci_host_config_write_common(PCIDevice *pci_dev, uint32_t addr,
>       * allowing direct removal of unexposed functions.
>       */
>      if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
> -        !pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
> +        !pci_dev->is_enabled || is_pci_dev_ejected(pci_dev)) {
>          return;
>      }
>  
> @@ -111,7 +111,7 @@ uint32_t pci_host_config_read_common(PCIDevice *pci_dev, uint32_t addr,
>       * allowing direct removal of unexposed functions.
>       */
>      if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
> -        !pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
> +        !pci_dev->is_enabled || is_pci_dev_ejected(pci_dev)) {
>          return ~0x0;
>      }
>  
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index 6db0cf69cd8a..f34c157e1fd3 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -394,7 +394,9 @@ static void pcie_set_power_device(PCIBus *bus, PCIDevice *dev, void *opaque)
>  {
>      bool *power = opaque;
>  
> -    pci_set_power(dev, *power);
> +    if (!pci_is_vf(dev)) {
> +        pci_set_enabled(dev, *power);
> +    }
>  }
>  
>  static void pcie_cap_update_power(PCIDevice *hotplug_dev)
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index da209b7f47fd..9ba34cf8f8ed 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -20,15 +20,29 @@
>  #include "qapi/error.h"
>  #include "trace.h"
>  
> -static PCIDevice *register_vf(PCIDevice *pf, int devfn,
> -                              const char *name, uint16_t vf_num);
> -static void unregister_vfs(PCIDevice *dev);
> +static void unrealize_vfs(PCIDevice *dev, uint16_t total_vfs)
> +{
> +    for (uint16_t i = 0; i < total_vfs; i++) {
> +        Error *err = NULL;
> +        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> +        if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
> +            error_reportf_err(err, "Failed to unplug: ");
> +        }
> +        object_unparent(OBJECT(vf));
> +        object_unref(OBJECT(vf));
> +    }
> +    g_free(dev->exp.sriov_pf.vf);
> +    dev->exp.sriov_pf.vf = NULL;
> +}
>  
> -void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>                          const char *vfname, uint16_t vf_dev_id,
>                          uint16_t init_vfs, uint16_t total_vfs,
> -                        uint16_t vf_offset, uint16_t vf_stride)
> +                        uint16_t vf_offset, uint16_t vf_stride,
> +                        Error **errp)
>  {
> +    BusState *bus = qdev_get_parent_bus(&dev->qdev);
> +    int32_t devfn = dev->devfn + vf_offset;
>      uint8_t *cfg = dev->config + offset;
>      uint8_t *wmask;
>  
> @@ -36,7 +50,6 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>                          offset, PCI_EXT_CAP_SRIOV_SIZEOF);
>      dev->exp.sriov_cap = offset;
>      dev->exp.sriov_pf.num_vfs = 0;
> -    dev->exp.sriov_pf.vfname = g_strdup(vfname);
>      dev->exp.sriov_pf.vf = NULL;
>  
>      pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> @@ -69,13 +82,35 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>      pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
>  
>      qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> +
> +    dev->exp.sriov_pf.vf = g_new(PCIDevice *, total_vfs);
> +
> +    for (uint16_t i = 0; i < total_vfs; i++) {
> +        PCIDevice *vf = pci_new(devfn, vfname);
> +        vf->exp.sriov_vf.pf = dev;
> +        vf->exp.sriov_vf.vf_number = i;
> +
> +        if (!qdev_realize(&vf->qdev, bus, errp)) {
> +            unrealize_vfs(dev, i);
> +            return false;
> +        }
> +
> +        /* set vid/did according to sr/iov spec - they are not used */
> +        pci_config_set_vendor_id(vf->config, 0xffff);
> +        pci_config_set_device_id(vf->config, 0xffff);
> +
> +        dev->exp.sriov_pf.vf[i] = vf;
> +        devfn += vf_stride;
> +    }
> +
> +    return true;
>  }
>  
>  void pcie_sriov_pf_exit(PCIDevice *dev)
>  {
> -    unregister_vfs(dev);
> -    g_free((char *)dev->exp.sriov_pf.vfname);
> -    dev->exp.sriov_pf.vfname = NULL;
> +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
> +
> +    unrealize_vfs(dev, pci_get_word(cfg + PCI_SRIOV_TOTAL_VF));
>  }
>  
>  void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> @@ -141,38 +176,11 @@ void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
>      }
>  }
>  
> -static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name,
> -                              uint16_t vf_num)
> -{
> -    PCIDevice *dev = pci_new(devfn, name);
> -    dev->exp.sriov_vf.pf = pf;
> -    dev->exp.sriov_vf.vf_number = vf_num;
> -    PCIBus *bus = pci_get_bus(pf);
> -    Error *local_err = NULL;
> -
> -    qdev_realize(&dev->qdev, &bus->qbus, &local_err);
> -    if (local_err) {
> -        error_report_err(local_err);
> -        return NULL;
> -    }
> -
> -    /* set vid/did according to sr/iov spec - they are not used */
> -    pci_config_set_vendor_id(dev->config, 0xffff);
> -    pci_config_set_device_id(dev->config, 0xffff);
> -
> -    return dev;
> -}
> -
>  static void register_vfs(PCIDevice *dev)
>  {
>      uint16_t num_vfs;
>      uint16_t i;
>      uint16_t sriov_cap = dev->exp.sriov_cap;
> -    uint16_t vf_offset =
> -        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
> -    uint16_t vf_stride =
> -        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
> -    int32_t devfn = dev->devfn + vf_offset;
>  
>      assert(sriov_cap > 0);
>      num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> @@ -180,18 +188,10 @@ static void register_vfs(PCIDevice *dev)
>          return;
>      }
>  
> -    dev->exp.sriov_pf.vf = g_new(PCIDevice *, num_vfs);
> -
>      trace_sriov_register_vfs(dev->name, PCI_SLOT(dev->devfn),
>                               PCI_FUNC(dev->devfn), num_vfs);
>      for (i = 0; i < num_vfs; i++) {
> -        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn,
> -                                              dev->exp.sriov_pf.vfname, i);
> -        if (!dev->exp.sriov_pf.vf[i]) {
> -            num_vfs = i;
> -            break;
> -        }
> -        devfn += vf_stride;
> +        pci_set_enabled(dev->exp.sriov_pf.vf[i], true);
>      }
>      dev->exp.sriov_pf.num_vfs = num_vfs;
>  }
> @@ -204,16 +204,8 @@ static void unregister_vfs(PCIDevice *dev)
>      trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
>                                 PCI_FUNC(dev->devfn), num_vfs);
>      for (i = 0; i < num_vfs; i++) {
> -        Error *err = NULL;
> -        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> -        if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
> -            error_reportf_err(err, "Failed to unplug: ");
> -        }
> -        object_unparent(OBJECT(vf));
> -        object_unref(OBJECT(vf));
> +        pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
>      }
> -    g_free(dev->exp.sriov_pf.vf);
> -    dev->exp.sriov_pf.vf = NULL;
>      dev->exp.sriov_pf.num_vfs = 0;
>      pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
>  }
> 
> -- 
> 2.43.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/9] pcie_sriov: Release VFs failed to realize
  2024-02-14  5:13 ` [PATCH v4 7/9] pcie_sriov: Release VFs failed to realize Akihiko Odaki
@ 2024-02-14  7:54   ` Michael S. Tsirkin
  2024-02-14  7:58     ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14  7:54 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 02:13:45PM +0900, Akihiko Odaki wrote:
> Release VFs failed to realize just as we do in unregister_vfs().
> 
> Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>

Is this fixing an old bug or a bug introduced by a previous patch?


> ---
>  hw/pci/pcie_sriov.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index 9ba34cf8f8ed..9d668b8d6c17 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -91,6 +91,8 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>          vf->exp.sriov_vf.vf_number = i;
>  
>          if (!qdev_realize(&vf->qdev, bus, errp)) {
> +            object_unparent(OBJECT(vf));
> +            object_unref(vf);
>              unrealize_vfs(dev, i);
>              return false;
>          }
> 
> -- 
> 2.43.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/9] pcie_sriov: Release VFs failed to realize
  2024-02-14  7:54   ` Michael S. Tsirkin
@ 2024-02-14  7:58     ` Akihiko Odaki
  0 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14  7:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/14 16:54, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 02:13:45PM +0900, Akihiko Odaki wrote:
>> Release VFs failed to realize just as we do in unregister_vfs().
>>
>> Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> 
> Is this fixing an old bug or a bug introduced by a previous patch?

Old bug, present since: commit 7c0fa8dff811 ("pcie: Add support for 
Single Root I/O Virtualization (SR/IOV)")


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14  5:13 ` [PATCH v4 5/9] pcie_sriov: Validate NumVFs Akihiko Odaki
  2024-02-14  6:52   ` Michael S. Tsirkin
@ 2024-02-14  8:58   ` Michael Tokarev
  2024-02-14 14:54     ` Akihiko Odaki
  1 sibling, 1 reply; 34+ messages in thread
From: Michael Tokarev @ 2024-02-14  8:58 UTC (permalink / raw)
  To: Akihiko Odaki, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, qemu-stable

14.02.2024 08:13, Akihiko Odaki wrote:
> The guest may write NumVFs greater than TotalVFs and that can lead
> to buffer overflow in VF implementations.

This seems to be stable-worthy (Cc'd), and maybe even CVE-worthy?

Thanks,

/mjt

> Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
>   hw/pci/pcie_sriov.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index a1fe65f5d801..da209b7f47fd 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -176,6 +176,9 @@ static void register_vfs(PCIDevice *dev)
>   
>       assert(sriov_cap > 0);
>       num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> +    if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
> +        return;
> +    }
>   
>       dev->exp.sriov_pf.vf = g_new(PCIDevice *, num_vfs);
>   
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs
  2024-02-14  7:07   ` Michael S. Tsirkin
@ 2024-02-14 14:09     ` Akihiko Odaki
  2024-02-14 15:46       ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/14 16:07, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 02:13:47PM +0900, Akihiko Odaki wrote:
>> NumVFs may not equal to the current effective number of VFs because VF
>> Enable is cleared, NumVFs is set after VF Enable is set, or NumVFs is
>> greater than TotalVFs.
>>
>> Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command")
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> 
> I don't get what this is saying about VF enable.
> This code will not trigger on numVFs write when VF enable is set.
> Generally this commit makes no sense on its own, squash it with
> the pci core change pls.

This code is meant to run when it is clearing VF Enable, and its 
functionality is to change the state of VFs currently enabled so that we 
can disable them.

However, NumVFs does not necessarily represent VFs currently being 
enabled, and have a different value in the case described above. Such 
cases exist even before the earlier patches and this fix is 
independently meaningful.

> 
>> ---
>>   hw/nvme/ctrl.c | 5 ++---
>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>> index f8df622fe590..daedda5d326f 100644
>> --- a/hw/nvme/ctrl.c
>> +++ b/hw/nvme/ctrl.c
>> @@ -8481,7 +8481,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>>       NvmeSecCtrlEntry *sctrl;
>>       uint16_t sriov_cap = dev->exp.sriov_cap;
>>       uint32_t off = address - sriov_cap;
>> -    int i, num_vfs;
>> +    int i;
>>   
>>       if (!sriov_cap) {
>>           return;
>> @@ -8489,8 +8489,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>>   
>>       if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
>>           if (!(val & PCI_SRIOV_CTRL_VFE)) {
>> -            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
>> -            for (i = 0; i < num_vfs; i++) {
>> +            for (i = 0; i < dev->exp.sriov_pf.num_vfs; i++) {
>>                   sctrl = &n->sec_ctrl_list.sec[i];
>>                   nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
>>               }
>>
>> -- 
>> 2.43.0
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs
  2024-02-14  6:53   ` Michael S. Tsirkin
@ 2024-02-14 14:32     ` Akihiko Odaki
  2024-02-14 15:51       ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 14:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/14 15:53, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 02:13:46PM +0900, Akihiko Odaki wrote:
>> I couldn't find such a behavior specified.
> 
> Is it fixing a bug or just removing unnecessary code?
> Is this guest visible at all?

My intention is just to remove unnecessary code, but it is 
guest-visible. The original behavior causes a problem and it should be 
considered as a bug fix if a guest expects VFs can be restored by 
setting VF Enable after clearing it, but I don't know such an example.

> 
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>>   hw/pci/pcie_sriov.c | 1 -
>>   1 file changed, 1 deletion(-)
>>
>> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
>> index 9d668b8d6c17..410bc090fc58 100644
>> --- a/hw/pci/pcie_sriov.c
>> +++ b/hw/pci/pcie_sriov.c
>> @@ -209,7 +209,6 @@ static void unregister_vfs(PCIDevice *dev)
>>           pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
>>       }
>>       dev->exp.sriov_pf.num_vfs = 0;
>> -    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
>>   }
>>   
>>   void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
>>
>> -- 
>> 2.43.0
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/9] pcie_sriov: Reuse SR-IOV VF device instances
  2024-02-14  7:54   ` Michael S. Tsirkin
@ 2024-02-14 14:44     ` Akihiko Odaki
  0 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 14:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/14 16:54, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 02:13:44PM +0900, Akihiko Odaki wrote:
>> Disable SR-IOV VF devices by reusing code to power down PCI devices
>> instead of removing them when the guest requests to disable VFs. This
>> allows to realize devices and report VF realization errors at PF
>> realization time.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> 
> This patch does much more than what commit log says.
> This really needs to be split up with each change you make
> documented.

I'll split into two patches; one for renaming "power" to "enabled", 
another for reusing the code for SR-IOV.

> 
>> ---
>>   docs/pcie_sriov.txt         |   8 ++--
>>   include/hw/pci/pci.h        |   2 +-
>>   include/hw/pci/pci_device.h |   2 +-
>>   include/hw/pci/pcie_sriov.h |   6 +--
>>   hw/net/igb.c                |  13 ++++--
>>   hw/nvme/ctrl.c              |  24 +++++++----
>>   hw/pci/pci.c                |  18 ++++----
>>   hw/pci/pci_host.c           |   4 +-
>>   hw/pci/pcie.c               |   4 +-
>>   hw/pci/pcie_sriov.c         | 100 ++++++++++++++++++++------------------------
>>   10 files changed, 97 insertions(+), 84 deletions(-)
>>
>> diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
>> index a47aad0bfab0..ab2142807f79 100644
>> --- a/docs/pcie_sriov.txt
>> +++ b/docs/pcie_sriov.txt
>> @@ -52,9 +52,11 @@ setting up a BAR for a VF.
>>         ...
>>   
>>         /* Add and initialize the SR/IOV capability */
>> -      pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
>> -                       vf_devid, initial_vfs, total_vfs,
>> -                       fun_offset, stride);
>> +      if (!pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
>> +                              vf_devid, initial_vfs, total_vfs,
>> +                              fun_offset, stride, errp)) {
>> +         return;
>> +      }
>>   
>>         /* Set up individual VF BARs (parameters as for normal BARs) */
>>         pcie_sriov_pf_init_vf_bar( ... )
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index eaa3fc99d884..442017b4865d 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -642,6 +642,6 @@ static inline void pci_irq_pulse(PCIDevice *pci_dev)
>>   }
>>   
>>   MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
>> -void pci_set_power(PCIDevice *pci_dev, bool state);
>> +void pci_set_enabled(PCIDevice *pci_dev, bool state);
> 
> Not a good name. Does this set enabled state? Test whether
> some kind of set is enabled?

I followed memory_region_set_enabled(), which is called from 
pci_set_enabled().

> 
>>   
>>   #endif
>> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
>> index 54fa0676abf1..f5aba8ae2675 100644
>> --- a/include/hw/pci/pci_device.h
>> +++ b/include/hw/pci/pci_device.h
>> @@ -56,7 +56,7 @@ typedef struct PCIReqIDCache PCIReqIDCache;
>>   struct PCIDevice {
>>       DeviceState qdev;
>>       bool partially_hotplugged;
>> -    bool has_power;
>> +    bool is_enabled;
>>   
>>       /* PCI config space */
>>       uint8_t *config;
>> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
>> index 095fb0c9edf9..d9a39daccac4 100644
>> --- a/include/hw/pci/pcie_sriov.h
>> +++ b/include/hw/pci/pcie_sriov.h
>> @@ -18,7 +18,6 @@
>>   struct PCIESriovPF {
>>       uint16_t num_vfs;   /* Number of virtual functions created */
>>       uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
>> -    const char *vfname; /* Reference to the device type used for the VFs */
>>       PCIDevice **vf;     /* Pointer to an array of num_vfs VF devices */
>>   };
>>   
>> @@ -27,10 +26,11 @@ struct PCIESriovVF {
>>       uint16_t vf_number; /* Logical VF number of this function */
>>   };
>>   
>> -void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>> +bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>>                           const char *vfname, uint16_t vf_dev_id,
>>                           uint16_t init_vfs, uint16_t total_vfs,
>> -                        uint16_t vf_offset, uint16_t vf_stride);
>> +                        uint16_t vf_offset, uint16_t vf_stride,
>> +                        Error **errp);
> 
> Returning false to indicate an error is a bad idea -
> most APIs we have return 0 on success, < 0 on error.
> Or just use errp.

I followed a statement in include/qapi/error.h:
  * - Whenever practical, also return a value that indicates success /
  *   failure.  This can make the error checking more concise, and can
  *   avoid useless error object creation and destruction.  Note that
  *   we still have many functions returning void.  We recommend
  *   • bool-valued functions return true on success / false on failure,
  *   • pointer-valued functions return non-null / null pointer, and
  *   • integer-valued functions return non-negative / negative.

> 
> 
>>   void pcie_sriov_pf_exit(PCIDevice *dev);
>>   
>>   /* Set up a VF bar in the SR/IOV bar area */
>> diff --git a/hw/net/igb.c b/hw/net/igb.c
>> index 0b5c31a58bba..1079a33d4000 100644
>> --- a/hw/net/igb.c
>> +++ b/hw/net/igb.c
>> @@ -447,9 +447,16 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error **errp)
>>   
>>       pcie_ari_init(pci_dev, 0x150);
>>   
>> -    pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET, TYPE_IGBVF,
>> -        IGB_82576_VF_DEV_ID, IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
>> -        IGB_VF_OFFSET, IGB_VF_STRIDE);
>> +    if (!pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET,
>> +                            TYPE_IGBVF, IGB_82576_VF_DEV_ID,
>> +                            IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
>> +                            IGB_VF_OFFSET, IGB_VF_STRIDE,
>> +                            errp)) {
>> +        pcie_cap_exit(pci_dev);
>> +        igb_cleanup_msix(s);
>> +        msi_uninit(pci_dev);
>> +        return;
>> +    }
>>   
>>       pcie_sriov_pf_init_vf_bar(pci_dev, IGBVF_MMIO_BAR_IDX,
>>           PCI_BASE_ADDRESS_MEM_TYPE_64 | PCI_BASE_ADDRESS_MEM_PREFETCH,
>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>> index f026245d1e9e..f8df622fe590 100644
>> --- a/hw/nvme/ctrl.c
>> +++ b/hw/nvme/ctrl.c
>> @@ -8031,7 +8031,8 @@ static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
>>       return bar_size;
>>   }
>>   
>> -static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
>> +static bool nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
>> +                            Error **errp)
>>   {
>>       uint16_t vf_dev_id = n->params.use_intel_id ?
>>                            PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
>> @@ -8040,12 +8041,17 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
>>                                         le16_to_cpu(cap->vifrsm),
>>                                         NULL, NULL);
>>   
>> -    pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
>> -                       n->params.sriov_max_vfs, n->params.sriov_max_vfs,
>> -                       NVME_VF_OFFSET, NVME_VF_STRIDE);
>> +    if (!pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
>> +                            n->params.sriov_max_vfs, n->params.sriov_max_vfs,
>> +                            NVME_VF_OFFSET, NVME_VF_STRIDE,
>> +                            errp)) {
>> +        return false;
>> +    }
>>   
>>       pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
>>                                 PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
>> +
>> +    return true;
>>   }
>>   
>>   static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
>> @@ -8124,6 +8130,12 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>>           return false;
>>       }
>>   
>> +    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs &&
>> +        !nvme_init_sriov(n, pci_dev, 0x120, errp)) {
>> +        msix_uninit(pci_dev, &n->bar0, &n->bar0);
>> +        return false;
>> +    }
>> +
>>       nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
>>   
>>       if (n->params.cmb_size_mb) {
>> @@ -8134,10 +8146,6 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>>           nvme_init_pmr(n, pci_dev);
>>       }
>>   
>> -    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
>> -        nvme_init_sriov(n, pci_dev, 0x120);
>> -    }
>> -
>>       return true;
>>   }
> 
> why this change?

I moved nvme_init_sriov() because nvme_init_cmb() does some memory 
allocation, which will be leaked if we return false after that.

nvme_init_sriov() now follows immediately after msix_init(); it already 
has code to return for error handling so we shouldn't need more than 
msix_uninit().

> 
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 47f38375bb09..838c261c21f0 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -1524,7 +1524,7 @@ static void pci_update_mappings(PCIDevice *d)
>>               continue;
>>   
>>           new_addr = pci_bar_address(d, i, r->type, r->size);
>> -        if (!d->has_power) {
>> +        if (!d->is_enabled) {
>>               new_addr = PCI_BAR_UNMAPPED;
>>           }
>>   
>> @@ -1612,7 +1612,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>>           pci_update_irq_disabled(d, was_irq_disabled);
>>           memory_region_set_enabled(&d->bus_master_enable_region,
>>                                     (pci_get_word(d->config + PCI_COMMAND)
>> -                                   & PCI_COMMAND_MASTER) && d->has_power);
>> +                                   & PCI_COMMAND_MASTER) && d->is_enabled);
>>       }
>>   
>>       msi_write_config(d, addr, val_in, l);
>> @@ -2154,7 +2154,9 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
>>           return;
>>       }
>>   
>> -    pci_set_power(pci_dev, true);
>> +    if (!pci_is_vf(pci_dev)) {
>> +        pci_set_enabled(pci_dev, true);
>> +    }
> 
> why are you not enabling vfs specifically? seems confusing. you also
> duplicate this logic in a couple more places. needs a
> comment at least. Or better, add sane APIs that do this correctly.
> E.g. keep pci_set_power around?

It makes sense. I'll add pci_set_power(), which calls pci_set_enabled() 
only for PFs and has a proper comment.

> 
>>   
>>       pci_dev->msi_trigger = pci_msi_trigger;
>>   }
>> @@ -2810,18 +2812,18 @@ MSIMessage pci_get_msi_message(PCIDevice *dev, int vector)
>>       return msg;
>>   }
>>   
>> -void pci_set_power(PCIDevice *d, bool state)
>> +void pci_set_enabled(PCIDevice *d, bool state)
>>   {
>> -    if (d->has_power == state) {
>> +    if (d->is_enabled == state) {
>>           return;
>>       }
>>   
>> -    d->has_power = state;
>> +    d->is_enabled = state;
>>       pci_update_mappings(d);
>>       memory_region_set_enabled(&d->bus_master_enable_region,
>>                                 (pci_get_word(d->config + PCI_COMMAND)
>> -                               & PCI_COMMAND_MASTER) && d->has_power);
>> -    if (!d->has_power) {
>> +                               & PCI_COMMAND_MASTER) && d->is_enabled);
>> +    if (d->qdev.realized) {
>>           pci_device_reset(d);
>>       }
>>   }
>> diff --git a/hw/pci/pci_host.c b/hw/pci/pci_host.c
>> index dfe6fe618401..d7e13d72ce07 100644
>> --- a/hw/pci/pci_host.c
>> +++ b/hw/pci/pci_host.c
>> @@ -86,7 +86,7 @@ void pci_host_config_write_common(PCIDevice *pci_dev, uint32_t addr,
>>        * allowing direct removal of unexposed functions.
>>        */
>>       if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
>> -        !pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
>> +        !pci_dev->is_enabled || is_pci_dev_ejected(pci_dev)) {
>>           return;
>>       }
>>   
>> @@ -111,7 +111,7 @@ uint32_t pci_host_config_read_common(PCIDevice *pci_dev, uint32_t addr,
>>        * allowing direct removal of unexposed functions.
>>        */
>>       if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
>> -        !pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
>> +        !pci_dev->is_enabled || is_pci_dev_ejected(pci_dev)) {
>>           return ~0x0;
>>       }
>>   
>> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
>> index 6db0cf69cd8a..f34c157e1fd3 100644
>> --- a/hw/pci/pcie.c
>> +++ b/hw/pci/pcie.c
>> @@ -394,7 +394,9 @@ static void pcie_set_power_device(PCIBus *bus, PCIDevice *dev, void *opaque)
>>   {
>>       bool *power = opaque;
>>   
>> -    pci_set_power(dev, *power);
>> +    if (!pci_is_vf(dev)) {
>> +        pci_set_enabled(dev, *power);
>> +    }
>>   }
>>   
>>   static void pcie_cap_update_power(PCIDevice *hotplug_dev)
>> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
>> index da209b7f47fd..9ba34cf8f8ed 100644
>> --- a/hw/pci/pcie_sriov.c
>> +++ b/hw/pci/pcie_sriov.c
>> @@ -20,15 +20,29 @@
>>   #include "qapi/error.h"
>>   #include "trace.h"
>>   
>> -static PCIDevice *register_vf(PCIDevice *pf, int devfn,
>> -                              const char *name, uint16_t vf_num);
>> -static void unregister_vfs(PCIDevice *dev);
>> +static void unrealize_vfs(PCIDevice *dev, uint16_t total_vfs)
>> +{
>> +    for (uint16_t i = 0; i < total_vfs; i++) {
>> +        Error *err = NULL;
>> +        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
>> +        if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
>> +            error_reportf_err(err, "Failed to unplug: ");
>> +        }
>> +        object_unparent(OBJECT(vf));
>> +        object_unref(OBJECT(vf));
>> +    }
>> +    g_free(dev->exp.sriov_pf.vf);
>> +    dev->exp.sriov_pf.vf = NULL;
>> +}
>>   
>> -void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>> +bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>>                           const char *vfname, uint16_t vf_dev_id,
>>                           uint16_t init_vfs, uint16_t total_vfs,
>> -                        uint16_t vf_offset, uint16_t vf_stride)
>> +                        uint16_t vf_offset, uint16_t vf_stride,
>> +                        Error **errp)
>>   {
>> +    BusState *bus = qdev_get_parent_bus(&dev->qdev);
>> +    int32_t devfn = dev->devfn + vf_offset;
>>       uint8_t *cfg = dev->config + offset;
>>       uint8_t *wmask;
>>   
>> @@ -36,7 +50,6 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>>                           offset, PCI_EXT_CAP_SRIOV_SIZEOF);
>>       dev->exp.sriov_cap = offset;
>>       dev->exp.sriov_pf.num_vfs = 0;
>> -    dev->exp.sriov_pf.vfname = g_strdup(vfname);
>>       dev->exp.sriov_pf.vf = NULL;
>>   
>>       pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
>> @@ -69,13 +82,35 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>>       pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
>>   
>>       qdev_prop_set_bit(&dev->qdev, "multifunction", true);
>> +
>> +    dev->exp.sriov_pf.vf = g_new(PCIDevice *, total_vfs);
>> +
>> +    for (uint16_t i = 0; i < total_vfs; i++) {
>> +        PCIDevice *vf = pci_new(devfn, vfname);
>> +        vf->exp.sriov_vf.pf = dev;
>> +        vf->exp.sriov_vf.vf_number = i;
>> +
>> +        if (!qdev_realize(&vf->qdev, bus, errp)) {
>> +            unrealize_vfs(dev, i);
>> +            return false;
>> +        }
>> +
>> +        /* set vid/did according to sr/iov spec - they are not used */
>> +        pci_config_set_vendor_id(vf->config, 0xffff);
>> +        pci_config_set_device_id(vf->config, 0xffff);
>> +
>> +        dev->exp.sriov_pf.vf[i] = vf;
>> +        devfn += vf_stride;
>> +    }
>> +
>> +    return true;
>>   }
>>   
>>   void pcie_sriov_pf_exit(PCIDevice *dev)
>>   {
>> -    unregister_vfs(dev);
>> -    g_free((char *)dev->exp.sriov_pf.vfname);
>> -    dev->exp.sriov_pf.vfname = NULL;
>> +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
>> +
>> +    unrealize_vfs(dev, pci_get_word(cfg + PCI_SRIOV_TOTAL_VF));
>>   }
>>   
>>   void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
>> @@ -141,38 +176,11 @@ void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
>>       }
>>   }
>>   
>> -static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name,
>> -                              uint16_t vf_num)
>> -{
>> -    PCIDevice *dev = pci_new(devfn, name);
>> -    dev->exp.sriov_vf.pf = pf;
>> -    dev->exp.sriov_vf.vf_number = vf_num;
>> -    PCIBus *bus = pci_get_bus(pf);
>> -    Error *local_err = NULL;
>> -
>> -    qdev_realize(&dev->qdev, &bus->qbus, &local_err);
>> -    if (local_err) {
>> -        error_report_err(local_err);
>> -        return NULL;
>> -    }
>> -
>> -    /* set vid/did according to sr/iov spec - they are not used */
>> -    pci_config_set_vendor_id(dev->config, 0xffff);
>> -    pci_config_set_device_id(dev->config, 0xffff);
>> -
>> -    return dev;
>> -}
>> -
>>   static void register_vfs(PCIDevice *dev)
>>   {
>>       uint16_t num_vfs;
>>       uint16_t i;
>>       uint16_t sriov_cap = dev->exp.sriov_cap;
>> -    uint16_t vf_offset =
>> -        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
>> -    uint16_t vf_stride =
>> -        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
>> -    int32_t devfn = dev->devfn + vf_offset;
>>   
>>       assert(sriov_cap > 0);
>>       num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
>> @@ -180,18 +188,10 @@ static void register_vfs(PCIDevice *dev)
>>           return;
>>       }
>>   
>> -    dev->exp.sriov_pf.vf = g_new(PCIDevice *, num_vfs);
>> -
>>       trace_sriov_register_vfs(dev->name, PCI_SLOT(dev->devfn),
>>                                PCI_FUNC(dev->devfn), num_vfs);
>>       for (i = 0; i < num_vfs; i++) {
>> -        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn,
>> -                                              dev->exp.sriov_pf.vfname, i);
>> -        if (!dev->exp.sriov_pf.vf[i]) {
>> -            num_vfs = i;
>> -            break;
>> -        }
>> -        devfn += vf_stride;
>> +        pci_set_enabled(dev->exp.sriov_pf.vf[i], true);
>>       }
>>       dev->exp.sriov_pf.num_vfs = num_vfs;
>>   }
>> @@ -204,16 +204,8 @@ static void unregister_vfs(PCIDevice *dev)
>>       trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
>>                                  PCI_FUNC(dev->devfn), num_vfs);
>>       for (i = 0; i < num_vfs; i++) {
>> -        Error *err = NULL;
>> -        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
>> -        if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
>> -            error_reportf_err(err, "Failed to unplug: ");
>> -        }
>> -        object_unparent(OBJECT(vf));
>> -        object_unref(OBJECT(vf));
>> +        pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
>>       }
>> -    g_free(dev->exp.sriov_pf.vf);
>> -    dev->exp.sriov_pf.vf = NULL;
>>       dev->exp.sriov_pf.num_vfs = 0;
>>       pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
>>   }
>>
>> -- 
>> 2.43.0
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14  6:52   ` Michael S. Tsirkin
@ 2024-02-14 14:49     ` Akihiko Odaki
  2024-02-14 15:54       ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 14:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/14 15:52, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 02:13:43PM +0900, Akihiko Odaki wrote:
>> The guest may write NumVFs greater than TotalVFs and that can lead
>> to buffer overflow in VF implementations.
>>
>> Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>>   hw/pci/pcie_sriov.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
>> index a1fe65f5d801..da209b7f47fd 100644
>> --- a/hw/pci/pcie_sriov.c
>> +++ b/hw/pci/pcie_sriov.c
>> @@ -176,6 +176,9 @@ static void register_vfs(PCIDevice *dev)
>>   
>>       assert(sriov_cap > 0);
>>       num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
>> +    if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
>> +        return;
>> +    }
> 
> 
> yes but with your change PCI_SRIOV_NUM_VF no longer reflects the
> number of registered VFs and that might lead to uninitialized
> data read which is not better :(.
> 
> How about just forcing the PCI_SRIOV_NUM_VF register to be
> below PCI_SRIOV_TOTAL_VF at all times?

PCI_SRIOV_NUM_VF is already divergent from the number of registered VFs. 
It may have a number greater than the current registered VFs before 
setting VF Enable.

The guest may also change PCI_SRIOV_NUM_VF while VF Enable is set; the 
behavior is undefined in such a case but we still accept such a write. A 
value written in such a case won't be reflected to the actual number of 
enabled VFs.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14  8:58   ` Michael Tokarev
@ 2024-02-14 14:54     ` Akihiko Odaki
  2024-02-14 15:53       ` Michael Tokarev
  0 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 14:54 UTC (permalink / raw)
  To: Michael Tokarev, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, qemu-stable

On 2024/02/14 17:58, Michael Tokarev wrote:
> 14.02.2024 08:13, Akihiko Odaki wrote:
>> The guest may write NumVFs greater than TotalVFs and that can lead
>> to buffer overflow in VF implementations.
> 
> This seems to be stable-worthy (Cc'd), and maybe even CVE-worthy?

Perhaps so. The scope of the bug is limited to emulated SR-IOV devices, 
and I think nobody use them except for development, but it may be still 
nice to have a CVE.

Can anyone help assign a CVE? I don't know the procedure.

Regards,
Akihiko Odaki

> 
> Thanks,
> 
> /mjt
> 
>> Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O 
>> Virtualization (SR/IOV)")
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>>   hw/pci/pcie_sriov.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
>> index a1fe65f5d801..da209b7f47fd 100644
>> --- a/hw/pci/pcie_sriov.c
>> +++ b/hw/pci/pcie_sriov.c
>> @@ -176,6 +176,9 @@ static void register_vfs(PCIDevice *dev)
>>       assert(sriov_cap > 0);
>>       num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
>> +    if (num_vfs > pci_get_word(dev->config + sriov_cap + 
>> PCI_SRIOV_TOTAL_VF)) {
>> +        return;
>> +    }
>>       dev->exp.sriov_pf.vf = g_new(PCIDevice *, num_vfs);
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs
  2024-02-14 14:09     ` Akihiko Odaki
@ 2024-02-14 15:46       ` Michael S. Tsirkin
  2024-02-14 16:07         ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14 15:46 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 11:09:50PM +0900, Akihiko Odaki wrote:
> On 2024/02/14 16:07, Michael S. Tsirkin wrote:
> > On Wed, Feb 14, 2024 at 02:13:47PM +0900, Akihiko Odaki wrote:
> > > NumVFs may not equal to the current effective number of VFs because VF
> > > Enable is cleared, NumVFs is set after VF Enable is set, or NumVFs is
> > > greater than TotalVFs.
> > > 
> > > Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command")
> > > Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> > 
> > I don't get what this is saying about VF enable.
> > This code will not trigger on numVFs write when VF enable is set.
> > Generally this commit makes no sense on its own, squash it with
> > the pci core change pls.
> 
> This code is meant to run when it is clearing VF Enable, and its
> functionality is to change the state of VFs currently enabled so that we can
> disable them.
> 
> However, NumVFs does not necessarily represent VFs currently being enabled,
> and have a different value in the case described above.

Ah so in this case, if numvfs is changed and then VFs are disabled,
we will not call nvme_virt_set_state? OK, it should say this in commit log.
And then, what happens?

> Such cases exist
> even before the earlier patches and this fix is independently meaningful.

yes but the previous patch causes a regression without this one.
squash it.


> > 
> > > ---
> > >   hw/nvme/ctrl.c | 5 ++---
> > >   1 file changed, 2 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > index f8df622fe590..daedda5d326f 100644
> > > --- a/hw/nvme/ctrl.c
> > > +++ b/hw/nvme/ctrl.c
> > > @@ -8481,7 +8481,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
> > >       NvmeSecCtrlEntry *sctrl;
> > >       uint16_t sriov_cap = dev->exp.sriov_cap;
> > >       uint32_t off = address - sriov_cap;
> > > -    int i, num_vfs;
> > > +    int i;
> > >       if (!sriov_cap) {
> > >           return;
> > > @@ -8489,8 +8489,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
> > >       if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > >           if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > > -            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> > > -            for (i = 0; i < num_vfs; i++) {
> > > +            for (i = 0; i < dev->exp.sriov_pf.num_vfs; i++) {

If the assumption you now make is that num_vfs only changes
when VFs are disabled, we should add a comment documenting this.
In fact, I think there's a nicer way to do this:

static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
                                  uint32_t val, int len)
{
    int old_num_vfs = dev->exp.sriov_pf.num_vfs;

    pci_default_write_config(dev, address, val, len);
    pcie_cap_flr_write_config(dev, address, val, len);
    nvme_sriov_pre_write_ctrl(dev, address, val, len, old_num_vfs);
}

and now, nvme_sriov_pre_write_ctrl can compare:

if (old_num_vfs && !dev->exp.sriov_pf.num_vfs)
	disable everything


this, without bothering with detail of SRIOV capability.
No?



> > >                   sctrl = &n->sec_ctrl_list.sec[i];
> > >                   nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
> > >               }
> > > 
> > > -- 
> > > 2.43.0
> > 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs
  2024-02-14 14:32     ` Akihiko Odaki
@ 2024-02-14 15:51       ` Michael S. Tsirkin
  2024-02-14 16:10         ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14 15:51 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 11:32:11PM +0900, Akihiko Odaki wrote:
> On 2024/02/14 15:53, Michael S. Tsirkin wrote:
> > On Wed, Feb 14, 2024 at 02:13:46PM +0900, Akihiko Odaki wrote:
> > > I couldn't find such a behavior specified.
> > 
> > Is it fixing a bug or just removing unnecessary code?
> > Is this guest visible at all?
> 
> My intention is just to remove unnecessary code, but it is guest-visible.
> The original behavior causes a problem and it should be considered as a bug
> fix if a guest expects VFs can be restored by setting VF Enable after
> clearing it, but I don't know such an example.

Ah ok.  So:

	According to the spec, PCI_SRIOV_NUM_VF isn't reset when
	VFs are disabled. Clearing it is guest visible and out of spec,
	even though guests mostly don't rely on this value being
	preserved, so we never noticed.

I wonder however what happens on reset.
How come we don't have a reset callback for sriov cap?
I suspect this hack serves as a work around to avoid leaking
this register, and we really should fix that too here?

> > 
> > > Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> > > ---
> > >   hw/pci/pcie_sriov.c | 1 -
> > >   1 file changed, 1 deletion(-)
> > > 
> > > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > > index 9d668b8d6c17..410bc090fc58 100644
> > > --- a/hw/pci/pcie_sriov.c
> > > +++ b/hw/pci/pcie_sriov.c
> > > @@ -209,7 +209,6 @@ static void unregister_vfs(PCIDevice *dev)
> > >           pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
> > >       }
> > >       dev->exp.sriov_pf.num_vfs = 0;
> > > -    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
> > >   }
> > >   void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > > 
> > > -- 
> > > 2.43.0
> > 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14 14:54     ` Akihiko Odaki
@ 2024-02-14 15:53       ` Michael Tokarev
  2024-02-14 15:55         ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Michael Tokarev @ 2024-02-14 15:53 UTC (permalink / raw)
  To: Akihiko Odaki, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Cédric Le Goater, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen
  Cc: qemu-devel, qemu-block, qemu-stable

14.02.2024 17:54, Akihiko Odaki wrote:
> On 2024/02/14 17:58, Michael Tokarev wrote:
>> 14.02.2024 08:13, Akihiko Odaki wrote:
>>> The guest may write NumVFs greater than TotalVFs and that can lead
>>> to buffer overflow in VF implementations.
>>
>> This seems to be stable-worthy (Cc'd), and maybe even CVE-worthy?
> 
> Perhaps so. The scope of the bug is limited to emulated SR-IOV devices, and I think nobody use them except for development, but it may be still nice 
> to have a CVE.
> 
> Can anyone help assign a CVE? I don't know the procedure.

Heh. Usually I ask exactly the opposite question: how to avoid assigning
a CVE# for a non-issue which they most likely think is a serious security
bug?  We've plenty of these in qemu, collecting dust for years...  For
example, for things like some actions by privileged guest process (or kernel)
which leads to qemu dying with assertion failure, which, on a real HW, will
cause hardware lockup.

Nope, I don't remember how to request a CVE ;)

/mjt


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14 14:49     ` Akihiko Odaki
@ 2024-02-14 15:54       ` Michael S. Tsirkin
  2024-02-14 16:22         ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14 15:54 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Wed, Feb 14, 2024 at 11:49:52PM +0900, Akihiko Odaki wrote:
> On 2024/02/14 15:52, Michael S. Tsirkin wrote:
> > On Wed, Feb 14, 2024 at 02:13:43PM +0900, Akihiko Odaki wrote:
> > > The guest may write NumVFs greater than TotalVFs and that can lead
> > > to buffer overflow in VF implementations.
> > > 
> > > Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
> > > Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> > > ---
> > >   hw/pci/pcie_sriov.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > > index a1fe65f5d801..da209b7f47fd 100644
> > > --- a/hw/pci/pcie_sriov.c
> > > +++ b/hw/pci/pcie_sriov.c
> > > @@ -176,6 +176,9 @@ static void register_vfs(PCIDevice *dev)
> > >       assert(sriov_cap > 0);
> > >       num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> > > +    if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
> > > +        return;
> > > +    }
> > 
> > 
> > yes but with your change PCI_SRIOV_NUM_VF no longer reflects the
> > number of registered VFs and that might lead to uninitialized
> > data read which is not better :(.
> > 
> > How about just forcing the PCI_SRIOV_NUM_VF register to be
> > below PCI_SRIOV_TOTAL_VF at all times?
> 
> PCI_SRIOV_NUM_VF is already divergent from the number of registered VFs. It
> may have a number greater than the current registered VFs before setting VF
> Enable.
> 
> The guest may also change PCI_SRIOV_NUM_VF while VF Enable is set; the
> behavior is undefined in such a case but we still accept such a write. A
> value written in such a case won't be reflected to the actual number of
> enabled VFs.

OK then let's add a comment near num_vfs explaining all this and saying
only to use this value. I also would prefer to have this if
just where we set num_vfs. And maybe after all do set num_vfs to
PCI_SRIOV_TOTAL_VF? Less of a chance of breaking something I feel...

-- 
MST



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14 15:53       ` Michael Tokarev
@ 2024-02-14 15:55         ` Michael S. Tsirkin
  2024-02-17 11:32           ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14 15:55 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Akihiko Odaki, Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block, qemu-stable

On Wed, Feb 14, 2024 at 06:53:43PM +0300, Michael Tokarev wrote:
> Nope, I don't remember how to request a CVE ;)

https://www.qemu.org/contribute/security-process/



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs
  2024-02-14 15:46       ` Michael S. Tsirkin
@ 2024-02-14 16:07         ` Akihiko Odaki
  2024-02-14 16:34           ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/15 0:46, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 11:09:50PM +0900, Akihiko Odaki wrote:
>> On 2024/02/14 16:07, Michael S. Tsirkin wrote:
>>> On Wed, Feb 14, 2024 at 02:13:47PM +0900, Akihiko Odaki wrote:
>>>> NumVFs may not equal to the current effective number of VFs because VF
>>>> Enable is cleared, NumVFs is set after VF Enable is set, or NumVFs is
>>>> greater than TotalVFs.
>>>>
>>>> Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command")
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>
>>> I don't get what this is saying about VF enable.
>>> This code will not trigger on numVFs write when VF enable is set.
>>> Generally this commit makes no sense on its own, squash it with
>>> the pci core change pls.
>>
>> This code is meant to run when it is clearing VF Enable, and its
>> functionality is to change the state of VFs currently enabled so that we can
>> disable them.
>>
>> However, NumVFs does not necessarily represent VFs currently being enabled,
>> and have a different value in the case described above.
> 
> Ah so in this case, if numvfs is changed and then VFs are disabled,
> we will not call nvme_virt_set_state? OK, it should say this in commit log.
> And then, what happens?

We will call nvme_virt_set_state() but only for VFs already enabled.

> 
>> Such cases exist
>> even before the earlier patches and this fix is independently meaningful.
> 
> yes but the previous patch causes a regression without this one.
> squash it.

I'll move this patch before the previous patch.

> 
> 
>>>
>>>> ---
>>>>    hw/nvme/ctrl.c | 5 ++---
>>>>    1 file changed, 2 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>>>> index f8df622fe590..daedda5d326f 100644
>>>> --- a/hw/nvme/ctrl.c
>>>> +++ b/hw/nvme/ctrl.c
>>>> @@ -8481,7 +8481,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>>>>        NvmeSecCtrlEntry *sctrl;
>>>>        uint16_t sriov_cap = dev->exp.sriov_cap;
>>>>        uint32_t off = address - sriov_cap;
>>>> -    int i, num_vfs;
>>>> +    int i;
>>>>        if (!sriov_cap) {
>>>>            return;
>>>> @@ -8489,8 +8489,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>>>>        if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
>>>>            if (!(val & PCI_SRIOV_CTRL_VFE)) {
>>>> -            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
>>>> -            for (i = 0; i < num_vfs; i++) {
>>>> +            for (i = 0; i < dev->exp.sriov_pf.num_vfs; i++) {
> 
> If the assumption you now make is that num_vfs only changes
> when VFs are disabled, we should add a comment documenting this.
> In fact, I think there's a nicer way to do this:
> 
> static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
>                                    uint32_t val, int len)
> {
>      int old_num_vfs = dev->exp.sriov_pf.num_vfs;
> 
>      pci_default_write_config(dev, address, val, len);
>      pcie_cap_flr_write_config(dev, address, val, len);
>      nvme_sriov_pre_write_ctrl(dev, address, val, len, old_num_vfs);
> }
> 
> and now, nvme_sriov_pre_write_ctrl can compare:
> 
> if (old_num_vfs && !dev->exp.sriov_pf.num_vfs)
> 	disable everything
> 
> 
> this, without bothering with detail of SRIOV capability.
> No?

It looks better. I'll do so in the next version.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs
  2024-02-14 15:51       ` Michael S. Tsirkin
@ 2024-02-14 16:10         ` Akihiko Odaki
  0 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 16:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/15 0:51, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 11:32:11PM +0900, Akihiko Odaki wrote:
>> On 2024/02/14 15:53, Michael S. Tsirkin wrote:
>>> On Wed, Feb 14, 2024 at 02:13:46PM +0900, Akihiko Odaki wrote:
>>>> I couldn't find such a behavior specified.
>>>
>>> Is it fixing a bug or just removing unnecessary code?
>>> Is this guest visible at all?
>>
>> My intention is just to remove unnecessary code, but it is guest-visible.
>> The original behavior causes a problem and it should be considered as a bug
>> fix if a guest expects VFs can be restored by setting VF Enable after
>> clearing it, but I don't know such an example.
> 
> Ah ok.  So:
> 
> 	According to the spec, PCI_SRIOV_NUM_VF isn't reset when
> 	VFs are disabled. Clearing it is guest visible and out of spec,
> 	even though guests mostly don't rely on this value being
> 	preserved, so we never noticed.
> 
> I wonder however what happens on reset.
> How come we don't have a reset callback for sriov cap?
> I suspect this hack serves as a work around to avoid leaking
> this register, and we really should fix that too here?

pcie_sriov_pf_disable_vfs() is meant to serve for that purpose, but it 
just disable VFs and does not update registers. I'll change it to update 
registers and rename it to pcie_sriov_pf_reset().

> 
>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>> ---
>>>>    hw/pci/pcie_sriov.c | 1 -
>>>>    1 file changed, 1 deletion(-)
>>>>
>>>> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
>>>> index 9d668b8d6c17..410bc090fc58 100644
>>>> --- a/hw/pci/pcie_sriov.c
>>>> +++ b/hw/pci/pcie_sriov.c
>>>> @@ -209,7 +209,6 @@ static void unregister_vfs(PCIDevice *dev)
>>>>            pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
>>>>        }
>>>>        dev->exp.sriov_pf.num_vfs = 0;
>>>> -    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
>>>>    }
>>>>    void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
>>>>
>>>> -- 
>>>> 2.43.0
>>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14 15:54       ` Michael S. Tsirkin
@ 2024-02-14 16:22         ` Akihiko Odaki
  0 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 16:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/15 0:54, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 11:49:52PM +0900, Akihiko Odaki wrote:
>> On 2024/02/14 15:52, Michael S. Tsirkin wrote:
>>> On Wed, Feb 14, 2024 at 02:13:43PM +0900, Akihiko Odaki wrote:
>>>> The guest may write NumVFs greater than TotalVFs and that can lead
>>>> to buffer overflow in VF implementations.
>>>>
>>>> Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)")
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>> ---
>>>>    hw/pci/pcie_sriov.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
>>>> index a1fe65f5d801..da209b7f47fd 100644
>>>> --- a/hw/pci/pcie_sriov.c
>>>> +++ b/hw/pci/pcie_sriov.c
>>>> @@ -176,6 +176,9 @@ static void register_vfs(PCIDevice *dev)
>>>>        assert(sriov_cap > 0);
>>>>        num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
>>>> +    if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
>>>> +        return;
>>>> +    }
>>>
>>>
>>> yes but with your change PCI_SRIOV_NUM_VF no longer reflects the
>>> number of registered VFs and that might lead to uninitialized
>>> data read which is not better :(.
>>>
>>> How about just forcing the PCI_SRIOV_NUM_VF register to be
>>> below PCI_SRIOV_TOTAL_VF at all times?
>>
>> PCI_SRIOV_NUM_VF is already divergent from the number of registered VFs. It
>> may have a number greater than the current registered VFs before setting VF
>> Enable.
>>
>> The guest may also change PCI_SRIOV_NUM_VF while VF Enable is set; the
>> behavior is undefined in such a case but we still accept such a write. A
>> value written in such a case won't be reflected to the actual number of
>> enabled VFs.
> 
> OK then let's add a comment near num_vfs explaining all this and saying
> only to use this value. I also would prefer to have this if
> just where we set num_vfs. And maybe after all do set num_vfs to
> PCI_SRIOV_TOTAL_VF? Less of a chance of breaking something I feel...

I don't think we need a comment for this. In general, it should be the 
last thing to do to read the PCI configuration space, which is writable 
by the guest, without proper validation. And such a validation should 
happen in the internal of the capability implementation.

I think the core of the problem in nvme is that it decided to take a 
look into SR-IOV configurations. It should have never looked at 
PCI_SRIOV_CTRL or PCI_SRIOV_NUM_VF.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs
  2024-02-14 16:07         ` Akihiko Odaki
@ 2024-02-14 16:34           ` Michael S. Tsirkin
  2024-02-14 16:51             ` Akihiko Odaki
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2024-02-14 16:34 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On Thu, Feb 15, 2024 at 01:07:29AM +0900, Akihiko Odaki wrote:
> On 2024/02/15 0:46, Michael S. Tsirkin wrote:
> > On Wed, Feb 14, 2024 at 11:09:50PM +0900, Akihiko Odaki wrote:
> > > On 2024/02/14 16:07, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 14, 2024 at 02:13:47PM +0900, Akihiko Odaki wrote:
> > > > > NumVFs may not equal to the current effective number of VFs because VF
> > > > > Enable is cleared, NumVFs is set after VF Enable is set, or NumVFs is
> > > > > greater than TotalVFs.
> > > > > 
> > > > > Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command")
> > > > > Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> > > > 
> > > > I don't get what this is saying about VF enable.
> > > > This code will not trigger on numVFs write when VF enable is set.
> > > > Generally this commit makes no sense on its own, squash it with
> > > > the pci core change pls.
> > > 
> > > This code is meant to run when it is clearing VF Enable, and its
> > > functionality is to change the state of VFs currently enabled so that we can
> > > disable them.
> > > 
> > > However, NumVFs does not necessarily represent VFs currently being enabled,
> > > and have a different value in the case described above.
> > 
> > Ah so in this case, if numvfs is changed and then VFs are disabled,
> > we will not call nvme_virt_set_state? OK, it should say this in commit log.
> > And then, what happens?
> 
> We will call nvme_virt_set_state() but only for VFs already enabled.

And? What does it cause? memory leak? some buffer is overrun?
the guest behaviour is illegal so it does not really
matter what we do as long as nothing too bad happens.

> > 
> > > Such cases exist
> > > even before the earlier patches and this fix is independently meaningful.
> > 
> > yes but the previous patch causes a regression without this one.
> > squash it.
> 
> I'll move this patch before the previous patch.
> 
> > 
> > 
> > > > 
> > > > > ---
> > > > >    hw/nvme/ctrl.c | 5 ++---
> > > > >    1 file changed, 2 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > > index f8df622fe590..daedda5d326f 100644
> > > > > --- a/hw/nvme/ctrl.c
> > > > > +++ b/hw/nvme/ctrl.c
> > > > > @@ -8481,7 +8481,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
> > > > >        NvmeSecCtrlEntry *sctrl;
> > > > >        uint16_t sriov_cap = dev->exp.sriov_cap;
> > > > >        uint32_t off = address - sriov_cap;
> > > > > -    int i, num_vfs;
> > > > > +    int i;
> > > > >        if (!sriov_cap) {
> > > > >            return;
> > > > > @@ -8489,8 +8489,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
> > > > >        if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > > > >            if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > > > > -            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> > > > > -            for (i = 0; i < num_vfs; i++) {
> > > > > +            for (i = 0; i < dev->exp.sriov_pf.num_vfs; i++) {
> > 
> > If the assumption you now make is that num_vfs only changes
> > when VFs are disabled, we should add a comment documenting this.
> > In fact, I think there's a nicer way to do this:
> > 
> > static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
> >                                    uint32_t val, int len)
> > {
> >      int old_num_vfs = dev->exp.sriov_pf.num_vfs;
> > 
> >      pci_default_write_config(dev, address, val, len);
> >      pcie_cap_flr_write_config(dev, address, val, len);
> >      nvme_sriov_pre_write_ctrl(dev, address, val, len, old_num_vfs);
> > }
> > 
> > and now, nvme_sriov_pre_write_ctrl can compare:
> > 
> > if (old_num_vfs && !dev->exp.sriov_pf.num_vfs)
> > 	disable everything
> > 
> > 
> > this, without bothering with detail of SRIOV capability.
> > No?
> 
> It looks better. I'll do so in the next version.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs
  2024-02-14 16:34           ` Michael S. Tsirkin
@ 2024-02-14 16:51             ` Akihiko Odaki
  0 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-14 16:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block

On 2024/02/15 1:34, Michael S. Tsirkin wrote:
> On Thu, Feb 15, 2024 at 01:07:29AM +0900, Akihiko Odaki wrote:
>> On 2024/02/15 0:46, Michael S. Tsirkin wrote:
>>> On Wed, Feb 14, 2024 at 11:09:50PM +0900, Akihiko Odaki wrote:
>>>> On 2024/02/14 16:07, Michael S. Tsirkin wrote:
>>>>> On Wed, Feb 14, 2024 at 02:13:47PM +0900, Akihiko Odaki wrote:
>>>>>> NumVFs may not equal to the current effective number of VFs because VF
>>>>>> Enable is cleared, NumVFs is set after VF Enable is set, or NumVFs is
>>>>>> greater than TotalVFs.
>>>>>>
>>>>>> Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command")
>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>
>>>>> I don't get what this is saying about VF enable.
>>>>> This code will not trigger on numVFs write when VF enable is set.
>>>>> Generally this commit makes no sense on its own, squash it with
>>>>> the pci core change pls.
>>>>
>>>> This code is meant to run when it is clearing VF Enable, and its
>>>> functionality is to change the state of VFs currently enabled so that we can
>>>> disable them.
>>>>
>>>> However, NumVFs does not necessarily represent VFs currently being enabled,
>>>> and have a different value in the case described above.
>>>
>>> Ah so in this case, if numvfs is changed and then VFs are disabled,
>>> we will not call nvme_virt_set_state? OK, it should say this in commit log.
>>> And then, what happens?
>>
>> We will call nvme_virt_set_state() but only for VFs already enabled.
> 
> And? What does it cause? memory leak? some buffer is overrun?
> the guest behaviour is illegal so it does not really
> matter what we do as long as nothing too bad happens.

nvme_sriov_pre_write_ctrl() is intended to free resources allocated to 
VFs. Previously, nvme_sriov_pre_write_ctrl() used NumVFs to iterate VFs 
with resources allocated. If NumVFs is changed and then VFs are 
disabled, this iteration resulted in buffer overrun.

With this patch, the changed value of NumVFs will be ignored, and 
nvme_sriov_pre_write_ctrl() only frees resources allocated to VFs 
actually enabled, thus no buffer overrun happens.

> 
>>>
>>>> Such cases exist
>>>> even before the earlier patches and this fix is independently meaningful.
>>>
>>> yes but the previous patch causes a regression without this one.
>>> squash it.
>>
>> I'll move this patch before the previous patch.
>>
>>>
>>>
>>>>>
>>>>>> ---
>>>>>>     hw/nvme/ctrl.c | 5 ++---
>>>>>>     1 file changed, 2 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>>>>>> index f8df622fe590..daedda5d326f 100644
>>>>>> --- a/hw/nvme/ctrl.c
>>>>>> +++ b/hw/nvme/ctrl.c
>>>>>> @@ -8481,7 +8481,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>>>>>>         NvmeSecCtrlEntry *sctrl;
>>>>>>         uint16_t sriov_cap = dev->exp.sriov_cap;
>>>>>>         uint32_t off = address - sriov_cap;
>>>>>> -    int i, num_vfs;
>>>>>> +    int i;
>>>>>>         if (!sriov_cap) {
>>>>>>             return;
>>>>>> @@ -8489,8 +8489,7 @@ static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
>>>>>>         if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
>>>>>>             if (!(val & PCI_SRIOV_CTRL_VFE)) {
>>>>>> -            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
>>>>>> -            for (i = 0; i < num_vfs; i++) {
>>>>>> +            for (i = 0; i < dev->exp.sriov_pf.num_vfs; i++) {
>>>
>>> If the assumption you now make is that num_vfs only changes
>>> when VFs are disabled, we should add a comment documenting this.
>>> In fact, I think there's a nicer way to do this:
>>>
>>> static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
>>>                                     uint32_t val, int len)
>>> {
>>>       int old_num_vfs = dev->exp.sriov_pf.num_vfs;
>>>
>>>       pci_default_write_config(dev, address, val, len);
>>>       pcie_cap_flr_write_config(dev, address, val, len);
>>>       nvme_sriov_pre_write_ctrl(dev, address, val, len, old_num_vfs);
>>> }
>>>
>>> and now, nvme_sriov_pre_write_ctrl can compare:
>>>
>>> if (old_num_vfs && !dev->exp.sriov_pf.num_vfs)
>>> 	disable everything
>>>
>>>
>>> this, without bothering with detail of SRIOV capability.
>>> No?
>>
>> It looks better. I'll do so in the next version.
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/9] pcie_sriov: Validate NumVFs
  2024-02-14 15:55         ` Michael S. Tsirkin
@ 2024-02-17 11:32           ` Akihiko Odaki
  0 siblings, 0 replies; 34+ messages in thread
From: Akihiko Odaki @ 2024-02-17 11:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, Michael Tokarev
  Cc: Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Sriram Yagnaraman, Jason Wang, Keith Busch,
	Klaus Jensen, qemu-devel, qemu-block, qemu-stable

On 2024/02/15 0:55, Michael S. Tsirkin wrote:
> On Wed, Feb 14, 2024 at 06:53:43PM +0300, Michael Tokarev wrote:
>> Nope, I don't remember how to request a CVE ;)
> 
> https://www.qemu.org/contribute/security-process/

Thanks, I requested CVEs with the form.

QEMU doesn't have any list of features without security guarantee so I 
think it's "kind" to have CVEs for any features though I would certainly 
add igb and NVMe's SR-IOV feature to such a list if there is.

It should be harmless to have extra CVEs at least. Well, some may think 
all CVEs serious but that's not my fault.


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-02-17 11:33 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-14  5:13 [PATCH v4 0/9] hw/pci: SR-IOV related fixes and improvements Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 1/9] hw/pci: Use -1 as a default value for rombar Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 2/9] hw/pci: Determine if rombar is explicitly enabled Akihiko Odaki
2024-02-14  6:49   ` Michael S. Tsirkin
2024-02-14  5:13 ` [PATCH v4 3/9] vfio: Avoid inspecting option QDict for rombar Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 4/9] hw/qdev: Remove opts member Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 5/9] pcie_sriov: Validate NumVFs Akihiko Odaki
2024-02-14  6:52   ` Michael S. Tsirkin
2024-02-14 14:49     ` Akihiko Odaki
2024-02-14 15:54       ` Michael S. Tsirkin
2024-02-14 16:22         ` Akihiko Odaki
2024-02-14  8:58   ` Michael Tokarev
2024-02-14 14:54     ` Akihiko Odaki
2024-02-14 15:53       ` Michael Tokarev
2024-02-14 15:55         ` Michael S. Tsirkin
2024-02-17 11:32           ` Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 6/9] pcie_sriov: Reuse SR-IOV VF device instances Akihiko Odaki
2024-02-14  7:54   ` Michael S. Tsirkin
2024-02-14 14:44     ` Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 7/9] pcie_sriov: Release VFs failed to realize Akihiko Odaki
2024-02-14  7:54   ` Michael S. Tsirkin
2024-02-14  7:58     ` Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 8/9] pcie_sriov: Do not reset NumVFs after unregistering VFs Akihiko Odaki
2024-02-14  6:53   ` Michael S. Tsirkin
2024-02-14 14:32     ` Akihiko Odaki
2024-02-14 15:51       ` Michael S. Tsirkin
2024-02-14 16:10         ` Akihiko Odaki
2024-02-14  5:13 ` [PATCH v4 9/9] hw/nvme: Refer to dev->exp.sriov_pf.num_vfs Akihiko Odaki
2024-02-14  7:07   ` Michael S. Tsirkin
2024-02-14 14:09     ` Akihiko Odaki
2024-02-14 15:46       ` Michael S. Tsirkin
2024-02-14 16:07         ` Akihiko Odaki
2024-02-14 16:34           ` Michael S. Tsirkin
2024-02-14 16:51             ` Akihiko Odaki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.