qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
@ 2021-10-07 16:23 Lukasz Maniak
  2021-10-07 16:23 ` [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512 Lukasz Maniak
                   ` (16 more replies)
  0 siblings, 17 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel; +Cc: Łukasz Gieryk, Lukasz Maniak, qemu-block

Hi,

This series of patches is an attempt to add support for the following
sections of NVMe specification revision 1.4:

8.5 Virtualization Enhancements (Optional)
    8.5.1 VQ Resource Definition
    8.5.2 VI Resource Definition
    8.5.3 Secondary Controller States and Resource Configuration
    8.5.4 Single Root I/O Virtualization and Sharing (SR-IOV)

The NVMe controller's Single Root I/O Virtualization and Sharing
implementation is based on patches introducing SR-IOV support for PCI
Express proposed by Knut Omang:
https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05155.html

However, based on what I was able to find historically, Knut's patches
have not yet been pulled into QEMU due to no example of a working device
up to this point:
https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg02722.html

In terms of design, the Physical Function controller and the Virtual
Function controllers are almost independent, with few exceptions:
PF handles flexible resource allocation for all its children (VFs have
read-only access to this data), and reset (PF explicitly calls it on VFs).
Since the MMIO access is serialized, no extra precautions are required
to handle concurrent resets, as well as the secondary controller state
access doesn't need to be atomic.

A controller with full SR-IOV support must be capable of handling the
Namespace Management command. As there is a pending review with this
functionality, this patch list is not duplicating efforts.
Yet, NS management patches are not required to test the SR-IOV support.

We tested the patches on Ubuntu 20.04.3 LTS with kernel 5.4.0. We have
hit various issues with NVMe CLI (list and virt-mgmt commands) between
releases from version 1.09 to master, thus we chose this golden NVMe CLI
hash for testing: a50a0c1.

The implementation is not 100% finished and certainly not bug free,
since we are already aware of some issues e.g. interaction with
namespaces related to AER, or unexpected (?) kernel behavior in more
complex reset scenarios. However, our SR-IOV implementation is already
able to support typical SR-IOV use cases, so we believe the patches are
ready to share with the community.

Hope you find some time to review the work we did, and share your
thoughts.

Kind regards,
Lukasz

Knut Omang (3):
  pcie: Set default and supported MaxReadReq to 512
  pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

Lukasz Maniak (5):
  pcie: Add callback preceding SR-IOV VFs update
  hw/nvme: Add support for SR-IOV
  hw/nvme: Add support for Primary Controller Capabilities
  hw/nvme: Add support for Secondary Controller List
  docs: Add documentation for SR-IOV and Virtualization Enhancements

Łukasz Gieryk (7):
  pcie: Add 1.2 version token for the Power Management Capability
  hw/nvme: Implement the Function Level Reset
  hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  hw/nvme: Calculate BAR atributes in a function
  hw/nvme: Initialize capability structures for primary/secondary
    controllers
  pcie: Add helpers to the SR/IOV API
  hw/nvme: Add support for the Virtualization Management command

 docs/pcie_sriov.txt          | 115 +++++++
 docs/system/devices/nvme.rst |  27 ++
 hw/nvme/ctrl.c               | 589 ++++++++++++++++++++++++++++++++---
 hw/nvme/ns.c                 |   2 +-
 hw/nvme/nvme.h               |  47 ++-
 hw/nvme/subsys.c             |  74 ++++-
 hw/nvme/trace-events         |   6 +
 hw/pci/meson.build           |   1 +
 hw/pci/pci.c                 |  97 ++++--
 hw/pci/pcie.c                |  10 +-
 hw/pci/pcie_sriov.c          | 313 +++++++++++++++++++
 hw/pci/trace-events          |   5 +
 include/block/nvme.h         |  65 ++++
 include/hw/pci/pci.h         |  12 +-
 include/hw/pci/pci_ids.h     |   1 +
 include/hw/pci/pci_regs.h    |   1 +
 include/hw/pci/pcie.h        |   6 +
 include/hw/pci/pcie_sriov.h  |  81 +++++
 include/qemu/typedefs.h      |   2 +
 19 files changed, 1369 insertions(+), 85 deletions(-)
 create mode 100644 docs/pcie_sriov.txt
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

-- 
2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-10-07 22:12   ` Michael S. Tsirkin
  2021-10-07 16:23 ` [PATCH 02/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, Knut Omang,
	Lukasz Maniak, Knut Omang

From: Knut Omang <knut.omang@oracle.com>

Make the default PCI Express Capability for PCIe devices set
MaxReadReq to 512. Tyipcal modern devices people would want to
emulate or simulate would want this. The previous value would
cause warnings from the root port driver on some kernels.

Signed-off-by: Knut Omang <knuto@ifi.uio.no>
---
 hw/pci/pcie.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 6e95d82903..c1a12f3744 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -62,8 +62,9 @@ pcie_cap_v1_fill(PCIDevice *dev, uint8_t port, uint8_t type, uint8_t version)
      * Functions conforming to the ECN, PCI Express Base
      * Specification, Revision 1.1., or subsequent PCI Express Base
      * Specification revisions.
+     *  + set max payload size to 256, which seems to be a common value
      */
-    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER);
+    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER | (0x1 & PCI_EXP_DEVCAP_PAYLOAD));
 
     pci_set_long(exp_cap + PCI_EXP_LNKCAP,
                  (port << PCI_EXP_LNKCAP_PN_SHIFT) |
@@ -179,6 +180,8 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset,
     pci_set_long(exp_cap + PCI_EXP_DEVCAP2,
                  PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP);
 
+    pci_set_word(exp_cap + PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_READRQ_256B);
+
     pci_set_word(dev->wmask + pos + PCI_EXP_DEVCTL2, PCI_EXP_DEVCTL2_EETLPPB);
 
     if (dev->cap_present & QEMU_PCIE_EXTCAP_INIT) {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
  2021-10-07 16:23 ` [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512 Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-10-07 16:23 ` [PATCH 03/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, Knut Omang,
	Lukasz Maniak, Knut Omang

From: Knut Omang <knut.omang@oracle.com>

This patch provides the building blocks for creating an SR/IOV
PCIe Extended Capability header and register/unregister
SR/IOV Virtual Functions.

Signed-off-by: Knut Omang <knuto@ifi.uio.no>
---
 hw/pci/meson.build          |   1 +
 hw/pci/pci.c                |  97 +++++++++---
 hw/pci/pcie.c               |   5 +
 hw/pci/pcie_sriov.c         | 287 ++++++++++++++++++++++++++++++++++++
 hw/pci/trace-events         |   5 +
 include/hw/pci/pci.h        |  12 +-
 include/hw/pci/pcie.h       |   6 +
 include/hw/pci/pcie_sriov.h |  67 +++++++++
 include/qemu/typedefs.h     |   2 +
 9 files changed, 456 insertions(+), 26 deletions(-)
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

diff --git a/hw/pci/meson.build b/hw/pci/meson.build
index 5c4bbac817..bcc9c75919 100644
--- a/hw/pci/meson.build
+++ b/hw/pci/meson.build
@@ -5,6 +5,7 @@ pci_ss.add(files(
   'pci.c',
   'pci_bridge.c',
   'pci_host.c',
+  'pcie_sriov.c',
   'shpc.c',
   'slotid_cap.c'
 ))
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 186758ee11..1ad647f78e 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
 {
     uint8_t type;
 
+    /* PCIe virtual functions do not have their own BARs */
+    assert(!pci_is_vf(d));
+
     if (reg != PCI_ROM_SLOT)
         return PCI_BASE_ADDRESS_0 + reg * 4;
 
@@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
     }
 }
 
-static void pci_do_device_reset(PCIDevice *dev)
+static void pci_reset_regions(PCIDevice *dev)
 {
     int r;
+    if (pci_is_vf(dev)) {
+        return;
+    }
+
+    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
+        PCIIORegion *region = &dev->io_regions[r];
+        if (!region->size) {
+            continue;
+        }
+
+        if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
+            region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            pci_set_quad(dev->config + pci_bar(dev, r), region->type);
+        } else {
+            pci_set_long(dev->config + pci_bar(dev, r), region->type);
+        }
+    }
+}
 
+static void pci_do_device_reset(PCIDevice *dev)
+{
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
@@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
                               pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
                               pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
     dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
-    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
-        PCIIORegion *region = &dev->io_regions[r];
-        if (!region->size) {
-            continue;
-        }
-
-        if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
-            region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
-            pci_set_quad(dev->config + pci_bar(dev, r), region->type);
-        } else {
-            pci_set_long(dev->config + pci_bar(dev, r), region->type);
-        }
-    }
+    pci_reset_regions(dev);
     pci_update_mappings(dev);
 
     msi_reset(dev);
@@ -884,6 +895,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice *dev, Error **errp)
         dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
     }
 
+    /* With SR/IOV and ARI, a device at function 0 need not be a multifunction
+     * device, as it may just be a VF that ended up with function 0 in
+     * the legacy PCI interpretation. Avoid failing in such cases:
+     */
+    if (pci_is_vf(dev) &&
+        dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+        return;
+    }
+
     /*
      * multifunction bit is interpreted in two ways as follows.
      *   - all functions must set the bit to 1.
@@ -1083,6 +1103,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
                    bus->devices[devfn]->name);
         return NULL;
     } else if (dev->hotplugged &&
+               !pci_is_vf(pci_dev) &&
                pci_get_function_0(pci_dev)) {
         error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
                    " new func %s cannot be exposed to guest.",
@@ -1191,6 +1212,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
     pcibus_t size = memory_region_size(memory);
     uint8_t hdr_type;
 
+    assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
     assert(region_num >= 0);
     assert(region_num < PCI_NUM_REGIONS);
     assert(is_power_of_2(size));
@@ -1294,11 +1316,43 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
     return pci_dev->io_regions[region_num].addr;
 }
 
-static pcibus_t pci_bar_address(PCIDevice *d,
-                                int reg, uint8_t type, pcibus_t size)
+static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
+                                        uint8_t type, pcibus_t size)
+{
+    pcibus_t new_addr;
+    if (!pci_is_vf(d)) {
+        int bar = pci_bar(d, reg);
+        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            new_addr = pci_get_quad(d->config + bar);
+        } else {
+            new_addr = pci_get_long(d->config + bar);
+        }
+    } else {
+        PCIDevice *pf = d->exp.sriov_vf.pf;
+        uint16_t sriov_cap = pf->exp.sriov_cap;
+        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
+        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
+        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
+        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / vf_stride;
+
+        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            new_addr = pci_get_quad(pf->config + bar);
+        } else {
+            new_addr = pci_get_long(pf->config + bar);
+        }
+        new_addr += vf_num * size;
+    }
+    if (reg != PCI_ROM_SLOT) {
+        /* Preserve the rom enable bit */
+        new_addr &= ~(size - 1);
+    }
+    return new_addr;
+}
+
+pcibus_t pci_bar_address(PCIDevice *d,
+                         int reg, uint8_t type, pcibus_t size)
 {
     pcibus_t new_addr, last_addr;
-    int bar = pci_bar(d, reg);
     uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
     Object *machine = qdev_get_machine();
     ObjectClass *oc = object_get_class(machine);
@@ -1309,7 +1363,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
         if (!(cmd & PCI_COMMAND_IO)) {
             return PCI_BAR_UNMAPPED;
         }
-        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
+        new_addr = pci_config_get_bar_addr(d, reg, type, size);
         last_addr = new_addr + size - 1;
         /* Check if 32 bit BAR wraps around explicitly.
          * TODO: make priorities correct and remove this work around.
@@ -1324,11 +1378,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
     if (!(cmd & PCI_COMMAND_MEMORY)) {
         return PCI_BAR_UNMAPPED;
     }
-    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
-        new_addr = pci_get_quad(d->config + bar);
-    } else {
-        new_addr = pci_get_long(d->config + bar);
-    }
+    new_addr = pci_config_get_bar_addr(d, reg, type, size);
     /* the ROM slot has a specific enable bit */
     if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
         return PCI_BAR_UNMAPPED;
@@ -1470,6 +1520,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
 
     msi_write_config(d, addr, val_in, l);
     msix_write_config(d, addr, val_in, l);
+    pcie_sriov_config_write(d, addr, val_in, l);
 }
 
 /***********************************************************/
diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index c1a12f3744..8c6982d03c 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -426,6 +426,11 @@ void pcie_cap_slot_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
     PCIDevice *pci_dev = PCI_DEVICE(dev);
     uint32_t lnkcap = pci_get_long(exp_cap + PCI_EXP_LNKCAP);
 
+    if(pci_is_vf(pci_dev)) {
+        /* We don't want to change any state in hotplug_dev for SR/IOV virtual functions */
+        return;
+    }
+
     /* Don't send event when device is enabled during qemu machine creation:
      * it is present on boot, no hotplug event is necessary. We do send an
      * event when the device is disabled later. */
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
new file mode 100644
index 0000000000..501a1ff433
--- /dev/null
+++ b/hw/pci/pcie_sriov.c
@@ -0,0 +1,287 @@
+/*
+ * pcie_sriov.c:
+ *
+ * Implementation of SR/IOV emulation support.
+ *
+ * Copyright (c) 2015-2017 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pcie.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/qdev-properties.h"
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "qapi/error.h"
+#include "trace.h"
+
+#define SRIOV_ID(dev) \
+    (dev)->name, PCI_SLOT((dev)->devfn), PCI_FUNC((dev)->devfn)
+
+static PCIDevice *register_vf(PCIDevice *pf, int devfn,
+                              const char *name, uint16_t vf_num);
+static void unregister_vfs(PCIDevice *dev);
+
+void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+                        const char *vfname, uint16_t vf_dev_id,
+                        uint16_t init_vfs, uint16_t total_vfs,
+                        uint16_t vf_offset, uint16_t vf_stride)
+{
+    uint8_t *cfg = dev->config + offset;
+    uint8_t *wmask;
+
+    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
+                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
+    dev->exp.sriov_cap = offset;
+    dev->exp.sriov_pf.num_vfs = 0;
+    dev->exp.sriov_pf.vfname = g_strdup(vfname);
+    dev->exp.sriov_pf.vf = NULL;
+
+    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
+    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
+
+    /* Mandatory page sizes to support.
+     * Device implementations can call pcie_sriov_pf_add_sup_pgsize()
+     * to set more bits:
+     */
+    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, SRIOV_SUP_PGSIZE_MINREQ);
+
+    /* Default is to use 4K pages, software can modify it
+     * to any of the supported bits
+     */
+    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
+
+    /* Set up device ID and initial/total number of VFs available */
+    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
+    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
+    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
+    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
+
+    /* Write enable control bits */
+    wmask = dev->wmask + offset;
+    pci_set_word(wmask + PCI_SRIOV_CTRL,
+                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
+    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
+    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
+
+    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
+}
+
+void pcie_sriov_pf_exit(PCIDevice *dev)
+{
+    unregister_vfs(dev);
+    g_free((char *)dev->exp.sriov_pf.vfname);
+    dev->exp.sriov_pf.vfname = NULL;
+}
+
+void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
+                               uint8_t type, dma_addr_t size)
+{
+    uint32_t addr;
+    uint64_t wmask;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+
+    assert(sriov_cap > 0);
+    assert(region_num >= 0);
+    assert(region_num < PCI_NUM_REGIONS);
+    assert(region_num != PCI_ROM_SLOT);
+
+    wmask = ~(size - 1);
+    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
+
+    pci_set_long(dev->config + addr, type);
+    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
+        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+        pci_set_quad(dev->wmask + addr, wmask);
+        pci_set_quad(dev->cmask + addr, ~0ULL);
+    } else {
+        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
+        pci_set_long(dev->cmask + addr, 0xffffffff);
+    }
+    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
+}
+
+void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
+                                MemoryRegion *memory)
+{
+    PCIIORegion *r;
+    PCIBus *bus = pci_get_bus(dev);
+    uint8_t type;
+    pcibus_t size = memory_region_size(memory);
+
+    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
+    assert(region_num >= 0);
+    assert(region_num < PCI_NUM_REGIONS);
+    type = dev->exp.sriov_vf.pf->exp.sriov_pf.vf_bar_type[region_num];
+
+    if (!is_power_of_2(size)) {
+        error_report("%s: PCI region size must be a power"
+                     " of two - type=0x%x, size=0x%"FMT_PCIBUS,
+                     __func__, type, size);
+        exit(1);
+    }
+
+    r = &dev->io_regions[region_num];
+    r->memory = memory;
+    r->address_space =
+        type & PCI_BASE_ADDRESS_SPACE_IO
+        ? bus->address_space_io
+        : bus->address_space_mem;
+    r->size = size;
+    r->type = type;
+
+    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
+    if (r->addr != PCI_BAR_UNMAPPED) {
+        memory_region_add_subregion_overlap(r->address_space,
+                                            r->addr, r->memory, 1);
+    }
+}
+
+static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name,
+                              uint16_t vf_num)
+{
+    PCIDevice *dev = pci_new(devfn, name);
+    dev->exp.sriov_vf.pf = pf;
+    dev->exp.sriov_vf.vf_number = vf_num;
+    PCIBus* bus = pci_get_bus(pf);
+    Error *local_err = NULL;
+
+    qdev_realize(&dev->qdev, &bus->qbus, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return NULL;
+    }
+
+    /* set vid/did according to sr/iov spec - they are not used */
+    pci_config_set_vendor_id(dev->config, 0xffff);
+    pci_config_set_device_id(dev->config, 0xffff);
+
+    return dev;
+}
+
+static void register_vfs(PCIDevice *dev)
+{
+    uint16_t num_vfs;
+    uint16_t i;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+    uint16_t vf_offset = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
+    uint16_t vf_stride = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
+    int32_t devfn = dev->devfn + vf_offset;
+
+    assert(sriov_cap > 0);
+    num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
+
+    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) * num_vfs);
+    assert(dev->exp.sriov_pf.vf);
+
+    trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
+    for (i = 0; i < num_vfs; i++) {
+        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
+        if (!dev->exp.sriov_pf.vf[i]) {
+            num_vfs = i;
+            break;
+        }
+        devfn += vf_stride;
+    }
+    dev->exp.sriov_pf.num_vfs = num_vfs;
+}
+
+static void unregister_vfs(PCIDevice *dev)
+{
+    Error *local_err = NULL;
+    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
+    uint16_t i;
+
+    trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
+    for (i = 0; i < num_vfs; i++) {
+        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
+        object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
+        if (local_err) {
+            fprintf(stderr, "Failed to unplug: %s\n",
+                    error_get_pretty(local_err));
+            error_free(local_err);
+        }
+        object_unparent(OBJECT(vf));
+    }
+    g_free(dev->exp.sriov_pf.vf);
+    dev->exp.sriov_pf.vf = NULL;
+    dev->exp.sriov_pf.num_vfs = 0;
+    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
+}
+
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
+{
+    uint32_t off;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+
+    if (!sriov_cap || address < sriov_cap) {
+        return;
+    }
+    off = address - sriov_cap;
+    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
+        return;
+    }
+
+    trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
+
+    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
+        if (dev->exp.sriov_pf.num_vfs) {
+            if (!(val & PCI_SRIOV_CTRL_VFE)) {
+                unregister_vfs(dev);
+            }
+        } else {
+            if (val & PCI_SRIOV_CTRL_VFE) {
+                register_vfs(dev);
+            }
+        }
+    }
+}
+
+
+/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs */
+void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
+{
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+    if (sriov_cap) {
+        uint32_t val = pci_get_byte(dev->config + sriov_cap + PCI_SRIOV_CTRL);
+        if (val & PCI_SRIOV_CTRL_VFE) {
+            val &= ~PCI_SRIOV_CTRL_VFE;
+            pcie_sriov_config_write(dev, sriov_cap + PCI_SRIOV_CTRL, val, 1);
+        }
+    }
+}
+
+/* Add optional supported page sizes to the mask of supported page sizes */
+void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize)
+{
+    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
+    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
+
+    uint16_t sup_pgsize = pci_get_word(cfg + PCI_SRIOV_SUP_PGSIZE);
+
+    sup_pgsize |= opt_sup_pgsize;
+
+    /* Make sure the new bits are set, and that system page size
+     * also can be set to any of the new values according to spec:
+     */
+    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
+    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
+}
+
+
+uint16_t pcie_sriov_vf_number(PCIDevice *dev)
+{
+    assert(pci_is_vf(dev));
+    return dev->exp.sriov_vf.vf_number;
+}
+
+
+PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
+{
+    return dev->exp.sriov_vf.pf;
+}
diff --git a/hw/pci/trace-events b/hw/pci/trace-events
index fc777d0b5e..bd92cf4a25 100644
--- a/hw/pci/trace-events
+++ b/hw/pci/trace-events
@@ -10,3 +10,8 @@ pci_cfg_write(const char *dev, unsigned devid, unsigned fnid, unsigned offs, uns
 
 # msix.c
 msix_write_config(char *name, bool enabled, bool masked) "dev %s enabled %d masked %d"
+
+# hw/pci/pcie_sriov.c
+sriov_register_vfs(const char *name, int slot, int function, int num_vfs) "%s %02x:%x: creating %d vf devs"
+sriov_unregister_vfs(const char *name, int slot, int function, int num_vfs) "%s %02x:%x: Unregistering %d vf devs"
+sriov_config_write(const char *name, int slot, int fun, uint32_t offset, uint32_t val, uint32_t len) "%s %02x:%x: sriov offset 0x%x val 0x%x len %d"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 7fc90132cf..d1d242c93a 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -7,9 +7,6 @@
 /* PCI includes legacy ISA access.  */
 #include "hw/isa/isa.h"
 
-#include "hw/pci/pcie.h"
-#include "qom/object.h"
-
 extern bool pci_available;
 
 /* PCI bus */
@@ -156,6 +153,7 @@ enum {
 #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
 
 #include "hw/pci/pci_regs.h"
+#include "hw/pci/pcie.h"
 
 /* PCI HEADER_TYPE */
 #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
@@ -493,6 +491,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
+pcibus_t pci_bar_address(PCIDevice *d,
+                         int reg, uint8_t type, pcibus_t size);
+
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
 {
@@ -768,6 +769,11 @@ static inline int pci_is_express_downstream_port(const PCIDevice *d)
     return type == PCI_EXP_TYPE_DOWNSTREAM || type == PCI_EXP_TYPE_ROOT_PORT;
 }
 
+static inline int pci_is_vf(const PCIDevice *d)
+{
+    return d->exp.sriov_vf.pf != NULL;
+}
+
 static inline uint32_t pci_config_size(const PCIDevice *d)
 {
     return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 6063bee0ec..168950a83b 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -24,6 +24,7 @@
 #include "hw/pci/pci_regs.h"
 #include "hw/pci/pcie_regs.h"
 #include "hw/pci/pcie_aer.h"
+#include "hw/pci/pcie_sriov.h"
 #include "hw/hotplug.h"
 
 typedef enum {
@@ -81,6 +82,11 @@ struct PCIExpressDevice {
 
     /* ACS */
     uint16_t acs_cap;
+
+    /* SR/IOV */
+    uint16_t sriov_cap;
+    PCIESriovPF sriov_pf;
+    PCIESriovVF sriov_vf;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
new file mode 100644
index 0000000000..0974f00054
--- /dev/null
+++ b/include/hw/pci/pcie_sriov.h
@@ -0,0 +1,67 @@
+/*
+ * pcie_sriov.h:
+ *
+ * Implementation of SR/IOV emulation support.
+ *
+ * Copyright (c) 2015 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_PCIE_SRIOV_H
+#define QEMU_PCIE_SRIOV_H
+
+struct PCIESriovPF {
+    uint16_t num_vfs;           /* Number of virtual functions created */
+    uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
+    const char *vfname;         /* Reference to the device type used for the VFs */
+    PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
+};
+
+struct PCIESriovVF {
+    PCIDevice *pf;              /* Pointer back to owner physical function */
+    uint16_t vf_number;		/* Logical VF number of this function */
+};
+
+void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+                        const char *vfname, uint16_t vf_dev_id,
+                        uint16_t init_vfs, uint16_t total_vfs,
+                        uint16_t vf_offset, uint16_t vf_stride);
+void pcie_sriov_pf_exit(PCIDevice *dev);
+
+/* Set up a VF bar in the SR/IOV bar area */
+void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
+                               uint8_t type, dma_addr_t size);
+
+/* Instantiate a bar for a VF */
+void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
+                                MemoryRegion *memory);
+
+/* Default (minimal) page size support values as required by the SR/IOV standard:
+ * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
+ */
+#define SRIOV_SUP_PGSIZE_MINREQ 0x553
+
+/* Optionally add supported page sizes to the mask of supported page sizes
+ * Page size values are interpreted as opt_sup_pgsize << 12.
+ */
+void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize);
+
+/* SR/IOV capability config write handler */
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
+                             uint32_t val, int len);
+
+/* Reset SR/IOV VF Enable bit to unregister all VFs */
+void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
+
+/* Get logical VF number of a VF - only valid for VFs */
+uint16_t pcie_sriov_vf_number(PCIDevice *dev);
+
+/* Get the physical function that owns this VF.
+ * Returns NULL if dev is not a virtual function
+ */
+PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
+
+#endif /* QEMU_PCIE_SRIOV_H */
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index ee60eb3de4..5b302cb214 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -86,6 +86,8 @@ typedef struct PCIDevice PCIDevice;
 typedef struct PCIEAERErr PCIEAERErr;
 typedef struct PCIEAERLog PCIEAERLog;
 typedef struct PCIEAERMsg PCIEAERMsg;
+typedef struct PCIESriovPF PCIESriovPF;
+typedef struct PCIESriovVF PCIESriovVF;
 typedef struct PCIEPort PCIEPort;
 typedef struct PCIESlot PCIESlot;
 typedef struct PCIExpressDevice PCIExpressDevice;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
  2021-10-07 16:23 ` [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512 Lukasz Maniak
  2021-10-07 16:23 ` [PATCH 02/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-10-07 16:23 ` [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update Lukasz Maniak
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, Knut Omang,
	Lukasz Maniak, Knut Omang

From: Knut Omang <knut.omang@oracle.com>

Add a small intro + minimal documentation for how to
implement SR/IOV support for an emulated device.

Signed-off-by: Knut Omang <knuto@ifi.uio.no>
---
 docs/pcie_sriov.txt | 115 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)
 create mode 100644 docs/pcie_sriov.txt

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
new file mode 100644
index 0000000000..f5e891e1d4
--- /dev/null
+++ b/docs/pcie_sriov.txt
@@ -0,0 +1,115 @@
+PCI SR/IOV EMULATION SUPPORT
+============================
+
+Description
+===========
+SR/IOV (Single Root I/O Virtualization) is an optional extended capability
+of a PCI Express device. It allows a single physical function (PF) to appear as multiple
+virtual functions (VFs) for the main purpose of eliminating software
+overhead in I/O from virtual machines.
+
+Qemu now implements the basic common functionality to enable an emulated device
+to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
+proof-of-concept hack of the Intel igb can be found here:
+
+git://github.com/knuto/qemu.git sriov_patches_v5
+
+Implementation
+==============
+Implementing emulation of an SR/IOV capable device typically consists of
+implementing support for two types of device classes; the "normal" physical device
+(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
+like other devices, except that some of their properties are derived from
+the PF.
+
+A virtual function is different from a physical function in that the BAR
+space for all VFs are defined by the BAR registers in the PFs SR/IOV
+capability. All VFs have the same BARs and BAR sizes.
+
+Accesses to these virtual BARs then is computed as
+
+   <VF BAR start> + <VF number> * <BAR sz> + <offset>
+
+From our emulation perspective this means that there is a separate call for
+setting up a BAR for a VF.
+
+1) To enable SR/IOV support in the PF, it must be a PCI Express device so
+   you would need to add a PCI Express capability in the normal PCI
+   capability list. You might also want to add an ARI (Alternative
+   Routing-ID Interpretation) capability to indicate that your device
+   supports functions beyond it's "own" function space (0-7),
+   which is necessary to support more than 7 functions, or
+   if functions extends beyond offset 7 because they are placed at an
+   offset > 1 or have stride > 1.
+
+   ...
+   #include "hw/pci/pcie.h"
+   #include "hw/pci/pcie_sriov.h"
+
+   pci_your_pf_dev_realize( ... )
+   {
+      ...
+      int ret = pcie_endpoint_cap_init(d, 0x70);
+      ...
+      pcie_ari_init(d, 0x100, 1);
+      ...
+
+      /* Add and initialize the SR/IOV capability */
+      pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+                       vf_devid, initial_vfs, total_vfs,
+                       fun_offset, stride);
+
+      /* Set up individual VF BARs (parameters as for normal BARs) */
+      pcie_sriov_pf_init_vf_bar( ... )
+      ...
+   }
+
+   For cleanup, you simply call:
+
+      pcie_sriov_pf_exit(device);
+
+   which will delete all the virtual functions and associated resources.
+
+2) Similarly in the implementation of the virtual function, you need to
+   make it a PCI Express device and add a similar set of capabilities
+   except for the SR/IOV capability. Then you need to set up the VF BARs as
+   subregions of the PFs SR/IOV VF BARs by calling
+   pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
+
+   pci_your_vf_dev_realize( ... )
+   {
+      ...
+      int ret = pcie_endpoint_cap_init(d, 0x60);
+      ...
+      pcie_ari_init(d, 0x100, 1);
+      ...
+      memory_region_init(mr, ... )
+      pcie_sriov_vf_register_bar(d, bar_nr, mr);
+      ...
+   }
+
+Testing on Linux guest
+======================
+The easiest is if your device driver supports sysfs based SR/IOV
+enabling. Support for this was added in kernel v.3.8, so not all drivers
+support it yet.
+
+To enable 4 VFs for a device at 01:00.0:
+
+	modprobe yourdriver
+	echo 4 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+You should now see 4 VFs with lspci.
+To turn SR/IOV off again - the standard requires you to turn it off before you can enable
+another VF count, and the emulation enforces this:
+
+	echo 0 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+Older drivers typically provide a max_vfs module parameter
+to enable it at load time:
+
+	modprobe yourdriver max_vfs=4
+
+To disable the VFs again then, you simply have to unload the driver:
+
+	rmmod yourdriver
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (2 preceding siblings ...)
  2021-10-07 16:23 ` [PATCH 03/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-10-12  7:25   ` Michael S. Tsirkin
  2021-10-07 16:23 ` [PATCH 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Łukasz Gieryk, Lukasz Maniak, qemu-block, Michael S. Tsirkin

PCIe devices implementing SR-IOV may need to perform certain actions
before the VFs are unrealized or vice versa.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 docs/pcie_sriov.txt         |  2 +-
 hw/pci/pcie_sriov.c         | 14 +++++++++++++-
 include/hw/pci/pcie_sriov.h |  8 +++++++-
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
index f5e891e1d4..63ca1a7b8e 100644
--- a/docs/pcie_sriov.txt
+++ b/docs/pcie_sriov.txt
@@ -57,7 +57,7 @@ setting up a BAR for a VF.
       /* Add and initialize the SR/IOV capability */
       pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
                        vf_devid, initial_vfs, total_vfs,
-                       fun_offset, stride);
+                       fun_offset, stride, pre_vfs_update_cb);
 
       /* Set up individual VF BARs (parameters as for normal BARs) */
       pcie_sriov_pf_init_vf_bar( ... )
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 501a1ff433..cac2aee061 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
 void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
                         const char *vfname, uint16_t vf_dev_id,
                         uint16_t init_vfs, uint16_t total_vfs,
-                        uint16_t vf_offset, uint16_t vf_stride)
+                        uint16_t vf_offset, uint16_t vf_stride,
+                        SriovVfsUpdate pre_vfs_update)
 {
     uint8_t *cfg = dev->config + offset;
     uint8_t *wmask;
@@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
     dev->exp.sriov_pf.num_vfs = 0;
     dev->exp.sriov_pf.vfname = g_strdup(vfname);
     dev->exp.sriov_pf.vf = NULL;
+    dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
 
     pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
     pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
@@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
     assert(dev->exp.sriov_pf.vf);
 
     trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
+
+    if (dev->exp.sriov_pf.pre_vfs_update) {
+        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, num_vfs);
+    }
+
     for (i = 0; i < num_vfs; i++) {
         dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
         if (!dev->exp.sriov_pf.vf[i]) {
@@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
     uint16_t i;
 
     trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
+
+    if (dev->exp.sriov_pf.pre_vfs_update) {
+        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
+    }
+
     for (i = 0; i < num_vfs; i++) {
         PCIDevice *vf = dev->exp.sriov_pf.vf[i];
         object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 0974f00054..9ab48b79c0 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -13,11 +13,16 @@
 #ifndef QEMU_PCIE_SRIOV_H
 #define QEMU_PCIE_SRIOV_H
 
+typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
+                               uint16_t num_vfs);
+
 struct PCIESriovPF {
     uint16_t num_vfs;           /* Number of virtual functions created */
     uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
     const char *vfname;         /* Reference to the device type used for the VFs */
     PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
+
+    SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
 };
 
 struct PCIESriovVF {
@@ -28,7 +33,8 @@ struct PCIESriovVF {
 void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
                         const char *vfname, uint16_t vf_dev_id,
                         uint16_t init_vfs, uint16_t total_vfs,
-                        uint16_t vf_offset, uint16_t vf_stride);
+                        uint16_t vf_offset, uint16_t vf_stride,
+                        SriovVfsUpdate pre_vfs_update);
 void pcie_sriov_pf_exit(PCIDevice *dev);
 
 /* Set up a VF bar in the SR/IOV bar area */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (3 preceding siblings ...)
  2021-10-07 16:23 ` [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-10-20 19:07   ` Klaus Jensen
  2021-11-02 14:33   ` Klaus Jensen
  2021-10-07 16:23 ` [PATCH 06/15] hw/nvme: Add support for Primary Controller Capabilities Lukasz Maniak
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk,
	Lukasz Maniak, Klaus Jensen, Keith Busch

This patch implements initial support for Single Root I/O Virtualization
on an NVMe device.

Essentially, it allows to define the maximum number of virtual functions
supported by the NVMe controller via sriov_max_vfs parameter.

Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
capability by a physical controller and ARI capability by both the
physical and virtual function devices.

NVMe controllers created via virtual functions mirror functionally
the physical controller, which may not entirely be the case, thus
consideration would be needed on the way to limit the capabilities of
the VF.

NVMe subsystem is required for the use of SR-IOV.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
 hw/nvme/nvme.h           |  1 +
 include/hw/pci/pci_ids.h |  1 +
 3 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 6a571d18cf..ad79ff0c00 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -35,6 +35,7 @@
  *              mdts=<N[optional]>,vsl=<N[optional]>, \
  *              zoned.zasl=<N[optional]>, \
  *              zoned.auto_transition=<on|off[optional]>, \
+ *              sriov_max_vfs=<N[optional]> \
  *              subsys=<subsys_id>
  *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>,\
  *              zoned=<true|false[optional]>, \
@@ -106,6 +107,12 @@
  *   transitioned to zone state closed for resource management purposes.
  *   Defaults to 'on'.
  *
+ * - `sriov_max_vfs`
+ *   Indicates the maximum number of PCIe virtual functions supported
+ *   by the controller. The default value is 0. Specifying a non-zero value
+ *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
+ *   Virtual function controllers will not report SR-IOV capability.
+ *
  * nvme namespace device parameters
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  * - `shared`
@@ -160,6 +167,7 @@
 #include "sysemu/block-backend.h"
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
+#include "hw/pci/pcie_sriov.h"
 #include "migration/vmstate.h"
 
 #include "nvme.h"
@@ -175,6 +183,9 @@
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
+#define NVME_MAX_VFS 127
+#define NVME_VF_OFFSET 0x1
+#define NVME_VF_STRIDE 1
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
@@ -5583,6 +5594,10 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
         g_free(event);
     }
 
+    if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
+        pcie_sriov_pf_disable_vfs(&n->parent_obj);
+    }
+
     n->aer_queued = 0;
     n->outstanding_aers = 0;
     n->qs_created = false;
@@ -6264,6 +6279,19 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
         error_setg(errp, "vsl must be non-zero");
         return;
     }
+
+    if (params->sriov_max_vfs) {
+        if (!n->subsys) {
+            error_setg(errp, "subsystem is required for the use of SR-IOV");
+            return;
+        }
+
+        if (params->sriov_max_vfs > NVME_MAX_VFS) {
+            error_setg(errp, "sriov_max_vfs must be between 0 and %d",
+                       NVME_MAX_VFS);
+            return;
+        }
+    }
 }
 
 static void nvme_init_state(NvmeCtrl *n)
@@ -6321,6 +6349,20 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
     memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
+                            uint64_t bar_size)
+{
+    uint16_t vf_dev_id = n->params.use_intel_id ?
+                         PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
+
+    pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
+                       n->params.sriov_max_vfs, n->params.sriov_max_vfs,
+                       NVME_VF_OFFSET, NVME_VF_STRIDE, NULL);
+
+    pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+                              PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
@@ -6335,7 +6377,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 
     if (n->params.use_intel_id) {
         pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
-        pci_config_set_device_id(pci_conf, 0x5845);
+        pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_INTEL_NVME);
     } else {
         pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_REDHAT);
         pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_REDHAT_NVME);
@@ -6343,6 +6385,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 
     pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
     pcie_endpoint_cap_init(pci_dev, 0x80);
+    if (n->params.sriov_max_vfs) {
+        pcie_ari_init(pci_dev, 0x100, 1);
+    }
 
     bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
     msix_table_offset = bar_size;
@@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
                           n->reg_size);
     memory_region_add_subregion(&n->bar0, 0, &n->iomem);
 
-    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
-                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
+    if (pci_is_vf(pci_dev)) {
+        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
+    } else {
+        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
+    }
     ret = msix_init(pci_dev, n->params.msix_qsize,
                     &n->bar0, 0, msix_table_offset,
                     &n->bar0, 0, msix_pba_offset, 0, &err);
@@ -6383,6 +6432,10 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         nvme_init_pmr(n, pci_dev);
     }
 
+    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
+        nvme_init_sriov(n, pci_dev, 0x120, bar_size);
+    }
+
     return 0;
 }
 
@@ -6532,6 +6585,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     NvmeCtrl *n = NVME(pci_dev);
     NvmeNamespace *ns;
     Error *local_err = NULL;
+    NvmeCtrl *pn = NVME(pcie_sriov_get_pf(pci_dev));
+
+    if (pci_is_vf(pci_dev)) {
+        /* VFs derive settings from the parent. PF's lifespan exceeds
+         * that of VF's, so it's safe to share params.serial.
+         */
+        memcpy(&n->params, &pn->params, sizeof(NvmeParams));
+        n->subsys = pn->subsys;
+    }
 
     nvme_check_constraints(n, &local_err);
     if (local_err) {
@@ -6596,6 +6658,11 @@ static void nvme_exit(PCIDevice *pci_dev)
     if (n->pmr.dev) {
         host_memory_backend_set_mapped(n->pmr.dev, false);
     }
+
+    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
+        pcie_sriov_pf_exit(pci_dev);
+    }
+
     msix_uninit(pci_dev, &n->bar0, &n->bar0);
     memory_region_del_subregion(&n->bar0, &n->iomem);
 }
@@ -6620,6 +6687,7 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
+    DEFINE_PROP_UINT8("sriov_max_vfs", NvmeCtrl, params.sriov_max_vfs, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 83ffabade4..4331f5da1f 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -391,6 +391,7 @@ typedef struct NvmeParams {
     uint8_t  zasl;
     bool     auto_transition_zones;
     bool     legacy_cmb;
+    uint8_t  sriov_max_vfs;
 } NvmeParams;
 
 typedef struct NvmeCtrl {
diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index 11abe22d46..992426768e 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -237,6 +237,7 @@
 #define PCI_DEVICE_ID_INTEL_82801BA_11   0x244e
 #define PCI_DEVICE_ID_INTEL_82801D       0x24CD
 #define PCI_DEVICE_ID_INTEL_ESB_9        0x25ab
+#define PCI_DEVICE_ID_INTEL_NVME         0x5845
 #define PCI_DEVICE_ID_INTEL_82371SB_0    0x7000
 #define PCI_DEVICE_ID_INTEL_82371SB_1    0x7010
 #define PCI_DEVICE_ID_INTEL_82371SB_2    0x7020
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/15] hw/nvme: Add support for Primary Controller Capabilities
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (4 preceding siblings ...)
  2021-10-07 16:23 ` [PATCH 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-11-02 14:34   ` Klaus Jensen
  2021-10-07 16:23 ` [PATCH 07/15] hw/nvme: Add support for Secondary Controller List Lukasz Maniak
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Lukasz Maniak, Klaus Jensen, Hanna Reitz, Stefan Hajnoczi,
	Keith Busch, Philippe Mathieu-Daudé

Implementation of Primary Controller Capabilities data
structure (Identify command with CNS value of 14h).

Currently, the command returns only ID of a primary controller.
Handling of remaining fields are added in subsequent patches
implementing virtualization enhancements.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 hw/nvme/ctrl.c       | 22 +++++++++++++++++-----
 hw/nvme/nvme.h       |  2 ++
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 23 +++++++++++++++++++++++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index ad79ff0c00..d2fde3dd07 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4538,6 +4538,13 @@ static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req,
     return nvme_c2h(n, (uint8_t *)list, sizeof(list), req);
 }
 
+static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req)
+{
+    trace_pci_nvme_identify_pri_ctrl_cap(le16_to_cpu(n->pri_ctrl_cap.cntlid));
+
+    return nvme_c2h(n, (uint8_t *)&n->pri_ctrl_cap, sizeof(NvmePriCtrlCap), req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
                                      bool active)
 {
@@ -4756,6 +4763,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
         return nvme_identify_ctrl_list(n, req, true);
     case NVME_ID_CNS_CTRL_LIST:
         return nvme_identify_ctrl_list(n, req, false);
+    case NVME_ID_CNS_PRIMARY_CTRL_CAP:
+        return nvme_identify_pri_ctrl_cap(n, req);
     case NVME_ID_CNS_CS_NS:
         return nvme_identify_ns_csi(n, req, true);
     case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6296,6 +6305,8 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 
 static void nvme_init_state(NvmeCtrl *n)
 {
+    NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+
     /* add one to max_ioqpairs to account for the admin queue pair */
     n->reg_size = pow2ceil(sizeof(NvmeBar) +
                            2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
@@ -6305,6 +6316,8 @@ static void nvme_init_state(NvmeCtrl *n)
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
+
+    cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -6604,15 +6617,14 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS,
               &pci_dev->qdev, n->parent_obj.qdev.id);
 
-    nvme_init_state(n);
-    if (nvme_init_pci(n, pci_dev, errp)) {
-        return;
-    }
-
     if (nvme_init_subsys(n, errp)) {
         error_propagate(errp, local_err);
         return;
     }
+    nvme_init_state(n);
+    if (nvme_init_pci(n, pci_dev, errp)) {
+        return;
+    }
     nvme_init_ctrl(n, pci_dev);
 
     /* setup a namespace if the controller drive property was given */
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 4331f5da1f..479817f66e 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -461,6 +461,8 @@ typedef struct NvmeCtrl {
         };
         uint32_t    async_config;
     } features;
+
+    NvmePriCtrlCap  pri_ctrl_cap;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index ff6cafd520..1014ebceb6 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -52,6 +52,7 @@ pci_nvme_identify_ctrl(void) "identify controller"
 pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ctrl_list(uint8_t cns, uint16_t cntid) "cns 0x%"PRIx8" cntid %"PRIu16""
+pci_nvme_identify_pri_ctrl_cap(uint16_t cntlid) "identify primary controller capabilities cntlid=%"PRIu16""
 pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index e3bd47bf76..f69bd1d14f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1017,6 +1017,7 @@ enum NvmeIdCns {
     NVME_ID_CNS_NS_PRESENT            = 0x11,
     NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12,
     NVME_ID_CNS_CTRL_LIST             = 0x13,
+    NVME_ID_CNS_PRIMARY_CTRL_CAP      = 0x14,
     NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
     NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
     NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
@@ -1465,6 +1466,27 @@ typedef enum NvmeZoneState {
     NVME_ZONE_STATE_OFFLINE          = 0x0f,
 } NvmeZoneState;
 
+typedef struct QEMU_PACKED NvmePriCtrlCap {
+    uint16_t    cntlid;
+    uint16_t    portid;
+    uint8_t     crt;
+    uint8_t     rsvd5[27];
+    uint32_t    vqfrt;
+    uint32_t    vqrfa;
+    uint16_t    vqrfap;
+    uint16_t    vqprt;
+    uint16_t    vqfrsm;
+    uint16_t    vqgran;
+    uint8_t     rsvd48[16];
+    uint32_t    vifrt;
+    uint32_t    virfa;
+    uint16_t    virfap;
+    uint16_t    viprt;
+    uint16_t    vifrsm;
+    uint16_t    vigran;
+    uint8_t     rsvd80[4016];
+} NvmePriCtrlCap;
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
@@ -1497,5 +1519,6 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDifTuple) != 8);
+    QEMU_BUILD_BUG_ON(sizeof(NvmePriCtrlCap) != 4096);
 }
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/15] hw/nvme: Add support for Secondary Controller List
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (5 preceding siblings ...)
  2021-10-07 16:23 ` [PATCH 06/15] hw/nvme: Add support for Primary Controller Capabilities Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-11-02 14:35   ` Klaus Jensen
  2021-10-07 16:23 ` [PATCH 08/15] pcie: Add 1.2 version token for the Power Management Capability Lukasz Maniak
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Lukasz Maniak, Klaus Jensen, Hanna Reitz, Stefan Hajnoczi,
	Keith Busch, Philippe Mathieu-Daudé

Introduce handling for Secondary Controller List (Identify command with
CNS value of 15h).

Secondary controller ids are unique in the subsystem, hence they are
reserved by it upon initialization of the primary controller to the
number of sriov_max_vfs.

ID reservation requires the addition of an intermediate controller slot
state, so the reserved controller has the address 0xFFFF.
A secondary controller is in the reserved state when it has no virtual
function assigned, but its primary controller is realized.
Secondary controller reservations are released to NULL when its primary
controller is unregistered.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 hw/nvme/ctrl.c       | 42 ++++++++++++++++++++++++-
 hw/nvme/ns.c         |  2 +-
 hw/nvme/nvme.h       | 16 +++++++++-
 hw/nvme/subsys.c     | 74 ++++++++++++++++++++++++++++++++++++++------
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 20 ++++++++++++
 6 files changed, 143 insertions(+), 12 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index d2fde3dd07..9687a7322c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4545,6 +4545,14 @@ static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req)
     return nvme_c2h(n, (uint8_t *)&n->pri_ctrl_cap, sizeof(NvmePriCtrlCap), req);
 }
 
+static uint16_t nvme_identify_sec_ctrl_list(NvmeCtrl *n, NvmeRequest *req)
+{
+    trace_pci_nvme_identify_sec_ctrl_list(le16_to_cpu(n->pri_ctrl_cap.cntlid),
+                                          n->sec_ctrl_list.numcntl);
+
+    return nvme_c2h(n, (uint8_t *)&n->sec_ctrl_list, sizeof(NvmeSecCtrlList), req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
                                      bool active)
 {
@@ -4765,6 +4773,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
         return nvme_identify_ctrl_list(n, req, false);
     case NVME_ID_CNS_PRIMARY_CTRL_CAP:
         return nvme_identify_pri_ctrl_cap(n, req);
+    case NVME_ID_CNS_SECONDARY_CTRL_LIST:
+        return nvme_identify_sec_ctrl_list(n, req);
     case NVME_ID_CNS_CS_NS:
         return nvme_identify_ns_csi(n, req, true);
     case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6306,6 +6316,9 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 static void nvme_init_state(NvmeCtrl *n)
 {
     NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    NvmeSecCtrlEntry *sctrl;
+    int i;
 
     /* add one to max_ioqpairs to account for the admin queue pair */
     n->reg_size = pow2ceil(sizeof(NvmeBar) +
@@ -6317,6 +6330,12 @@ static void nvme_init_state(NvmeCtrl *n)
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 
+    list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
+    for (i = 0; i < n->params.sriov_max_vfs; i++) {
+        sctrl = &list->sec[i];
+        sctrl->pcid = cpu_to_le16(n->cntlid);
+    }
+
     cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
@@ -6362,6 +6381,27 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
     memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static void nvme_update_vfs(PCIDevice *pci_dev, uint16_t prev_num_vfs,
+                            uint16_t num_vfs)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+    uint16_t num_active_vfs = MAX(prev_num_vfs, num_vfs);
+    bool vf_enable = (prev_num_vfs < num_vfs);
+    uint16_t i;
+
+    /*
+     * As per SR-IOV design,
+     * VF count can only go from 0 to a set value and vice versa.
+     */
+    for (i = 0; i < num_active_vfs; i++) {
+        if (vf_enable) {
+            n->sec_ctrl_list.sec[i].vfn = cpu_to_le16(i + 1);
+        } else {
+            n->sec_ctrl_list.sec[i].vfn = 0;
+        }
+    }
+}
+
 static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
                             uint64_t bar_size)
 {
@@ -6370,7 +6410,7 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
 
     pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
                        n->params.sriov_max_vfs, n->params.sriov_max_vfs,
-                       NVME_VF_OFFSET, NVME_VF_STRIDE, NULL);
+                       NVME_VF_OFFSET, NVME_VF_STRIDE, nvme_update_vfs);
 
     pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
                               PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index b7cf1494e7..c70aed8c66 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -517,7 +517,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
             for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
                 NvmeCtrl *ctrl = subsys->ctrls[i];
 
-                if (ctrl) {
+                if (ctrl && ctrl != SUBSYS_SLOT_RSVD) {
                     nvme_attach_ns(ctrl, ns);
                 }
             }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 479817f66e..fd229f06f0 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -24,7 +24,7 @@
 
 #include "block/nvme.h"
 
-#define NVME_MAX_CONTROLLERS 32
+#define NVME_MAX_CONTROLLERS 256
 #define NVME_MAX_NAMESPACES  256
 #define NVME_EUI64_DEFAULT ((uint64_t)0x5254000000000000)
 
@@ -43,6 +43,7 @@ typedef struct NvmeBus {
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
     OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
+#define SUBSYS_SLOT_RSVD (void *)0xFFFF
 
 typedef struct NvmeSubsystem {
     DeviceState parent_obj;
@@ -463,6 +464,7 @@ typedef struct NvmeCtrl {
     } features;
 
     NvmePriCtrlCap  pri_ctrl_cap;
+    NvmeSecCtrlList sec_ctrl_list;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
@@ -497,6 +499,18 @@ static inline uint16_t nvme_cid(NvmeRequest *req)
     return le16_to_cpu(req->cqe.cid);
 }
 
+static inline NvmeSecCtrlEntry *nvme_sctrl(NvmeCtrl *n)
+{
+    PCIDevice *pci_dev = &n->parent_obj;
+    NvmeCtrl *pf = NVME(pcie_sriov_get_pf(pci_dev));
+
+    if (pci_is_vf(pci_dev)) {
+        return &pf->sec_ctrl_list.sec[pcie_sriov_vf_number(pci_dev)];
+    }
+
+    return NULL;
+}
+
 void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns);
 uint16_t nvme_bounce_data(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
                           NvmeTxDirection dir, NvmeRequest *req);
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 495dcff5eb..43c295056f 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -11,20 +11,71 @@
 
 #include "nvme.h"
 
-int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
+static int nvme_subsys_reserve_cntlids(NvmeCtrl *n, int start, int num)
 {
     NvmeSubsystem *subsys = n->subsys;
-    int cntlid;
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    NvmeSecCtrlEntry *sctrl;
+    int i, cnt = 0;
+
+    for (i = start; i < ARRAY_SIZE(subsys->ctrls) && cnt < num; i++) {
+        if (!subsys->ctrls[i]) {
+            sctrl = &list->sec[cnt];
+            sctrl->scid = cpu_to_le16(i);
+            subsys->ctrls[i] = SUBSYS_SLOT_RSVD;
+            cnt++;
+        }
+    }
+
+    return cnt;
+}
 
-    for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
-        if (!subsys->ctrls[cntlid]) {
-            break;
+static void nvme_subsys_unreserve_cntlids(NvmeCtrl *n)
+{
+    NvmeSubsystem *subsys = n->subsys;
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    NvmeSecCtrlEntry *sctrl;
+    int i, cntlid;
+
+    for (i = 0; i < n->params.sriov_max_vfs; i++) {
+        sctrl = &list->sec[i];
+        cntlid = le16_to_cpu(sctrl->scid);
+
+        if (cntlid) {
+            assert(subsys->ctrls[cntlid] == SUBSYS_SLOT_RSVD);
+            subsys->ctrls[cntlid] = NULL;
+            sctrl->scid = 0;
         }
     }
+}
 
-    if (cntlid == ARRAY_SIZE(subsys->ctrls)) {
-        error_setg(errp, "no more free controller id");
-        return -1;
+int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
+{
+    NvmeSubsystem *subsys = n->subsys;
+    NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
+    int cntlid, num_rsvd, num_vfs = n->params.sriov_max_vfs;
+
+    if (pci_is_vf(&n->parent_obj)) {
+        cntlid = le16_to_cpu(sctrl->scid);
+    } else {
+        for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
+            if (!subsys->ctrls[cntlid]) {
+                break;
+            }
+        }
+
+        if (cntlid == ARRAY_SIZE(subsys->ctrls)) {
+            error_setg(errp, "no more free controller id");
+            return -1;
+        }
+
+        num_rsvd = nvme_subsys_reserve_cntlids(n, cntlid + 1, num_vfs);
+        if (num_rsvd != num_vfs) {
+            nvme_subsys_unreserve_cntlids(n);
+            error_setg(errp,
+                       "no more free controller ids for secondary controllers");
+            return -1;
+        }
     }
 
     subsys->ctrls[cntlid] = n;
@@ -34,7 +85,12 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 
 void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
 {
-    subsys->ctrls[n->cntlid] = NULL;
+    if (pci_is_vf(&n->parent_obj)) {
+        subsys->ctrls[n->cntlid] = SUBSYS_SLOT_RSVD;
+    } else {
+        subsys->ctrls[n->cntlid] = NULL;
+        nvme_subsys_unreserve_cntlids(n);
+    }
 }
 
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index 1014ebceb6..dd2aac3418 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -53,6 +53,7 @@ pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ctrl_list(uint8_t cns, uint16_t cntid) "cns 0x%"PRIx8" cntid %"PRIu16""
 pci_nvme_identify_pri_ctrl_cap(uint16_t cntlid) "identify primary controller capabilities cntlid=%"PRIu16""
+pci_nvme_identify_sec_ctrl_list(uint16_t cntlid, uint8_t numcntl) "identify secondary controller list cntlid=%"PRIu16" numcntl=%"PRIu8""
 pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index f69bd1d14f..96595ea8f1 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1018,6 +1018,7 @@ enum NvmeIdCns {
     NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12,
     NVME_ID_CNS_CTRL_LIST             = 0x13,
     NVME_ID_CNS_PRIMARY_CTRL_CAP      = 0x14,
+    NVME_ID_CNS_SECONDARY_CTRL_LIST   = 0x15,
     NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
     NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
     NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
@@ -1487,6 +1488,23 @@ typedef struct QEMU_PACKED NvmePriCtrlCap {
     uint8_t     rsvd80[4016];
 } NvmePriCtrlCap;
 
+typedef struct QEMU_PACKED NvmeSecCtrlEntry {
+    uint16_t    scid;
+    uint16_t    pcid;
+    uint8_t     scs;
+    uint8_t     rsvd5[3];
+    uint16_t    vfn;
+    uint16_t    nvq;
+    uint16_t    nvi;
+    uint8_t     rsvd14[18];
+} NvmeSecCtrlEntry;
+
+typedef struct QEMU_PACKED NvmeSecCtrlList {
+    uint8_t             numcntl;
+    uint8_t             rsvd1[31];
+    NvmeSecCtrlEntry    sec[127];
+} NvmeSecCtrlList;
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
@@ -1520,5 +1538,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDifTuple) != 8);
     QEMU_BUILD_BUG_ON(sizeof(NvmePriCtrlCap) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeSecCtrlEntry) != 32);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeSecCtrlList) != 4096);
 }
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/15] pcie: Add 1.2 version token for the Power Management Capability
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (6 preceding siblings ...)
  2021-10-07 16:23 ` [PATCH 07/15] hw/nvme: Add support for Secondary Controller List Lukasz Maniak
@ 2021-10-07 16:23 ` Lukasz Maniak
  2021-10-07 16:24 ` [PATCH 09/15] hw/nvme: Implement the Function Level Reset Lukasz Maniak
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Łukasz Gieryk, Lukasz Maniak, qemu-block, Michael S. Tsirkin

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 include/hw/pci/pci_regs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 77ba64b931..a590140962 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -4,5 +4,6 @@
 #include "standard-headers/linux/pci_regs.h"
 
 #define  PCI_PM_CAP_VER_1_1     0x0002  /* PCI PM spec ver. 1.1 */
+#define  PCI_PM_CAP_VER_1_2     0x0003  /* PCI PM spec ver. 1.2 */
 
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/15] hw/nvme: Implement the Function Level Reset
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (7 preceding siblings ...)
  2021-10-07 16:23 ` [PATCH 08/15] pcie: Add 1.2 version token for the Power Management Capability Lukasz Maniak
@ 2021-10-07 16:24 ` Lukasz Maniak
  2021-11-02 14:35   ` Klaus Jensen
  2021-10-07 16:24 ` [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Keith Busch, Łukasz Gieryk, Klaus Jensen, Lukasz Maniak, qemu-block

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

This patch implements the FLR, a feature currently not implemented for
the Nvme device, while listed as a mandatory ("shall") in the 1.4 spec.

The implementation reuses FLR-related building blocks defined for the
pci-bridge module, and follows the same logic:
    - FLR capability is advertised in the PCIE config,
    - custom pci_write_config callback detects a write to the trigger
      register and performs the PCI reset,
    - which, eventually, calls the custom dc->reset handler.

Depending on reset type, parts of the state should (or should not) be
cleared. To distinguish the type of reset, an additional parameter is
passed to the reset function.

This patch also enables advertisement of the Power Management PCI
capability. The main reason behind it is to announce the no_soft_reset=1
bit, to signal SR/IOV support where each VF can be reset individually.

The implementation purposedly ignores writes to the PMCS.PS register,
as even such naïve behavior is enough to correctly handle the D3->D0
transition.

It’s worth to note, that the power state transition back to to D3, with
all the corresponding side effects, wasn't and stil isn't handled
properly.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c       | 52 ++++++++++++++++++++++++++++++++++++++++----
 hw/nvme/nvme.h       |  5 +++++
 hw/nvme/trace-events |  1 +
 3 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 9687a7322c..b04cf5eae9 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5582,7 +5582,7 @@ static void nvme_process_sq(void *opaque)
     }
 }
 
-static void nvme_ctrl_reset(NvmeCtrl *n)
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
     NvmeNamespace *ns;
     int i;
@@ -5614,7 +5614,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
     }
 
     if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
-        pcie_sriov_pf_disable_vfs(&n->parent_obj);
+        if (rst != NVME_RESET_CONTROLLER) {
+            pcie_sriov_pf_disable_vfs(&n->parent_obj);
+        }
     }
 
     n->aer_queued = 0;
@@ -5848,7 +5850,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
             }
         } else if (!NVME_CC_EN(data) && NVME_CC_EN(cc)) {
             trace_pci_nvme_mmio_stopped();
-            nvme_ctrl_reset(n);
+            nvme_ctrl_reset(n, NVME_RESET_CONTROLLER);
             cc = 0;
             csts &= ~NVME_CSTS_READY;
         }
@@ -6416,6 +6418,28 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
                               PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
 }
 
+static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
+{
+    Error *err = NULL;
+    int ret;
+
+    ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
+                             PCI_PM_SIZEOF, &err);
+    if (err) {
+        error_report_err(err);
+        return ret;
+    }
+
+    pci_set_word(pci_dev->config + offset + PCI_PM_PMC,
+                 PCI_PM_CAP_VER_1_2);
+    pci_set_word(pci_dev->config + offset + PCI_PM_CTRL,
+                 PCI_PM_CTRL_NO_SOFT_RESET);
+    pci_set_word(pci_dev->wmask + offset + PCI_PM_CTRL,
+                 PCI_PM_CTRL_STATE_MASK);
+
+    return 0;
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
@@ -6437,7 +6461,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
     }
 
     pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+    nvme_add_pm_capability(pci_dev, 0x60);
     pcie_endpoint_cap_init(pci_dev, 0x80);
+    pcie_cap_flr_init(pci_dev);
     if (n->params.sriov_max_vfs) {
         pcie_ari_init(pci_dev, 0x100, 1);
     }
@@ -6686,7 +6712,7 @@ static void nvme_exit(PCIDevice *pci_dev)
     NvmeNamespace *ns;
     int i;
 
-    nvme_ctrl_reset(n);
+    nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
 
     if (n->subsys) {
         for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
@@ -6785,6 +6811,22 @@ static void nvme_set_smart_warning(Object *obj, Visitor *v, const char *name,
     }
 }
 
+static void nvme_pci_reset(DeviceState *qdev)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(qdev);
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_pci_reset();
+    nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
+}
+
+static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
+                                  uint32_t val, int len)
+{
+    pci_default_write_config(dev, address, val, len);
+    pcie_cap_flr_write_config(dev, address, val, len);
+}
+
 static const VMStateDescription nvme_vmstate = {
     .name = "nvme",
     .unmigratable = 1,
@@ -6796,6 +6838,7 @@ static void nvme_class_init(ObjectClass *oc, void *data)
     PCIDeviceClass *pc = PCI_DEVICE_CLASS(oc);
 
     pc->realize = nvme_realize;
+    pc->config_write = nvme_pci_write_config;
     pc->exit = nvme_exit;
     pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
     pc->revision = 2;
@@ -6804,6 +6847,7 @@ static void nvme_class_init(ObjectClass *oc, void *data)
     dc->desc = "Non-Volatile Memory Express";
     device_class_set_props(dc, nvme_props);
     dc->vmsd = &nvme_vmstate;
+    dc->reset = nvme_pci_reset;
 }
 
 static void nvme_instance_init(Object *obj)
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index fd229f06f0..9fbb0a70b5 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -467,6 +467,11 @@ typedef struct NvmeCtrl {
     NvmeSecCtrlList sec_ctrl_list;
 } NvmeCtrl;
 
+typedef enum NvmeResetType {
+    NVME_RESET_FUNCTION   = 0,
+    NVME_RESET_CONTROLLER = 1,
+} NvmeResetType;
+
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
     if (!nsid || nsid > NVME_MAX_NAMESPACES) {
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index dd2aac3418..88678fc21e 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -105,6 +105,7 @@ pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone de
 pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_idx=%"PRIu32""
 pci_nvme_clear_ns_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
 pci_nvme_clear_ns_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
+pci_nvme_pci_reset(void) "PCI Function Level Reset"
 
 # error conditions
 pci_nvme_err_mdts(size_t len) "len %zu"
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (8 preceding siblings ...)
  2021-10-07 16:24 ` [PATCH 09/15] hw/nvme: Implement the Function Level Reset Lukasz Maniak
@ 2021-10-07 16:24 ` Lukasz Maniak
  2021-10-18 10:06   ` Philippe Mathieu-Daudé
                     ` (2 more replies)
  2021-10-07 16:24 ` [PATCH 11/15] hw/nvme: Calculate BAR atributes in a function Lukasz Maniak
                   ` (6 subsequent siblings)
  16 siblings, 3 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Keith Busch, Łukasz Gieryk, Klaus Jensen, Lukasz Maniak, qemu-block

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
them as constants is problematic for SR-IOV support.

The SR-IOV feature introduces virtual resources (queues, interrupts)
that can be assigned to PF and its dependent VFs. Each device, following
a reset, should work with the configured number of queues. A single
constant is no longer sufficient to hold the whole state.

This patch tries to solve the problem by introducing additional
variables in NvmeCtrl’s state. The variables for, e.g., managing queues
are therefore organized as:

 - n->params.max_ioqpairs – no changes, constant set by the user.

 - n->max_ioqpairs - (new) value derived from n->params.* in realize();
                     constant through device’s lifetime.

 - n->(mutable_state) – (not a part of this patch) user-configurable,
                        specifies number of queues available _after_
                        reset.

 - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
                      n->params.max_ioqpairs; initialized in realize()
                      and updated during reset() to reflect user’s
                      changes to the mutable state.

Since the number of available i/o queues and interrupts can change in
runtime, buffers for sq/cqs and the MSIX-related structures are
allocated big enough to handle the limits, to completely avoid the
complicated reallocation. A helper function (nvme_update_msixcap_ts)
updates the corresponding capability register, to signal configuration
changes.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c | 62 +++++++++++++++++++++++++++++++++-----------------
 hw/nvme/nvme.h |  4 ++++
 2 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index b04cf5eae9..5d9166d66f 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -416,12 +416,12 @@ static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-    return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
+    return sqid < n->conf_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-    return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
+    return cqid < n->conf_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -4034,8 +4034,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
         trace_pci_nvme_err_invalid_create_sq_cqid(cqid);
         return NVME_INVALID_CQID | NVME_DNR;
     }
-    if (unlikely(!sqid || sqid > n->params.max_ioqpairs ||
-        n->sq[sqid] != NULL)) {
+    if (unlikely(!sqid || sqid > n->conf_ioqpairs || n->sq[sqid] != NULL)) {
         trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
         return NVME_INVALID_QID | NVME_DNR;
     }
@@ -4382,8 +4381,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
     trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
                              NVME_CQ_FLAGS_IEN(qflags) != 0);
 
-    if (unlikely(!cqid || cqid > n->params.max_ioqpairs ||
-        n->cq[cqid] != NULL)) {
+    if (unlikely(!cqid || cqid > n->conf_ioqpairs || n->cq[cqid] != NULL)) {
         trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
         return NVME_INVALID_QID | NVME_DNR;
     }
@@ -4399,7 +4397,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
         trace_pci_nvme_err_invalid_create_cq_vector(vector);
         return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
     }
-    if (unlikely(vector >= n->params.msix_qsize)) {
+    if (unlikely(vector >= n->conf_msix_qsize)) {
         trace_pci_nvme_err_invalid_create_cq_vector(vector);
         return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
     }
@@ -4980,13 +4978,12 @@ defaults:
 
         break;
     case NVME_NUMBER_OF_QUEUES:
-        result = (n->params.max_ioqpairs - 1) |
-            ((n->params.max_ioqpairs - 1) << 16);
+        result = (n->conf_ioqpairs - 1) | ((n->conf_ioqpairs - 1) << 16);
         trace_pci_nvme_getfeat_numq(result);
         break;
     case NVME_INTERRUPT_VECTOR_CONF:
         iv = dw11 & 0xffff;
-        if (iv >= n->params.max_ioqpairs + 1) {
+        if (iv >= n->conf_ioqpairs + 1) {
             return NVME_INVALID_FIELD | NVME_DNR;
         }
 
@@ -5141,10 +5138,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 
         trace_pci_nvme_setfeat_numq((dw11 & 0xffff) + 1,
                                     ((dw11 >> 16) & 0xffff) + 1,
-                                    n->params.max_ioqpairs,
-                                    n->params.max_ioqpairs);
-        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-                                      ((n->params.max_ioqpairs - 1) << 16));
+                                    n->conf_ioqpairs,
+                                    n->conf_ioqpairs);
+        req->cqe.result = cpu_to_le32((n->conf_ioqpairs - 1) |
+                                      ((n->conf_ioqpairs - 1) << 16));
         break;
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
@@ -5582,8 +5579,21 @@ static void nvme_process_sq(void *opaque)
     }
 }
 
+static void nvme_update_msixcap_ts(PCIDevice *pci_dev, uint32_t table_size)
+{
+    uint8_t *config;
+
+    assert(pci_dev->msix_cap);
+    assert(table_size <= pci_dev->msix_entries_nr);
+
+    config = pci_dev->config + pci_dev->msix_cap;
+    pci_set_word_by_mask(config + PCI_MSIX_FLAGS, PCI_MSIX_FLAGS_QSIZE,
+                         table_size - 1);
+}
+
 static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
+    PCIDevice *pci_dev = &n->parent_obj;
     NvmeNamespace *ns;
     int i;
 
@@ -5596,12 +5606,12 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
         nvme_ns_drain(ns);
     }
 
-    for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+    for (i = 0; i < n->max_ioqpairs + 1; i++) {
         if (n->sq[i] != NULL) {
             nvme_free_sq(n->sq[i], n);
         }
     }
-    for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
+    for (i = 0; i < n->max_ioqpairs + 1; i++) {
         if (n->cq[i] != NULL) {
             nvme_free_cq(n->cq[i], n);
         }
@@ -5613,15 +5623,17 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
         g_free(event);
     }
 
-    if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
+    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
         if (rst != NVME_RESET_CONTROLLER) {
-            pcie_sriov_pf_disable_vfs(&n->parent_obj);
+            pcie_sriov_pf_disable_vfs(pci_dev);
         }
     }
 
     n->aer_queued = 0;
     n->outstanding_aers = 0;
     n->qs_created = false;
+
+    nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
 }
 
 static void nvme_ctrl_shutdown(NvmeCtrl *n)
@@ -6322,11 +6334,17 @@ static void nvme_init_state(NvmeCtrl *n)
     NvmeSecCtrlEntry *sctrl;
     int i;
 
+    n->max_ioqpairs = n->params.max_ioqpairs;
+    n->conf_ioqpairs = n->max_ioqpairs;
+
+    n->max_msix_qsize = n->params.msix_qsize;
+    n->conf_msix_qsize = n->max_msix_qsize;
+
     /* add one to max_ioqpairs to account for the admin queue pair */
     n->reg_size = pow2ceil(sizeof(NvmeBar) +
                            2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
-    n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
-    n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
+    n->sq = g_new0(NvmeSQueue *, n->max_ioqpairs + 1);
+    n->cq = g_new0(NvmeCQueue *, n->max_ioqpairs + 1);
     n->temperature = NVME_TEMPERATURE;
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
@@ -6491,7 +6509,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
                          PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
     }
-    ret = msix_init(pci_dev, n->params.msix_qsize,
+    ret = msix_init(pci_dev, n->max_msix_qsize,
                     &n->bar0, 0, msix_table_offset,
                     &n->bar0, 0, msix_pba_offset, 0, &err);
     if (ret < 0) {
@@ -6503,6 +6521,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         }
     }
 
+    nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
+
     if (n->params.cmb_size_mb) {
         nvme_init_cmb(n, pci_dev);
     }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 9fbb0a70b5..65383e495c 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -420,6 +420,10 @@ typedef struct NvmeCtrl {
     uint64_t    starttime_ms;
     uint16_t    temperature;
     uint8_t     smart_critical_warning;
+    uint32_t    max_msix_qsize;                 /* Derived from params.msix.qsize */
+    uint32_t    conf_msix_qsize;                /* Configured limit */
+    uint32_t    max_ioqpairs;                   /* Derived from params.max_ioqpairs */
+    uint32_t    conf_ioqpairs;                  /* Configured limit */
 
     struct {
         MemoryRegion mem;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/15] hw/nvme: Calculate BAR atributes in a function
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (9 preceding siblings ...)
  2021-10-07 16:24 ` [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
@ 2021-10-07 16:24 ` Lukasz Maniak
  2021-10-18  9:52   ` Philippe Mathieu-Daudé
  2021-10-07 16:24 ` [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Keith Busch, Łukasz Gieryk, Klaus Jensen, Lukasz Maniak, qemu-block

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

An Nvme device with SR-IOV capability calculates the BAR size
differently for PF and VF, so it makes sense to extract the common code
to a separate function.

Also: it seems the n->reg_size parameter unnecessarily splits the BAR
size calculation in two phases; removed to simplify the code.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c | 52 +++++++++++++++++++++++++++++++++-----------------
 hw/nvme/nvme.h |  1 -
 2 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 5d9166d66f..425fbf2c73 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6339,10 +6339,6 @@ static void nvme_init_state(NvmeCtrl *n)
 
     n->max_msix_qsize = n->params.msix_qsize;
     n->conf_msix_qsize = n->max_msix_qsize;
-
-    /* add one to max_ioqpairs to account for the admin queue pair */
-    n->reg_size = pow2ceil(sizeof(NvmeBar) +
-                           2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
     n->sq = g_new0(NvmeSQueue *, n->max_ioqpairs + 1);
     n->cq = g_new0(NvmeCQueue *, n->max_ioqpairs + 1);
     n->temperature = NVME_TEMPERATURE;
@@ -6401,6 +6397,36 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
     memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
+                              unsigned *msix_table_offset,
+                              unsigned *msix_pba_offset)
+{
+    uint64_t bar_size, msix_table_size, msix_pba_size;
+
+    bar_size = sizeof(NvmeBar);
+    bar_size += 2 * total_queues * NVME_DB_SIZE;
+    bar_size = pow2ceil(bar_size);
+    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+    if (msix_table_offset) {
+        *msix_table_offset = bar_size;
+    }
+
+    msix_table_size = PCI_MSIX_ENTRY_SIZE * total_irqs;
+    bar_size += msix_table_size;
+    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+    if (msix_pba_offset) {
+        *msix_pba_offset = bar_size;
+    }
+
+    msix_pba_size = QEMU_ALIGN_UP(total_irqs, 64) / 8;
+    bar_size += msix_pba_size;
+
+    bar_size = pow2ceil(bar_size);
+    return bar_size;
+}
+
 static void nvme_update_vfs(PCIDevice *pci_dev, uint16_t prev_num_vfs,
                             uint16_t num_vfs)
 {
@@ -6461,7 +6487,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
-    uint64_t bar_size, msix_table_size, msix_pba_size;
+    uint64_t bar_size;
     unsigned msix_table_offset, msix_pba_offset;
     int ret;
 
@@ -6486,21 +6512,13 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         pcie_ari_init(pci_dev, 0x100, 1);
     }
 
-    bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
-    msix_table_offset = bar_size;
-    msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
-
-    bar_size += msix_table_size;
-    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-    msix_pba_offset = bar_size;
-    msix_pba_size = QEMU_ALIGN_UP(n->params.msix_qsize, 64) / 8;
-
-    bar_size += msix_pba_size;
-    bar_size = pow2ceil(bar_size);
+    /* add one to max_ioqpairs to account for the admin queue pair */
+    bar_size = nvme_bar_size(n->max_ioqpairs + 1, n->max_msix_qsize,
+                             &msix_table_offset, &msix_pba_offset);
 
     memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size);
     memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
-                          n->reg_size);
+                          msix_table_offset);
     memory_region_add_subregion(&n->bar0, 0, &n->iomem);
 
     if (pci_is_vf(pci_dev)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 65383e495c..a8eded4713 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -410,7 +410,6 @@ typedef struct NvmeCtrl {
     uint16_t    max_prp_ents;
     uint16_t    cqe_size;
     uint16_t    sqe_size;
-    uint32_t    reg_size;
     uint32_t    max_q_ents;
     uint8_t     outstanding_aers;
     uint32_t    irq_status;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (10 preceding siblings ...)
  2021-10-07 16:24 ` [PATCH 11/15] hw/nvme: Calculate BAR atributes in a function Lukasz Maniak
@ 2021-10-07 16:24 ` Lukasz Maniak
  2021-11-03 12:07   ` Klaus Jensen
  2021-10-07 16:24 ` [PATCH 13/15] pcie: Add helpers to the SR/IOV API Lukasz Maniak
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Lukasz Maniak, Klaus Jensen, Hanna Reitz, Stefan Hajnoczi,
	Keith Busch, Philippe Mathieu-Daudé

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
can configure the maximum number of virtual queues and interrupts
assignable to a single virtual device. The primary and secondary
controller capability structures are initialized accordingly.

Since the number of available queues (interrupts) now varies between
VF/PF, BAR size calculation is also adjusted.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c       | 110 +++++++++++++++++++++++++++++++++++++++----
 hw/nvme/nvme.h       |   2 +
 include/block/nvme.h |   5 ++
 3 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 425fbf2c73..67c7210d7e 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -36,6 +36,8 @@
  *              zoned.zasl=<N[optional]>, \
  *              zoned.auto_transition=<on|off[optional]>, \
  *              sriov_max_vfs=<N[optional]> \
+ *              sriov_max_vi_per_vf=<N[optional]> \
+ *              sriov_max_vq_per_vf=<N[optional]> \
  *              subsys=<subsys_id>
  *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>,\
  *              zoned=<true|false[optional]>, \
@@ -113,6 +115,18 @@
  *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
  *   Virtual function controllers will not report SR-IOV capability.
  *
+ * - `sriov_max_vi_per_vf`
+ *   Indicates the maximum number of virtual interrupt resources assignable
+ *   to a secondary controller. Must be explicitly set if sriov_max_vfs != 0.
+ *   The parameter affect VFs similarly to how msix_qsize affects PF, i.e.,
+ *   determines the number of interrupts available to all queues (admin, io).
+ *
+ * - `sriov_max_vq_per_vf`
+ *   Indicates the maximum number of virtual queue resources assignable to
+ *   a secondary controller. Must be explicitly set if sriov_max_vfs != 0.
+ *   The parameter affect VFs similarly to how max_ioqpairs affects PF,
+ *   except the number of flexible queues includes the admin queue.
+ *
  * nvme namespace device parameters
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  * - `shared`
@@ -184,6 +198,7 @@
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 #define NVME_MAX_VFS 127
+#define NVME_VF_RES_GRANULARITY 1
 #define NVME_VF_OFFSET 0x1
 #define NVME_VF_STRIDE 1
 
@@ -6254,6 +6269,7 @@ static const MemoryRegionOps nvme_cmb_ops = {
 static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
     NvmeParams *params = &n->params;
+    int msix_total;
 
     if (params->num_queues) {
         warn_report("num_queues is deprecated; please use max_ioqpairs "
@@ -6324,6 +6340,30 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
                        NVME_MAX_VFS);
             return;
         }
+
+        if (params->sriov_max_vi_per_vf < 1 ||
+            (params->sriov_max_vi_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
+            error_setg(errp, "sriov_max_vi_per_vf must meet:"
+                       " (X - 1) %% %d == 0 and X >= 1",
+                       NVME_VF_RES_GRANULARITY);
+            return;
+        }
+
+        if (params->sriov_max_vq_per_vf < 2 ||
+            (params->sriov_max_vq_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
+            error_setg(errp, "sriov_max_vq_per_vf must meet:"
+                       " (X - 1) %% %d == 0 and X >= 2",
+                       NVME_VF_RES_GRANULARITY);
+            return;
+        }
+
+        msix_total = params->msix_qsize +
+                     params->sriov_max_vfs * params->sriov_max_vi_per_vf;
+        if (msix_total > PCI_MSIX_FLAGS_QSIZE + 1) {
+            error_setg(errp, "sriov_max_vi_per_vf is too big for max_vfs=%d",
+                       params->sriov_max_vfs);
+            return;
+        }
     }
 }
 
@@ -6332,13 +6372,35 @@ static void nvme_init_state(NvmeCtrl *n)
     NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
     NvmeSecCtrlList *list = &n->sec_ctrl_list;
     NvmeSecCtrlEntry *sctrl;
+    uint8_t max_vfs;
+    uint32_t total_vq, total_vi;
     int i;
 
-    n->max_ioqpairs = n->params.max_ioqpairs;
-    n->conf_ioqpairs = n->max_ioqpairs;
+    if (pci_is_vf(&n->parent_obj)) {
+        sctrl = nvme_sctrl(n);
+
+        max_vfs = 0;
+
+        n->max_ioqpairs = n->params.sriov_max_vq_per_vf - 1;
+        n->conf_ioqpairs = sctrl->nvq ? le16_to_cpu(sctrl->nvq) - 1 : 0;
+
+        n->max_msix_qsize = n->params.sriov_max_vi_per_vf;
+        n->conf_msix_qsize = sctrl->nvi ? le16_to_cpu(sctrl->nvi) : 1;
+    } else {
+        max_vfs = n->params.sriov_max_vfs;
+
+        n->max_ioqpairs = n->params.max_ioqpairs +
+                          max_vfs * n->params.sriov_max_vq_per_vf;
+        n->conf_ioqpairs = n->max_ioqpairs;
+
+        n->max_msix_qsize = n->params.msix_qsize +
+                            max_vfs * n->params.sriov_max_vi_per_vf;
+        n->conf_msix_qsize = n->max_msix_qsize;
+    }
+
+    total_vq = n->params.sriov_max_vq_per_vf * max_vfs;
+    total_vi = n->params.sriov_max_vi_per_vf * max_vfs;
 
-    n->max_msix_qsize = n->params.msix_qsize;
-    n->conf_msix_qsize = n->max_msix_qsize;
     n->sq = g_new0(NvmeSQueue *, n->max_ioqpairs + 1);
     n->cq = g_new0(NvmeCQueue *, n->max_ioqpairs + 1);
     n->temperature = NVME_TEMPERATURE;
@@ -6346,13 +6408,34 @@ static void nvme_init_state(NvmeCtrl *n)
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 
-    list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
-    for (i = 0; i < n->params.sriov_max_vfs; i++) {
+    list->numcntl = cpu_to_le16(max_vfs);
+    for (i = 0; i < max_vfs; i++) {
         sctrl = &list->sec[i];
         sctrl->pcid = cpu_to_le16(n->cntlid);
     }
 
     cap->cntlid = cpu_to_le16(n->cntlid);
+    cap->crt = NVME_CRT_VQ | NVME_CRT_VI;
+
+    cap->vqfrt = cpu_to_le32(total_vq);
+    cap->vqrfap = cpu_to_le32(total_vq);
+    if (pci_is_vf(&n->parent_obj)) {
+        cap->vqprt = cpu_to_le16(n->conf_ioqpairs + 1);
+    } else {
+        cap->vqprt = cpu_to_le16(n->params.max_ioqpairs + 1);
+        cap->vqfrsm = cpu_to_le16(n->params.sriov_max_vq_per_vf);
+        cap->vqgran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
+    }
+
+    cap->vifrt = cpu_to_le32(total_vi);
+    cap->virfap = cpu_to_le32(total_vi);
+    if (pci_is_vf(&n->parent_obj)) {
+        cap->viprt = cpu_to_le16(n->conf_msix_qsize);
+    } else {
+        cap->viprt = cpu_to_le16(n->params.msix_qsize);
+        cap->vifrsm = cpu_to_le16(n->params.sriov_max_vi_per_vf);
+        cap->vigran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
+    }
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -6448,11 +6531,13 @@ static void nvme_update_vfs(PCIDevice *pci_dev, uint16_t prev_num_vfs,
     }
 }
 
-static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
-                            uint64_t bar_size)
+static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
 {
     uint16_t vf_dev_id = n->params.use_intel_id ?
                          PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
+    uint64_t bar_size = nvme_bar_size(n->params.sriov_max_vq_per_vf,
+                                      n->params.sriov_max_vi_per_vf,
+                                      NULL, NULL);
 
     pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
                        n->params.sriov_max_vfs, n->params.sriov_max_vfs,
@@ -6550,7 +6635,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
     }
 
     if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
-        nvme_init_sriov(n, pci_dev, 0x120, bar_size);
+        nvme_init_sriov(n, pci_dev, 0x120);
     }
 
     return 0;
@@ -6574,6 +6659,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NvmeIdCtrl *id = &n->id_ctrl;
     uint8_t *pci_conf = pci_dev->config;
     uint64_t cap = ldq_le_p(&n->bar.cap);
+    NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
 
     id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
     id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
@@ -6665,6 +6751,10 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
 
     stl_le_p(&n->bar.vs, NVME_SPEC_VER);
     n->bar.intmc = n->bar.intms = 0;
+
+    if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
+        stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
+    }
 }
 
 static int nvme_init_subsys(NvmeCtrl *n, Error **errp)
@@ -6804,6 +6894,8 @@ static Property nvme_props[] = {
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
     DEFINE_PROP_UINT8("sriov_max_vfs", NvmeCtrl, params.sriov_max_vfs, 0),
+    DEFINE_PROP_UINT8("sriov_max_vi_per_vf", NvmeCtrl, params.sriov_max_vi_per_vf, 0),
+    DEFINE_PROP_UINT8("sriov_max_vq_per_vf", NvmeCtrl, params.sriov_max_vq_per_vf, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index a8eded4713..43609c979a 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -393,6 +393,8 @@ typedef struct NvmeParams {
     bool     auto_transition_zones;
     bool     legacy_cmb;
     uint8_t  sriov_max_vfs;
+    uint8_t  sriov_max_vq_per_vf;
+    uint8_t  sriov_max_vi_per_vf;
 } NvmeParams;
 
 typedef struct NvmeCtrl {
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 96595ea8f1..26672d0a31 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1488,6 +1488,11 @@ typedef struct QEMU_PACKED NvmePriCtrlCap {
     uint8_t     rsvd80[4016];
 } NvmePriCtrlCap;
 
+typedef enum NvmePriCtrlCapCrt {
+    NVME_CRT_VQ             = 1 << 0,
+    NVME_CRT_VI             = 1 << 1,
+} NvmePriCtrlCapCrt;
+
 typedef struct QEMU_PACKED NvmeSecCtrlEntry {
     uint16_t    scid;
     uint16_t    pcid;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 13/15] pcie: Add helpers to the SR/IOV API
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (11 preceding siblings ...)
  2021-10-07 16:24 ` [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
@ 2021-10-07 16:24 ` Lukasz Maniak
  2021-10-26 16:57   ` Knut Omang
  2021-10-07 16:24 ` [PATCH 14/15] hw/nvme: Add support for the Virtualization Management command Lukasz Maniak
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Łukasz Gieryk, Lukasz Maniak, qemu-block, Michael S. Tsirkin

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Two convenience functions for retrieving:
 - the total number of VFs,
 - the PCIDevice object of the N-th VF.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/pci/pcie_sriov.c         | 14 ++++++++++++++
 include/hw/pci/pcie_sriov.h |  8 ++++++++
 2 files changed, 22 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index cac2aee061..5a8e92d5ab 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -292,8 +292,22 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev)
     return dev->exp.sriov_vf.vf_number;
 }
 
+uint16_t pcie_sriov_vf_number_total(PCIDevice *dev)
+{
+    assert(!pci_is_vf(dev));
+    return dev->exp.sriov_pf.num_vfs;
+}
 
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
 {
     return dev->exp.sriov_vf.pf;
 }
+
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
+{
+    assert(!pci_is_vf(dev));
+    if (n < dev->exp.sriov_pf.num_vfs) {
+        return dev->exp.sriov_pf.vf[n];
+    }
+    return NULL;
+}
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 9ab48b79c0..d1f39b7223 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -65,9 +65,17 @@ void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
 /* Get logical VF number of a VF - only valid for VFs */
 uint16_t pcie_sriov_vf_number(PCIDevice *dev);
 
+/* Get the total number of VFs - only valid for PF */
+uint16_t pcie_sriov_vf_number_total(PCIDevice *dev);
+
 /* Get the physical function that owns this VF.
  * Returns NULL if dev is not a virtual function
  */
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
 
+/* Get the n-th VF of this physical function - only valid for PF.
+ * Returns NULL if index is invalid
+ */
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n);
+
 #endif /* QEMU_PCIE_SRIOV_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 14/15] hw/nvme: Add support for the Virtualization Management command
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (12 preceding siblings ...)
  2021-10-07 16:24 ` [PATCH 13/15] pcie: Add helpers to the SR/IOV API Lukasz Maniak
@ 2021-10-07 16:24 ` Lukasz Maniak
  2021-10-07 16:24 ` [PATCH 15/15] docs: Add documentation for SR-IOV and Virtualization Enhancements Lukasz Maniak
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Lukasz Maniak, Klaus Jensen, Hanna Reitz, Stefan Hajnoczi,
	Keith Busch, Philippe Mathieu-Daudé

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

With the new command one can:
 - assign flexible resources (queues, interrupts) to primary and
   secondary controllers,
 - toggle the online/offline state of given controller.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c       | 207 ++++++++++++++++++++++++++++++++++++++++++-
 hw/nvme/nvme.h       |  16 ++++
 hw/nvme/trace-events |   3 +
 include/block/nvme.h |  17 ++++
 4 files changed, 241 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 67c7210d7e..0c44d9b23a 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -246,6 +246,7 @@ static const uint32_t nvme_cse_acs[256] = {
     [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_NS_ATTACHMENT]    = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+    [NVME_ADM_CMD_VIRT_MNGMT]       = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_FORMAT_NVM]       = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
 
@@ -277,6 +278,7 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 };
 
 static void nvme_process_sq(void *opaque);
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
 
 static uint16_t nvme_sqid(NvmeRequest *req)
 {
@@ -5506,6 +5508,163 @@ out:
     return status;
 }
 
+static void nvme_get_virt_res_num(NvmeCtrl *n, uint8_t rt, int *num_total,
+                                  int *num_prim, int *num_sec)
+{
+    *num_total = le32_to_cpu(rt ? n->pri_ctrl_cap.vifrt : n->pri_ctrl_cap.vqfrt);
+    *num_prim = le16_to_cpu(rt ? n->pri_ctrl_cap.virfap : n->pri_ctrl_cap.vqrfap);
+    *num_sec = le16_to_cpu(rt ? n->pri_ctrl_cap.virfa : n->pri_ctrl_cap.vqrfa);
+}
+
+static uint16_t nvme_assign_virt_res_to_prim(NvmeCtrl *n, NvmeRequest *req,
+                                             uint16_t cntlid, uint8_t rt, int nr)
+{
+    int num_total, num_prim, num_sec;
+
+    if (cntlid != n->cntlid) {
+        return NVME_INVALID_CTRL_ID;
+    }
+
+    nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec);
+
+    if (nr > num_total) {
+        return NVME_INVALID_NUM_RESOURCES;
+    }
+
+    if (nr > num_total - num_sec) {
+        return NVME_INVALID_RESOURCE_ID;
+    }
+
+    if (rt) {
+        n->pri_ctrl_cap.virfap = cpu_to_le16(nr);
+    } else {
+        n->pri_ctrl_cap.vqrfap = cpu_to_le16(nr);
+    }
+
+    req->cqe.result = cpu_to_le32(nr);
+    return req->status;
+}
+
+static void nvme_update_virt_res(NvmeCtrl *n, NvmeSecCtrlEntry *sctrl,
+                                 uint8_t rt, int nr)
+{
+    int prev_nr, prev_total;
+
+    if (rt) {
+        prev_nr = le16_to_cpu(sctrl->nvi);
+        prev_total = le32_to_cpu(n->pri_ctrl_cap.virfa);
+        sctrl->nvi = cpu_to_le16(nr);
+        n->pri_ctrl_cap.virfa = cpu_to_le32(prev_total + nr - prev_nr);
+    } else {
+        prev_nr = le16_to_cpu(sctrl->nvq);
+        prev_total = le32_to_cpu(n->pri_ctrl_cap.vqrfa);
+        sctrl->nvq = cpu_to_le16(nr);
+        n->pri_ctrl_cap.vqrfa = cpu_to_le32(prev_total + nr - prev_nr);
+    }
+}
+
+static uint16_t nvme_assign_virt_res_to_sec(NvmeCtrl *n, NvmeRequest *req,
+                                            uint16_t cntlid, uint8_t rt, int nr)
+{
+    int limit = rt ? n->params.sriov_max_vi_per_vf :
+                     n->params.sriov_max_vq_per_vf;
+    int num_total, num_prim, num_sec, num_free, diff;
+    NvmeSecCtrlEntry *sctrl;
+
+    sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+    if (!sctrl) {
+        return NVME_INVALID_CTRL_ID;
+    }
+
+    if (sctrl->scs) {
+        return NVME_INVALID_SEC_CTRL_STATE;
+    }
+
+    if (nr > limit) {
+        return NVME_INVALID_NUM_RESOURCES;
+    }
+
+    nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec);
+    num_free = num_total - num_prim - num_sec;
+    diff = nr - le16_to_cpu(rt ? sctrl->nvi : sctrl->nvq);
+
+    if (diff > num_free) {
+        return NVME_INVALID_RESOURCE_ID;
+    }
+
+    nvme_update_virt_res(n, sctrl, rt, nr);
+    req->cqe.result = cpu_to_le32(nr);
+
+    return req->status;
+}
+
+static uint16_t nvme_virt_set_state(NvmeCtrl *n, uint16_t cntlid, bool online)
+{
+    NvmeCtrl *sn = NULL;
+    NvmeSecCtrlEntry *sctrl;
+
+    sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+    if (!sctrl) {
+        return NVME_INVALID_CTRL_ID;
+    }
+
+    if (sctrl->vfn) {
+        sn = NVME(pcie_sriov_get_vf_at_index(&n->parent_obj,
+                                             le16_to_cpu(sctrl->vfn) - 1));
+    }
+
+    if (online) {
+        if (!NVME_CC_EN(ldl_le_p(&n->bar.cc)) || !sctrl->nvi ||
+            (le16_to_cpu(sctrl->nvq) < 2)) {
+            return NVME_INVALID_SEC_CTRL_STATE;
+        }
+
+        if (!sctrl->scs) {
+            sctrl->scs = 0x1;
+            if (sn) {
+                nvme_ctrl_reset(sn, NVME_RESET_CONTROLLER);
+            }
+        }
+    } else {
+        if (sctrl->scs) {
+            sctrl->scs = 0x0;
+            if (sn) {
+                nvme_ctrl_reset(sn, NVME_RESET_CONTROLLER);
+            }
+        }
+
+        nvme_update_virt_res(n, sctrl, NVME_VIRT_RES_INTERRUPT, 0);
+        nvme_update_virt_res(n, sctrl, NVME_VIRT_RES_QUEUE, 0);
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_virt_mngmt(NvmeCtrl *n, NvmeRequest *req)
+{
+    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
+    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
+    uint8_t act = dw10 & 0xf;
+    uint8_t rt = (dw10 >> 8) & 0x7;
+    uint16_t cntlid = (dw10 >> 16) & 0xffff;
+    int nr = dw11 & 0xffff;
+
+    trace_pci_nvme_virt_mngmt(nvme_cid(req), act, cntlid, rt ? "VI" : "VQ", nr);
+
+    switch (act) {
+    case NVME_VIRT_MNGMT_ACTION_SEC_ASSIGN:
+        return nvme_assign_virt_res_to_sec(n, req, cntlid, rt, nr);
+    case NVME_VIRT_MNGMT_ACTION_PRM_ALLOC:
+        return nvme_assign_virt_res_to_prim(n, req, cntlid, rt, nr);
+    case NVME_VIRT_MNGMT_ACTION_SEC_ONLINE:
+        return nvme_virt_set_state(n, cntlid, true);
+    case NVME_VIRT_MNGMT_ACTION_SEC_OFFLINE:
+        return nvme_virt_set_state(n, cntlid, false);
+    default:
+        return NVME_INVALID_FIELD;
+    }
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
@@ -5548,6 +5707,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
         return nvme_aer(n, req);
     case NVME_ADM_CMD_NS_ATTACHMENT:
         return nvme_ns_attachment(n, req);
+    case NVME_ADM_CMD_VIRT_MNGMT:
+        return nvme_virt_mngmt(n, req);
     case NVME_ADM_CMD_FORMAT_NVM:
         return nvme_format(n, req);
     default:
@@ -5609,6 +5770,7 @@ static void nvme_update_msixcap_ts(PCIDevice *pci_dev, uint32_t table_size)
 static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
     PCIDevice *pci_dev = &n->parent_obj;
+    NvmeSecCtrlEntry *sctrl;
     NvmeNamespace *ns;
     int i;
 
@@ -5639,6 +5801,11 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
     }
 
     if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
+        for (i = 0; i < n->sec_ctrl_list.numcntl; i++) {
+            sctrl = &n->sec_ctrl_list.sec[i];
+            nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
+        }
+
         if (rst != NVME_RESET_CONTROLLER) {
             pcie_sriov_pf_disable_vfs(pci_dev);
         }
@@ -5648,6 +5815,19 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
     n->outstanding_aers = 0;
     n->qs_created = false;
 
+    if (pci_is_vf(pci_dev)) {
+        sctrl = nvme_sctrl(n);
+        n->conf_ioqpairs = sctrl->nvq ? le16_to_cpu(sctrl->nvq) - 1 : 0;
+        n->conf_msix_qsize = sctrl->nvi ? le16_to_cpu(sctrl->nvi) : 1;
+        stl_le_p(&n->bar.csts, sctrl->scs ? 0 : NVME_CSTS_FAILED);
+    } else {
+        n->conf_ioqpairs = n->params.max_ioqpairs +
+                           le16_to_cpu(n->pri_ctrl_cap.vqrfap);
+        n->conf_msix_qsize = n->params.msix_qsize +
+                             le16_to_cpu(n->pri_ctrl_cap.virfap);
+        stl_le_p(&n->bar.csts, 0);
+    }
+
     nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
 }
 
@@ -5694,7 +5874,14 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     uint64_t acq = ldq_le_p(&n->bar.acq);
     uint32_t page_bits = NVME_CC_MPS(cc) + 12;
     uint32_t page_size = 1 << page_bits;
+    NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
 
+    if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
+        trace_pci_nvme_err_startfail_virt_state(le16_to_cpu(sctrl->nvi),
+                                                le16_to_cpu(sctrl->nvq),
+                                                sctrl->scs ? "ONLINE" : "OFFLINE");
+        return -1;
+    }
     if (unlikely(n->cq[0])) {
         trace_pci_nvme_err_startfail_cq();
         return -1;
@@ -6077,6 +6264,12 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr addr, unsigned size)
         return 0;
     }
 
+    if (pci_is_vf(&n->parent_obj) && !nvme_sctrl(n)->scs &&
+        addr != NVME_REG_CSTS) {
+        trace_pci_nvme_err_ignored_mmio_vf_offline(addr, size);
+        return 0;
+    }
+
     /*
      * When PMRWBM bit 1 is set then read from
      * from PMRSTS should ensure prior writes
@@ -6226,6 +6419,12 @@ static void nvme_mmio_write(void *opaque, hwaddr addr, uint64_t data,
 
     trace_pci_nvme_mmio_write(addr, data, size);
 
+    if (pci_is_vf(&n->parent_obj) && !nvme_sctrl(n)->scs &&
+        addr != NVME_REG_CSTS) {
+        trace_pci_nvme_err_ignored_mmio_vf_offline(addr, size);
+        return;
+    }
+
     if (addr < sizeof(n->bar)) {
         nvme_write_bar(n, addr, data, size);
     } else {
@@ -6516,6 +6715,7 @@ static void nvme_update_vfs(PCIDevice *pci_dev, uint16_t prev_num_vfs,
     NvmeCtrl *n = NVME(pci_dev);
     uint16_t num_active_vfs = MAX(prev_num_vfs, num_vfs);
     bool vf_enable = (prev_num_vfs < num_vfs);
+    NvmeSecCtrlEntry *sctrl;
     uint16_t i;
 
     /*
@@ -6523,10 +6723,13 @@ static void nvme_update_vfs(PCIDevice *pci_dev, uint16_t prev_num_vfs,
      * VF count can only go from 0 to a set value and vice versa.
      */
     for (i = 0; i < num_active_vfs; i++) {
+        sctrl = &n->sec_ctrl_list.sec[i];
+
         if (vf_enable) {
-            n->sec_ctrl_list.sec[i].vfn = cpu_to_le16(i + 1);
+            sctrl->vfn = cpu_to_le16(i + 1);
         } else {
-            n->sec_ctrl_list.sec[i].vfn = 0;
+            sctrl->vfn = 0;
+            nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
         }
     }
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 43609c979a..79667af635 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -321,6 +321,7 @@ static inline const char *nvme_adm_opc_str(uint8_t opc)
     case NVME_ADM_CMD_GET_FEATURES:     return "NVME_ADM_CMD_GET_FEATURES";
     case NVME_ADM_CMD_ASYNC_EV_REQ:     return "NVME_ADM_CMD_ASYNC_EV_REQ";
     case NVME_ADM_CMD_NS_ATTACHMENT:    return "NVME_ADM_CMD_NS_ATTACHMENT";
+    case NVME_ADM_CMD_VIRT_MNGMT:       return "NVME_ADM_CMD_VIRT_MNGMT";
     case NVME_ADM_CMD_FORMAT_NVM:       return "NVME_ADM_CMD_FORMAT_NVM";
     default:                            return "NVME_ADM_CMD_UNKNOWN";
     }
@@ -521,6 +522,21 @@ static inline NvmeSecCtrlEntry *nvme_sctrl(NvmeCtrl *n)
     return NULL;
 }
 
+static inline NvmeSecCtrlEntry *nvme_sctrl_for_cntlid(NvmeCtrl *n,
+                                                      uint16_t cntlid)
+{
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    uint8_t i;
+
+    for (i = 0; i < list->numcntl; i++) {
+        if (le16_to_cpu(list->sec[i].scid) == cntlid) {
+            return &list->sec[i];
+        }
+    }
+
+    return NULL;
+}
+
 void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns);
 uint16_t nvme_bounce_data(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
                           NvmeTxDirection dir, NvmeRequest *req);
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index 88678fc21e..aab70cc5bd 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -106,6 +106,7 @@ pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_
 pci_nvme_clear_ns_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
 pci_nvme_clear_ns_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
 pci_nvme_pci_reset(void) "PCI Function Level Reset"
+pci_nvme_virt_mngmt(uint16_t cid, uint16_t act, uint16_t cntlid, const char* rt, uint16_t nr) "cid %"PRIu16", act=0x%"PRIx16", ctrlid=%"PRIu16" %s nr=%"PRIu16""
 
 # error conditions
 pci_nvme_err_mdts(size_t len) "len %zu"
@@ -175,7 +176,9 @@ pci_nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the
 pci_nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
 pci_nvme_err_startfail_zasl_too_small(uint32_t zasl, uint32_t pagesz) "nvme_start_ctrl failed because zone append size limit %"PRIu32" is too small, needs to be >= %"PRIu32""
 pci_nvme_err_startfail(void) "setting controller enable bit failed"
+pci_nvme_err_startfail_virt_state(uint16_t vq, uint16_t vi, const char *state) "nvme_start_ctrl failed due to ctrl state: vi=%u vq=%u %s"
 pci_nvme_err_invalid_mgmt_action(uint8_t action) "action=0x%"PRIx8""
+pci_nvme_err_ignored_mmio_vf_offline(uint64_t addr, unsigned size) "addr 0x%"PRIx64" size %d"
 
 # undefined behavior
 pci_nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 26672d0a31..320f43d186 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -595,6 +595,7 @@ enum NvmeAdminCommands {
     NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
     NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
     NVME_ADM_CMD_NS_ATTACHMENT  = 0x15,
+    NVME_ADM_CMD_VIRT_MNGMT     = 0x1c,
     NVME_ADM_CMD_FORMAT_NVM     = 0x80,
     NVME_ADM_CMD_SECURITY_SEND  = 0x81,
     NVME_ADM_CMD_SECURITY_RECV  = 0x82,
@@ -886,6 +887,10 @@ enum NvmeStatusCodes {
     NVME_NS_PRIVATE             = 0x0119,
     NVME_NS_NOT_ATTACHED        = 0x011a,
     NVME_NS_CTRL_LIST_INVALID   = 0x011c,
+    NVME_INVALID_CTRL_ID        = 0x011f,
+    NVME_INVALID_SEC_CTRL_STATE = 0x0120,
+    NVME_INVALID_NUM_RESOURCES  = 0x0121,
+    NVME_INVALID_RESOURCE_ID    = 0x0122,
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
@@ -1510,6 +1515,18 @@ typedef struct QEMU_PACKED NvmeSecCtrlList {
     NvmeSecCtrlEntry    sec[127];
 } NvmeSecCtrlList;
 
+typedef enum NvmeVirtMngmtAction {
+    NVME_VIRT_MNGMT_ACTION_PRM_ALLOC    = 0x01,
+    NVME_VIRT_MNGMT_ACTION_SEC_OFFLINE  = 0x07,
+    NVME_VIRT_MNGMT_ACTION_SEC_ASSIGN   = 0x08,
+    NVME_VIRT_MNGMT_ACTION_SEC_ONLINE   = 0x09,
+} NvmeVirtMngmtAct;
+
+typedef enum NvmeVirtResType {
+    NVME_VIRT_RES_QUEUE         = 0x00,
+    NVME_VIRT_RES_INTERRUPT     = 0x01,
+} NvmeVirtualResourceType;
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 15/15] docs: Add documentation for SR-IOV and Virtualization Enhancements
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (13 preceding siblings ...)
  2021-10-07 16:24 ` [PATCH 14/15] hw/nvme: Add support for the Virtualization Management command Lukasz Maniak
@ 2021-10-07 16:24 ` Lukasz Maniak
  2021-10-08  6:31 ` [PATCH 00/15] hw/nvme: SR-IOV with " Klaus Jensen
  2021-10-26 18:20 ` Klaus Jensen
  16 siblings, 0 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-07 16:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Łukasz Gieryk, Lukasz Maniak, qemu-block

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 docs/system/devices/nvme.rst | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index bff72d1c24..904fd7290c 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -235,3 +235,30 @@ The virtual namespace device supports DIF- and DIX-based protection information
   to ``1`` to transfer protection information as the first eight bytes of
   metadata. Otherwise, the protection information is transferred as the last
   eight bytes.
+
+Virtualization Enhancements and SR-IOV
+--------------------------------------
+
+The ``nvme`` device supports Single Root I/O Virtualization and Sharing
+along with Virtualization Enhancements. The controller has to be linked to
+an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
+
+A number of parameters are present:
+
+``sriov_max_vfs`` (default: ``0``)
+  Indicates the maximum number of PCIe virtual functions supported
+  by the controller. Specifying a non-zero value enables reporting of both
+  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
+  by the NVMe device. Virtual function controllers will not report SR-IOV.
+
+``sriov_max_vi_per_vf``
+  Indicates the maximum number of virtual interrupt resources assignable
+  to a secondary controller. Must be explicitly set if ``sriov_max_vfs`` != 0.
+  The parameter affect VFs similarly to how ``msix_qsize`` affects PF, i.e.,
+  determines the number of interrupts available to all queues (admin, io).
+
+``sriov_max_vq_per_vf``
+  Indicates the maximum number of virtual queue resources assignable to
+  a secondary controller. Must be explicitly set if ``sriov_max_vfs`` != 0.
+  The parameter affect VFs similarly to how ``max_ioqpairs`` affects PF,
+  except the number of flexible queues includes the admin queue.
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512
  2021-10-07 16:23 ` [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512 Lukasz Maniak
@ 2021-10-07 22:12   ` Michael S. Tsirkin
  2021-10-26 14:36     ` Lukasz Maniak
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-07 22:12 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Łukasz Gieryk, Knut Omang, qemu-devel, Knut Omang

On Thu, Oct 07, 2021 at 06:23:52PM +0200, Lukasz Maniak wrote:
> From: Knut Omang <knut.omang@oracle.com>
> 
> Make the default PCI Express Capability for PCIe devices set
> MaxReadReq to 512.

code says 256

> Tyipcal modern devices people would want to


typo

> emulate or simulate would want this. The previous value would
> cause warnings from the root port driver on some kernels.


which specifically?

> 
> Signed-off-by: Knut Omang <knuto@ifi.uio.no>

we can't make changes like this unconditionally, this will
break migration across versions.
Pls tie this to a machine version.

Thanks!
> ---
>  hw/pci/pcie.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index 6e95d82903..c1a12f3744 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -62,8 +62,9 @@ pcie_cap_v1_fill(PCIDevice *dev, uint8_t port, uint8_t type, uint8_t version)
>       * Functions conforming to the ECN, PCI Express Base
>       * Specification, Revision 1.1., or subsequent PCI Express Base
>       * Specification revisions.
> +     *  + set max payload size to 256, which seems to be a common value
>       */
> -    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER);
> +    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER | (0x1 & PCI_EXP_DEVCAP_PAYLOAD));
>  
>      pci_set_long(exp_cap + PCI_EXP_LNKCAP,
>                   (port << PCI_EXP_LNKCAP_PN_SHIFT) |
> @@ -179,6 +180,8 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset,
>      pci_set_long(exp_cap + PCI_EXP_DEVCAP2,
>                   PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP);
>  
> +    pci_set_word(exp_cap + PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_READRQ_256B);
> +
>      pci_set_word(dev->wmask + pos + PCI_EXP_DEVCTL2, PCI_EXP_DEVCTL2_EETLPPB);
>  
>      if (dev->cap_present & QEMU_PCIE_EXTCAP_INIT) {
> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (14 preceding siblings ...)
  2021-10-07 16:24 ` [PATCH 15/15] docs: Add documentation for SR-IOV and Virtualization Enhancements Lukasz Maniak
@ 2021-10-08  6:31 ` Klaus Jensen
  2021-10-26 18:20 ` Klaus Jensen
  16 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-10-08  6:31 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: kbusch, Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2819 bytes --]

On Oct  7 18:23, Lukasz Maniak wrote:
> Hi,
> 
> This series of patches is an attempt to add support for the following
> sections of NVMe specification revision 1.4:
> 
> 8.5 Virtualization Enhancements (Optional)
>     8.5.1 VQ Resource Definition
>     8.5.2 VI Resource Definition
>     8.5.3 Secondary Controller States and Resource Configuration
>     8.5.4 Single Root I/O Virtualization and Sharing (SR-IOV)
> 
> The NVMe controller's Single Root I/O Virtualization and Sharing
> implementation is based on patches introducing SR-IOV support for PCI
> Express proposed by Knut Omang:
> https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05155.html
> 
> However, based on what I was able to find historically, Knut's patches
> have not yet been pulled into QEMU due to no example of a working device
> up to this point:
> https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg02722.html
> 
> In terms of design, the Physical Function controller and the Virtual
> Function controllers are almost independent, with few exceptions:
> PF handles flexible resource allocation for all its children (VFs have
> read-only access to this data), and reset (PF explicitly calls it on VFs).
> Since the MMIO access is serialized, no extra precautions are required
> to handle concurrent resets, as well as the secondary controller state
> access doesn't need to be atomic.
> 
> A controller with full SR-IOV support must be capable of handling the
> Namespace Management command. As there is a pending review with this
> functionality, this patch list is not duplicating efforts.
> Yet, NS management patches are not required to test the SR-IOV support.
> 
> We tested the patches on Ubuntu 20.04.3 LTS with kernel 5.4.0. We have
> hit various issues with NVMe CLI (list and virt-mgmt commands) between
> releases from version 1.09 to master, thus we chose this golden NVMe CLI
> hash for testing: a50a0c1.
> 
> The implementation is not 100% finished and certainly not bug free,
> since we are already aware of some issues e.g. interaction with
> namespaces related to AER, or unexpected (?) kernel behavior in more
> complex reset scenarios. However, our SR-IOV implementation is already
> able to support typical SR-IOV use cases, so we believe the patches are
> ready to share with the community.
> 
> Hope you find some time to review the work we did, and share your
> thoughts.
> 
> Kind regards,
> Lukasz
> 

Hi all,

This is super interesting. I was looking at Knut's patches the other day
and considered hw/nvme to be an ideal candidate as the device to
implement it. And then this shows up with perfect timing! :)

I'll need to set aside some time to go through this, but I should have
comments for your by the end of next week :)


Thanks!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update
  2021-10-07 16:23 ` [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update Lukasz Maniak
@ 2021-10-12  7:25   ` Michael S. Tsirkin
  2021-10-12 16:06     ` Lukasz Maniak
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-12  7:25 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Łukasz Gieryk, qemu-devel, qemu-block

On Thu, Oct 07, 2021 at 06:23:55PM +0200, Lukasz Maniak wrote:
> PCIe devices implementing SR-IOV may need to perform certain actions
> before the VFs are unrealized or vice versa.
> 
> Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>

Callbacks are annoying and easy to misuse though.
VFs are enabled through a config cycle, we generally just
have devices invoke the capability handler.
E.g.

static void pci_bridge_dev_write_config(PCIDevice *d,
                                        uint32_t address, uint32_t val, int len)
{
    pci_bridge_write_config(d, address, val, len);
    if (msi_present(d)) {
        msi_write_config(d, address, val, len);
    }
}

this makes it easy to do whatever you want before/after
the write. You can also add a helper to check
that SRIOV is being enabled/disabled if necessary.

> ---
>  docs/pcie_sriov.txt         |  2 +-
>  hw/pci/pcie_sriov.c         | 14 +++++++++++++-
>  include/hw/pci/pcie_sriov.h |  8 +++++++-
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> index f5e891e1d4..63ca1a7b8e 100644
> --- a/docs/pcie_sriov.txt
> +++ b/docs/pcie_sriov.txt
> @@ -57,7 +57,7 @@ setting up a BAR for a VF.
>        /* Add and initialize the SR/IOV capability */
>        pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
>                         vf_devid, initial_vfs, total_vfs,
> -                       fun_offset, stride);
> +                       fun_offset, stride, pre_vfs_update_cb);
>  
>        /* Set up individual VF BARs (parameters as for normal BARs) */
>        pcie_sriov_pf_init_vf_bar( ... )
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index 501a1ff433..cac2aee061 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
>  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>                          const char *vfname, uint16_t vf_dev_id,
>                          uint16_t init_vfs, uint16_t total_vfs,
> -                        uint16_t vf_offset, uint16_t vf_stride)
> +                        uint16_t vf_offset, uint16_t vf_stride,
> +                        SriovVfsUpdate pre_vfs_update)
>  {
>      uint8_t *cfg = dev->config + offset;
>      uint8_t *wmask;
> @@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>      dev->exp.sriov_pf.num_vfs = 0;
>      dev->exp.sriov_pf.vfname = g_strdup(vfname);
>      dev->exp.sriov_pf.vf = NULL;
> +    dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
>  
>      pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
>      pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> @@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
>      assert(dev->exp.sriov_pf.vf);
>  
>      trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
> +
> +    if (dev->exp.sriov_pf.pre_vfs_update) {
> +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, num_vfs);
> +    }
> +
>      for (i = 0; i < num_vfs; i++) {
>          dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
>          if (!dev->exp.sriov_pf.vf[i]) {
> @@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
>      uint16_t i;
>  
>      trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
> +
> +    if (dev->exp.sriov_pf.pre_vfs_update) {
> +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
> +    }
> +
>      for (i = 0; i < num_vfs; i++) {
>          PCIDevice *vf = dev->exp.sriov_pf.vf[i];
>          object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> index 0974f00054..9ab48b79c0 100644
> --- a/include/hw/pci/pcie_sriov.h
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -13,11 +13,16 @@
>  #ifndef QEMU_PCIE_SRIOV_H
>  #define QEMU_PCIE_SRIOV_H
>  
> +typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
> +                               uint16_t num_vfs);
> +
>  struct PCIESriovPF {
>      uint16_t num_vfs;           /* Number of virtual functions created */
>      uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
>      const char *vfname;         /* Reference to the device type used for the VFs */
>      PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> +
> +    SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
>  };
>  
>  struct PCIESriovVF {
> @@ -28,7 +33,8 @@ struct PCIESriovVF {
>  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
>                          const char *vfname, uint16_t vf_dev_id,
>                          uint16_t init_vfs, uint16_t total_vfs,
> -                        uint16_t vf_offset, uint16_t vf_stride);
> +                        uint16_t vf_offset, uint16_t vf_stride,
> +                        SriovVfsUpdate pre_vfs_update);
>  void pcie_sriov_pf_exit(PCIDevice *dev);
>  
>  /* Set up a VF bar in the SR/IOV bar area */
> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update
  2021-10-12  7:25   ` Michael S. Tsirkin
@ 2021-10-12 16:06     ` Lukasz Maniak
  2021-10-13  9:10       ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-12 16:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Łukasz Gieryk, Knut Omang, qemu-devel, qemu-block

On Tue, Oct 12, 2021 at 03:25:12AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 07, 2021 at 06:23:55PM +0200, Lukasz Maniak wrote:
> > PCIe devices implementing SR-IOV may need to perform certain actions
> > before the VFs are unrealized or vice versa.
> > 
> > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> 
> Callbacks are annoying and easy to misuse though.
> VFs are enabled through a config cycle, we generally just
> have devices invoke the capability handler.
> E.g.
> 
> static void pci_bridge_dev_write_config(PCIDevice *d,
>                                         uint32_t address, uint32_t val, int len)
> {
>     pci_bridge_write_config(d, address, val, len);
>     if (msi_present(d)) {
>         msi_write_config(d, address, val, len);
>     }
> }
> 
> this makes it easy to do whatever you want before/after
> the write. You can also add a helper to check
> that SRIOV is being enabled/disabled if necessary.
> 
> > ---
> >  docs/pcie_sriov.txt         |  2 +-
> >  hw/pci/pcie_sriov.c         | 14 +++++++++++++-
> >  include/hw/pci/pcie_sriov.h |  8 +++++++-
> >  3 files changed, 21 insertions(+), 3 deletions(-)
> > 
> > diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> > index f5e891e1d4..63ca1a7b8e 100644
> > --- a/docs/pcie_sriov.txt
> > +++ b/docs/pcie_sriov.txt
> > @@ -57,7 +57,7 @@ setting up a BAR for a VF.
> >        /* Add and initialize the SR/IOV capability */
> >        pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> >                         vf_devid, initial_vfs, total_vfs,
> > -                       fun_offset, stride);
> > +                       fun_offset, stride, pre_vfs_update_cb);
> >  
> >        /* Set up individual VF BARs (parameters as for normal BARs) */
> >        pcie_sriov_pf_init_vf_bar( ... )
> > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > index 501a1ff433..cac2aee061 100644
> > --- a/hw/pci/pcie_sriov.c
> > +++ b/hw/pci/pcie_sriov.c
> > @@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
> >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> >                          const char *vfname, uint16_t vf_dev_id,
> >                          uint16_t init_vfs, uint16_t total_vfs,
> > -                        uint16_t vf_offset, uint16_t vf_stride)
> > +                        uint16_t vf_offset, uint16_t vf_stride,
> > +                        SriovVfsUpdate pre_vfs_update)
> >  {
> >      uint8_t *cfg = dev->config + offset;
> >      uint8_t *wmask;
> > @@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> >      dev->exp.sriov_pf.num_vfs = 0;
> >      dev->exp.sriov_pf.vfname = g_strdup(vfname);
> >      dev->exp.sriov_pf.vf = NULL;
> > +    dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
> >  
> >      pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> >      pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > @@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
> >      assert(dev->exp.sriov_pf.vf);
> >  
> >      trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
> > +
> > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, num_vfs);
> > +    }
> > +
> >      for (i = 0; i < num_vfs; i++) {
> >          dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
> >          if (!dev->exp.sriov_pf.vf[i]) {
> > @@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
> >      uint16_t i;
> >  
> >      trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
> > +
> > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
> > +    }
> > +
> >      for (i = 0; i < num_vfs; i++) {
> >          PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> >          object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
> > diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> > index 0974f00054..9ab48b79c0 100644
> > --- a/include/hw/pci/pcie_sriov.h
> > +++ b/include/hw/pci/pcie_sriov.h
> > @@ -13,11 +13,16 @@
> >  #ifndef QEMU_PCIE_SRIOV_H
> >  #define QEMU_PCIE_SRIOV_H
> >  
> > +typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
> > +                               uint16_t num_vfs);
> > +
> >  struct PCIESriovPF {
> >      uint16_t num_vfs;           /* Number of virtual functions created */
> >      uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
> >      const char *vfname;         /* Reference to the device type used for the VFs */
> >      PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> > +
> > +    SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
> >  };
> >  
> >  struct PCIESriovVF {
> > @@ -28,7 +33,8 @@ struct PCIESriovVF {
> >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> >                          const char *vfname, uint16_t vf_dev_id,
> >                          uint16_t init_vfs, uint16_t total_vfs,
> > -                        uint16_t vf_offset, uint16_t vf_stride);
> > +                        uint16_t vf_offset, uint16_t vf_stride,
> > +                        SriovVfsUpdate pre_vfs_update);
> >  void pcie_sriov_pf_exit(PCIDevice *dev);
> >  
> >  /* Set up a VF bar in the SR/IOV bar area */
> > -- 
> > 2.25.1
>

Hi Michael,

A custom config_write callback was the first approach we used.
However, once implemented, we realized it looks the same as the
pcie_sriov_config_write function. To avoid code duplication and
interfering with the internal SR-IOV structures for purposes of NVMe,
we opted for this callback prior to the VFs update.
After all, we have callbacks in both approaches, config_write and the
added pre_vfs_update, so both are prone to misuse.

But I agree it may not be a good moment yet to add a new API
specifically for SR-IOV functionality, as NVMe will be the first device
to use it.

CCing Knut, perhaps as the author of SR-IOV you have some thoughts on
how the device notification of an upcoming VFs update would be handled.

Thanks,
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update
  2021-10-12 16:06     ` Lukasz Maniak
@ 2021-10-13  9:10       ` Michael S. Tsirkin
  2021-10-15 16:24         ` Lukasz Maniak
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-13  9:10 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Łukasz Gieryk, Knut Omang, qemu-devel, qemu-block

On Tue, Oct 12, 2021 at 06:06:46PM +0200, Lukasz Maniak wrote:
> On Tue, Oct 12, 2021 at 03:25:12AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Oct 07, 2021 at 06:23:55PM +0200, Lukasz Maniak wrote:
> > > PCIe devices implementing SR-IOV may need to perform certain actions
> > > before the VFs are unrealized or vice versa.
> > > 
> > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > 
> > Callbacks are annoying and easy to misuse though.
> > VFs are enabled through a config cycle, we generally just
> > have devices invoke the capability handler.
> > E.g.
> > 
> > static void pci_bridge_dev_write_config(PCIDevice *d,
> >                                         uint32_t address, uint32_t val, int len)
> > {
> >     pci_bridge_write_config(d, address, val, len);
> >     if (msi_present(d)) {
> >         msi_write_config(d, address, val, len);
> >     }
> > }
> > 
> > this makes it easy to do whatever you want before/after
> > the write. You can also add a helper to check
> > that SRIOV is being enabled/disabled if necessary.
> > 
> > > ---
> > >  docs/pcie_sriov.txt         |  2 +-
> > >  hw/pci/pcie_sriov.c         | 14 +++++++++++++-
> > >  include/hw/pci/pcie_sriov.h |  8 +++++++-
> > >  3 files changed, 21 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> > > index f5e891e1d4..63ca1a7b8e 100644
> > > --- a/docs/pcie_sriov.txt
> > > +++ b/docs/pcie_sriov.txt
> > > @@ -57,7 +57,7 @@ setting up a BAR for a VF.
> > >        /* Add and initialize the SR/IOV capability */
> > >        pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> > >                         vf_devid, initial_vfs, total_vfs,
> > > -                       fun_offset, stride);
> > > +                       fun_offset, stride, pre_vfs_update_cb);
> > >  
> > >        /* Set up individual VF BARs (parameters as for normal BARs) */
> > >        pcie_sriov_pf_init_vf_bar( ... )
> > > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > > index 501a1ff433..cac2aee061 100644
> > > --- a/hw/pci/pcie_sriov.c
> > > +++ b/hw/pci/pcie_sriov.c
> > > @@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
> > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > >                          const char *vfname, uint16_t vf_dev_id,
> > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > -                        uint16_t vf_offset, uint16_t vf_stride)
> > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > +                        SriovVfsUpdate pre_vfs_update)
> > >  {
> > >      uint8_t *cfg = dev->config + offset;
> > >      uint8_t *wmask;
> > > @@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > >      dev->exp.sriov_pf.num_vfs = 0;
> > >      dev->exp.sriov_pf.vfname = g_strdup(vfname);
> > >      dev->exp.sriov_pf.vf = NULL;
> > > +    dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
> > >  
> > >      pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> > >      pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > > @@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
> > >      assert(dev->exp.sriov_pf.vf);
> > >  
> > >      trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
> > > +
> > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, num_vfs);
> > > +    }
> > > +
> > >      for (i = 0; i < num_vfs; i++) {
> > >          dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
> > >          if (!dev->exp.sriov_pf.vf[i]) {
> > > @@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
> > >      uint16_t i;
> > >  
> > >      trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
> > > +
> > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
> > > +    }
> > > +
> > >      for (i = 0; i < num_vfs; i++) {
> > >          PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> > >          object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
> > > diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> > > index 0974f00054..9ab48b79c0 100644
> > > --- a/include/hw/pci/pcie_sriov.h
> > > +++ b/include/hw/pci/pcie_sriov.h
> > > @@ -13,11 +13,16 @@
> > >  #ifndef QEMU_PCIE_SRIOV_H
> > >  #define QEMU_PCIE_SRIOV_H
> > >  
> > > +typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
> > > +                               uint16_t num_vfs);
> > > +
> > >  struct PCIESriovPF {
> > >      uint16_t num_vfs;           /* Number of virtual functions created */
> > >      uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
> > >      const char *vfname;         /* Reference to the device type used for the VFs */
> > >      PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> > > +
> > > +    SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
> > >  };
> > >  
> > >  struct PCIESriovVF {
> > > @@ -28,7 +33,8 @@ struct PCIESriovVF {
> > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > >                          const char *vfname, uint16_t vf_dev_id,
> > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > -                        uint16_t vf_offset, uint16_t vf_stride);
> > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > +                        SriovVfsUpdate pre_vfs_update);
> > >  void pcie_sriov_pf_exit(PCIDevice *dev);
> > >  
> > >  /* Set up a VF bar in the SR/IOV bar area */
> > > -- 
> > > 2.25.1
> >
> 
> Hi Michael,
> 
> A custom config_write callback was the first approach we used.
> However, once implemented, we realized it looks the same as the
> pcie_sriov_config_write function. To avoid code duplication and
> interfering with the internal SR-IOV structures for purposes of NVMe,
> we opted for this callback prior to the VFs update.
> After all, we have callbacks in both approaches, config_write and the
> added pre_vfs_update, so both are prone to misuse.
> 
> But I agree it may not be a good moment yet to add a new API
> specifically for SR-IOV functionality, as NVMe will be the first device
> to use it.
> 
> CCing Knut, perhaps as the author of SR-IOV you have some thoughts on
> how the device notification of an upcoming VFs update would be handled.
> 
> Thanks,
> Lukasz

So just split it up?

void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
{
    uint32_t off;
    uint16_t sriov_cap = dev->exp.sriov_cap;

    if (!sriov_cap || address < sriov_cap) {
        return;
    }
    off = address - sriov_cap;
    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
        return;
    }
    
    trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
        
    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
        if (dev->exp.sriov_pf.num_vfs) {
            if (!(val & PCI_SRIOV_CTRL_VFE)) {
                unregister_vfs(dev);
            }
        } else {
            if (val & PCI_SRIOV_CTRL_VFE) {
                register_vfs(dev);
            }
        }
    }
}


Would become:

bool
 pcie_sriov_is_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
{
    uint32_t off;
    uint16_t sriov_cap = dev->exp.sriov_cap;

    if (!sriov_cap || address < sriov_cap) {
        return false;
    }
    off = address - sriov_cap;
    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
        return false;
    }
}

bool
 pcie_sriov_do_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
{
    trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
        
    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
        if (dev->exp.sriov_pf.num_vfs) {
            if (!(val & PCI_SRIOV_CTRL_VFE)) {
                unregister_vfs(dev);
            }
        } else {
            if (val & PCI_SRIOV_CTRL_VFE) {
                register_vfs(dev);
            }
        }
    }
}



void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
{
	if (pcie_sriov_is_config_write(dev, address, val, len)) {
		pcie_sriov_do_config_write(dev, address, val, len);
	}
    
}


Now  pcie_sriov_is_config_write and pcie_sriov_do_config_write
can be reused by NVME.

-- 
MST



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update
  2021-10-13  9:10       ` Michael S. Tsirkin
@ 2021-10-15 16:24         ` Lukasz Maniak
  2021-10-15 17:30           ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-15 16:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Łukasz Gieryk, Knut Omang, qemu-devel, qemu-block

On Wed, Oct 13, 2021 at 05:10:35AM -0400, Michael S. Tsirkin wrote:
> On Tue, Oct 12, 2021 at 06:06:46PM +0200, Lukasz Maniak wrote:
> > On Tue, Oct 12, 2021 at 03:25:12AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Oct 07, 2021 at 06:23:55PM +0200, Lukasz Maniak wrote:
> > > > PCIe devices implementing SR-IOV may need to perform certain actions
> > > > before the VFs are unrealized or vice versa.
> > > > 
> > > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > > 
> > > Callbacks are annoying and easy to misuse though.
> > > VFs are enabled through a config cycle, we generally just
> > > have devices invoke the capability handler.
> > > E.g.
> > > 
> > > static void pci_bridge_dev_write_config(PCIDevice *d,
> > >                                         uint32_t address, uint32_t val, int len)
> > > {
> > >     pci_bridge_write_config(d, address, val, len);
> > >     if (msi_present(d)) {
> > >         msi_write_config(d, address, val, len);
> > >     }
> > > }
> > > 
> > > this makes it easy to do whatever you want before/after
> > > the write. You can also add a helper to check
> > > that SRIOV is being enabled/disabled if necessary.
> > > 
> > > > ---
> > > >  docs/pcie_sriov.txt         |  2 +-
> > > >  hw/pci/pcie_sriov.c         | 14 +++++++++++++-
> > > >  include/hw/pci/pcie_sriov.h |  8 +++++++-
> > > >  3 files changed, 21 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> > > > index f5e891e1d4..63ca1a7b8e 100644
> > > > --- a/docs/pcie_sriov.txt
> > > > +++ b/docs/pcie_sriov.txt
> > > > @@ -57,7 +57,7 @@ setting up a BAR for a VF.
> > > >        /* Add and initialize the SR/IOV capability */
> > > >        pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> > > >                         vf_devid, initial_vfs, total_vfs,
> > > > -                       fun_offset, stride);
> > > > +                       fun_offset, stride, pre_vfs_update_cb);
> > > >  
> > > >        /* Set up individual VF BARs (parameters as for normal BARs) */
> > > >        pcie_sriov_pf_init_vf_bar( ... )
> > > > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > > > index 501a1ff433..cac2aee061 100644
> > > > --- a/hw/pci/pcie_sriov.c
> > > > +++ b/hw/pci/pcie_sriov.c
> > > > @@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
> > > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > >                          const char *vfname, uint16_t vf_dev_id,
> > > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > > -                        uint16_t vf_offset, uint16_t vf_stride)
> > > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > > +                        SriovVfsUpdate pre_vfs_update)
> > > >  {
> > > >      uint8_t *cfg = dev->config + offset;
> > > >      uint8_t *wmask;
> > > > @@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > >      dev->exp.sriov_pf.num_vfs = 0;
> > > >      dev->exp.sriov_pf.vfname = g_strdup(vfname);
> > > >      dev->exp.sriov_pf.vf = NULL;
> > > > +    dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
> > > >  
> > > >      pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> > > >      pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > > > @@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
> > > >      assert(dev->exp.sriov_pf.vf);
> > > >  
> > > >      trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
> > > > +
> > > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, num_vfs);
> > > > +    }
> > > > +
> > > >      for (i = 0; i < num_vfs; i++) {
> > > >          dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
> > > >          if (!dev->exp.sriov_pf.vf[i]) {
> > > > @@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
> > > >      uint16_t i;
> > > >  
> > > >      trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
> > > > +
> > > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
> > > > +    }
> > > > +
> > > >      for (i = 0; i < num_vfs; i++) {
> > > >          PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> > > >          object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
> > > > diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> > > > index 0974f00054..9ab48b79c0 100644
> > > > --- a/include/hw/pci/pcie_sriov.h
> > > > +++ b/include/hw/pci/pcie_sriov.h
> > > > @@ -13,11 +13,16 @@
> > > >  #ifndef QEMU_PCIE_SRIOV_H
> > > >  #define QEMU_PCIE_SRIOV_H
> > > >  
> > > > +typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
> > > > +                               uint16_t num_vfs);
> > > > +
> > > >  struct PCIESriovPF {
> > > >      uint16_t num_vfs;           /* Number of virtual functions created */
> > > >      uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
> > > >      const char *vfname;         /* Reference to the device type used for the VFs */
> > > >      PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> > > > +
> > > > +    SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
> > > >  };
> > > >  
> > > >  struct PCIESriovVF {
> > > > @@ -28,7 +33,8 @@ struct PCIESriovVF {
> > > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > >                          const char *vfname, uint16_t vf_dev_id,
> > > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > > -                        uint16_t vf_offset, uint16_t vf_stride);
> > > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > > +                        SriovVfsUpdate pre_vfs_update);
> > > >  void pcie_sriov_pf_exit(PCIDevice *dev);
> > > >  
> > > >  /* Set up a VF bar in the SR/IOV bar area */
> > > > -- 
> > > > 2.25.1
> > >
> > 
> > Hi Michael,
> > 
> > A custom config_write callback was the first approach we used.
> > However, once implemented, we realized it looks the same as the
> > pcie_sriov_config_write function. To avoid code duplication and
> > interfering with the internal SR-IOV structures for purposes of NVMe,
> > we opted for this callback prior to the VFs update.
> > After all, we have callbacks in both approaches, config_write and the
> > added pre_vfs_update, so both are prone to misuse.
> > 
> > But I agree it may not be a good moment yet to add a new API
> > specifically for SR-IOV functionality, as NVMe will be the first device
> > to use it.
> > 
> > CCing Knut, perhaps as the author of SR-IOV you have some thoughts on
> > how the device notification of an upcoming VFs update would be handled.
> > 
> > Thanks,
> > Lukasz
> 
> So just split it up?
> 
> void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> {
>     uint32_t off;
>     uint16_t sriov_cap = dev->exp.sriov_cap;
> 
>     if (!sriov_cap || address < sriov_cap) {
>         return;
>     }
>     off = address - sriov_cap;
>     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
>         return;
>     }
>     
>     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
>         
>     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
>         if (dev->exp.sriov_pf.num_vfs) {
>             if (!(val & PCI_SRIOV_CTRL_VFE)) {
>                 unregister_vfs(dev);
>             }
>         } else {
>             if (val & PCI_SRIOV_CTRL_VFE) {
>                 register_vfs(dev);
>             }
>         }
>     }
> }
> 
> 
> Would become:
> 
> bool
>  pcie_sriov_is_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> {
>     uint32_t off;
>     uint16_t sriov_cap = dev->exp.sriov_cap;
> 
>     if (!sriov_cap || address < sriov_cap) {
>         return false;
>     }
>     off = address - sriov_cap;
>     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
>         return false;
>     }
> }
> 
> bool
>  pcie_sriov_do_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> {
>     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
>         
>     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
>         if (dev->exp.sriov_pf.num_vfs) {
>             if (!(val & PCI_SRIOV_CTRL_VFE)) {
>                 unregister_vfs(dev);
>             }
>         } else {
>             if (val & PCI_SRIOV_CTRL_VFE) {
>                 register_vfs(dev);
>             }
>         }
>     }
> }
> 
> 
> 
> void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> {
> 	if (pcie_sriov_is_config_write(dev, address, val, len)) {
> 		pcie_sriov_do_config_write(dev, address, val, len);
> 	}
>     
> }
> 
> 
> Now  pcie_sriov_is_config_write and pcie_sriov_do_config_write
> can be reused by NVME.
> 
> -- 
> MST
> 

Hi Michael,

I extracted one condition to the helper, what do you think?

bool pcie_sriov_is_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
{
    uint32_t off;
    uint16_t sriov_cap = dev->exp.sriov_cap;

    if (!sriov_cap || address < sriov_cap) {
        return false;
    }
    off = address - sriov_cap;
    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
        return false;
    }

    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
        return true;
    }

    return false;
}

static void pcie_sriov_do_config_write(PCIDevice *dev, uint32_t address,
                                       uint32_t val, int len)
{
    uint32_t off = address - dev->exp.sriov_cap;
    trace_sriov_config_write(SRIOV_ID(dev), off, val, len);

    if (dev->exp.sriov_pf.num_vfs) {
        if (!(val & PCI_SRIOV_CTRL_VFE)) {
            unregister_vfs(dev);
        }
    } else {
        if (val & PCI_SRIOV_CTRL_VFE) {
            register_vfs(dev);
        }
    }
}

void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
{
    if (pcie_sriov_is_config_write(dev, address, val, len)) {
        pcie_sriov_do_config_write(dev, address, val, len);
    }
}

--
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update
  2021-10-15 16:24         ` Lukasz Maniak
@ 2021-10-15 17:30           ` Michael S. Tsirkin
  2021-10-20 13:30             ` Lukasz Maniak
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2021-10-15 17:30 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Łukasz Gieryk, Knut Omang, qemu-devel, qemu-block

On Fri, Oct 15, 2021 at 06:24:14PM +0200, Lukasz Maniak wrote:
> On Wed, Oct 13, 2021 at 05:10:35AM -0400, Michael S. Tsirkin wrote:
> > On Tue, Oct 12, 2021 at 06:06:46PM +0200, Lukasz Maniak wrote:
> > > On Tue, Oct 12, 2021 at 03:25:12AM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Oct 07, 2021 at 06:23:55PM +0200, Lukasz Maniak wrote:
> > > > > PCIe devices implementing SR-IOV may need to perform certain actions
> > > > > before the VFs are unrealized or vice versa.
> > > > > 
> > > > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > > > 
> > > > Callbacks are annoying and easy to misuse though.
> > > > VFs are enabled through a config cycle, we generally just
> > > > have devices invoke the capability handler.
> > > > E.g.
> > > > 
> > > > static void pci_bridge_dev_write_config(PCIDevice *d,
> > > >                                         uint32_t address, uint32_t val, int len)
> > > > {
> > > >     pci_bridge_write_config(d, address, val, len);
> > > >     if (msi_present(d)) {
> > > >         msi_write_config(d, address, val, len);
> > > >     }
> > > > }
> > > > 
> > > > this makes it easy to do whatever you want before/after
> > > > the write. You can also add a helper to check
> > > > that SRIOV is being enabled/disabled if necessary.
> > > > 
> > > > > ---
> > > > >  docs/pcie_sriov.txt         |  2 +-
> > > > >  hw/pci/pcie_sriov.c         | 14 +++++++++++++-
> > > > >  include/hw/pci/pcie_sriov.h |  8 +++++++-
> > > > >  3 files changed, 21 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> > > > > index f5e891e1d4..63ca1a7b8e 100644
> > > > > --- a/docs/pcie_sriov.txt
> > > > > +++ b/docs/pcie_sriov.txt
> > > > > @@ -57,7 +57,7 @@ setting up a BAR for a VF.
> > > > >        /* Add and initialize the SR/IOV capability */
> > > > >        pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> > > > >                         vf_devid, initial_vfs, total_vfs,
> > > > > -                       fun_offset, stride);
> > > > > +                       fun_offset, stride, pre_vfs_update_cb);
> > > > >  
> > > > >        /* Set up individual VF BARs (parameters as for normal BARs) */
> > > > >        pcie_sriov_pf_init_vf_bar( ... )
> > > > > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > > > > index 501a1ff433..cac2aee061 100644
> > > > > --- a/hw/pci/pcie_sriov.c
> > > > > +++ b/hw/pci/pcie_sriov.c
> > > > > @@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
> > > > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > > >                          const char *vfname, uint16_t vf_dev_id,
> > > > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > > > -                        uint16_t vf_offset, uint16_t vf_stride)
> > > > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > > > +                        SriovVfsUpdate pre_vfs_update)
> > > > >  {
> > > > >      uint8_t *cfg = dev->config + offset;
> > > > >      uint8_t *wmask;
> > > > > @@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > > >      dev->exp.sriov_pf.num_vfs = 0;
> > > > >      dev->exp.sriov_pf.vfname = g_strdup(vfname);
> > > > >      dev->exp.sriov_pf.vf = NULL;
> > > > > +    dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
> > > > >  
> > > > >      pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> > > > >      pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > > > > @@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
> > > > >      assert(dev->exp.sriov_pf.vf);
> > > > >  
> > > > >      trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
> > > > > +
> > > > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, num_vfs);
> > > > > +    }
> > > > > +
> > > > >      for (i = 0; i < num_vfs; i++) {
> > > > >          dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
> > > > >          if (!dev->exp.sriov_pf.vf[i]) {
> > > > > @@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
> > > > >      uint16_t i;
> > > > >  
> > > > >      trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
> > > > > +
> > > > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
> > > > > +    }
> > > > > +
> > > > >      for (i = 0; i < num_vfs; i++) {
> > > > >          PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> > > > >          object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
> > > > > diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> > > > > index 0974f00054..9ab48b79c0 100644
> > > > > --- a/include/hw/pci/pcie_sriov.h
> > > > > +++ b/include/hw/pci/pcie_sriov.h
> > > > > @@ -13,11 +13,16 @@
> > > > >  #ifndef QEMU_PCIE_SRIOV_H
> > > > >  #define QEMU_PCIE_SRIOV_H
> > > > >  
> > > > > +typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
> > > > > +                               uint16_t num_vfs);
> > > > > +
> > > > >  struct PCIESriovPF {
> > > > >      uint16_t num_vfs;           /* Number of virtual functions created */
> > > > >      uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
> > > > >      const char *vfname;         /* Reference to the device type used for the VFs */
> > > > >      PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> > > > > +
> > > > > +    SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
> > > > >  };
> > > > >  
> > > > >  struct PCIESriovVF {
> > > > > @@ -28,7 +33,8 @@ struct PCIESriovVF {
> > > > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > > >                          const char *vfname, uint16_t vf_dev_id,
> > > > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > > > -                        uint16_t vf_offset, uint16_t vf_stride);
> > > > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > > > +                        SriovVfsUpdate pre_vfs_update);
> > > > >  void pcie_sriov_pf_exit(PCIDevice *dev);
> > > > >  
> > > > >  /* Set up a VF bar in the SR/IOV bar area */
> > > > > -- 
> > > > > 2.25.1
> > > >
> > > 
> > > Hi Michael,
> > > 
> > > A custom config_write callback was the first approach we used.
> > > However, once implemented, we realized it looks the same as the
> > > pcie_sriov_config_write function. To avoid code duplication and
> > > interfering with the internal SR-IOV structures for purposes of NVMe,
> > > we opted for this callback prior to the VFs update.
> > > After all, we have callbacks in both approaches, config_write and the
> > > added pre_vfs_update, so both are prone to misuse.
> > > 
> > > But I agree it may not be a good moment yet to add a new API
> > > specifically for SR-IOV functionality, as NVMe will be the first device
> > > to use it.
> > > 
> > > CCing Knut, perhaps as the author of SR-IOV you have some thoughts on
> > > how the device notification of an upcoming VFs update would be handled.
> > > 
> > > Thanks,
> > > Lukasz
> > 
> > So just split it up?
> > 
> > void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > {
> >     uint32_t off;
> >     uint16_t sriov_cap = dev->exp.sriov_cap;
> > 
> >     if (!sriov_cap || address < sriov_cap) {
> >         return;
> >     }
> >     off = address - sriov_cap;
> >     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> >         return;
> >     }
> >     
> >     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
> >         
> >     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> >         if (dev->exp.sriov_pf.num_vfs) {
> >             if (!(val & PCI_SRIOV_CTRL_VFE)) {
> >                 unregister_vfs(dev);
> >             }
> >         } else {
> >             if (val & PCI_SRIOV_CTRL_VFE) {
> >                 register_vfs(dev);
> >             }
> >         }
> >     }
> > }
> > 
> > 
> > Would become:
> > 
> > bool
> >  pcie_sriov_is_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > {
> >     uint32_t off;
> >     uint16_t sriov_cap = dev->exp.sriov_cap;
> > 
> >     if (!sriov_cap || address < sriov_cap) {
> >         return false;
> >     }
> >     off = address - sriov_cap;
> >     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> >         return false;
> >     }
> > }
> > 
> > bool
> >  pcie_sriov_do_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > {
> >     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
> >         
> >     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> >         if (dev->exp.sriov_pf.num_vfs) {
> >             if (!(val & PCI_SRIOV_CTRL_VFE)) {
> >                 unregister_vfs(dev);
> >             }
> >         } else {
> >             if (val & PCI_SRIOV_CTRL_VFE) {
> >                 register_vfs(dev);
> >             }
> >         }
> >     }
> > }
> > 
> > 
> > 
> > void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > {
> > 	if (pcie_sriov_is_config_write(dev, address, val, len)) {
> > 		pcie_sriov_do_config_write(dev, address, val, len);
> > 	}
> >     
> > }
> > 
> > 
> > Now  pcie_sriov_is_config_write and pcie_sriov_do_config_write
> > can be reused by NVME.
> > 
> > -- 
> > MST
> > 
> 
> Hi Michael,
> 
> I extracted one condition to the helper, what do you think?
> 
> bool pcie_sriov_is_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> {
>     uint32_t off;
>     uint16_t sriov_cap = dev->exp.sriov_cap;
> 
>     if (!sriov_cap || address < sriov_cap) {
>         return false;
>     }
>     off = address - sriov_cap;
>     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
>         return false;
>     }
> 
>     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
>         return true;
>     }
> 
>     return false;
> }
> 
> static void pcie_sriov_do_config_write(PCIDevice *dev, uint32_t address,
>                                        uint32_t val, int len)
> {
>     uint32_t off = address - dev->exp.sriov_cap;
>     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
> 
>     if (dev->exp.sriov_pf.num_vfs) {
>         if (!(val & PCI_SRIOV_CTRL_VFE)) {
>             unregister_vfs(dev);
>         }
>     } else {
>         if (val & PCI_SRIOV_CTRL_VFE) {
>             register_vfs(dev);
>         }
>     }
> }
> 
> void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> {
>     if (pcie_sriov_is_config_write(dev, address, val, len)) {
>         pcie_sriov_do_config_write(dev, address, val, len);
>     }
> }

ok

> --
> Lukasz



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/15] hw/nvme: Calculate BAR atributes in a function
  2021-10-07 16:24 ` [PATCH 11/15] hw/nvme: Calculate BAR atributes in a function Lukasz Maniak
@ 2021-10-18  9:52   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 55+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-18  9:52 UTC (permalink / raw)
  To: Lukasz Maniak, qemu-devel
  Cc: Keith Busch, Łukasz Gieryk, qemu-block, Klaus Jensen

Hi Łukasz,

On 10/7/21 18:24, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> An Nvme device with SR-IOV capability calculates the BAR size
> differently for PF and VF, so it makes sense to extract the common code
> to a separate function.
> 
> Also: it seems the n->reg_size parameter unnecessarily splits the BAR
> size calculation in two phases; removed to simplify the code.

Preferably split in 2 patches, simplification in first, new function
in second.
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> ---
>  hw/nvme/ctrl.c | 52 +++++++++++++++++++++++++++++++++-----------------
>  hw/nvme/nvme.h |  1 -
>  2 files changed, 35 insertions(+), 18 deletions(-)



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2021-10-07 16:24 ` [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
@ 2021-10-18 10:06   ` Philippe Mathieu-Daudé
  2021-10-18 15:53     ` Łukasz Gieryk
  2021-10-20 19:06   ` Klaus Jensen
  2021-10-20 19:26   ` Klaus Jensen
  2 siblings, 1 reply; 55+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-18 10:06 UTC (permalink / raw)
  To: Lukasz Maniak, qemu-devel
  Cc: Keith Busch, Łukasz Gieryk, qemu-block, Klaus Jensen

On 10/7/21 18:24, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
> them as constants is problematic for SR-IOV support.
> 
> The SR-IOV feature introduces virtual resources (queues, interrupts)
> that can be assigned to PF and its dependent VFs. Each device, following
> a reset, should work with the configured number of queues. A single
> constant is no longer sufficient to hold the whole state.
> 
> This patch tries to solve the problem by introducing additional
> variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> are therefore organized as:
> 
>  - n->params.max_ioqpairs – no changes, constant set by the user.
> 
>  - n->max_ioqpairs - (new) value derived from n->params.* in realize();
>                      constant through device’s lifetime.
> 
>  - n->(mutable_state) – (not a part of this patch) user-configurable,
>                         specifies number of queues available _after_
>                         reset.
> 
>  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
>                       n->params.max_ioqpairs; initialized in realize()
>                       and updated during reset() to reflect user’s
>                       changes to the mutable state.
> 
> Since the number of available i/o queues and interrupts can change in
> runtime, buffers for sq/cqs and the MSIX-related structures are
> allocated big enough to handle the limits, to completely avoid the
> complicated reallocation. A helper function (nvme_update_msixcap_ts)
> updates the corresponding capability register, to signal configuration
> changes.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> ---
>  hw/nvme/ctrl.c | 62 +++++++++++++++++++++++++++++++++-----------------
>  hw/nvme/nvme.h |  4 ++++
>  2 files changed, 45 insertions(+), 21 deletions(-)

> @@ -6322,11 +6334,17 @@ static void nvme_init_state(NvmeCtrl *n)
>      NvmeSecCtrlEntry *sctrl;
>      int i;
>  
> +    n->max_ioqpairs = n->params.max_ioqpairs;
> +    n->conf_ioqpairs = n->max_ioqpairs;
> +
> +    n->max_msix_qsize = n->params.msix_qsize;
> +    n->conf_msix_qsize = n->max_msix_qsize;

From an developer perspective, the API becomes confusing.
Most fields from NvmeParams are exposed via QMP, such max_ioqpairs.

I'm not sure we need 2 distinct fields. Maybe simply reorganize
to not reset these values in the DeviceReset handler?

Also, with this series we should consider implementing the migration
state (nvme_vmstate).

> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 9fbb0a70b5..65383e495c 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -420,6 +420,10 @@ typedef struct NvmeCtrl {
>      uint64_t    starttime_ms;
>      uint16_t    temperature;
>      uint8_t     smart_critical_warning;
> +    uint32_t    max_msix_qsize;                 /* Derived from params.msix.qsize */
> +    uint32_t    conf_msix_qsize;                /* Configured limit */
> +    uint32_t    max_ioqpairs;                   /* Derived from params.max_ioqpairs */
> +    uint32_t    conf_ioqpairs;                  /* Configured limit */
>  



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2021-10-18 10:06   ` Philippe Mathieu-Daudé
@ 2021-10-18 15:53     ` Łukasz Gieryk
  0 siblings, 0 replies; 55+ messages in thread
From: Łukasz Gieryk @ 2021-10-18 15:53 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Keith Busch, Klaus Jensen, Lukasz Maniak, qemu-block, qemu-devel

On Mon, Oct 18, 2021 at 12:06:22PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/7/21 18:24, Lukasz Maniak wrote:
> > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > 
> > The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
> > them as constants is problematic for SR-IOV support.
> > 
> > The SR-IOV feature introduces virtual resources (queues, interrupts)
> > that can be assigned to PF and its dependent VFs. Each device, following
> > a reset, should work with the configured number of queues. A single
> > constant is no longer sufficient to hold the whole state.
> > 
> > This patch tries to solve the problem by introducing additional
> > variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> > are therefore organized as:
> > 
> >  - n->params.max_ioqpairs – no changes, constant set by the user.
> > 
> >  - n->max_ioqpairs - (new) value derived from n->params.* in realize();
> >                      constant through device’s lifetime.
> > 
> >  - n->(mutable_state) – (not a part of this patch) user-configurable,
> >                         specifies number of queues available _after_
> >                         reset.
> > 
> >  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
> >                       n->params.max_ioqpairs; initialized in realize()
> >                       and updated during reset() to reflect user’s
> >                       changes to the mutable state.
> > 
> > Since the number of available i/o queues and interrupts can change in
> > runtime, buffers for sq/cqs and the MSIX-related structures are
> > allocated big enough to handle the limits, to completely avoid the
> > complicated reallocation. A helper function (nvme_update_msixcap_ts)
> > updates the corresponding capability register, to signal configuration
> > changes.
> > 
> > Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > ---
> >  hw/nvme/ctrl.c | 62 +++++++++++++++++++++++++++++++++-----------------
> >  hw/nvme/nvme.h |  4 ++++
> >  2 files changed, 45 insertions(+), 21 deletions(-)
> 
> > @@ -6322,11 +6334,17 @@ static void nvme_init_state(NvmeCtrl *n)
> >      NvmeSecCtrlEntry *sctrl;
> >      int i;
> >  
> > +    n->max_ioqpairs = n->params.max_ioqpairs;
> > +    n->conf_ioqpairs = n->max_ioqpairs;
> > +
> > +    n->max_msix_qsize = n->params.msix_qsize;
> > +    n->conf_msix_qsize = n->max_msix_qsize;
> 
> From an developer perspective, the API becomes confusing.
> Most fields from NvmeParams are exposed via QMP, such max_ioqpairs.

Hi Philippe,

I’m not sure I understand your concern. The NvmeParams stays as it was,
so the interaction with QMP stays unchanged. Sure, if QMP allows
updating NvmeParams in runtime (I’m guessing, as I’m not really
accustomed with the feature), then the Nvme device will no longer
respond to those changes. But n->conf_ioqpairs is not meant to be
altered via QEMU’s interfaces, but rather though the NVME protocol, by
the guest OS kernel/user.

Could you explain how the changes are going to break (or make more
confusing) the interaction with QMP?

> I'm not sure we need 2 distinct fields. Maybe simply reorganize
> to not reset these values in the DeviceReset handler?

The idea was to calculate the max value once and use it in multiple
places later. The actual calculations are in the following 12/15 patch
(I’m also including the code below), so indeed, the intended use case
is not so obvious.

if (pci_is_vf(&n->parent_obj)) {
    n->max_ioqpairs = n->params.sriov_max_vq_per_vf - 1;
} else {
    n->max_ioqpairs = n->params.max_ioqpairs +
                      n->params.sriov_max_vfs * n->params.sriov_max_vq_per_vf;
}

But as I’m thinking more about the problem, then indeed, the max_*
fields may be not necessary. I could calculate max_msix_qsize in the
only place it’s used, and turn the above snippet for max_iopairs into a
function. The downside is the code for calculating maximums is no longer
grouped together.

> Also, with this series we should consider implementing the migration
> state (nvme_vmstate).

I wasn’t aware of this feature. I have to do the readings first.

> > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> > index 9fbb0a70b5..65383e495c 100644
> > --- a/hw/nvme/nvme.h
> > +++ b/hw/nvme/nvme.h
> > @@ -420,6 +420,10 @@ typedef struct NvmeCtrl {
> >      uint64_t    starttime_ms;
> >      uint16_t    temperature;
> >      uint8_t     smart_critical_warning;
> > +    uint32_t    max_msix_qsize;                 /* Derived from params.msix.qsize */
> > +    uint32_t    conf_msix_qsize;                /* Configured limit */
> > +    uint32_t    max_ioqpairs;                   /* Derived from params.max_ioqpairs */
> > +    uint32_t    conf_ioqpairs;                  /* Configured limit */
> >  
> 

-- 
Regards,
Łukasz Gieryk



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update
  2021-10-15 17:30           ` Michael S. Tsirkin
@ 2021-10-20 13:30             ` Lukasz Maniak
  0 siblings, 0 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-20 13:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Łukasz Gieryk, Knut Omang, qemu-devel, qemu-block

On Fri, Oct 15, 2021 at 01:30:32PM -0400, Michael S. Tsirkin wrote:
> On Fri, Oct 15, 2021 at 06:24:14PM +0200, Lukasz Maniak wrote:
> > On Wed, Oct 13, 2021 at 05:10:35AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Oct 12, 2021 at 06:06:46PM +0200, Lukasz Maniak wrote:
> > > > On Tue, Oct 12, 2021 at 03:25:12AM -0400, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 07, 2021 at 06:23:55PM +0200, Lukasz Maniak wrote:
> > > > > > PCIe devices implementing SR-IOV may need to perform certain actions
> > > > > > before the VFs are unrealized or vice versa.
> > > > > > 
> > > > > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > > > > 
> > > > > Callbacks are annoying and easy to misuse though.
> > > > > VFs are enabled through a config cycle, we generally just
> > > > > have devices invoke the capability handler.
> > > > > E.g.
> > > > > 
> > > > > static void pci_bridge_dev_write_config(PCIDevice *d,
> > > > >                                         uint32_t address, uint32_t val, int len)
> > > > > {
> > > > >     pci_bridge_write_config(d, address, val, len);
> > > > >     if (msi_present(d)) {
> > > > >         msi_write_config(d, address, val, len);
> > > > >     }
> > > > > }
> > > > > 
> > > > > this makes it easy to do whatever you want before/after
> > > > > the write. You can also add a helper to check
> > > > > that SRIOV is being enabled/disabled if necessary.
> > > > > 
> > > > > > ---
> > > > > >  docs/pcie_sriov.txt         |  2 +-
> > > > > >  hw/pci/pcie_sriov.c         | 14 +++++++++++++-
> > > > > >  include/hw/pci/pcie_sriov.h |  8 +++++++-
> > > > > >  3 files changed, 21 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> > > > > > index f5e891e1d4..63ca1a7b8e 100644
> > > > > > --- a/docs/pcie_sriov.txt
> > > > > > +++ b/docs/pcie_sriov.txt
> > > > > > @@ -57,7 +57,7 @@ setting up a BAR for a VF.
> > > > > >        /* Add and initialize the SR/IOV capability */
> > > > > >        pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> > > > > >                         vf_devid, initial_vfs, total_vfs,
> > > > > > -                       fun_offset, stride);
> > > > > > +                       fun_offset, stride, pre_vfs_update_cb);
> > > > > >  
> > > > > >        /* Set up individual VF BARs (parameters as for normal BARs) */
> > > > > >        pcie_sriov_pf_init_vf_bar( ... )
> > > > > > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > > > > > index 501a1ff433..cac2aee061 100644
> > > > > > --- a/hw/pci/pcie_sriov.c
> > > > > > +++ b/hw/pci/pcie_sriov.c
> > > > > > @@ -30,7 +30,8 @@ static void unregister_vfs(PCIDevice *dev);
> > > > > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > > > >                          const char *vfname, uint16_t vf_dev_id,
> > > > > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > > > > -                        uint16_t vf_offset, uint16_t vf_stride)
> > > > > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > > > > +                        SriovVfsUpdate pre_vfs_update)
> > > > > >  {
> > > > > >      uint8_t *cfg = dev->config + offset;
> > > > > >      uint8_t *wmask;
> > > > > > @@ -41,6 +42,7 @@ void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > > > >      dev->exp.sriov_pf.num_vfs = 0;
> > > > > >      dev->exp.sriov_pf.vfname = g_strdup(vfname);
> > > > > >      dev->exp.sriov_pf.vf = NULL;
> > > > > > +    dev->exp.sriov_pf.pre_vfs_update = pre_vfs_update;
> > > > > >  
> > > > > >      pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> > > > > >      pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > > > > > @@ -180,6 +182,11 @@ static void register_vfs(PCIDevice *dev)
> > > > > >      assert(dev->exp.sriov_pf.vf);
> > > > > >  
> > > > > >      trace_sriov_register_vfs(SRIOV_ID(dev), num_vfs);
> > > > > > +
> > > > > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > > > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, num_vfs);
> > > > > > +    }
> > > > > > +
> > > > > >      for (i = 0; i < num_vfs; i++) {
> > > > > >          dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname, i);
> > > > > >          if (!dev->exp.sriov_pf.vf[i]) {
> > > > > > @@ -198,6 +205,11 @@ static void unregister_vfs(PCIDevice *dev)
> > > > > >      uint16_t i;
> > > > > >  
> > > > > >      trace_sriov_unregister_vfs(SRIOV_ID(dev), num_vfs);
> > > > > > +
> > > > > > +    if (dev->exp.sriov_pf.pre_vfs_update) {
> > > > > > +        dev->exp.sriov_pf.pre_vfs_update(dev, dev->exp.sriov_pf.num_vfs, 0);
> > > > > > +    }
> > > > > > +
> > > > > >      for (i = 0; i < num_vfs; i++) {
> > > > > >          PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> > > > > >          object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
> > > > > > diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> > > > > > index 0974f00054..9ab48b79c0 100644
> > > > > > --- a/include/hw/pci/pcie_sriov.h
> > > > > > +++ b/include/hw/pci/pcie_sriov.h
> > > > > > @@ -13,11 +13,16 @@
> > > > > >  #ifndef QEMU_PCIE_SRIOV_H
> > > > > >  #define QEMU_PCIE_SRIOV_H
> > > > > >  
> > > > > > +typedef void (*SriovVfsUpdate)(PCIDevice *dev, uint16_t prev_num_vfs,
> > > > > > +                               uint16_t num_vfs);
> > > > > > +
> > > > > >  struct PCIESriovPF {
> > > > > >      uint16_t num_vfs;           /* Number of virtual functions created */
> > > > > >      uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
> > > > > >      const char *vfname;         /* Reference to the device type used for the VFs */
> > > > > >      PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> > > > > > +
> > > > > > +    SriovVfsUpdate pre_vfs_update;  /* Callback preceding VFs count change */
> > > > > >  };
> > > > > >  
> > > > > >  struct PCIESriovVF {
> > > > > > @@ -28,7 +33,8 @@ struct PCIESriovVF {
> > > > > >  void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > > > >                          const char *vfname, uint16_t vf_dev_id,
> > > > > >                          uint16_t init_vfs, uint16_t total_vfs,
> > > > > > -                        uint16_t vf_offset, uint16_t vf_stride);
> > > > > > +                        uint16_t vf_offset, uint16_t vf_stride,
> > > > > > +                        SriovVfsUpdate pre_vfs_update);
> > > > > >  void pcie_sriov_pf_exit(PCIDevice *dev);
> > > > > >  
> > > > > >  /* Set up a VF bar in the SR/IOV bar area */
> > > > > > -- 
> > > > > > 2.25.1
> > > > >
> > > > 
> > > > Hi Michael,
> > > > 
> > > > A custom config_write callback was the first approach we used.
> > > > However, once implemented, we realized it looks the same as the
> > > > pcie_sriov_config_write function. To avoid code duplication and
> > > > interfering with the internal SR-IOV structures for purposes of NVMe,
> > > > we opted for this callback prior to the VFs update.
> > > > After all, we have callbacks in both approaches, config_write and the
> > > > added pre_vfs_update, so both are prone to misuse.
> > > > 
> > > > But I agree it may not be a good moment yet to add a new API
> > > > specifically for SR-IOV functionality, as NVMe will be the first device
> > > > to use it.
> > > > 
> > > > CCing Knut, perhaps as the author of SR-IOV you have some thoughts on
> > > > how the device notification of an upcoming VFs update would be handled.
> > > > 
> > > > Thanks,
> > > > Lukasz
> > > 
> > > So just split it up?
> > > 
> > > void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > > {
> > >     uint32_t off;
> > >     uint16_t sriov_cap = dev->exp.sriov_cap;
> > > 
> > >     if (!sriov_cap || address < sriov_cap) {
> > >         return;
> > >     }
> > >     off = address - sriov_cap;
> > >     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> > >         return;
> > >     }
> > >     
> > >     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
> > >         
> > >     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > >         if (dev->exp.sriov_pf.num_vfs) {
> > >             if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > >                 unregister_vfs(dev);
> > >             }
> > >         } else {
> > >             if (val & PCI_SRIOV_CTRL_VFE) {
> > >                 register_vfs(dev);
> > >             }
> > >         }
> > >     }
> > > }
> > > 
> > > 
> > > Would become:
> > > 
> > > bool
> > >  pcie_sriov_is_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > > {
> > >     uint32_t off;
> > >     uint16_t sriov_cap = dev->exp.sriov_cap;
> > > 
> > >     if (!sriov_cap || address < sriov_cap) {
> > >         return false;
> > >     }
> > >     off = address - sriov_cap;
> > >     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> > >         return false;
> > >     }
> > > }
> > > 
> > > bool
> > >  pcie_sriov_do_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > > {
> > >     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
> > >         
> > >     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > >         if (dev->exp.sriov_pf.num_vfs) {
> > >             if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > >                 unregister_vfs(dev);
> > >             }
> > >         } else {
> > >             if (val & PCI_SRIOV_CTRL_VFE) {
> > >                 register_vfs(dev);
> > >             }
> > >         }
> > >     }
> > > }
> > > 
> > > 
> > > 
> > > void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > > {
> > > 	if (pcie_sriov_is_config_write(dev, address, val, len)) {
> > > 		pcie_sriov_do_config_write(dev, address, val, len);
> > > 	}
> > >     
> > > }
> > > 
> > > 
> > > Now  pcie_sriov_is_config_write and pcie_sriov_do_config_write
> > > can be reused by NVME.
> > > 
> > > -- 
> > > MST
> > > 
> > 
> > Hi Michael,
> > 
> > I extracted one condition to the helper, what do you think?
> > 
> > bool pcie_sriov_is_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > {
> >     uint32_t off;
> >     uint16_t sriov_cap = dev->exp.sriov_cap;
> > 
> >     if (!sriov_cap || address < sriov_cap) {
> >         return false;
> >     }
> >     off = address - sriov_cap;
> >     if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> >         return false;
> >     }
> > 
> >     if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> >         return true;
> >     }
> > 
> >     return false;
> > }
> > 
> > static void pcie_sriov_do_config_write(PCIDevice *dev, uint32_t address,
> >                                        uint32_t val, int len)
> > {
> >     uint32_t off = address - dev->exp.sriov_cap;
> >     trace_sriov_config_write(SRIOV_ID(dev), off, val, len);
> > 
> >     if (dev->exp.sriov_pf.num_vfs) {
> >         if (!(val & PCI_SRIOV_CTRL_VFE)) {
> >             unregister_vfs(dev);
> >         }
> >     } else {
> >         if (val & PCI_SRIOV_CTRL_VFE) {
> >             register_vfs(dev);
> >         }
> >     }
> > }
> > 
> > void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> > {
> >     if (pcie_sriov_is_config_write(dev, address, val, len)) {
> >         pcie_sriov_do_config_write(dev, address, val, len);
> >     }
> > }
> 
> ok
> 
> > --
> > Lukasz
> 

Hi Michael,

After more investigation, we concluded that both pre_vfs_update callback
and pcie_sriov_config_write split are not required to perform needed
device-side actions prior to the SR-IOV state change.

Hence, we decided to drop this patch for v2.

Kind regards,
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2021-10-07 16:24 ` [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
  2021-10-18 10:06   ` Philippe Mathieu-Daudé
@ 2021-10-20 19:06   ` Klaus Jensen
  2021-10-21 13:40     ` Łukasz Gieryk
  2021-10-20 19:26   ` Klaus Jensen
  2 siblings, 1 reply; 55+ messages in thread
From: Klaus Jensen @ 2021-10-20 19:06 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Keith Busch, Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]

On Oct  7 18:24, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
> them as constants is problematic for SR-IOV support.
> 
> The SR-IOV feature introduces virtual resources (queues, interrupts)
> that can be assigned to PF and its dependent VFs. Each device, following
> a reset, should work with the configured number of queues. A single
> constant is no longer sufficient to hold the whole state.
> 
> This patch tries to solve the problem by introducing additional
> variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> are therefore organized as:
> 
>  - n->params.max_ioqpairs – no changes, constant set by the user.
> 
>  - n->max_ioqpairs - (new) value derived from n->params.* in realize();
>                      constant through device’s lifetime.
> 
>  - n->(mutable_state) – (not a part of this patch) user-configurable,
>                         specifies number of queues available _after_
>                         reset.
> 
>  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
>                       n->params.max_ioqpairs; initialized in realize()
>                       and updated during reset() to reflect user’s
>                       changes to the mutable state.
> 
> Since the number of available i/o queues and interrupts can change in
> runtime, buffers for sq/cqs and the MSIX-related structures are
> allocated big enough to handle the limits, to completely avoid the
> complicated reallocation. A helper function (nvme_update_msixcap_ts)
> updates the corresponding capability register, to signal configuration
> changes.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Instead of this, how about adding new parameters, say, sriov_vi_private
and sriov_vq_private. Then, max_ioqpairs and msix_qsize are still the
"physical" limits and the new parameters just reserve some for the
primary controller, the rest being available for flexsible resources.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-10-07 16:23 ` [PATCH 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
@ 2021-10-20 19:07   ` Klaus Jensen
  2021-10-21 14:33     ` Lukasz Maniak
  2021-11-02 14:33   ` Klaus Jensen
  1 sibling, 1 reply; 55+ messages in thread
From: Klaus Jensen @ 2021-10-20 19:07 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

[-- Attachment #1: Type: text/plain, Size: 4473 bytes --]

On Oct  7 18:23, Lukasz Maniak wrote:
> This patch implements initial support for Single Root I/O Virtualization
> on an NVMe device.
> 
> Essentially, it allows to define the maximum number of virtual functions
> supported by the NVMe controller via sriov_max_vfs parameter.
> 
> Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> capability by a physical controller and ARI capability by both the
> physical and virtual function devices.
> 
> NVMe controllers created via virtual functions mirror functionally
> the physical controller, which may not entirely be the case, thus
> consideration would be needed on the way to limit the capabilities of
> the VF.
> 
> NVMe subsystem is required for the use of SR-IOV.
> 
> Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> ---
>  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
>  hw/nvme/nvme.h           |  1 +
>  include/hw/pci/pci_ids.h |  1 +
>  3 files changed, 73 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 6a571d18cf..ad79ff0c00 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -35,6 +35,7 @@
>   *              mdts=<N[optional]>,vsl=<N[optional]>, \
>   *              zoned.zasl=<N[optional]>, \
>   *              zoned.auto_transition=<on|off[optional]>, \
> + *              sriov_max_vfs=<N[optional]> \
>   *              subsys=<subsys_id>
>   *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>,\
>   *              zoned=<true|false[optional]>, \
> @@ -106,6 +107,12 @@
>   *   transitioned to zone state closed for resource management purposes.
>   *   Defaults to 'on'.
>   *
> + * - `sriov_max_vfs`
> + *   Indicates the maximum number of PCIe virtual functions supported
> + *   by the controller. The default value is 0. Specifying a non-zero value
> + *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
> + *   Virtual function controllers will not report SR-IOV capability.
> + *
>   * nvme namespace device parameters
>   * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   * - `shared`
> @@ -160,6 +167,7 @@
>  #include "sysemu/block-backend.h"
>  #include "sysemu/hostmem.h"
>  #include "hw/pci/msix.h"
> +#include "hw/pci/pcie_sriov.h"
>  #include "migration/vmstate.h"
>  
>  #include "nvme.h"
> @@ -175,6 +183,9 @@
>  #define NVME_TEMPERATURE_CRITICAL 0x175
>  #define NVME_NUM_FW_SLOTS 1
>  #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
> +#define NVME_MAX_VFS 127
> +#define NVME_VF_OFFSET 0x1
> +#define NVME_VF_STRIDE 1
>  
>  #define NVME_GUEST_ERR(trace, fmt, ...) \
>      do { \
> @@ -5583,6 +5594,10 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
>          g_free(event);
>      }
>  
> +    if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
> +        pcie_sriov_pf_disable_vfs(&n->parent_obj);
> +    }
> +
>      n->aer_queued = 0;
>      n->outstanding_aers = 0;
>      n->qs_created = false;
> @@ -6264,6 +6279,19 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
>          error_setg(errp, "vsl must be non-zero");
>          return;
>      }
> +
> +    if (params->sriov_max_vfs) {
> +        if (!n->subsys) {
> +            error_setg(errp, "subsystem is required for the use of SR-IOV");
> +            return;
> +        }
> +
> +        if (params->sriov_max_vfs > NVME_MAX_VFS) {
> +            error_setg(errp, "sriov_max_vfs must be between 0 and %d",
> +                       NVME_MAX_VFS);
> +            return;
> +        }
> +    }
>  }
>  
>  static void nvme_init_state(NvmeCtrl *n)
> @@ -6321,6 +6349,20 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
>      memory_region_set_enabled(&n->pmr.dev->mr, false);
>  }
>  
> +static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
> +                            uint64_t bar_size)
> +{
> +    uint16_t vf_dev_id = n->params.use_intel_id ?
> +                         PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
> +
> +    pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
> +                       n->params.sriov_max_vfs, n->params.sriov_max_vfs,
> +                       NVME_VF_OFFSET, NVME_VF_STRIDE, NULL);

Did you consider adding a new device for the virtual function device,
"nvmevf"?

Down the road it might help with the variations in capabilities that you
describe.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2021-10-07 16:24 ` [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
  2021-10-18 10:06   ` Philippe Mathieu-Daudé
  2021-10-20 19:06   ` Klaus Jensen
@ 2021-10-20 19:26   ` Klaus Jensen
  2 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-10-20 19:26 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Keith Busch, Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 247 bytes --]

On Oct  7 18:24, Lukasz Maniak wrote:
> +static void nvme_update_msixcap_ts(PCIDevice *pci_dev, uint32_t table_size)
> +{
> +    uint8_t *config;
> +
> +    assert(pci_dev->msix_cap);

Not all platforms support msix, so an assert() is not right.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2021-10-20 19:06   ` Klaus Jensen
@ 2021-10-21 13:40     ` Łukasz Gieryk
  2021-11-03 12:11       ` Klaus Jensen
  0 siblings, 1 reply; 55+ messages in thread
From: Łukasz Gieryk @ 2021-10-21 13:40 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: Keith Busch, Lukasz Maniak, qemu-block, qemu-devel

On Wed, Oct 20, 2021 at 09:06:06PM +0200, Klaus Jensen wrote:
> On Oct  7 18:24, Lukasz Maniak wrote:
> > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > 
> > The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
> > them as constants is problematic for SR-IOV support.
> > 
> > The SR-IOV feature introduces virtual resources (queues, interrupts)
> > that can be assigned to PF and its dependent VFs. Each device, following
> > a reset, should work with the configured number of queues. A single
> > constant is no longer sufficient to hold the whole state.
> > 
> > This patch tries to solve the problem by introducing additional
> > variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> > are therefore organized as:
> > 
> >  - n->params.max_ioqpairs – no changes, constant set by the user.
> > 
> >  - n->max_ioqpairs - (new) value derived from n->params.* in realize();
> >                      constant through device’s lifetime.
> > 
> >  - n->(mutable_state) – (not a part of this patch) user-configurable,
> >                         specifies number of queues available _after_
> >                         reset.
> > 
> >  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
> >                       n->params.max_ioqpairs; initialized in realize()
> >                       and updated during reset() to reflect user’s
> >                       changes to the mutable state.
> > 
> > Since the number of available i/o queues and interrupts can change in
> > runtime, buffers for sq/cqs and the MSIX-related structures are
> > allocated big enough to handle the limits, to completely avoid the
> > complicated reallocation. A helper function (nvme_update_msixcap_ts)
> > updates the corresponding capability register, to signal configuration
> > changes.
> > 
> > Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> Instead of this, how about adding new parameters, say, sriov_vi_private
> and sriov_vq_private. Then, max_ioqpairs and msix_qsize are still the
> "physical" limits and the new parameters just reserve some for the
> primary controller, the rest being available for flexsible resources.

Compare your configuration:

    max_ioqpairs     = 26
    sriov_max_vfs    = 4
    sriov_vq_private = 10

with mine:

    max_ioqpairs        = 10
    sriov_max_vfs       = 4
    sriov_max_vq_per_vf = 4

In your version, if I wanted to change max_vfs but keep the same number
of flexible resources per VF, then I would have to do some math and
update max_ioparis. And then I also would have to adjust the other
interrupt-related parameter, as it's also affected. In my opinion
it's quite inconvenient.
 
Now, even if I changed the semantic of params, I would still need most
of this patch. (Let’s keep the discussion regarding if max_* fields are
necessary in the other thread).

Without virtualization, the maximum number of queues is constant. User
(i.e., nvme kernel driver) can only query this value (e.g., 10) and
needs to follow this limit.

With virtualization, the flexible resources kick in. Let's continue with
the sample numbers defined earlier (10 private + 16 flexible resources).

1) The device boots, all 16 flexible queues are assigned to the primary
   controller.
2) Nvme kernel driver queries for the limit (10+16=26) and can create/use
   up to this many queues. 
3) User via the virtualization management command unbinds some (let's
   say 2) of the flexible queues from the primary controller and assigns
   them to a secondary controller.
4) After reset, the Physical Function Device reports different limit
   (24), and when the Virtual Device shows up, it will report 1 (adminQ
   consumed the other resource). 

So I need additional variable in the state to store the intermediate
limit (24 or 1), as none of the existing params has the correct value,
and all the places that validate limits must work on the value.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-10-20 19:07   ` Klaus Jensen
@ 2021-10-21 14:33     ` Lukasz Maniak
  0 siblings, 0 replies; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-21 14:33 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

On Wed, Oct 20, 2021 at 09:07:47PM +0200, Klaus Jensen wrote:
> On Oct  7 18:23, Lukasz Maniak wrote:
> > This patch implements initial support for Single Root I/O Virtualization
> > on an NVMe device.
> > 
> > Essentially, it allows to define the maximum number of virtual functions
> > supported by the NVMe controller via sriov_max_vfs parameter.
> > 
> > Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> > capability by a physical controller and ARI capability by both the
> > physical and virtual function devices.
> > 
> > NVMe controllers created via virtual functions mirror functionally
> > the physical controller, which may not entirely be the case, thus
> > consideration would be needed on the way to limit the capabilities of
> > the VF.
> > 
> > NVMe subsystem is required for the use of SR-IOV.
> > 
> > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > ---
> >  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
> >  hw/nvme/nvme.h           |  1 +
> >  include/hw/pci/pci_ids.h |  1 +
> >  3 files changed, 73 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index 6a571d18cf..ad79ff0c00 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -35,6 +35,7 @@
> >   *              mdts=<N[optional]>,vsl=<N[optional]>, \
> >   *              zoned.zasl=<N[optional]>, \
> >   *              zoned.auto_transition=<on|off[optional]>, \
> > + *              sriov_max_vfs=<N[optional]> \
> >   *              subsys=<subsys_id>
> >   *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>,\
> >   *              zoned=<true|false[optional]>, \
> > @@ -106,6 +107,12 @@
> >   *   transitioned to zone state closed for resource management purposes.
> >   *   Defaults to 'on'.
> >   *
> > + * - `sriov_max_vfs`
> > + *   Indicates the maximum number of PCIe virtual functions supported
> > + *   by the controller. The default value is 0. Specifying a non-zero value
> > + *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
> > + *   Virtual function controllers will not report SR-IOV capability.
> > + *
> >   * nvme namespace device parameters
> >   * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >   * - `shared`
> > @@ -160,6 +167,7 @@
> >  #include "sysemu/block-backend.h"
> >  #include "sysemu/hostmem.h"
> >  #include "hw/pci/msix.h"
> > +#include "hw/pci/pcie_sriov.h"
> >  #include "migration/vmstate.h"
> >  
> >  #include "nvme.h"
> > @@ -175,6 +183,9 @@
> >  #define NVME_TEMPERATURE_CRITICAL 0x175
> >  #define NVME_NUM_FW_SLOTS 1
> >  #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
> > +#define NVME_MAX_VFS 127
> > +#define NVME_VF_OFFSET 0x1
> > +#define NVME_VF_STRIDE 1
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >      do { \
> > @@ -5583,6 +5594,10 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
> >          g_free(event);
> >      }
> >  
> > +    if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
> > +        pcie_sriov_pf_disable_vfs(&n->parent_obj);
> > +    }
> > +
> >      n->aer_queued = 0;
> >      n->outstanding_aers = 0;
> >      n->qs_created = false;
> > @@ -6264,6 +6279,19 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
> >          error_setg(errp, "vsl must be non-zero");
> >          return;
> >      }
> > +
> > +    if (params->sriov_max_vfs) {
> > +        if (!n->subsys) {
> > +            error_setg(errp, "subsystem is required for the use of SR-IOV");
> > +            return;
> > +        }
> > +
> > +        if (params->sriov_max_vfs > NVME_MAX_VFS) {
> > +            error_setg(errp, "sriov_max_vfs must be between 0 and %d",
> > +                       NVME_MAX_VFS);
> > +            return;
> > +        }
> > +    }
> >  }
> >  
> >  static void nvme_init_state(NvmeCtrl *n)
> > @@ -6321,6 +6349,20 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
> >      memory_region_set_enabled(&n->pmr.dev->mr, false);
> >  }
> >  
> > +static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
> > +                            uint64_t bar_size)
> > +{
> > +    uint16_t vf_dev_id = n->params.use_intel_id ?
> > +                         PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
> > +
> > +    pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
> > +                       n->params.sriov_max_vfs, n->params.sriov_max_vfs,
> > +                       NVME_VF_OFFSET, NVME_VF_STRIDE, NULL);
> 
> Did you consider adding a new device for the virtual function device,
> "nvmevf"?
> 
> Down the road it might help with the variations in capabilities that you
> describe.

Hi Klaus,

A separate nvmevf device was actually the first approach I tried.
Well, it came down to copying the nvme device functions in favor of a
few changes that can be covered with conditions.

As for limiting VF capabilities, the problem comes down to a nice
restriction on supported command set by the VF controller. Thus, using
nvmevf for this purpose sounds like an overkill.

Concerning restriction on the supported command set, an actual real
device would reduce VF's ability to use namespace attachment, namespace
management, virtualization enhancements, and corresponding identify
commands. However, since implementing secure virtualization in QEMU
would be complex and is not required it can be skipped for now.

Kind regards,
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512
  2021-10-07 22:12   ` Michael S. Tsirkin
@ 2021-10-26 14:36     ` Lukasz Maniak
  2021-10-26 15:37       ` Knut Omang
  0 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-26 14:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-block, Łukasz Gieryk, Knut Omang, qemu-devel, Knut Omang

On Thu, Oct 07, 2021 at 06:12:41PM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 07, 2021 at 06:23:52PM +0200, Lukasz Maniak wrote:
> > From: Knut Omang <knut.omang@oracle.com>
> > 
> > Make the default PCI Express Capability for PCIe devices set
> > MaxReadReq to 512.
> 
> code says 256
> 
> > Tyipcal modern devices people would want to
> 
> 
> typo
> 
> > emulate or simulate would want this. The previous value would
> > cause warnings from the root port driver on some kernels.
> 
> 
> which specifically?
> 
> > 
> > Signed-off-by: Knut Omang <knuto@ifi.uio.no>
> 
> we can't make changes like this unconditionally, this will
> break migration across versions.
> Pls tie this to a machine version.
> 
> Thanks!
> > ---
> >  hw/pci/pcie.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 6e95d82903..c1a12f3744 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -62,8 +62,9 @@ pcie_cap_v1_fill(PCIDevice *dev, uint8_t port, uint8_t type, uint8_t version)
> >       * Functions conforming to the ECN, PCI Express Base
> >       * Specification, Revision 1.1., or subsequent PCI Express Base
> >       * Specification revisions.
> > +     *  + set max payload size to 256, which seems to be a common value
> >       */
> > -    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER);
> > +    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER | (0x1 & PCI_EXP_DEVCAP_PAYLOAD));
> >  
> >      pci_set_long(exp_cap + PCI_EXP_LNKCAP,
> >                   (port << PCI_EXP_LNKCAP_PN_SHIFT) |
> > @@ -179,6 +180,8 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset,
> >      pci_set_long(exp_cap + PCI_EXP_DEVCAP2,
> >                   PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP);
> >  
> > +    pci_set_word(exp_cap + PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_READRQ_256B);
> > +
> >      pci_set_word(dev->wmask + pos + PCI_EXP_DEVCTL2, PCI_EXP_DEVCTL2_EETLPPB);
> >  
> >      if (dev->cap_present & QEMU_PCIE_EXTCAP_INIT) {
> > -- 
> > 2.25.1
> 

Hi Michael,

The reason Knut keeps rebasing this fix along with SR-IOV patch is not
clear for us.

Since we have tested the NVMe device without this fix and did not notice
any issues mentioned by Knut on kernel 5.4.0, we decided to drop it for
v2.

However, I have posted your comments to this patch on Knut's github so
they can be addressed in case Knut decides to resubmit it later though.

Thanks,
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512
  2021-10-26 14:36     ` Lukasz Maniak
@ 2021-10-26 15:37       ` Knut Omang
  0 siblings, 0 replies; 55+ messages in thread
From: Knut Omang @ 2021-10-26 15:37 UTC (permalink / raw)
  To: Lukasz Maniak, Michael S. Tsirkin
  Cc: Łukasz Gieryk, Knut Omang, qemu-devel, qemu-block

On Tue, 2021-10-26 at 16:36 +0200, Lukasz Maniak wrote:
> On Thu, Oct 07, 2021 at 06:12:41PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Oct 07, 2021 at 06:23:52PM +0200, Lukasz Maniak wrote:
> > > From: Knut Omang <knut.omang@oracle.com>
> > > 
> > > Make the default PCI Express Capability for PCIe devices set
> > > MaxReadReq to 512.
> > 
> > code says 256
> > 
> > > Tyipcal modern devices people would want to
> > 
> > 
> > typo
> > 
> > > emulate or simulate would want this. The previous value would
> > > cause warnings from the root port driver on some kernels.
> > 
> > 
> > which specifically?
> > 
> > > 
> > > Signed-off-by: Knut Omang <knuto@ifi.uio.no>
> > 
> > we can't make changes like this unconditionally, this will
> > break migration across versions.
> > Pls tie this to a machine version.
> > 
> > Thanks!
> > > ---
> > >   hw/pci/pcie.c | 5 ++++-
> > >   1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > index 6e95d82903..c1a12f3744 100644
> > > --- a/hw/pci/pcie.c
> > > +++ b/hw/pci/pcie.c
> > > @@ -62,8 +62,9 @@ pcie_cap_v1_fill(PCIDevice *dev, uint8_t port, uint8_t type, uint8_t
> > > version)
> > >        * Functions conforming to the ECN, PCI Express Base
> > >        * Specification, Revision 1.1., or subsequent PCI Express Base
> > >        * Specification revisions.
> > > +     *  + set max payload size to 256, which seems to be a common value
> > >        */
> > > -    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER);
> > > +    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER | (0x1 &
> > > PCI_EXP_DEVCAP_PAYLOAD));
> > >   
> > >       pci_set_long(exp_cap + PCI_EXP_LNKCAP,
> > >                    (port << PCI_EXP_LNKCAP_PN_SHIFT) |
> > > @@ -179,6 +180,8 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset,
> > >       pci_set_long(exp_cap + PCI_EXP_DEVCAP2,
> > >                    PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP);
> > >   
> > > +    pci_set_word(exp_cap + PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_READRQ_256B);
> > > +
> > >       pci_set_word(dev->wmask + pos + PCI_EXP_DEVCTL2, PCI_EXP_DEVCTL2_EETLPPB);
> > >   
> > >       if (dev->cap_present & QEMU_PCIE_EXTCAP_INIT) {
> > > -- 
> > > 2.25.1
> > 
> 
> Hi Michael,
> 
> The reason Knut keeps rebasing this fix along with SR-IOV patch is not
> clear for us.

Sorry for the slow response - I seem to have messed up my mail filters so this
thread slipped past my attention.

> Since we have tested the NVMe device without this fix and did not notice
> any issues mentioned by Knut on kernel 5.4.0, we decided to drop it for
> v2.

I agree, let's just drop it.

It was likely in the 3.x kernels I had to relate to back then, 
likely discovered in Oracle Linux given that I did not specifically point to a kernel
range already back then.

> However, I have posted your comments to this patch on Knut's github so
> they can be addressed in case Knut decides to resubmit it later though.

Thanks for that ping, Lukasz, and great to see the patch finally being used in a
functional device!

Knut

> Thanks,
> Lukasz




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] pcie: Add helpers to the SR/IOV API
  2021-10-07 16:24 ` [PATCH 13/15] pcie: Add helpers to the SR/IOV API Lukasz Maniak
@ 2021-10-26 16:57   ` Knut Omang
  0 siblings, 0 replies; 55+ messages in thread
From: Knut Omang @ 2021-10-26 16:57 UTC (permalink / raw)
  To: Lukasz Maniak, qemu-devel
  Cc: Łukasz Gieryk, qemu-block, Michael S. Tsirkin

On Thu, 2021-10-07 at 18:24 +0200, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> Two convenience functions for retrieving:
>  - the total number of VFs,
>  - the PCIDevice object of the N-th VF.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> ---
>  hw/pci/pcie_sriov.c         | 14 ++++++++++++++
>  include/hw/pci/pcie_sriov.h |  8 ++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index cac2aee061..5a8e92d5ab 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -292,8 +292,22 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev)
>      return dev->exp.sriov_vf.vf_number;
>  }
>  
> +uint16_t pcie_sriov_vf_number_total(PCIDevice *dev)
> +{
> +    assert(!pci_is_vf(dev));
> +    return dev->exp.sriov_pf.num_vfs;
> +}
>  
>  PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
>  {
>      return dev->exp.sriov_vf.pf;
>  }
> +
> +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
> +{
> +    assert(!pci_is_vf(dev));
> +    if (n < dev->exp.sriov_pf.num_vfs) {
> +        return dev->exp.sriov_pf.vf[n];
> +    }
> +    return NULL;
> +}
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> index 9ab48b79c0..d1f39b7223 100644
> --- a/include/hw/pci/pcie_sriov.h
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -65,9 +65,17 @@ void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
>  /* Get logical VF number of a VF - only valid for VFs */
>  uint16_t pcie_sriov_vf_number(PCIDevice *dev);
>  
> +/* Get the total number of VFs - only valid for PF */
> +uint16_t pcie_sriov_vf_number_total(PCIDevice *dev);
> +
>  /* Get the physical function that owns this VF.
>   * Returns NULL if dev is not a virtual function
>   */
>  PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
>  
> +/* Get the n-th VF of this physical function - only valid for PF.
> + * Returns NULL if index is invalid
> + */
> +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n);
> +
>  #endif /* QEMU_PCIE_SRIOV_H */


These look like natural improvements to me, thanks!

Reviewed-by: Knut Omang <knuto@ifi.uio.no>




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (15 preceding siblings ...)
  2021-10-08  6:31 ` [PATCH 00/15] hw/nvme: SR-IOV with " Klaus Jensen
@ 2021-10-26 18:20 ` Klaus Jensen
  2021-10-27 16:49   ` Lukasz Maniak
  16 siblings, 1 reply; 55+ messages in thread
From: Klaus Jensen @ 2021-10-26 18:20 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2798 bytes --]

On Oct  7 18:23, Lukasz Maniak wrote:
> Hi,
> 
> This series of patches is an attempt to add support for the following
> sections of NVMe specification revision 1.4:
> 
> 8.5 Virtualization Enhancements (Optional)
>     8.5.1 VQ Resource Definition
>     8.5.2 VI Resource Definition
>     8.5.3 Secondary Controller States and Resource Configuration
>     8.5.4 Single Root I/O Virtualization and Sharing (SR-IOV)
> 
> The NVMe controller's Single Root I/O Virtualization and Sharing
> implementation is based on patches introducing SR-IOV support for PCI
> Express proposed by Knut Omang:
> https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05155.html
> 
> However, based on what I was able to find historically, Knut's patches
> have not yet been pulled into QEMU due to no example of a working device
> up to this point:
> https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg02722.html
> 
> In terms of design, the Physical Function controller and the Virtual
> Function controllers are almost independent, with few exceptions:
> PF handles flexible resource allocation for all its children (VFs have
> read-only access to this data), and reset (PF explicitly calls it on VFs).
> Since the MMIO access is serialized, no extra precautions are required
> to handle concurrent resets, as well as the secondary controller state
> access doesn't need to be atomic.
> 
> A controller with full SR-IOV support must be capable of handling the
> Namespace Management command. As there is a pending review with this
> functionality, this patch list is not duplicating efforts.
> Yet, NS management patches are not required to test the SR-IOV support.
> 
> We tested the patches on Ubuntu 20.04.3 LTS with kernel 5.4.0. We have
> hit various issues with NVMe CLI (list and virt-mgmt commands) between
> releases from version 1.09 to master, thus we chose this golden NVMe CLI
> hash for testing: a50a0c1.
> 
> The implementation is not 100% finished and certainly not bug free,
> since we are already aware of some issues e.g. interaction with
> namespaces related to AER, or unexpected (?) kernel behavior in more
> complex reset scenarios. However, our SR-IOV implementation is already
> able to support typical SR-IOV use cases, so we believe the patches are
> ready to share with the community.
> 
> Hope you find some time to review the work we did, and share your
> thoughts.
> 
> Kind regards,
> Lukasz

Hi Lukasz,

If possible, can you share a brief guide on testing this? I keep hitting
an assert

  qemu-system-x86_64: ../hw/pci/pci.c:1215: pci_register_bar: Assertion `!pci_is_vf(pci_dev)' failed.

when I try to modify sriov_numvfs. This should be fixed anyway, but I
might be doing something wrong in the first place.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2021-10-26 18:20 ` Klaus Jensen
@ 2021-10-27 16:49   ` Lukasz Maniak
  2021-11-02  7:24     ` Klaus Jensen
  0 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-10-27 16:49 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: Łukasz Gieryk, qemu-devel, qemu-block

On Tue, Oct 26, 2021 at 08:20:12PM +0200, Klaus Jensen wrote:
> On Oct  7 18:23, Lukasz Maniak wrote:
> > Hi,
> > 
> > This series of patches is an attempt to add support for the following
> > sections of NVMe specification revision 1.4:
> > 
> > 8.5 Virtualization Enhancements (Optional)
> >     8.5.1 VQ Resource Definition
> >     8.5.2 VI Resource Definition
> >     8.5.3 Secondary Controller States and Resource Configuration
> >     8.5.4 Single Root I/O Virtualization and Sharing (SR-IOV)
> > 
> > The NVMe controller's Single Root I/O Virtualization and Sharing
> > implementation is based on patches introducing SR-IOV support for PCI
> > Express proposed by Knut Omang:
> > https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05155.html
> > 
> > However, based on what I was able to find historically, Knut's patches
> > have not yet been pulled into QEMU due to no example of a working device
> > up to this point:
> > https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg02722.html
> > 
> > In terms of design, the Physical Function controller and the Virtual
> > Function controllers are almost independent, with few exceptions:
> > PF handles flexible resource allocation for all its children (VFs have
> > read-only access to this data), and reset (PF explicitly calls it on VFs).
> > Since the MMIO access is serialized, no extra precautions are required
> > to handle concurrent resets, as well as the secondary controller state
> > access doesn't need to be atomic.
> > 
> > A controller with full SR-IOV support must be capable of handling the
> > Namespace Management command. As there is a pending review with this
> > functionality, this patch list is not duplicating efforts.
> > Yet, NS management patches are not required to test the SR-IOV support.
> > 
> > We tested the patches on Ubuntu 20.04.3 LTS with kernel 5.4.0. We have
> > hit various issues with NVMe CLI (list and virt-mgmt commands) between
> > releases from version 1.09 to master, thus we chose this golden NVMe CLI
> > hash for testing: a50a0c1.
> > 
> > The implementation is not 100% finished and certainly not bug free,
> > since we are already aware of some issues e.g. interaction with
> > namespaces related to AER, or unexpected (?) kernel behavior in more
> > complex reset scenarios. However, our SR-IOV implementation is already
> > able to support typical SR-IOV use cases, so we believe the patches are
> > ready to share with the community.
> > 
> > Hope you find some time to review the work we did, and share your
> > thoughts.
> > 
> > Kind regards,
> > Lukasz
> 
> Hi Lukasz,
> 
> If possible, can you share a brief guide on testing this? I keep hitting
> an assert
> 
>   qemu-system-x86_64: ../hw/pci/pci.c:1215: pci_register_bar: Assertion `!pci_is_vf(pci_dev)' failed.
> 
> when I try to modify sriov_numvfs. This should be fixed anyway, but I
> might be doing something wrong in the first place.

Hi Klaus,

Let me share all the details about the steps I did to run 7 fully
functional VF controllers and failed to reproduce the assert.

I rebased v1 patches to eliminate any recent regression onto the current
master 931ce30859.

I configured build as follows:
./configure \
--target-list=x86_64-softmmu \
--enable-kvm

Then I launched QEMU using these options:
./qemu-system-x86_64 \
-m 4096 \
-smp 4 \
-drive file=qemu-os.qcow2 \
-nographic \
-enable-kvm \
-machine q35 \
-device pcie-root-port,slot=0,id=rp0 \
-device nvme-subsys,id=subsys0 \
-device nvme,serial=1234,id=nvme0,subsys=subsys0,bus=rp0,sriov_max_vfs=127,sriov_max_vq_per_vf=2,sriov_max_vi_per_vf=1

Next, I issued below commands as root to configure VFs:
nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
nvme reset /dev/nvme0
echo 1 > /sys/bus/pci/rescan

nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
nvme virt-mgmt /dev/nvme0 -c 2 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 2 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 2 -r 0 -a 9 -n 0
nvme virt-mgmt /dev/nvme0 -c 3 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 3 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 3 -r 0 -a 9 -n 0
nvme virt-mgmt /dev/nvme0 -c 4 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 4 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 4 -r 0 -a 9 -n 0
nvme virt-mgmt /dev/nvme0 -c 5 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 5 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 5 -r 0 -a 9 -n 0
nvme virt-mgmt /dev/nvme0 -c 6 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 6 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 6 -r 0 -a 9 -n 0
nvme virt-mgmt /dev/nvme0 -c 7 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 7 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 7 -r 0 -a 9 -n 0

echo 7 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs

If you use only up to patch 05 inclusive then this command should do all
the job:
echo 7 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs

The OS I used for the host and guest was Ubuntu 20.04.3 LTS.

Can you share more call stack for assert or the configuration you are
trying to run?

Thanks,
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2021-10-27 16:49   ` Lukasz Maniak
@ 2021-11-02  7:24     ` Klaus Jensen
  0 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-11-02  7:24 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 5903 bytes --]

On Oct 27 18:49, Lukasz Maniak wrote:
> On Tue, Oct 26, 2021 at 08:20:12PM +0200, Klaus Jensen wrote:
> > On Oct  7 18:23, Lukasz Maniak wrote:
> > > Hi,
> > > 
> > > This series of patches is an attempt to add support for the following
> > > sections of NVMe specification revision 1.4:
> > > 
> > > 8.5 Virtualization Enhancements (Optional)
> > >     8.5.1 VQ Resource Definition
> > >     8.5.2 VI Resource Definition
> > >     8.5.3 Secondary Controller States and Resource Configuration
> > >     8.5.4 Single Root I/O Virtualization and Sharing (SR-IOV)
> > > 
> > > The NVMe controller's Single Root I/O Virtualization and Sharing
> > > implementation is based on patches introducing SR-IOV support for PCI
> > > Express proposed by Knut Omang:
> > > https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05155.html
> > > 
> > > However, based on what I was able to find historically, Knut's patches
> > > have not yet been pulled into QEMU due to no example of a working device
> > > up to this point:
> > > https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg02722.html
> > > 
> > > In terms of design, the Physical Function controller and the Virtual
> > > Function controllers are almost independent, with few exceptions:
> > > PF handles flexible resource allocation for all its children (VFs have
> > > read-only access to this data), and reset (PF explicitly calls it on VFs).
> > > Since the MMIO access is serialized, no extra precautions are required
> > > to handle concurrent resets, as well as the secondary controller state
> > > access doesn't need to be atomic.
> > > 
> > > A controller with full SR-IOV support must be capable of handling the
> > > Namespace Management command. As there is a pending review with this
> > > functionality, this patch list is not duplicating efforts.
> > > Yet, NS management patches are not required to test the SR-IOV support.
> > > 
> > > We tested the patches on Ubuntu 20.04.3 LTS with kernel 5.4.0. We have
> > > hit various issues with NVMe CLI (list and virt-mgmt commands) between
> > > releases from version 1.09 to master, thus we chose this golden NVMe CLI
> > > hash for testing: a50a0c1.
> > > 
> > > The implementation is not 100% finished and certainly not bug free,
> > > since we are already aware of some issues e.g. interaction with
> > > namespaces related to AER, or unexpected (?) kernel behavior in more
> > > complex reset scenarios. However, our SR-IOV implementation is already
> > > able to support typical SR-IOV use cases, so we believe the patches are
> > > ready to share with the community.
> > > 
> > > Hope you find some time to review the work we did, and share your
> > > thoughts.
> > > 
> > > Kind regards,
> > > Lukasz
> > 
> > Hi Lukasz,
> > 
> > If possible, can you share a brief guide on testing this? I keep hitting
> > an assert
> > 
> >   qemu-system-x86_64: ../hw/pci/pci.c:1215: pci_register_bar: Assertion `!pci_is_vf(pci_dev)' failed.
> > 
> > when I try to modify sriov_numvfs. This should be fixed anyway, but I
> > might be doing something wrong in the first place.
> 
> Hi Klaus,
> 
> Let me share all the details about the steps I did to run 7 fully
> functional VF controllers and failed to reproduce the assert.
> 
> I rebased v1 patches to eliminate any recent regression onto the current
> master 931ce30859.
> 
> I configured build as follows:
> ./configure \
> --target-list=x86_64-softmmu \
> --enable-kvm
> 
> Then I launched QEMU using these options:
> ./qemu-system-x86_64 \
> -m 4096 \
> -smp 4 \
> -drive file=qemu-os.qcow2 \
> -nographic \
> -enable-kvm \
> -machine q35 \
> -device pcie-root-port,slot=0,id=rp0 \
> -device nvme-subsys,id=subsys0 \
> -device nvme,serial=1234,id=nvme0,subsys=subsys0,bus=rp0,sriov_max_vfs=127,sriov_max_vq_per_vf=2,sriov_max_vi_per_vf=1
> 
> Next, I issued below commands as root to configure VFs:
> nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
> nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
> nvme reset /dev/nvme0
> echo 1 > /sys/bus/pci/rescan
> 
> nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
> nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
> nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
> nvme virt-mgmt /dev/nvme0 -c 2 -r 1 -a 8 -n 1
> nvme virt-mgmt /dev/nvme0 -c 2 -r 0 -a 8 -n 2
> nvme virt-mgmt /dev/nvme0 -c 2 -r 0 -a 9 -n 0
> nvme virt-mgmt /dev/nvme0 -c 3 -r 1 -a 8 -n 1
> nvme virt-mgmt /dev/nvme0 -c 3 -r 0 -a 8 -n 2
> nvme virt-mgmt /dev/nvme0 -c 3 -r 0 -a 9 -n 0
> nvme virt-mgmt /dev/nvme0 -c 4 -r 1 -a 8 -n 1
> nvme virt-mgmt /dev/nvme0 -c 4 -r 0 -a 8 -n 2
> nvme virt-mgmt /dev/nvme0 -c 4 -r 0 -a 9 -n 0
> nvme virt-mgmt /dev/nvme0 -c 5 -r 1 -a 8 -n 1
> nvme virt-mgmt /dev/nvme0 -c 5 -r 0 -a 8 -n 2
> nvme virt-mgmt /dev/nvme0 -c 5 -r 0 -a 9 -n 0
> nvme virt-mgmt /dev/nvme0 -c 6 -r 1 -a 8 -n 1
> nvme virt-mgmt /dev/nvme0 -c 6 -r 0 -a 8 -n 2
> nvme virt-mgmt /dev/nvme0 -c 6 -r 0 -a 9 -n 0
> nvme virt-mgmt /dev/nvme0 -c 7 -r 1 -a 8 -n 1
> nvme virt-mgmt /dev/nvme0 -c 7 -r 0 -a 8 -n 2
> nvme virt-mgmt /dev/nvme0 -c 7 -r 0 -a 9 -n 0
> 
> echo 7 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> 
> If you use only up to patch 05 inclusive then this command should do all
> the job:
> echo 7 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> 
> The OS I used for the host and guest was Ubuntu 20.04.3 LTS.
> 
> Can you share more call stack for assert or the configuration you are
> trying to run?
> 
> Thanks,
> Lukasz
> 

Hi Lukasz,

Thanks, this all works for me and in general it all looks pretty good to
me. I don't have any big reservations about this series (the hw/nvme
parts).

However, the assert.

I did the right procedure, but if the device has a CMB, then changing
sriov_numvfs asserts QEMU. I.e., add `cmb_size_mb=64` to the controller
parameters.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-10-07 16:23 ` [PATCH 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
  2021-10-20 19:07   ` Klaus Jensen
@ 2021-11-02 14:33   ` Klaus Jensen
  2021-11-02 17:33     ` Lukasz Maniak
  1 sibling, 1 reply; 55+ messages in thread
From: Klaus Jensen @ 2021-11-02 14:33 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

[-- Attachment #1: Type: text/plain, Size: 1926 bytes --]

On Oct  7 18:23, Lukasz Maniak wrote:
> This patch implements initial support for Single Root I/O Virtualization
> on an NVMe device.
> 
> Essentially, it allows to define the maximum number of virtual functions
> supported by the NVMe controller via sriov_max_vfs parameter.
> 
> Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> capability by a physical controller and ARI capability by both the
> physical and virtual function devices.
> 
> NVMe controllers created via virtual functions mirror functionally
> the physical controller, which may not entirely be the case, thus
> consideration would be needed on the way to limit the capabilities of
> the VF.
> 
> NVMe subsystem is required for the use of SR-IOV.
> 
> Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> ---
>  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
>  hw/nvme/nvme.h           |  1 +
>  include/hw/pci/pci_ids.h |  1 +
>  3 files changed, 73 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 6a571d18cf..ad79ff0c00 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>                            n->reg_size);
>      memory_region_add_subregion(&n->bar0, 0, &n->iomem);
>  
> -    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> +    if (pci_is_vf(pci_dev)) {
> +        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
> +    } else {
> +        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> +                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> +    }

I assume that the assert we are seeing means that the pci_register_bars
in nvme_init_cmb and nvme_init_pmr must be changed similarly to this.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/15] hw/nvme: Add support for Primary Controller Capabilities
  2021-10-07 16:23 ` [PATCH 06/15] hw/nvme: Add support for Primary Controller Capabilities Lukasz Maniak
@ 2021-11-02 14:34   ` Klaus Jensen
  0 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-11-02 14:34 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	qemu-devel, Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 388 bytes --]

On Oct  7 18:23, Lukasz Maniak wrote:
> Implementation of Primary Controller Capabilities data
> structure (Identify command with CNS value of 14h).
> 
> Currently, the command returns only ID of a primary controller.
> Handling of remaining fields are added in subsequent patches
> implementing virtualization enhancements.
> 

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/15] hw/nvme: Add support for Secondary Controller List
  2021-10-07 16:23 ` [PATCH 07/15] hw/nvme: Add support for Secondary Controller List Lukasz Maniak
@ 2021-11-02 14:35   ` Klaus Jensen
  0 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-11-02 14:35 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	qemu-devel, Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 817 bytes --]

On Oct  7 18:23, Lukasz Maniak wrote:
> Introduce handling for Secondary Controller List (Identify command with
> CNS value of 15h).
> 
> Secondary controller ids are unique in the subsystem, hence they are
> reserved by it upon initialization of the primary controller to the
> number of sriov_max_vfs.
> 
> ID reservation requires the addition of an intermediate controller slot
> state, so the reserved controller has the address 0xFFFF.
> A secondary controller is in the reserved state when it has no virtual
> function assigned, but its primary controller is realized.
> Secondary controller reservations are released to NULL when its primary
> controller is unregistered.
> 

If I understood correctly, you want to change the callback thing in v2,
so I'll wait for v2 with my review on this.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/15] hw/nvme: Implement the Function Level Reset
  2021-10-07 16:24 ` [PATCH 09/15] hw/nvme: Implement the Function Level Reset Lukasz Maniak
@ 2021-11-02 14:35   ` Klaus Jensen
  0 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-11-02 14:35 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Keith Busch, Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1471 bytes --]

On Oct  7 18:24, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> This patch implements the FLR, a feature currently not implemented for
> the Nvme device, while listed as a mandatory ("shall") in the 1.4 spec.
> 
> The implementation reuses FLR-related building blocks defined for the
> pci-bridge module, and follows the same logic:
>     - FLR capability is advertised in the PCIE config,
>     - custom pci_write_config callback detects a write to the trigger
>       register and performs the PCI reset,
>     - which, eventually, calls the custom dc->reset handler.
> 
> Depending on reset type, parts of the state should (or should not) be
> cleared. To distinguish the type of reset, an additional parameter is
> passed to the reset function.
> 
> This patch also enables advertisement of the Power Management PCI
> capability. The main reason behind it is to announce the no_soft_reset=1
> bit, to signal SR/IOV support where each VF can be reset individually.
> 
> The implementation purposedly ignores writes to the PMCS.PS register,
> as even such naïve behavior is enough to correctly handle the D3->D0
> transition.
> 
> It’s worth to note, that the power state transition back to to D3, with
> all the corresponding side effects, wasn't and stil isn't handled
> properly.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-11-02 14:33   ` Klaus Jensen
@ 2021-11-02 17:33     ` Lukasz Maniak
  2021-11-04 14:30       ` Lukasz Maniak
  0 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-11-02 17:33 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

On Tue, Nov 02, 2021 at 03:33:15PM +0100, Klaus Jensen wrote:
> On Oct  7 18:23, Lukasz Maniak wrote:
> > This patch implements initial support for Single Root I/O Virtualization
> > on an NVMe device.
> > 
> > Essentially, it allows to define the maximum number of virtual functions
> > supported by the NVMe controller via sriov_max_vfs parameter.
> > 
> > Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> > capability by a physical controller and ARI capability by both the
> > physical and virtual function devices.
> > 
> > NVMe controllers created via virtual functions mirror functionally
> > the physical controller, which may not entirely be the case, thus
> > consideration would be needed on the way to limit the capabilities of
> > the VF.
> > 
> > NVMe subsystem is required for the use of SR-IOV.
> > 
> > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > ---
> >  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
> >  hw/nvme/nvme.h           |  1 +
> >  include/hw/pci/pci_ids.h |  1 +
> >  3 files changed, 73 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index 6a571d18cf..ad79ff0c00 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> >                            n->reg_size);
> >      memory_region_add_subregion(&n->bar0, 0, &n->iomem);
> >  
> > -    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > +    if (pci_is_vf(pci_dev)) {
> > +        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
> > +    } else {
> > +        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > +                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > +    }
> 
> I assume that the assert we are seeing means that the pci_register_bars
> in nvme_init_cmb and nvme_init_pmr must be changed similarly to this.

Assert will only arise for CMB as VF params are initialized with PF
params.

@@ -6532,6 +6585,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     NvmeCtrl *n = NVME(pci_dev);
     NvmeNamespace *ns;
     Error *local_err = NULL;
+    NvmeCtrl *pn = NVME(pcie_sriov_get_pf(pci_dev));
+
+    if (pci_is_vf(pci_dev)) {
+        /* VFs derive settings from the parent. PF's lifespan exceeds
+         * that of VF's, so it's safe to share params.serial.
+         */
+        memcpy(&n->params, &pn->params, sizeof(NvmeParams));
+        n->subsys = pn->subsys;
+    }
 
     nvme_check_constraints(n, &local_err);
     if (local_err) {

The following simple fix will both fix assert and also allow
each VF to have its own CMB of the size defined for PF.

---
 hw/nvme/ctrl.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 19b32dd4da..99daa6290c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6837,10 +6837,15 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
     n->cmb.buf = g_malloc0(cmb_size);
     memory_region_init_io(&n->cmb.mem, OBJECT(n), &nvme_cmb_ops, n,
                           "nvme-cmb", cmb_size);
-    pci_register_bar(pci_dev, NVME_CMB_BIR,
-                     PCI_BASE_ADDRESS_SPACE_MEMORY |
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 |
-                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
+
+    if (pci_is_vf(pci_dev)) {
+        pcie_sriov_vf_register_bar(pci_dev, NVME_CMB_BIR, &n->cmb.mem);
+    } else {
+        pci_register_bar(pci_dev, NVME_CMB_BIR,
+                        PCI_BASE_ADDRESS_SPACE_MEMORY |
+                        PCI_BASE_ADDRESS_MEM_TYPE_64 |
+                        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
+    }
 
     NVME_CAP_SET_CMBS(cap, 1);
     stq_le_p(&n->bar.cap, cap);

As for PMR, it is currently only available on PF, as only PF is capable
of specifying the memory-backend-file object to use with PMR.
Otherwise, either VFs would have to share the PMR with its PF, or there
would be a requirement to define a memory-backend-file object for each VF.


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-10-07 16:24 ` [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
@ 2021-11-03 12:07   ` Klaus Jensen
  2021-11-04 15:48     ` Łukasz Gieryk
  0 siblings, 1 reply; 55+ messages in thread
From: Klaus Jensen @ 2021-11-03 12:07 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	qemu-devel, Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 1935 bytes --]

On Oct  7 18:24, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> can configure the maximum number of virtual queues and interrupts
> assignable to a single virtual device. The primary and secondary
> controller capability structures are initialized accordingly.
> 
> Since the number of available queues (interrupts) now varies between
> VF/PF, BAR size calculation is also adjusted.
> 

While this patch allows configuring the VQFRSM and VIFRSM fields, it
implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
bound and this removes a testable case for host software (e.g.
requesting more flexible resources than what is currently available).

This patch also requires that these parameters are set if sriov_max_vfs
is. I think we can provide better defaults.

How about,

1. if only sriov_max_vfs is set, then all VFs get private resources
   equal to max_ioqpairs. Like before this patch. This limits the number
   of parameters required to get a basic setup going.

2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
   10), the difference between that and max_ioqpairs become flexible
   resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
   instead and just make the difference become private resources.
   Potato/potato.

   a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
      of calculated flexible resources.

This probably smells a bit like bikeshedding, but I think this gives
more flexibility and better defaults, which helps with verifying host
software.

If we can't agree on this now, I suggest we could go ahead and merge the
base functionality (i.e. private resources only) and ruminate some more
about these parameters.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2021-10-21 13:40     ` Łukasz Gieryk
@ 2021-11-03 12:11       ` Klaus Jensen
  0 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-11-03 12:11 UTC (permalink / raw)
  To: Łukasz Gieryk; +Cc: Keith Busch, Lukasz Maniak, qemu-block, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4806 bytes --]

On Oct 21 15:40, Łukasz Gieryk wrote:
> On Wed, Oct 20, 2021 at 09:06:06PM +0200, Klaus Jensen wrote:
> > On Oct  7 18:24, Lukasz Maniak wrote:
> > > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > > 
> > > The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
> > > them as constants is problematic for SR-IOV support.
> > > 
> > > The SR-IOV feature introduces virtual resources (queues, interrupts)
> > > that can be assigned to PF and its dependent VFs. Each device, following
> > > a reset, should work with the configured number of queues. A single
> > > constant is no longer sufficient to hold the whole state.
> > > 
> > > This patch tries to solve the problem by introducing additional
> > > variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> > > are therefore organized as:
> > > 
> > >  - n->params.max_ioqpairs – no changes, constant set by the user.
> > > 
> > >  - n->max_ioqpairs - (new) value derived from n->params.* in realize();
> > >                      constant through device’s lifetime.
> > > 
> > >  - n->(mutable_state) – (not a part of this patch) user-configurable,
> > >                         specifies number of queues available _after_
> > >                         reset.
> > > 
> > >  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
> > >                       n->params.max_ioqpairs; initialized in realize()
> > >                       and updated during reset() to reflect user’s
> > >                       changes to the mutable state.
> > > 
> > > Since the number of available i/o queues and interrupts can change in
> > > runtime, buffers for sq/cqs and the MSIX-related structures are
> > > allocated big enough to handle the limits, to completely avoid the
> > > complicated reallocation. A helper function (nvme_update_msixcap_ts)
> > > updates the corresponding capability register, to signal configuration
> > > changes.
> > > 
> > > Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > 
> > Instead of this, how about adding new parameters, say, sriov_vi_private
> > and sriov_vq_private. Then, max_ioqpairs and msix_qsize are still the
> > "physical" limits and the new parameters just reserve some for the
> > primary controller, the rest being available for flexsible resources.
> 
> Compare your configuration:
> 
>     max_ioqpairs     = 26
>     sriov_max_vfs    = 4
>     sriov_vq_private = 10
> 
> with mine:
> 
>     max_ioqpairs        = 10
>     sriov_max_vfs       = 4
>     sriov_max_vq_per_vf = 4
> 
> In your version, if I wanted to change max_vfs but keep the same number
> of flexible resources per VF, then I would have to do some math and
> update max_ioparis. And then I also would have to adjust the other
> interrupt-related parameter, as it's also affected. In my opinion
> it's quite inconvenient.

True, that is probably inconvenient, but we have tools to do this math
for us. I very much prefer to be explicit in these parameters.

Also, see my comment on patch 12. If we keep this meaning of
max_ioqpairs, then we have reasonable defaults for the number of private
resources (if no flexible resources are required) and I think we can
control all parameters in the capabilities structures (with a little
math).

>  
> Now, even if I changed the semantic of params, I would still need most
> of this patch. (Let’s keep the discussion regarding if max_* fields are
> necessary in the other thread).
> 
> Without virtualization, the maximum number of queues is constant. User
> (i.e., nvme kernel driver) can only query this value (e.g., 10) and
> needs to follow this limit.
> 
> With virtualization, the flexible resources kick in. Let's continue with
> the sample numbers defined earlier (10 private + 16 flexible resources).
> 
> 1) The device boots, all 16 flexible queues are assigned to the primary
>    controller.
> 2) Nvme kernel driver queries for the limit (10+16=26) and can create/use
>    up to this many queues. 
> 3) User via the virtualization management command unbinds some (let's
>    say 2) of the flexible queues from the primary controller and assigns
>    them to a secondary controller.
> 4) After reset, the Physical Function Device reports different limit
>    (24), and when the Virtual Device shows up, it will report 1 (adminQ
>    consumed the other resource). 
> 
> So I need additional variable in the state to store the intermediate
> limit (24 or 1), as none of the existing params has the correct value,
> and all the places that validate limits must work on the value.
> 

I do not contest that you need additional state to keep track of
assigned resources. That seems totally reasonable.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-11-02 17:33     ` Lukasz Maniak
@ 2021-11-04 14:30       ` Lukasz Maniak
  2021-11-08  7:56         ` Klaus Jensen
  0 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-11-04 14:30 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

On Tue, Nov 02, 2021 at 06:33:31PM +0100, Lukasz Maniak wrote:
> On Tue, Nov 02, 2021 at 03:33:15PM +0100, Klaus Jensen wrote:
> > On Oct  7 18:23, Lukasz Maniak wrote:
> > > This patch implements initial support for Single Root I/O Virtualization
> > > on an NVMe device.
> > > 
> > > Essentially, it allows to define the maximum number of virtual functions
> > > supported by the NVMe controller via sriov_max_vfs parameter.
> > > 
> > > Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> > > capability by a physical controller and ARI capability by both the
> > > physical and virtual function devices.
> > > 
> > > NVMe controllers created via virtual functions mirror functionally
> > > the physical controller, which may not entirely be the case, thus
> > > consideration would be needed on the way to limit the capabilities of
> > > the VF.
> > > 
> > > NVMe subsystem is required for the use of SR-IOV.
> > > 
> > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > > ---
> > >  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
> > >  hw/nvme/nvme.h           |  1 +
> > >  include/hw/pci/pci_ids.h |  1 +
> > >  3 files changed, 73 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > index 6a571d18cf..ad79ff0c00 100644
> > > --- a/hw/nvme/ctrl.c
> > > +++ b/hw/nvme/ctrl.c
> > > @@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > >                            n->reg_size);
> > >      memory_region_add_subregion(&n->bar0, 0, &n->iomem);
> > >  
> > > -    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > -                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > +    if (pci_is_vf(pci_dev)) {
> > > +        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
> > > +    } else {
> > > +        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > +                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > +    }
> > 
> > I assume that the assert we are seeing means that the pci_register_bars
> > in nvme_init_cmb and nvme_init_pmr must be changed similarly to this.
> 
> Assert will only arise for CMB as VF params are initialized with PF
> params.
> 
> @@ -6532,6 +6585,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      NvmeCtrl *n = NVME(pci_dev);
>      NvmeNamespace *ns;
>      Error *local_err = NULL;
> +    NvmeCtrl *pn = NVME(pcie_sriov_get_pf(pci_dev));
> +
> +    if (pci_is_vf(pci_dev)) {
> +        /* VFs derive settings from the parent. PF's lifespan exceeds
> +         * that of VF's, so it's safe to share params.serial.
> +         */
> +        memcpy(&n->params, &pn->params, sizeof(NvmeParams));
> +        n->subsys = pn->subsys;
> +    }
>  
>      nvme_check_constraints(n, &local_err);
>      if (local_err) {
> 
> The following simple fix will both fix assert and also allow
> each VF to have its own CMB of the size defined for PF.
> 
> ---
>  hw/nvme/ctrl.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 19b32dd4da..99daa6290c 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -6837,10 +6837,15 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
>      n->cmb.buf = g_malloc0(cmb_size);
>      memory_region_init_io(&n->cmb.mem, OBJECT(n), &nvme_cmb_ops, n,
>                            "nvme-cmb", cmb_size);
> -    pci_register_bar(pci_dev, NVME_CMB_BIR,
> -                     PCI_BASE_ADDRESS_SPACE_MEMORY |
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64 |
> -                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> +
> +    if (pci_is_vf(pci_dev)) {
> +        pcie_sriov_vf_register_bar(pci_dev, NVME_CMB_BIR, &n->cmb.mem);
> +    } else {
> +        pci_register_bar(pci_dev, NVME_CMB_BIR,
> +                        PCI_BASE_ADDRESS_SPACE_MEMORY |
> +                        PCI_BASE_ADDRESS_MEM_TYPE_64 |
> +                        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> +    }
>  
>      NVME_CAP_SET_CMBS(cap, 1);
>      stq_le_p(&n->bar.cap, cap);
> 
> As for PMR, it is currently only available on PF, as only PF is capable
> of specifying the memory-backend-file object to use with PMR.
> Otherwise, either VFs would have to share the PMR with its PF, or there
> would be a requirement to define a memory-backend-file object for each VF.

Hi Klaus,

After some discussion, we decided to prohibit in V2 the use of CMB and
PMR in combination with SR-IOV.

While the implementation of CMB with SR-IOV is relatively
straightforward, PMR is not. We are committed to consistency in CMB and
PMR design in association with SR-IOV. So we considered it best to
disable both features and implement them in separate patches.

Kind regards,
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-11-03 12:07   ` Klaus Jensen
@ 2021-11-04 15:48     ` Łukasz Gieryk
  2021-11-05  8:46       ` Łukasz Gieryk
  0 siblings, 1 reply; 55+ messages in thread
From: Łukasz Gieryk @ 2021-11-04 15:48 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Lukasz Maniak, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> On Oct  7 18:24, Lukasz Maniak wrote:
> > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > 
> > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > can configure the maximum number of virtual queues and interrupts
> > assignable to a single virtual device. The primary and secondary
> > controller capability structures are initialized accordingly.
> > 
> > Since the number of available queues (interrupts) now varies between
> > VF/PF, BAR size calculation is also adjusted.
> > 
> 
> While this patch allows configuring the VQFRSM and VIFRSM fields, it
> implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> bound and this removes a testable case for host software (e.g.
> requesting more flexible resources than what is currently available).
> 
> This patch also requires that these parameters are set if sriov_max_vfs
> is. I think we can provide better defaults.
> 

Originally I considered more params, but ended up coding the simplest,
user-friendly solution, because I did not like the mess with so many
parameters, and the flexibility wasn't needed for my use cases. But I do
agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
valid and resembles an actual device.

> How about,
> 
> 1. if only sriov_max_vfs is set, then all VFs get private resources
>    equal to max_ioqpairs. Like before this patch. This limits the number
>    of parameters required to get a basic setup going.
> 
> 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
>    10), the difference between that and max_ioqpairs become flexible
>    resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
>    instead and just make the difference become private resources.
>    Potato/potato.
> 
>    a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
>       of calculated flexible resources.
> 
> This probably smells a bit like bikeshedding, but I think this gives
> more flexibility and better defaults, which helps with verifying host
> software.
> 
> If we can't agree on this now, I suggest we could go ahead and merge the
> base functionality (i.e. private resources only) and ruminate some more
> about these parameters.

The problem is that the spec allows VFs to support either only private,
or only flexible resources.

At this point I have to admit, that since my use cases for
QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
attention to the case with VFs having private resources. So this SR/IOV
implementation doesn’t even support such case (max_vX_per_vf != 0).

Let me summarize the possible config space, and how the current
parameters (could) map to these (interrupt-related ones omitted):

Flexible resources not supported (not implemented):
 - Private resources for PF     = max_ioqpairs
 - Private resources per VF     = ?
 - (error if flexible resources are configured)

With flexible resources:
 - VQPRT, private resources for PF      = max_ioqpairs
 - VQFRT, total flexible resources      = max_vq_per_vf * num_vfs
 - VQFRSM, maximum assignable per VF    = max_vq_per_vf
 - VQGRAN, granularity                  = #define constant
 - (error if private resources per VF are configured)

Since I don’t want to misunderstand your suggestion: could you provide a
similar map with your parameters, formulas, and explain how to determine
if flexible resources are active? I want to be sure we are on the
same page.

-- 
Regards,
Łukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-11-04 15:48     ` Łukasz Gieryk
@ 2021-11-05  8:46       ` Łukasz Gieryk
  2021-11-05 14:04         ` Łukasz Gieryk
  0 siblings, 1 reply; 55+ messages in thread
From: Łukasz Gieryk @ 2021-11-05  8:46 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Lukasz Maniak, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > On Oct  7 18:24, Lukasz Maniak wrote:
> > > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > > 
> > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > > can configure the maximum number of virtual queues and interrupts
> > > assignable to a single virtual device. The primary and secondary
> > > controller capability structures are initialized accordingly.
> > > 
> > > Since the number of available queues (interrupts) now varies between
> > > VF/PF, BAR size calculation is also adjusted.
> > > 
> > 
> > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > bound and this removes a testable case for host software (e.g.
> > requesting more flexible resources than what is currently available).
> > 
> > This patch also requires that these parameters are set if sriov_max_vfs
> > is. I think we can provide better defaults.
> > 
> 
> Originally I considered more params, but ended up coding the simplest,
> user-friendly solution, because I did not like the mess with so many
> parameters, and the flexibility wasn't needed for my use cases. But I do
> agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> valid and resembles an actual device.
> 
> > How about,
> > 
> > 1. if only sriov_max_vfs is set, then all VFs get private resources
> >    equal to max_ioqpairs. Like before this patch. This limits the number
> >    of parameters required to get a basic setup going.
> > 
> > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> >    10), the difference between that and max_ioqpairs become flexible
> >    resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> >    instead and just make the difference become private resources.
> >    Potato/potato.
> > 
> >    a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
> >       of calculated flexible resources.
> > 
> > This probably smells a bit like bikeshedding, but I think this gives
> > more flexibility and better defaults, which helps with verifying host
> > software.
> > 
> > If we can't agree on this now, I suggest we could go ahead and merge the
> > base functionality (i.e. private resources only) and ruminate some more
> > about these parameters.
> 
> The problem is that the spec allows VFs to support either only private,
> or only flexible resources.
> 
> At this point I have to admit, that since my use cases for
> QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> attention to the case with VFs having private resources. So this SR/IOV
> implementation doesn’t even support such case (max_vX_per_vf != 0).
> 
> Let me summarize the possible config space, and how the current
> parameters (could) map to these (interrupt-related ones omitted):
> 
> Flexible resources not supported (not implemented):
>  - Private resources for PF     = max_ioqpairs
>  - Private resources per VF     = ?
>  - (error if flexible resources are configured)
> 
> With flexible resources:
>  - VQPRT, private resources for PF      = max_ioqpairs
>  - VQFRT, total flexible resources      = max_vq_per_vf * num_vfs
>  - VQFRSM, maximum assignable per VF    = max_vq_per_vf
>  - VQGRAN, granularity                  = #define constant
>  - (error if private resources per VF are configured)
> 
> Since I don’t want to misunderstand your suggestion: could you provide a
> similar map with your parameters, formulas, and explain how to determine
> if flexible resources are active? I want to be sure we are on the
> same page.
> 

I’ve just re-read through my email and decided that some bits need
clarification.

This implementation supports the “Flexible”-resources-only flavor of
SR/IOV, while the “Private” also could be supported. Some effort is
required to support both, and I cannot afford that (at least I cannot
commit today, neither the other Lukasz).

While I’m ready to rework the Flexible config and prepare it to be
extended later to handle the Private variant, the 2nd version of these
patches will still support the Flexible flavor only.

I will include appropriate TODO/open in the next cover letter.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-11-05  8:46       ` Łukasz Gieryk
@ 2021-11-05 14:04         ` Łukasz Gieryk
  2021-11-08  8:25           ` Klaus Jensen
  0 siblings, 1 reply; 55+ messages in thread
From: Łukasz Gieryk @ 2021-11-05 14:04 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Lukasz Maniak, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

On Fri, Nov 05, 2021 at 09:46:28AM +0100, Łukasz Gieryk wrote:
> On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> > On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > > On Oct  7 18:24, Lukasz Maniak wrote:
> > > > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > > > 
> > > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > > > can configure the maximum number of virtual queues and interrupts
> > > > assignable to a single virtual device. The primary and secondary
> > > > controller capability structures are initialized accordingly.
> > > > 
> > > > Since the number of available queues (interrupts) now varies between
> > > > VF/PF, BAR size calculation is also adjusted.
> > > > 
> > > 
> > > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> > > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > > bound and this removes a testable case for host software (e.g.
> > > requesting more flexible resources than what is currently available).
> > > 
> > > This patch also requires that these parameters are set if sriov_max_vfs
> > > is. I think we can provide better defaults.
> > > 
> > 
> > Originally I considered more params, but ended up coding the simplest,
> > user-friendly solution, because I did not like the mess with so many
> > parameters, and the flexibility wasn't needed for my use cases. But I do
> > agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> > valid and resembles an actual device.
> > 
> > > How about,
> > > 
> > > 1. if only sriov_max_vfs is set, then all VFs get private resources
> > >    equal to max_ioqpairs. Like before this patch. This limits the number
> > >    of parameters required to get a basic setup going.
> > > 
> > > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> > >    10), the difference between that and max_ioqpairs become flexible
> > >    resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> > >    instead and just make the difference become private resources.
> > >    Potato/potato.
> > > 
> > >    a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
> > >       of calculated flexible resources.
> > > 
> > > This probably smells a bit like bikeshedding, but I think this gives
> > > more flexibility and better defaults, which helps with verifying host
> > > software.
> > > 
> > > If we can't agree on this now, I suggest we could go ahead and merge the
> > > base functionality (i.e. private resources only) and ruminate some more
> > > about these parameters.
> > 
> > The problem is that the spec allows VFs to support either only private,
> > or only flexible resources.
> > 
> > At this point I have to admit, that since my use cases for
> > QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> > attention to the case with VFs having private resources. So this SR/IOV
> > implementation doesn’t even support such case (max_vX_per_vf != 0).
> > 
> > Let me summarize the possible config space, and how the current
> > parameters (could) map to these (interrupt-related ones omitted):
> > 
> > Flexible resources not supported (not implemented):
> >  - Private resources for PF     = max_ioqpairs
> >  - Private resources per VF     = ?
> >  - (error if flexible resources are configured)
> > 
> > With flexible resources:
> >  - VQPRT, private resources for PF      = max_ioqpairs
> >  - VQFRT, total flexible resources      = max_vq_per_vf * num_vfs
> >  - VQFRSM, maximum assignable per VF    = max_vq_per_vf
> >  - VQGRAN, granularity                  = #define constant
> >  - (error if private resources per VF are configured)
> > 
> > Since I don’t want to misunderstand your suggestion: could you provide a
> > similar map with your parameters, formulas, and explain how to determine
> > if flexible resources are active? I want to be sure we are on the
> > same page.
> > 
> 
> I’ve just re-read through my email and decided that some bits need
> clarification.
> 
> This implementation supports the “Flexible”-resources-only flavor of
> SR/IOV, while the “Private” also could be supported. Some effort is
> required to support both, and I cannot afford that (at least I cannot
> commit today, neither the other Lukasz).
> 
> While I’m ready to rework the Flexible config and prepare it to be
> extended later to handle the Private variant, the 2nd version of these
> patches will still support the Flexible flavor only.
> 
> I will include appropriate TODO/open in the next cover letter.
> 

The summary of my thoughts, so far:
- I'm going to introduce sriov_v{q,i}_flexible and better defaults,
  according to your suggestion (as far as I understand your intentions,
  please correct me if I've missed something).
- The Private SR/IOV flavor, if it's ever implemented, could introduce
  sriov_vq_private_per_vf.
- The updated formulas are listed below.

Flexible resources not supported (not implemented):
 - Private resources for PF     = max_ioqpairs
 - Private resources per VF     = sriov_vq_private_per_vf
 - (error if sriov_vq_flexible is set)

With flexible resources:
 - VQPRT, private resources for PF      = max_ioqpairs - sriov_vq_flexible
 - VQFRT, total flexible resources      = sriov_vq_flexible (if set, or)
                                          VQPRT * num_vfs
 - VQFRSM, maximum assignable per VF    = sriov_max_vq_per_vf (if set, or)
                                          VQPRT
 - VQGRAN, granularity                  = #define constant
 - (error if sriov_vq_private_per_vf is set)

Is this version acceptable?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-11-04 14:30       ` Lukasz Maniak
@ 2021-11-08  7:56         ` Klaus Jensen
  2021-11-10 13:42           ` Lukasz Maniak
  0 siblings, 1 reply; 55+ messages in thread
From: Klaus Jensen @ 2021-11-08  7:56 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

[-- Attachment #1: Type: text/plain, Size: 5444 bytes --]

On Nov  4 15:30, Lukasz Maniak wrote:
> On Tue, Nov 02, 2021 at 06:33:31PM +0100, Lukasz Maniak wrote:
> > On Tue, Nov 02, 2021 at 03:33:15PM +0100, Klaus Jensen wrote:
> > > On Oct  7 18:23, Lukasz Maniak wrote:
> > > > This patch implements initial support for Single Root I/O Virtualization
> > > > on an NVMe device.
> > > > 
> > > > Essentially, it allows to define the maximum number of virtual functions
> > > > supported by the NVMe controller via sriov_max_vfs parameter.
> > > > 
> > > > Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> > > > capability by a physical controller and ARI capability by both the
> > > > physical and virtual function devices.
> > > > 
> > > > NVMe controllers created via virtual functions mirror functionally
> > > > the physical controller, which may not entirely be the case, thus
> > > > consideration would be needed on the way to limit the capabilities of
> > > > the VF.
> > > > 
> > > > NVMe subsystem is required for the use of SR-IOV.
> > > > 
> > > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > > > ---
> > > >  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
> > > >  hw/nvme/nvme.h           |  1 +
> > > >  include/hw/pci/pci_ids.h |  1 +
> > > >  3 files changed, 73 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > index 6a571d18cf..ad79ff0c00 100644
> > > > --- a/hw/nvme/ctrl.c
> > > > +++ b/hw/nvme/ctrl.c
> > > > @@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > > >                            n->reg_size);
> > > >      memory_region_add_subregion(&n->bar0, 0, &n->iomem);
> > > >  
> > > > -    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > -                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > > +    if (pci_is_vf(pci_dev)) {
> > > > +        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
> > > > +    } else {
> > > > +        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > +                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > > +    }
> > > 
> > > I assume that the assert we are seeing means that the pci_register_bars
> > > in nvme_init_cmb and nvme_init_pmr must be changed similarly to this.
> > 
> > Assert will only arise for CMB as VF params are initialized with PF
> > params.
> > 
> > @@ -6532,6 +6585,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >      NvmeCtrl *n = NVME(pci_dev);
> >      NvmeNamespace *ns;
> >      Error *local_err = NULL;
> > +    NvmeCtrl *pn = NVME(pcie_sriov_get_pf(pci_dev));
> > +
> > +    if (pci_is_vf(pci_dev)) {
> > +        /* VFs derive settings from the parent. PF's lifespan exceeds
> > +         * that of VF's, so it's safe to share params.serial.
> > +         */
> > +        memcpy(&n->params, &pn->params, sizeof(NvmeParams));
> > +        n->subsys = pn->subsys;
> > +    }
> >  
> >      nvme_check_constraints(n, &local_err);
> >      if (local_err) {
> > 
> > The following simple fix will both fix assert and also allow
> > each VF to have its own CMB of the size defined for PF.
> > 
> > ---
> >  hw/nvme/ctrl.c | 13 +++++++++----
> >  1 file changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index 19b32dd4da..99daa6290c 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -6837,10 +6837,15 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> >      n->cmb.buf = g_malloc0(cmb_size);
> >      memory_region_init_io(&n->cmb.mem, OBJECT(n), &nvme_cmb_ops, n,
> >                            "nvme-cmb", cmb_size);
> > -    pci_register_bar(pci_dev, NVME_CMB_BIR,
> > -                     PCI_BASE_ADDRESS_SPACE_MEMORY |
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > -                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> > +
> > +    if (pci_is_vf(pci_dev)) {
> > +        pcie_sriov_vf_register_bar(pci_dev, NVME_CMB_BIR, &n->cmb.mem);
> > +    } else {
> > +        pci_register_bar(pci_dev, NVME_CMB_BIR,
> > +                        PCI_BASE_ADDRESS_SPACE_MEMORY |
> > +                        PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > +                        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> > +    }
> >  
> >      NVME_CAP_SET_CMBS(cap, 1);
> >      stq_le_p(&n->bar.cap, cap);
> > 
> > As for PMR, it is currently only available on PF, as only PF is capable
> > of specifying the memory-backend-file object to use with PMR.
> > Otherwise, either VFs would have to share the PMR with its PF, or there
> > would be a requirement to define a memory-backend-file object for each VF.
> 
> Hi Klaus,
> 
> After some discussion, we decided to prohibit in V2 the use of CMB and
> PMR in combination with SR-IOV.
> 
> While the implementation of CMB with SR-IOV is relatively
> straightforward, PMR is not. We are committed to consistency in CMB and
> PMR design in association with SR-IOV. So we considered it best to
> disable both features and implement them in separate patches.
> 

I am completely fine with that. However, since we are copying the
parameters verbatimly, it would nice that the `info qtree` would reflect
this difference (that the parameters, say, cmb_size_mb is 0 for the
virtual controllers).


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-11-05 14:04         ` Łukasz Gieryk
@ 2021-11-08  8:25           ` Klaus Jensen
  2021-11-08 13:57             ` Łukasz Gieryk
  0 siblings, 1 reply; 55+ messages in thread
From: Klaus Jensen @ 2021-11-08  8:25 UTC (permalink / raw)
  To: Łukasz Gieryk
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Lukasz Maniak, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 6777 bytes --]

On Nov  5 15:04, Łukasz Gieryk wrote:
> On Fri, Nov 05, 2021 at 09:46:28AM +0100, Łukasz Gieryk wrote:
> > On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> > > On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > > > On Oct  7 18:24, Lukasz Maniak wrote:
> > > > > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > > > > 
> > > > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > > > > can configure the maximum number of virtual queues and interrupts
> > > > > assignable to a single virtual device. The primary and secondary
> > > > > controller capability structures are initialized accordingly.
> > > > > 
> > > > > Since the number of available queues (interrupts) now varies between
> > > > > VF/PF, BAR size calculation is also adjusted.
> > > > > 
> > > > 
> > > > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > > > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> > > > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > > > bound and this removes a testable case for host software (e.g.
> > > > requesting more flexible resources than what is currently available).
> > > > 
> > > > This patch also requires that these parameters are set if sriov_max_vfs
> > > > is. I think we can provide better defaults.
> > > > 
> > > 
> > > Originally I considered more params, but ended up coding the simplest,
> > > user-friendly solution, because I did not like the mess with so many
> > > parameters, and the flexibility wasn't needed for my use cases. But I do
> > > agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> > > valid and resembles an actual device.
> > > 
> > > > How about,
> > > > 
> > > > 1. if only sriov_max_vfs is set, then all VFs get private resources
> > > >    equal to max_ioqpairs. Like before this patch. This limits the number
> > > >    of parameters required to get a basic setup going.
> > > > 
> > > > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> > > >    10), the difference between that and max_ioqpairs become flexible
> > > >    resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> > > >    instead and just make the difference become private resources.
> > > >    Potato/potato.
> > > > 
> > > >    a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
> > > >       of calculated flexible resources.
> > > > 
> > > > This probably smells a bit like bikeshedding, but I think this gives
> > > > more flexibility and better defaults, which helps with verifying host
> > > > software.
> > > > 
> > > > If we can't agree on this now, I suggest we could go ahead and merge the
> > > > base functionality (i.e. private resources only) and ruminate some more
> > > > about these parameters.
> > > 
> > > The problem is that the spec allows VFs to support either only private,
> > > or only flexible resources.
> > > 
> > > At this point I have to admit, that since my use cases for
> > > QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> > > attention to the case with VFs having private resources. So this SR/IOV
> > > implementation doesn’t even support such case (max_vX_per_vf != 0).
> > > 
> > > Let me summarize the possible config space, and how the current
> > > parameters (could) map to these (interrupt-related ones omitted):
> > > 
> > > Flexible resources not supported (not implemented):
> > >  - Private resources for PF     = max_ioqpairs
> > >  - Private resources per VF     = ?
> > >  - (error if flexible resources are configured)
> > > 
> > > With flexible resources:
> > >  - VQPRT, private resources for PF      = max_ioqpairs
> > >  - VQFRT, total flexible resources      = max_vq_per_vf * num_vfs
> > >  - VQFRSM, maximum assignable per VF    = max_vq_per_vf
> > >  - VQGRAN, granularity                  = #define constant
> > >  - (error if private resources per VF are configured)
> > > 
> > > Since I don’t want to misunderstand your suggestion: could you provide a
> > > similar map with your parameters, formulas, and explain how to determine
> > > if flexible resources are active? I want to be sure we are on the
> > > same page.
> > > 
> > 
> > I’ve just re-read through my email and decided that some bits need
> > clarification.
> > 
> > This implementation supports the “Flexible”-resources-only flavor of
> > SR/IOV, while the “Private” also could be supported. Some effort is
> > required to support both, and I cannot afford that (at least I cannot
> > commit today, neither the other Lukasz).
> > 
> > While I’m ready to rework the Flexible config and prepare it to be
> > extended later to handle the Private variant, the 2nd version of these
> > patches will still support the Flexible flavor only.
> > 
> > I will include appropriate TODO/open in the next cover letter.
> > 
> 
> The summary of my thoughts, so far:
> - I'm going to introduce sriov_v{q,i}_flexible and better defaults,
>   according to your suggestion (as far as I understand your intentions,
>   please correct me if I've missed something).
> - The Private SR/IOV flavor, if it's ever implemented, could introduce
>   sriov_vq_private_per_vf.
> - The updated formulas are listed below.
> 
> Flexible resources not supported (not implemented):
>  - Private resources for PF     = max_ioqpairs
>  - Private resources per VF     = sriov_vq_private_per_vf

I would just keep it simple and say, if sriov_v{q,i}_flexible is not
set, then each VF gets max_ioqpairs private resources.

>  - (error if sriov_vq_flexible is set)
> 
> With flexible resources:
>  - VQPRT, private resources for PF      = max_ioqpairs - sriov_vq_flexible
>  - VQFRT, total flexible resources      = sriov_vq_flexible (if set, or)
>                                           VQPRT * num_vfs
>  - VQFRSM, maximum assignable per VF    = sriov_max_vq_per_vf (if set, or)
>                                           VQPRT

You mean VQFRT here, right?

>  - VQGRAN, granularity                  = #define constant

Yeah, 1 seems pretty reasonable here.

>  - (error if sriov_vq_private_per_vf is set)
> 
> Is this version acceptable?
> 

Sounds good to me. The only one I am not too happy about is the default
of VQPRT * num_vfs. (i.e. max_ioqpairs * num_vfs) when vq_flexible is
not set. I think this is the case where we should default to private
resources. If you don't want to work with private resources right now,
can we instead have it bug out and complain that sriov_vq_flexible must
be set? We can then later lift that restriction and implement private
resources.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-11-08  8:25           ` Klaus Jensen
@ 2021-11-08 13:57             ` Łukasz Gieryk
  2021-11-09 12:22               ` Klaus Jensen
  0 siblings, 1 reply; 55+ messages in thread
From: Łukasz Gieryk @ 2021-11-08 13:57 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Lukasz Maniak, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

On Mon, Nov 08, 2021 at 09:25:58AM +0100, Klaus Jensen wrote:
> On Nov  5 15:04, Łukasz Gieryk wrote:
> > On Fri, Nov 05, 2021 at 09:46:28AM +0100, Łukasz Gieryk wrote:
> > > On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> > > > On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > > > > On Oct  7 18:24, Lukasz Maniak wrote:
> > > > > > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > > > > > 
> > > > > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > > > > > can configure the maximum number of virtual queues and interrupts
> > > > > > assignable to a single virtual device. The primary and secondary
> > > > > > controller capability structures are initialized accordingly.
> > > > > > 
> > > > > > Since the number of available queues (interrupts) now varies between
> > > > > > VF/PF, BAR size calculation is also adjusted.
> > > > > > 
> > > > > 
> > > > > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > > > > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> > > > > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > > > > bound and this removes a testable case for host software (e.g.
> > > > > requesting more flexible resources than what is currently available).
> > > > > 
> > > > > This patch also requires that these parameters are set if sriov_max_vfs
> > > > > is. I think we can provide better defaults.
> > > > > 
> > > > 
> > > > Originally I considered more params, but ended up coding the simplest,
> > > > user-friendly solution, because I did not like the mess with so many
> > > > parameters, and the flexibility wasn't needed for my use cases. But I do
> > > > agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> > > > valid and resembles an actual device.
> > > > 
> > > > > How about,
> > > > > 
> > > > > 1. if only sriov_max_vfs is set, then all VFs get private resources
> > > > >    equal to max_ioqpairs. Like before this patch. This limits the number
> > > > >    of parameters required to get a basic setup going.
> > > > > 
> > > > > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> > > > >    10), the difference between that and max_ioqpairs become flexible
> > > > >    resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> > > > >    instead and just make the difference become private resources.
> > > > >    Potato/potato.
> > > > > 
> > > > >    a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
> > > > >       of calculated flexible resources.
> > > > > 
> > > > > This probably smells a bit like bikeshedding, but I think this gives
> > > > > more flexibility and better defaults, which helps with verifying host
> > > > > software.
> > > > > 
> > > > > If we can't agree on this now, I suggest we could go ahead and merge the
> > > > > base functionality (i.e. private resources only) and ruminate some more
> > > > > about these parameters.
> > > > 
> > > > The problem is that the spec allows VFs to support either only private,
> > > > or only flexible resources.
> > > > 
> > > > At this point I have to admit, that since my use cases for
> > > > QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> > > > attention to the case with VFs having private resources. So this SR/IOV
> > > > implementation doesn’t even support such case (max_vX_per_vf != 0).
> > > > 
> > > > Let me summarize the possible config space, and how the current
> > > > parameters (could) map to these (interrupt-related ones omitted):
> > > > 
> > > > Flexible resources not supported (not implemented):
> > > >  - Private resources for PF     = max_ioqpairs
> > > >  - Private resources per VF     = ?
> > > >  - (error if flexible resources are configured)
> > > > 
> > > > With flexible resources:
> > > >  - VQPRT, private resources for PF      = max_ioqpairs
> > > >  - VQFRT, total flexible resources      = max_vq_per_vf * num_vfs
> > > >  - VQFRSM, maximum assignable per VF    = max_vq_per_vf
> > > >  - VQGRAN, granularity                  = #define constant
> > > >  - (error if private resources per VF are configured)
> > > > 
> > > > Since I don’t want to misunderstand your suggestion: could you provide a
> > > > similar map with your parameters, formulas, and explain how to determine
> > > > if flexible resources are active? I want to be sure we are on the
> > > > same page.
> > > > 
> > > 
> > > I’ve just re-read through my email and decided that some bits need
> > > clarification.
> > > 
> > > This implementation supports the “Flexible”-resources-only flavor of
> > > SR/IOV, while the “Private” also could be supported. Some effort is
> > > required to support both, and I cannot afford that (at least I cannot
> > > commit today, neither the other Lukasz).
> > > 
> > > While I’m ready to rework the Flexible config and prepare it to be
> > > extended later to handle the Private variant, the 2nd version of these
> > > patches will still support the Flexible flavor only.
> > > 
> > > I will include appropriate TODO/open in the next cover letter.
> > > 
> > 
> > The summary of my thoughts, so far:
> > - I'm going to introduce sriov_v{q,i}_flexible and better defaults,
> >   according to your suggestion (as far as I understand your intentions,
> >   please correct me if I've missed something).
> > - The Private SR/IOV flavor, if it's ever implemented, could introduce
> >   sriov_vq_private_per_vf.
> > - The updated formulas are listed below.
> > 
> > Flexible resources not supported (not implemented):
> >  - Private resources for PF     = max_ioqpairs
> >  - Private resources per VF     = sriov_vq_private_per_vf
> 
> I would just keep it simple and say, if sriov_v{q,i}_flexible is not
> set, then each VF gets max_ioqpairs private resources.
> 

Since you did request more tuning knobs for the Flexible variant, the
Private one should follow that and allow full configuration. A device
where PF.priv=64 and each VF.priv=4 makes sense, and I couldn’t
configure it if sriov_v{q,i}_flexible=0 enabled the Private mode.

> >  - (error if sriov_vq_flexible is set)
> > 
> > With flexible resources:
> >  - VQPRT, private resources for PF      = max_ioqpairs - sriov_vq_flexible
> >  - VQFRT, total flexible resources      = sriov_vq_flexible (if set, or)
> >                                           VQPRT * num_vfs
> >  - VQFRSM, maximum assignable per VF    = sriov_max_vq_per_vf (if set, or)
> >                                           VQPRT
> 
> You mean VQFRT here, right?
> 

VQPRT is right, and – in my opinion – makes a better default than VQFRT.

E.g., configuring a device:

(max_vfs=32, PF.priv=VQPRT=X, PF.flex_total=VQFRT=256)

as (num_vfs=1, VF0.flex=256) doesn’t make much sense. Virtualization is
not needed in such case, and user should probably use PF directly. On
the other hand, VQPRT is probably tuned to offer most (if not all) of
the performance and functionality; thus serves as a sane default.

> >  - VQGRAN, granularity                  = #define constant
> 
> Yeah, 1 seems pretty reasonable here.
> 
> >  - (error if sriov_vq_private_per_vf is set)
> > 
> > Is this version acceptable?
> > 
> 
> Sounds good to me. The only one I am not too happy about is the default
> of VQPRT * num_vfs. (i.e. max_ioqpairs * num_vfs) when vq_flexible is
> not set. I think this is the case where we should default to private
> resources. If you don't want to work with private resources right now,
> can we instead have it bug out and complain that sriov_vq_flexible must
> be set? We can then later lift that restriction and implement private
> resources.

I would prefer reserving sriov_v{q,i}_flexible=0 for now. That's my current
plan for V2.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2021-11-08 13:57             ` Łukasz Gieryk
@ 2021-11-09 12:22               ` Klaus Jensen
  0 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-11-09 12:22 UTC (permalink / raw)
  To: Łukasz Gieryk
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Lukasz Maniak, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 8435 bytes --]

On Nov  8 14:57, Łukasz Gieryk wrote:
> On Mon, Nov 08, 2021 at 09:25:58AM +0100, Klaus Jensen wrote:
> > On Nov  5 15:04, Łukasz Gieryk wrote:
> > > On Fri, Nov 05, 2021 at 09:46:28AM +0100, Łukasz Gieryk wrote:
> > > > On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> > > > > On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > > > > > On Oct  7 18:24, Lukasz Maniak wrote:
> > > > > > > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > > > > > > 
> > > > > > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > > > > > > can configure the maximum number of virtual queues and interrupts
> > > > > > > assignable to a single virtual device. The primary and secondary
> > > > > > > controller capability structures are initialized accordingly.
> > > > > > > 
> > > > > > > Since the number of available queues (interrupts) now varies between
> > > > > > > VF/PF, BAR size calculation is also adjusted.
> > > > > > > 
> > > > > > 
> > > > > > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > > > > > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> > > > > > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > > > > > bound and this removes a testable case for host software (e.g.
> > > > > > requesting more flexible resources than what is currently available).
> > > > > > 
> > > > > > This patch also requires that these parameters are set if sriov_max_vfs
> > > > > > is. I think we can provide better defaults.
> > > > > > 
> > > > > 
> > > > > Originally I considered more params, but ended up coding the simplest,
> > > > > user-friendly solution, because I did not like the mess with so many
> > > > > parameters, and the flexibility wasn't needed for my use cases. But I do
> > > > > agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> > > > > valid and resembles an actual device.
> > > > > 
> > > > > > How about,
> > > > > > 
> > > > > > 1. if only sriov_max_vfs is set, then all VFs get private resources
> > > > > >    equal to max_ioqpairs. Like before this patch. This limits the number
> > > > > >    of parameters required to get a basic setup going.
> > > > > > 
> > > > > > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> > > > > >    10), the difference between that and max_ioqpairs become flexible
> > > > > >    resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> > > > > >    instead and just make the difference become private resources.
> > > > > >    Potato/potato.
> > > > > > 
> > > > > >    a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
> > > > > >       of calculated flexible resources.
> > > > > > 
> > > > > > This probably smells a bit like bikeshedding, but I think this gives
> > > > > > more flexibility and better defaults, which helps with verifying host
> > > > > > software.
> > > > > > 
> > > > > > If we can't agree on this now, I suggest we could go ahead and merge the
> > > > > > base functionality (i.e. private resources only) and ruminate some more
> > > > > > about these parameters.
> > > > > 
> > > > > The problem is that the spec allows VFs to support either only private,
> > > > > or only flexible resources.
> > > > > 
> > > > > At this point I have to admit, that since my use cases for
> > > > > QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> > > > > attention to the case with VFs having private resources. So this SR/IOV
> > > > > implementation doesn’t even support such case (max_vX_per_vf != 0).
> > > > > 
> > > > > Let me summarize the possible config space, and how the current
> > > > > parameters (could) map to these (interrupt-related ones omitted):
> > > > > 
> > > > > Flexible resources not supported (not implemented):
> > > > >  - Private resources for PF     = max_ioqpairs
> > > > >  - Private resources per VF     = ?
> > > > >  - (error if flexible resources are configured)
> > > > > 
> > > > > With flexible resources:
> > > > >  - VQPRT, private resources for PF      = max_ioqpairs
> > > > >  - VQFRT, total flexible resources      = max_vq_per_vf * num_vfs
> > > > >  - VQFRSM, maximum assignable per VF    = max_vq_per_vf
> > > > >  - VQGRAN, granularity                  = #define constant
> > > > >  - (error if private resources per VF are configured)
> > > > > 
> > > > > Since I don’t want to misunderstand your suggestion: could you provide a
> > > > > similar map with your parameters, formulas, and explain how to determine
> > > > > if flexible resources are active? I want to be sure we are on the
> > > > > same page.
> > > > > 
> > > > 
> > > > I’ve just re-read through my email and decided that some bits need
> > > > clarification.
> > > > 
> > > > This implementation supports the “Flexible”-resources-only flavor of
> > > > SR/IOV, while the “Private” also could be supported. Some effort is
> > > > required to support both, and I cannot afford that (at least I cannot
> > > > commit today, neither the other Lukasz).
> > > > 
> > > > While I’m ready to rework the Flexible config and prepare it to be
> > > > extended later to handle the Private variant, the 2nd version of these
> > > > patches will still support the Flexible flavor only.
> > > > 
> > > > I will include appropriate TODO/open in the next cover letter.
> > > > 
> > > 
> > > The summary of my thoughts, so far:
> > > - I'm going to introduce sriov_v{q,i}_flexible and better defaults,
> > >   according to your suggestion (as far as I understand your intentions,
> > >   please correct me if I've missed something).
> > > - The Private SR/IOV flavor, if it's ever implemented, could introduce
> > >   sriov_vq_private_per_vf.
> > > - The updated formulas are listed below.
> > > 
> > > Flexible resources not supported (not implemented):
> > >  - Private resources for PF     = max_ioqpairs
> > >  - Private resources per VF     = sriov_vq_private_per_vf
> > 
> > I would just keep it simple and say, if sriov_v{q,i}_flexible is not
> > set, then each VF gets max_ioqpairs private resources.
> > 
> 
> Since you did request more tuning knobs for the Flexible variant, the
> Private one should follow that and allow full configuration. A device
> where PF.priv=64 and each VF.priv=4 makes sense, and I couldn’t
> configure it if sriov_v{q,i}_flexible=0 enabled the Private mode.
> 

It was just to simplify, I am just fine with having
`sriov_vq_private_per_vf` :)

> > >  - (error if sriov_vq_flexible is set)
> > > 
> > > With flexible resources:
> > >  - VQPRT, private resources for PF      = max_ioqpairs - sriov_vq_flexible
> > >  - VQFRT, total flexible resources      = sriov_vq_flexible (if set, or)
> > >                                           VQPRT * num_vfs
> > >  - VQFRSM, maximum assignable per VF    = sriov_max_vq_per_vf (if set, or)
> > >                                           VQPRT
> > 
> > You mean VQFRT here, right?
> > 
> 
> VQPRT is right, and – in my opinion – makes a better default than VQFRT.
> 
> E.g., configuring a device:
> 
> (max_vfs=32, PF.priv=VQPRT=X, PF.flex_total=VQFRT=256)
> 
> as (num_vfs=1, VF0.flex=256) doesn’t make much sense. Virtualization is
> not needed in such case, and user should probably use PF directly. On
> the other hand, VQPRT is probably tuned to offer most (if not all) of
> the performance and functionality; thus serves as a sane default.
> 

Alright.

> > >  - VQGRAN, granularity                  = #define constant
> > 
> > Yeah, 1 seems pretty reasonable here.
> > 
> > >  - (error if sriov_vq_private_per_vf is set)
> > > 
> > > Is this version acceptable?
> > > 
> > 
> > Sounds good to me. The only one I am not too happy about is the default
> > of VQPRT * num_vfs. (i.e. max_ioqpairs * num_vfs) when vq_flexible is
> > not set. I think this is the case where we should default to private
> > resources. If you don't want to work with private resources right now,
> > can we instead have it bug out and complain that sriov_vq_flexible must
> > be set? We can then later lift that restriction and implement private
> > resources.
> 
> I would prefer reserving sriov_v{q,i}_flexible=0 for now. That's my current
> plan for V2.
> 

Alright.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-11-08  7:56         ` Klaus Jensen
@ 2021-11-10 13:42           ` Lukasz Maniak
  2021-11-10 16:39             ` Klaus Jensen
  0 siblings, 1 reply; 55+ messages in thread
From: Lukasz Maniak @ 2021-11-10 13:42 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

On Mon, Nov 08, 2021 at 08:56:43AM +0100, Klaus Jensen wrote:
> On Nov  4 15:30, Lukasz Maniak wrote:
> > On Tue, Nov 02, 2021 at 06:33:31PM +0100, Lukasz Maniak wrote:
> > > On Tue, Nov 02, 2021 at 03:33:15PM +0100, Klaus Jensen wrote:
> > > > On Oct  7 18:23, Lukasz Maniak wrote:
> > > > > This patch implements initial support for Single Root I/O Virtualization
> > > > > on an NVMe device.
> > > > > 
> > > > > Essentially, it allows to define the maximum number of virtual functions
> > > > > supported by the NVMe controller via sriov_max_vfs parameter.
> > > > > 
> > > > > Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> > > > > capability by a physical controller and ARI capability by both the
> > > > > physical and virtual function devices.
> > > > > 
> > > > > NVMe controllers created via virtual functions mirror functionally
> > > > > the physical controller, which may not entirely be the case, thus
> > > > > consideration would be needed on the way to limit the capabilities of
> > > > > the VF.
> > > > > 
> > > > > NVMe subsystem is required for the use of SR-IOV.
> > > > > 
> > > > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > > > > ---
> > > > >  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
> > > > >  hw/nvme/nvme.h           |  1 +
> > > > >  include/hw/pci/pci_ids.h |  1 +
> > > > >  3 files changed, 73 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > > index 6a571d18cf..ad79ff0c00 100644
> > > > > --- a/hw/nvme/ctrl.c
> > > > > +++ b/hw/nvme/ctrl.c
> > > > > @@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > > > >                            n->reg_size);
> > > > >      memory_region_add_subregion(&n->bar0, 0, &n->iomem);
> > > > >  
> > > > > -    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > > -                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > > > +    if (pci_is_vf(pci_dev)) {
> > > > > +        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
> > > > > +    } else {
> > > > > +        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > > +                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > > > +    }
> > > > 
> > > > I assume that the assert we are seeing means that the pci_register_bars
> > > > in nvme_init_cmb and nvme_init_pmr must be changed similarly to this.
> > > 
> > > Assert will only arise for CMB as VF params are initialized with PF
> > > params.
> > > 
> > > @@ -6532,6 +6585,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > >      NvmeCtrl *n = NVME(pci_dev);
> > >      NvmeNamespace *ns;
> > >      Error *local_err = NULL;
> > > +    NvmeCtrl *pn = NVME(pcie_sriov_get_pf(pci_dev));
> > > +
> > > +    if (pci_is_vf(pci_dev)) {
> > > +        /* VFs derive settings from the parent. PF's lifespan exceeds
> > > +         * that of VF's, so it's safe to share params.serial.
> > > +         */
> > > +        memcpy(&n->params, &pn->params, sizeof(NvmeParams));
> > > +        n->subsys = pn->subsys;
> > > +    }
> > >  
> > >      nvme_check_constraints(n, &local_err);
> > >      if (local_err) {
> > > 
> > > The following simple fix will both fix assert and also allow
> > > each VF to have its own CMB of the size defined for PF.
> > > 
> > > ---
> > >  hw/nvme/ctrl.c | 13 +++++++++----
> > >  1 file changed, 9 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > index 19b32dd4da..99daa6290c 100644
> > > --- a/hw/nvme/ctrl.c
> > > +++ b/hw/nvme/ctrl.c
> > > @@ -6837,10 +6837,15 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > >      n->cmb.buf = g_malloc0(cmb_size);
> > >      memory_region_init_io(&n->cmb.mem, OBJECT(n), &nvme_cmb_ops, n,
> > >                            "nvme-cmb", cmb_size);
> > > -    pci_register_bar(pci_dev, NVME_CMB_BIR,
> > > -                     PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > > -                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> > > +
> > > +    if (pci_is_vf(pci_dev)) {
> > > +        pcie_sriov_vf_register_bar(pci_dev, NVME_CMB_BIR, &n->cmb.mem);
> > > +    } else {
> > > +        pci_register_bar(pci_dev, NVME_CMB_BIR,
> > > +                        PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > +                        PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > > +                        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> > > +    }
> > >  
> > >      NVME_CAP_SET_CMBS(cap, 1);
> > >      stq_le_p(&n->bar.cap, cap);
> > > 
> > > As for PMR, it is currently only available on PF, as only PF is capable
> > > of specifying the memory-backend-file object to use with PMR.
> > > Otherwise, either VFs would have to share the PMR with its PF, or there
> > > would be a requirement to define a memory-backend-file object for each VF.
> > 
> > Hi Klaus,
> > 
> > After some discussion, we decided to prohibit in V2 the use of CMB and
> > PMR in combination with SR-IOV.
> > 
> > While the implementation of CMB with SR-IOV is relatively
> > straightforward, PMR is not. We are committed to consistency in CMB and
> > PMR design in association with SR-IOV. So we considered it best to
> > disable both features and implement them in separate patches.
> > 
> 
> I am completely fine with that. However, since we are copying the
> parameters verbatimly, it would nice that the `info qtree` would reflect
> this difference (that the parameters, say, cmb_size_mb is 0 for the
> virtual controllers).
> 

Hi Klaus,

Literal copying will still be correct and there will be no difference
between PF and VF since by prohibit we mean to disable interaction
between SR-IOV functionality and CMB/PMR for PF as well.

if (params->sriov_max_vfs) {
    if (!n->subsys) {
        error_setg(errp, "subsystem is required for the use of SR-IOV");
        return;
    }

    if (params->sriov_max_vfs > NVME_MAX_VFS) {
        error_setg(errp, "sriov_max_vfs must be between 0 and %d",
                   NVME_MAX_VFS);
        return;
    }

    if (params->cmb_size_mb) {
        error_setg(errp, "CMB is not supported with SR-IOV");
        return;
    }

    if (n->pmr.dev) {
        error_setg(errp, "PMR is not supported with SR-IOV");
        return;
    }

Regards,
Lukasz


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/15] hw/nvme: Add support for SR-IOV
  2021-11-10 13:42           ` Lukasz Maniak
@ 2021-11-10 16:39             ` Klaus Jensen
  0 siblings, 0 replies; 55+ messages in thread
From: Klaus Jensen @ 2021-11-10 16:39 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

[-- Attachment #1: Type: text/plain, Size: 6890 bytes --]

On Nov 10 14:42, Lukasz Maniak wrote:
> On Mon, Nov 08, 2021 at 08:56:43AM +0100, Klaus Jensen wrote:
> > On Nov  4 15:30, Lukasz Maniak wrote:
> > > On Tue, Nov 02, 2021 at 06:33:31PM +0100, Lukasz Maniak wrote:
> > > > On Tue, Nov 02, 2021 at 03:33:15PM +0100, Klaus Jensen wrote:
> > > > > On Oct  7 18:23, Lukasz Maniak wrote:
> > > > > > This patch implements initial support for Single Root I/O Virtualization
> > > > > > on an NVMe device.
> > > > > > 
> > > > > > Essentially, it allows to define the maximum number of virtual functions
> > > > > > supported by the NVMe controller via sriov_max_vfs parameter.
> > > > > > 
> > > > > > Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> > > > > > capability by a physical controller and ARI capability by both the
> > > > > > physical and virtual function devices.
> > > > > > 
> > > > > > NVMe controllers created via virtual functions mirror functionally
> > > > > > the physical controller, which may not entirely be the case, thus
> > > > > > consideration would be needed on the way to limit the capabilities of
> > > > > > the VF.
> > > > > > 
> > > > > > NVMe subsystem is required for the use of SR-IOV.
> > > > > > 
> > > > > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > > > > > ---
> > > > > >  hw/nvme/ctrl.c           | 74 ++++++++++++++++++++++++++++++++++++++--
> > > > > >  hw/nvme/nvme.h           |  1 +
> > > > > >  include/hw/pci/pci_ids.h |  1 +
> > > > > >  3 files changed, 73 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > > > index 6a571d18cf..ad79ff0c00 100644
> > > > > > --- a/hw/nvme/ctrl.c
> > > > > > +++ b/hw/nvme/ctrl.c
> > > > > > @@ -6361,8 +6406,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
> > > > > >                            n->reg_size);
> > > > > >      memory_region_add_subregion(&n->bar0, 0, &n->iomem);
> > > > > >  
> > > > > > -    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > > > -                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > > > > +    if (pci_is_vf(pci_dev)) {
> > > > > > +        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
> > > > > > +    } else {
> > > > > > +        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > > > +                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
> > > > > > +    }
> > > > > 
> > > > > I assume that the assert we are seeing means that the pci_register_bars
> > > > > in nvme_init_cmb and nvme_init_pmr must be changed similarly to this.
> > > > 
> > > > Assert will only arise for CMB as VF params are initialized with PF
> > > > params.
> > > > 
> > > > @@ -6532,6 +6585,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > > >      NvmeCtrl *n = NVME(pci_dev);
> > > >      NvmeNamespace *ns;
> > > >      Error *local_err = NULL;
> > > > +    NvmeCtrl *pn = NVME(pcie_sriov_get_pf(pci_dev));
> > > > +
> > > > +    if (pci_is_vf(pci_dev)) {
> > > > +        /* VFs derive settings from the parent. PF's lifespan exceeds
> > > > +         * that of VF's, so it's safe to share params.serial.
> > > > +         */
> > > > +        memcpy(&n->params, &pn->params, sizeof(NvmeParams));
> > > > +        n->subsys = pn->subsys;
> > > > +    }
> > > >  
> > > >      nvme_check_constraints(n, &local_err);
> > > >      if (local_err) {
> > > > 
> > > > The following simple fix will both fix assert and also allow
> > > > each VF to have its own CMB of the size defined for PF.
> > > > 
> > > > ---
> > > >  hw/nvme/ctrl.c | 13 +++++++++----
> > > >  1 file changed, 9 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > index 19b32dd4da..99daa6290c 100644
> > > > --- a/hw/nvme/ctrl.c
> > > > +++ b/hw/nvme/ctrl.c
> > > > @@ -6837,10 +6837,15 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > > >      n->cmb.buf = g_malloc0(cmb_size);
> > > >      memory_region_init_io(&n->cmb.mem, OBJECT(n), &nvme_cmb_ops, n,
> > > >                            "nvme-cmb", cmb_size);
> > > > -    pci_register_bar(pci_dev, NVME_CMB_BIR,
> > > > -                     PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > > > -                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> > > > +
> > > > +    if (pci_is_vf(pci_dev)) {
> > > > +        pcie_sriov_vf_register_bar(pci_dev, NVME_CMB_BIR, &n->cmb.mem);
> > > > +    } else {
> > > > +        pci_register_bar(pci_dev, NVME_CMB_BIR,
> > > > +                        PCI_BASE_ADDRESS_SPACE_MEMORY |
> > > > +                        PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > > > +                        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->cmb.mem);
> > > > +    }
> > > >  
> > > >      NVME_CAP_SET_CMBS(cap, 1);
> > > >      stq_le_p(&n->bar.cap, cap);
> > > > 
> > > > As for PMR, it is currently only available on PF, as only PF is capable
> > > > of specifying the memory-backend-file object to use with PMR.
> > > > Otherwise, either VFs would have to share the PMR with its PF, or there
> > > > would be a requirement to define a memory-backend-file object for each VF.
> > > 
> > > Hi Klaus,
> > > 
> > > After some discussion, we decided to prohibit in V2 the use of CMB and
> > > PMR in combination with SR-IOV.
> > > 
> > > While the implementation of CMB with SR-IOV is relatively
> > > straightforward, PMR is not. We are committed to consistency in CMB and
> > > PMR design in association with SR-IOV. So we considered it best to
> > > disable both features and implement them in separate patches.
> > > 
> > 
> > I am completely fine with that. However, since we are copying the
> > parameters verbatimly, it would nice that the `info qtree` would reflect
> > this difference (that the parameters, say, cmb_size_mb is 0 for the
> > virtual controllers).
> > 
> 
> Hi Klaus,
> 
> Literal copying will still be correct and there will be no difference
> between PF and VF since by prohibit we mean to disable interaction
> between SR-IOV functionality and CMB/PMR for PF as well.
> 
> if (params->sriov_max_vfs) {
>     if (!n->subsys) {
>         error_setg(errp, "subsystem is required for the use of SR-IOV");
>         return;
>     }
> 
>     if (params->sriov_max_vfs > NVME_MAX_VFS) {
>         error_setg(errp, "sriov_max_vfs must be between 0 and %d",
>                    NVME_MAX_VFS);
>         return;
>     }
> 
>     if (params->cmb_size_mb) {
>         error_setg(errp, "CMB is not supported with SR-IOV");
>         return;
>     }
> 
>     if (n->pmr.dev) {
>         error_setg(errp, "PMR is not supported with SR-IOV");
>         return;
>     }
> 

Right. Understood.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2021-11-10 16:45 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 16:23 [PATCH 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
2021-10-07 16:23 ` [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512 Lukasz Maniak
2021-10-07 22:12   ` Michael S. Tsirkin
2021-10-26 14:36     ` Lukasz Maniak
2021-10-26 15:37       ` Knut Omang
2021-10-07 16:23 ` [PATCH 02/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
2021-10-07 16:23 ` [PATCH 03/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak
2021-10-07 16:23 ` [PATCH 04/15] pcie: Add callback preceding SR-IOV VFs update Lukasz Maniak
2021-10-12  7:25   ` Michael S. Tsirkin
2021-10-12 16:06     ` Lukasz Maniak
2021-10-13  9:10       ` Michael S. Tsirkin
2021-10-15 16:24         ` Lukasz Maniak
2021-10-15 17:30           ` Michael S. Tsirkin
2021-10-20 13:30             ` Lukasz Maniak
2021-10-07 16:23 ` [PATCH 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
2021-10-20 19:07   ` Klaus Jensen
2021-10-21 14:33     ` Lukasz Maniak
2021-11-02 14:33   ` Klaus Jensen
2021-11-02 17:33     ` Lukasz Maniak
2021-11-04 14:30       ` Lukasz Maniak
2021-11-08  7:56         ` Klaus Jensen
2021-11-10 13:42           ` Lukasz Maniak
2021-11-10 16:39             ` Klaus Jensen
2021-10-07 16:23 ` [PATCH 06/15] hw/nvme: Add support for Primary Controller Capabilities Lukasz Maniak
2021-11-02 14:34   ` Klaus Jensen
2021-10-07 16:23 ` [PATCH 07/15] hw/nvme: Add support for Secondary Controller List Lukasz Maniak
2021-11-02 14:35   ` Klaus Jensen
2021-10-07 16:23 ` [PATCH 08/15] pcie: Add 1.2 version token for the Power Management Capability Lukasz Maniak
2021-10-07 16:24 ` [PATCH 09/15] hw/nvme: Implement the Function Level Reset Lukasz Maniak
2021-11-02 14:35   ` Klaus Jensen
2021-10-07 16:24 ` [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
2021-10-18 10:06   ` Philippe Mathieu-Daudé
2021-10-18 15:53     ` Łukasz Gieryk
2021-10-20 19:06   ` Klaus Jensen
2021-10-21 13:40     ` Łukasz Gieryk
2021-11-03 12:11       ` Klaus Jensen
2021-10-20 19:26   ` Klaus Jensen
2021-10-07 16:24 ` [PATCH 11/15] hw/nvme: Calculate BAR atributes in a function Lukasz Maniak
2021-10-18  9:52   ` Philippe Mathieu-Daudé
2021-10-07 16:24 ` [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
2021-11-03 12:07   ` Klaus Jensen
2021-11-04 15:48     ` Łukasz Gieryk
2021-11-05  8:46       ` Łukasz Gieryk
2021-11-05 14:04         ` Łukasz Gieryk
2021-11-08  8:25           ` Klaus Jensen
2021-11-08 13:57             ` Łukasz Gieryk
2021-11-09 12:22               ` Klaus Jensen
2021-10-07 16:24 ` [PATCH 13/15] pcie: Add helpers to the SR/IOV API Lukasz Maniak
2021-10-26 16:57   ` Knut Omang
2021-10-07 16:24 ` [PATCH 14/15] hw/nvme: Add support for the Virtualization Management command Lukasz Maniak
2021-10-07 16:24 ` [PATCH 15/15] docs: Add documentation for SR-IOV and Virtualization Enhancements Lukasz Maniak
2021-10-08  6:31 ` [PATCH 00/15] hw/nvme: SR-IOV with " Klaus Jensen
2021-10-26 18:20 ` Klaus Jensen
2021-10-27 16:49   ` Lukasz Maniak
2021-11-02  7:24     ` Klaus Jensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).