linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI: Add Intel remapped NVMe device support
@ 2019-06-10  7:44 Daniel Drake
  2019-06-10 16:00 ` Keith Busch
  2019-06-10 21:16 ` Bjorn Helgaas
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Drake @ 2019-06-10  7:44 UTC (permalink / raw)
  To: bhelgaas, axboe, kbusch, hch
  Cc: linux-pci, linux-ide, sagi, linux-nvme, linux

Consumer products that are configured by default to run the Intel SATA AHCI
controller in "RAID" or "Intel RST Premium With Intel Optane System
Acceleration" mode are becoming increasingly prevalent.

Unde this mode, NVMe devices are remapped into the SATA device and become
hidden from the PCI bus, which means that Linux users cannot access their
storage devices unless they go into the firmware setup menu to revert back
to AHCI mode - assuming such option is available. Lack of support for this
mode is also causing complications for vendors who distribute Linux.

Add support for the remapped NVMe mode by creating a virtual PCI bus,
where the AHCI and NVMe devices are presented separately, allowing the
ahci and nvme drivers to bind in the normal way.

Unfortunately the NVMe device configuration space is inaccesible under
this scheme, so we provide a fake one, and hope that no DeviceID-based
quirks are needed. The interrupt is shared between the AHCI and NVMe
devices.

The existing ahci driver is modified to not claim devices where remapped
NVMe devices are present, allowing this new driver to step in.

The details of the remapping scheme came from patches previously
posted by Dan Williams and the resulting discussion.

Signed-off-by: Daniel Drake <drake@endlessm.com>
---
 drivers/ata/ahci.c                        |  25 +-
 drivers/pci/controller/Kconfig            |  16 +
 drivers/pci/controller/Makefile           |   1 +
 drivers/pci/controller/intel-nvme-remap.c | 461 ++++++++++++++++++++++
 4 files changed, 493 insertions(+), 10 deletions(-)
 create mode 100644 drivers/pci/controller/intel-nvme-remap.c

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index f7652baa6337..75c5733cbae9 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1499,7 +1499,7 @@ static irqreturn_t ahci_thunderx_irq_handler(int irq, void *dev_instance)
 }
 #endif
 
-static void ahci_remap_check(struct pci_dev *pdev, int bar,
+static int ahci_remap_check(struct pci_dev *pdev, int bar,
 		struct ahci_host_priv *hpriv)
 {
 	int i, count = 0;
@@ -1512,7 +1512,7 @@ static void ahci_remap_check(struct pci_dev *pdev, int bar,
 	    pci_resource_len(pdev, bar) < SZ_512K ||
 	    bar != AHCI_PCI_BAR_STANDARD ||
 	    !(readl(hpriv->mmio + AHCI_VSCAP) & 1))
-		return;
+		return 0;
 
 	cap = readq(hpriv->mmio + AHCI_REMAP_CAP);
 	for (i = 0; i < AHCI_MAX_REMAP; i++) {
@@ -1527,17 +1527,20 @@ static void ahci_remap_check(struct pci_dev *pdev, int bar,
 	}
 
 	if (!count)
-		return;
+		return 0;
+
+	/* Abort probe, allowing intel-nvme-remap to step in when available */
+	if (IS_ENABLED(CONFIG_INTEL_NVME_REMAP)) {
+		dev_info(&pdev->dev,
+			 "Device will be handled by intel-nvme-remap.\n");
+		return -ENODEV;
+	}
 
 	dev_warn(&pdev->dev, "Found %d remapped NVMe devices.\n", count);
 	dev_warn(&pdev->dev,
-		 "Switch your BIOS from RAID to AHCI mode to use them.\n");
+		 "Enable intel-nvme-remap or switch your BIOS to AHCI mode to use them.\n");
 
-	/*
-	 * Don't rely on the msi-x capability in the remap case,
-	 * share the legacy interrupt across ahci and remapped devices.
-	 */
-	hpriv->flags |= AHCI_HFLAG_NO_MSI;
+	return 0;
 }
 
 static int ahci_get_irq_vector(struct ata_host *host, int port)
@@ -1717,7 +1720,9 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	hpriv->mmio = pcim_iomap_table(pdev)[ahci_pci_bar];
 
 	/* detect remapped nvme devices */
-	ahci_remap_check(pdev, ahci_pci_bar, hpriv);
+	rc = ahci_remap_check(pdev, ahci_pci_bar, hpriv);
+	if (rc)
+		return rc;
 
 	/* must set flag prior to save config in order to take effect */
 	if (ahci_broken_devslp(pdev))
diff --git a/drivers/pci/controller/Kconfig b/drivers/pci/controller/Kconfig
index 011c57cae4b0..20bf2b528c5f 100644
--- a/drivers/pci/controller/Kconfig
+++ b/drivers/pci/controller/Kconfig
@@ -265,6 +265,22 @@ config PCIE_TANGO_SMP8759
 	  This can lead to data corruption if drivers perform concurrent
 	  config and MMIO accesses.
 
+config INTEL_NVME_REMAP
+	tristate "Support Intel NVMe remap via AHCI"
+	depends on X86_64
+	help
+	  Adds support for hidden NVMe devices remapped into AHCI memory
+	  on Intel platforms.
+
+	  As an alternative to this driver, it is sometimes possible to
+	  make such devices appear in the normal way by configuring your
+	  SATA controller to AHCI mode in the firmware setup menu.
+
+	  Say Y here if you want to access the remapped devices.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called intel-nvme-remap.
+
 config VMD
 	depends on PCI_MSI && X86_64 && SRCU
 	select X86_DEV_DMA_OPS
diff --git a/drivers/pci/controller/Makefile b/drivers/pci/controller/Makefile
index d56a507495c5..d4386025b0d1 100644
--- a/drivers/pci/controller/Makefile
+++ b/drivers/pci/controller/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_INTEL_NVME_REMAP) += intel-nvme-remap.o
 obj-$(CONFIG_PCIE_CADENCE) += pcie-cadence.o
 obj-$(CONFIG_PCIE_CADENCE_HOST) += pcie-cadence-host.o
 obj-$(CONFIG_PCIE_CADENCE_EP) += pcie-cadence-ep.o
diff --git a/drivers/pci/controller/intel-nvme-remap.c b/drivers/pci/controller/intel-nvme-remap.c
new file mode 100644
index 000000000000..c5ac712443aa
--- /dev/null
+++ b/drivers/pci/controller/intel-nvme-remap.c
@@ -0,0 +1,461 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Intel remapped NVMe device support.
+ *
+ * Copyright (c) 2019 Endless Mobile, Inc.
+ * Author: Daniel Drake <drake@endlessm.com>
+ *
+ * Some products ship by default with the SATA controller in "RAID" or
+ * "Intel RST Premium With Intel Optane System Acceleration" mode. Under this
+ * mode, which we refer to as "remapped NVMe" mode, any installed NVMe
+ * devices disappear from the PCI bus, and instead their I/O memory becomes
+ * available within the AHCI device BARs.
+ *
+ * This scheme is understood to be a way of avoiding usage of the standard
+ * Windows NVMe driver under that OS, instead mandating usage of Intel's
+ * driver instead, which has better power management, and presumably offers
+ * some RAID/disk-caching solutions too.
+ *
+ * Here in this driver, we support the remapped NVMe mode by claiming the
+ * AHCI device and creating a fake PCIe root port. On the new bus, the
+ * original AHCI device is exposed with only minor tweaks. Then, fake PCI
+ * devices corresponding to the remapped NVMe devices are created. The usual
+ * ahci and nvme drivers are then expected to bind to these devices and
+ * operate as normal.
+ *
+ * The PCI configuration space for the NVMe devices is completely
+ * unavailable, so we fake a minimal one and hope for the best.
+ *
+ * Interrupts are shared between the AHCI and NVMe devices. For simplicity,
+ * we only support the legacy interrupt here, although MSI support
+ * could potentially be added later.
+ */
+
+#define MODULE_NAME "intel-nvme-remap"
+
+#include <linux/ahci-remap.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+
+#define AHCI_PCI_BAR_STANDARD 5
+
+struct nvme_remap_dev {
+	struct pci_dev		*dev;		/* AHCI device */
+	struct pci_bus		*bus;		/* our fake PCI bus */
+	struct pci_sysdata	sysdata;
+	int			irq_base;	/* our fake interrupts */
+
+	/*
+	 * When we detect an all-ones write to a BAR register, this flag
+	 * is set, so that we return the BAR size on the next read (a
+	 * standard PCI behaviour).
+	 * This includes the assumption that an all-ones BAR write is
+	 * immediately followed by a read of the same register.
+	 */
+	bool			bar_sizing;
+
+	/*
+	 * Resources copied from the AHCI device, to be regarded as
+	 * resources on our fake bus.
+	 */
+	struct resource		ahci_resources[PCI_NUM_RESOURCES];
+
+	/* Resources corresponding to the NVMe devices. */
+	struct resource		remapped_dev_mem[AHCI_MAX_REMAP];
+
+	/* Number of remapped NVMe devices found. */
+	int			num_remapped_devices;
+};
+
+static inline struct nvme_remap_dev *nrdev_from_bus(struct pci_bus *bus)
+{
+	return container_of(bus->sysdata, struct nvme_remap_dev, sysdata);
+}
+
+
+/******** PCI configuration space **********/
+
+/*
+ * Helper macros for tweaking returned contents of PCI configuration space.
+ *
+ * value contains len bytes of data read from reg.
+ * If fixup_reg is included in that range, fix up the contents of that
+ * register to fixed_value.
+ */
+#define NR_FIX8(fixup_reg, fixed_value) do { \
+		if (reg <= fixup_reg && fixup_reg < reg + len) \
+			((u8 *) value)[fixup_reg - reg] = (u8) (fixed_value); \
+	} while (0)
+
+#define NR_FIX16(fixup_reg, fixed_value) do { \
+		NR_FIX8(fixup_reg, fixed_value); \
+		NR_FIX8(fixup_reg + 1, fixed_value >> 8); \
+	} while (0)
+
+#define NR_FIX24(fixup_reg, fixed_value) do { \
+		NR_FIX8(fixup_reg, fixed_value); \
+		NR_FIX8(fixup_reg + 1, fixed_value >> 8); \
+		NR_FIX8(fixup_reg + 2, fixed_value >> 16); \
+	} while (0)
+
+#define NR_FIX32(fixup_reg, fixed_value) do { \
+		NR_FIX16(fixup_reg, (u16) fixed_value); \
+		NR_FIX16(fixup_reg + 2, fixed_value >> 16); \
+	} while (0)
+
+/*
+ * Read PCI config space of the slot 0 (AHCI) device.
+ * We pass through the read request to the underlying device, but
+ * tweak the results in some cases.
+ */
+static int nvme_remap_pci_read_slot0(struct pci_bus *bus, int reg,
+				     int len, u32 *value)
+{
+	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
+	struct pci_bus *ahci_dev_bus = nrdev->dev->bus;
+	int ret;
+
+	ret = ahci_dev_bus->ops->read(ahci_dev_bus, nrdev->dev->devfn,
+				      reg, len, value);
+	if (ret)
+		return ret;
+
+	/*
+	 * Adjust the device class, to prevent this driver from attempting to
+	 * additionally probe the device we're simulating here.
+	 */
+	NR_FIX24(PCI_CLASS_PROG, PCI_CLASS_STORAGE_SATA_AHCI);
+
+	/*
+	 * Unset interrupt pin, otherwise ACPI tries to find routing
+	 * info for our virtual IRQ, fails, and complains.
+	 */
+	NR_FIX8(PCI_INTERRUPT_PIN, 0);
+
+	/*
+	 * Truncate the AHCI BAR to not include the region that covers the
+	 * hidden devices. This will cause the ahci driver to successfully
+	 * probe th new device (instead of handing it over to this driver).
+	 */
+	if (nrdev->bar_sizing) {
+		NR_FIX32(PCI_BASE_ADDRESS_5, ~(SZ_16K - 1));
+		nrdev->bar_sizing = false;
+	}
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
+/*
+ * Read PCI config space of a remapped device.
+ * Since the original PCI config space is inaccessible, we provide a minimal,
+ * fake config space instead.
+ */
+static int nvme_remap_pci_read_remapped(struct pci_bus *bus, unsigned int port,
+					int reg, int len, u32 *value)
+{
+	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
+	struct resource *remapped_mem;
+
+	if (port > nrdev->num_remapped_devices)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	*value = 0;
+	remapped_mem = &nrdev->remapped_dev_mem[port - 1];
+
+	/* Set a Vendor ID, otherwise Linux assumes no device is present */
+	NR_FIX16(PCI_VENDOR_ID, PCI_VENDOR_ID_INTEL);
+
+	/* Always appear on & bus mastering */
+	NR_FIX16(PCI_COMMAND, PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER);
+
+	/* Set class so that nvme driver probes us */
+	NR_FIX24(PCI_CLASS_PROG, PCI_CLASS_STORAGE_EXPRESS);
+
+	if (nrdev->bar_sizing) {
+		NR_FIX32(PCI_BASE_ADDRESS_0,
+			 ~(resource_size(remapped_mem) - 1));
+		nrdev->bar_sizing = false;
+	} else {
+		resource_size_t mem_start = remapped_mem->start;
+
+		mem_start |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+		NR_FIX32(PCI_BASE_ADDRESS_0, mem_start);
+		mem_start >>= 32;
+		NR_FIX32(PCI_BASE_ADDRESS_1, mem_start);
+	}
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
+/* Read PCI configuration space. */
+static int nvme_remap_pci_read(struct pci_bus *bus, unsigned int devfn,
+			       int reg, int len, u32 *value)
+{
+	if (PCI_SLOT(devfn) == 0)
+		return nvme_remap_pci_read_slot0(bus, reg, len, value);
+	else
+		return nvme_remap_pci_read_remapped(bus, PCI_SLOT(devfn),
+						    reg, len, value);
+}
+
+/*
+ * Write PCI config space of the slot 0 (AHCI) device.
+ * Apart from the special case of BAR sizing, we disable all writes.
+ * Otherwise, the ahci driver could make changes (e.g. unset PCI bus master)
+ * that would affect the operation of the NVMe devices.
+ */
+static int nvme_remap_pci_write_slot0(struct pci_bus *bus, int reg,
+				      int len, u32 value)
+{
+	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
+	struct pci_bus *ahci_dev_bus = nrdev->dev->bus;
+
+	if (reg >= PCI_BASE_ADDRESS_0 && reg <= PCI_BASE_ADDRESS_5) {
+		/*
+		 * Writing all-ones to a BAR means that the size of the
+		 * memory region is being checked. Flag this so that we can
+		 * reply with an appropriate size on the next read.
+		 */
+		if (value == ~0)
+			nrdev->bar_sizing = true;
+
+		return ahci_dev_bus->ops->write(ahci_dev_bus,
+						nrdev->dev->devfn,
+						reg, len, value);
+	}
+
+	return PCIBIOS_SET_FAILED;
+}
+
+/*
+ * Write PCI config space of a remapped device.
+ * Since the original PCI config space is inaccessible, we reject all
+ * writes, except for the special case of BAR probing.
+ */
+static int nvme_remap_pci_write_remapped(struct pci_bus *bus,
+					 unsigned int port,
+					 int reg, int len, u32 value)
+{
+	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
+
+	if (port > nrdev->num_remapped_devices)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	/*
+	 * Writing all-ones to a BAR means that the size of the memory
+	 * region is being checked. Flag this so that we can reply with
+	 * an appropriate size on the next read.
+	 */
+	if (value == ~0 && reg >= PCI_BASE_ADDRESS_0
+			&& reg <= PCI_BASE_ADDRESS_5) {
+		nrdev->bar_sizing = true;
+		return PCIBIOS_SUCCESSFUL;
+	}
+
+	return PCIBIOS_SET_FAILED;
+}
+
+/* Write PCI configuration space. */
+static int nvme_remap_pci_write(struct pci_bus *bus, unsigned int devfn,
+				int reg, int len, u32 value)
+{
+	if (PCI_SLOT(devfn) == 0)
+		return nvme_remap_pci_write_slot0(bus, reg, len, value);
+	else
+		return nvme_remap_pci_write_remapped(bus, PCI_SLOT(devfn),
+						     reg, len, value);
+}
+
+static struct pci_ops nvme_remap_pci_ops = {
+	.read	= nvme_remap_pci_read,
+	.write	= nvme_remap_pci_write,
+};
+
+
+/******** Initialization & exit **********/
+
+/*
+ * Find a PCI domain ID to use for our fake bus.
+ * Start at 0x10000 to not clash with ACPI _SEG domains (16 bits).
+ */
+static int find_free_domain(void)
+{
+	int domain = 0xffff;
+	struct pci_bus *bus = NULL;
+
+	while ((bus = pci_find_next_bus(bus)) != NULL)
+		domain = max_t(int, domain, pci_domain_nr(bus));
+
+	return domain + 1;
+}
+
+static int find_remapped_devices(struct nvme_remap_dev *nrdev,
+				 struct list_head *resources)
+{
+	void __iomem *mmio;
+	int i, count = 0;
+	u32 cap;
+
+	mmio = pcim_iomap(nrdev->dev, AHCI_PCI_BAR_STANDARD,
+			  pci_resource_len(nrdev->dev,
+					   AHCI_PCI_BAR_STANDARD));
+	if (!mmio)
+		return -ENODEV;
+
+	/* Check if this device might have remapped nvme devices. */
+	if (pci_resource_len(nrdev->dev, AHCI_PCI_BAR_STANDARD) < SZ_512K ||
+	    !(readl(mmio + AHCI_VSCAP) & 1))
+		return -ENODEV;
+
+	cap = readq(mmio + AHCI_REMAP_CAP);
+	for (i = 0; i < AHCI_MAX_REMAP; i++) {
+		struct resource *remapped_mem;
+
+		if ((cap & (1 << i)) == 0)
+			continue;
+		if (readl(mmio + ahci_remap_dcc(i))
+				!= PCI_CLASS_STORAGE_EXPRESS)
+			continue;
+
+		/* We've found a remapped device */
+		remapped_mem = &nrdev->remapped_dev_mem[count++];
+		remapped_mem->start =
+			pci_resource_start(nrdev->dev, AHCI_PCI_BAR_STANDARD)
+			+ ahci_remap_base(i);
+		remapped_mem->end = remapped_mem->start
+			+ AHCI_REMAP_N_SIZE - 1;
+		remapped_mem->flags = IORESOURCE_MEM | IORESOURCE_PCI_FIXED;
+		pci_add_resource(resources, remapped_mem);
+	}
+
+	pcim_iounmap(nrdev->dev, mmio);
+
+	if (count == 0)
+		return -ENODEV;
+
+	nrdev->num_remapped_devices = count;
+	dev_info(&nrdev->dev->dev, "Found %d remapped NVMe devices\n",
+		 nrdev->num_remapped_devices);
+	return 0;
+}
+
+static void nvme_remap_remove_root_bus(void *data)
+{
+	struct pci_bus *bus = data;
+
+	pci_stop_root_bus(bus);
+	pci_remove_root_bus(bus);
+}
+
+static int nvme_remap_probe(struct pci_dev *dev,
+			    const struct pci_device_id *id)
+{
+	struct nvme_remap_dev *nrdev;
+	LIST_HEAD(resources);
+	int i;
+	int ret;
+	struct pci_dev *child;
+
+	nrdev = devm_kzalloc(&dev->dev, sizeof(*nrdev), GFP_KERNEL);
+	nrdev->sysdata.domain = find_free_domain();
+	nrdev->dev = dev;
+	pci_set_drvdata(dev, nrdev);
+
+	ret = pcim_enable_device(dev);
+	if (ret < 0)
+		return ret;
+
+	pci_set_master(dev);
+
+	ret = find_remapped_devices(nrdev, &resources);
+	if (ret)
+		return ret;
+
+	/* Add resources from the original AHCI device */
+	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+		struct resource *res = &dev->resource[i];
+
+		if (res->start) {
+			struct resource *nr_res = &nrdev->ahci_resources[i];
+
+			nr_res->start = res->start;
+			nr_res->end = res->end;
+			nr_res->flags = res->flags;
+			pci_add_resource(&resources, nr_res);
+		}
+	}
+
+	/* Create virtual interrupts */
+	nrdev->irq_base = devm_irq_alloc_descs(&dev->dev, -1, 0,
+					       nrdev->num_remapped_devices + 1,
+					       0);
+	if (nrdev->irq_base < 0)
+		return nrdev->irq_base;
+
+	/* Create and populate PCI bus */
+	nrdev->bus = pci_create_root_bus(&dev->dev, 0, &nvme_remap_pci_ops,
+					 &nrdev->sysdata, &resources);
+	if (!nrdev->bus)
+		return -ENODEV;
+
+	if (devm_add_action_or_reset(&dev->dev, nvme_remap_remove_root_bus,
+				     nrdev->bus))
+		return -ENOMEM;
+
+	/* We don't support sharing MSI interrupts between these devices */
+	nrdev->bus->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
+
+	pci_scan_child_bus(nrdev->bus);
+
+	list_for_each_entry(child, &nrdev->bus->devices, bus_list) {
+		/*
+		 * Prevent PCI core from trying to move memory BARs around.
+		 * The hidden NVMe devices are at fixed locations.
+		 */
+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+			struct resource *res = &child->resource[i];
+
+			if (res->flags & IORESOURCE_MEM)
+				res->flags |= IORESOURCE_PCI_FIXED;
+		}
+
+		/* Share the legacy IRQ between all devices */
+		child->irq = dev->irq;
+	}
+
+	pci_assign_unassigned_bus_resources(nrdev->bus);
+	pci_bus_add_devices(nrdev->bus);
+
+	return 0;
+}
+
+static const struct pci_device_id nvme_remap_ids[] = {
+	/*
+	 * Match all Intel RAID controllers.
+	 *
+	 * There's overlap here with the set of devices detected by the ahci
+	 * driver, but ahci will only successfully probe when there
+	 * *aren't* any remapped NVMe devices, and this driver will only
+	 * successfully probe when there *are* remapped NVMe devices that
+	 * need handling.
+	 */
+	{
+		PCI_VDEVICE(INTEL, PCI_ANY_ID),
+		.class = PCI_CLASS_STORAGE_RAID << 8,
+		.class_mask = 0xffffff00,
+	},
+	{0,}
+};
+MODULE_DEVICE_TABLE(pci, nvme_remap_ids);
+
+static struct pci_driver nvme_remap_drv = {
+	.name		= MODULE_NAME,
+	.id_table	= nvme_remap_ids,
+	.probe		= nvme_remap_probe,
+};
+module_pci_driver(nvme_remap_drv);
+
+MODULE_AUTHOR("Daniel Drake <drake@endlessm.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-10  7:44 [PATCH] PCI: Add Intel remapped NVMe device support Daniel Drake
@ 2019-06-10 16:00 ` Keith Busch
  2019-06-11  2:46   ` Daniel Drake
  2019-06-10 21:16 ` Bjorn Helgaas
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Busch @ 2019-06-10 16:00 UTC (permalink / raw)
  To: Daniel Drake
  Cc: bhelgaas, Jens Axboe, Keith Busch, Christoph Hellwig, linux-pci,
	linux-ide, Sagi Grimberg, linux-nvme, linux

On Mon, Jun 10, 2019 at 1:45 AM Daniel Drake <drake@endlessm.com> wrote:
> +       /* We don't support sharing MSI interrupts between these devices */
> +       nrdev->bus->bus_flags |= PCI_BUS_FLAGS_NO_MSI;

And this is a problem, isn't it? Since we don't have an option to open
the MSI implementation in RAID mode, your experience will be much
better to disable this mode when using Linux as per the current
recommendation rather than limping along with legacy IRQ.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-10  7:44 [PATCH] PCI: Add Intel remapped NVMe device support Daniel Drake
  2019-06-10 16:00 ` Keith Busch
@ 2019-06-10 21:16 ` Bjorn Helgaas
  2019-06-11  3:25   ` Daniel Drake
  1 sibling, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2019-06-10 21:16 UTC (permalink / raw)
  To: Daniel Drake
  Cc: axboe, kbusch, hch, linux-pci, linux, sagi, linux-nvme,
	linux-ide, Dan Williams, Alex Williamson

[+cc Dan, Alex]

On Mon, Jun 10, 2019 at 03:44:56PM +0800, Daniel Drake wrote:
> Consumer products that are configured by default to run the Intel SATA AHCI
> controller in "RAID" or "Intel RST Premium With Intel Optane System
> Acceleration" mode are becoming increasingly prevalent.
> 
> Unde this mode, NVMe devices are remapped into the SATA device and become
> hidden from the PCI bus, which means that Linux users cannot access their
> storage devices unless they go into the firmware setup menu to revert back
> to AHCI mode - assuming such option is available. Lack of support for this
> mode is also causing complications for vendors who distribute Linux.

Ugh.  Is there a spec that details what's actually going on here?
"Remapping" doesn't describe much other than to say "something magic
is happening here."

I'm guessing this is related to these:

  bfa9cb3e110c ("ahci-remap.h: add ahci remapping definitions")
  aecec8b60422 ("ahci: warn about remapped NVMe devices")

This driver makes a lot of assumptions about how this works, e.g.,
apparently there's an AHCI BAR that covers "hidden devices" plus some
other stuff of some magic size, whatever is special about device 0,
etc, but I don't see the source of those assumptions.

Why can't we use the device in "RAID" or "Intel RST Premium With Intel
Optane System Acceleration" mode?  Why doesn't Intel make a Linux
driver that works in that mode?

What do users see in that mode, i.e., how would they recognize that
they need this patch?

I'm not really keen on the precedent this sets about pretending things
are PCI when they're not.  This seems like a bit of a kludge that
might happen to work now but could easily break in the future because
it's not based on any spec we can rely on.  Plus it makes future PCI
maintenance harder because we have to worry about how these differ
from real PCI devices.

> Add support for the remapped NVMe mode by creating a virtual PCI bus,
> where the AHCI and NVMe devices are presented separately, allowing the
> ahci and nvme drivers to bind in the normal way.
> 
> Unfortunately the NVMe device configuration space is inaccesible under
> this scheme, so we provide a fake one, and hope that no DeviceID-based
> quirks are needed. The interrupt is shared between the AHCI and NVMe
> devices.
> 
> The existing ahci driver is modified to not claim devices where remapped
> NVMe devices are present, allowing this new driver to step in.
> 
> The details of the remapping scheme came from patches previously
> posted by Dan Williams and the resulting discussion.

Maybe these details answer some of my questions.  Please include the
URL (in addition to addressing the questions in the commit log).

> Signed-off-by: Daniel Drake <drake@endlessm.com>
> ---
>  drivers/ata/ahci.c                        |  25 +-
>  drivers/pci/controller/Kconfig            |  16 +
>  drivers/pci/controller/Makefile           |   1 +
>  drivers/pci/controller/intel-nvme-remap.c | 461 ++++++++++++++++++++++
>  4 files changed, 493 insertions(+), 10 deletions(-)
>  create mode 100644 drivers/pci/controller/intel-nvme-remap.c
> 
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index f7652baa6337..75c5733cbae9 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1499,7 +1499,7 @@ static irqreturn_t ahci_thunderx_irq_handler(int irq, void *dev_instance)
>  }
>  #endif
>  
> -static void ahci_remap_check(struct pci_dev *pdev, int bar,
> +static int ahci_remap_check(struct pci_dev *pdev, int bar,
>  		struct ahci_host_priv *hpriv)
>  {
>  	int i, count = 0;
> @@ -1512,7 +1512,7 @@ static void ahci_remap_check(struct pci_dev *pdev, int bar,
>  	    pci_resource_len(pdev, bar) < SZ_512K ||
>  	    bar != AHCI_PCI_BAR_STANDARD ||
>  	    !(readl(hpriv->mmio + AHCI_VSCAP) & 1))
> -		return;
> +		return 0;
>  
>  	cap = readq(hpriv->mmio + AHCI_REMAP_CAP);
>  	for (i = 0; i < AHCI_MAX_REMAP; i++) {
> @@ -1527,17 +1527,20 @@ static void ahci_remap_check(struct pci_dev *pdev, int bar,
>  	}
>  
>  	if (!count)
> -		return;
> +		return 0;
> +
> +	/* Abort probe, allowing intel-nvme-remap to step in when available */
> +	if (IS_ENABLED(CONFIG_INTEL_NVME_REMAP)) {
> +		dev_info(&pdev->dev,
> +			 "Device will be handled by intel-nvme-remap.\n");
> +		return -ENODEV;
> +	}
>  
>  	dev_warn(&pdev->dev, "Found %d remapped NVMe devices.\n", count);
>  	dev_warn(&pdev->dev,
> -		 "Switch your BIOS from RAID to AHCI mode to use them.\n");
> +		 "Enable intel-nvme-remap or switch your BIOS to AHCI mode to use them.\n");

I don't think users will know how to "enable intel-nvme-remap".  I
think the message would have to mention a config option and rebuilding
the kernel.

> -	/*
> -	 * Don't rely on the msi-x capability in the remap case,
> -	 * share the legacy interrupt across ahci and remapped devices.
> -	 */
> -	hpriv->flags |= AHCI_HFLAG_NO_MSI;
> +	return 0;
>  }
>  
>  static int ahci_get_irq_vector(struct ata_host *host, int port)
> @@ -1717,7 +1720,9 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	hpriv->mmio = pcim_iomap_table(pdev)[ahci_pci_bar];
>  
>  	/* detect remapped nvme devices */
> -	ahci_remap_check(pdev, ahci_pci_bar, hpriv);
> +	rc = ahci_remap_check(pdev, ahci_pci_bar, hpriv);
> +	if (rc)
> +		return rc;
>  
>  	/* must set flag prior to save config in order to take effect */
>  	if (ahci_broken_devslp(pdev))
> diff --git a/drivers/pci/controller/Kconfig b/drivers/pci/controller/Kconfig
> index 011c57cae4b0..20bf2b528c5f 100644
> --- a/drivers/pci/controller/Kconfig
> +++ b/drivers/pci/controller/Kconfig
> @@ -265,6 +265,22 @@ config PCIE_TANGO_SMP8759
>  	  This can lead to data corruption if drivers perform concurrent
>  	  config and MMIO accesses.
>  
> +config INTEL_NVME_REMAP
> +	tristate "Support Intel NVMe remap via AHCI"
> +	depends on X86_64
> +	help
> +	  Adds support for hidden NVMe devices remapped into AHCI memory
> +	  on Intel platforms.

As a user, I don't know what "hidden NVMe devices" means.  I think
this needs to say something about RAID and/or the other Intel
marketing-speak.

> +	  As an alternative to this driver, it is sometimes possible to
> +	  make such devices appear in the normal way by configuring your
> +	  SATA controller to AHCI mode in the firmware setup menu.
> +
> +	  Say Y here if you want to access the remapped devices.
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called intel-nvme-remap.
> +
>  config VMD
>  	depends on PCI_MSI && X86_64 && SRCU
>  	select X86_DEV_DMA_OPS
> diff --git a/drivers/pci/controller/Makefile b/drivers/pci/controller/Makefile
> index d56a507495c5..d4386025b0d1 100644
> --- a/drivers/pci/controller/Makefile
> +++ b/drivers/pci/controller/Makefile
> @@ -1,4 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_INTEL_NVME_REMAP) += intel-nvme-remap.o
>  obj-$(CONFIG_PCIE_CADENCE) += pcie-cadence.o
>  obj-$(CONFIG_PCIE_CADENCE_HOST) += pcie-cadence-host.o
>  obj-$(CONFIG_PCIE_CADENCE_EP) += pcie-cadence-ep.o
> diff --git a/drivers/pci/controller/intel-nvme-remap.c b/drivers/pci/controller/intel-nvme-remap.c
> new file mode 100644
> index 000000000000..c5ac712443aa
> --- /dev/null
> +++ b/drivers/pci/controller/intel-nvme-remap.c
> @@ -0,0 +1,461 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Intel remapped NVMe device support.
> + *
> + * Copyright (c) 2019 Endless Mobile, Inc.
> + * Author: Daniel Drake <drake@endlessm.com>
> + *
> + * Some products ship by default with the SATA controller in "RAID" or
> + * "Intel RST Premium With Intel Optane System Acceleration" mode. Under this
> + * mode, which we refer to as "remapped NVMe" mode, any installed NVMe
> + * devices disappear from the PCI bus, and instead their I/O memory becomes
> + * available within the AHCI device BARs.
> + *
> + * This scheme is understood to be a way of avoiding usage of the standard
> + * Windows NVMe driver under that OS, instead mandating usage of Intel's
> + * driver instead, which has better power management, and presumably offers
> + * some RAID/disk-caching solutions too.
> + *
> + * Here in this driver, we support the remapped NVMe mode by claiming the
> + * AHCI device and creating a fake PCIe root port. On the new bus, the
> + * original AHCI device is exposed with only minor tweaks. Then, fake PCI
> + * devices corresponding to the remapped NVMe devices are created. The usual
> + * ahci and nvme drivers are then expected to bind to these devices and
> + * operate as normal.

I think this creates a fake PCI host bridge, but not an actual PCIe
Root Port, right?  I.e., "lspci" doesn't show a new Root Port device,
does it?

But I suppose "lspci" *does* show new NVMe devices that seem to be
PCIe endpoints?  But they probably don't *work* like PCIe endpoints,
e.g., we can't control ASPM, can't use AER, etc?

> + * The PCI configuration space for the NVMe devices is completely
> + * unavailable, so we fake a minimal one and hope for the best.
> + *
> + * Interrupts are shared between the AHCI and NVMe devices. For simplicity,
> + * we only support the legacy interrupt here, although MSI support
> + * could potentially be added later.
> + */
> +
> +#define MODULE_NAME "intel-nvme-remap"
> +
> +#include <linux/ahci-remap.h>
> +#include <linux/irq.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +
> +#define AHCI_PCI_BAR_STANDARD 5
> +
> +struct nvme_remap_dev {
> +	struct pci_dev		*dev;		/* AHCI device */
> +	struct pci_bus		*bus;		/* our fake PCI bus */
> +	struct pci_sysdata	sysdata;
> +	int			irq_base;	/* our fake interrupts */
> +
> +	/*
> +	 * When we detect an all-ones write to a BAR register, this flag
> +	 * is set, so that we return the BAR size on the next read (a
> +	 * standard PCI behaviour).
> +	 * This includes the assumption that an all-ones BAR write is
> +	 * immediately followed by a read of the same register.
> +	 */
> +	bool			bar_sizing;
> +
> +	/*
> +	 * Resources copied from the AHCI device, to be regarded as
> +	 * resources on our fake bus.
> +	 */
> +	struct resource		ahci_resources[PCI_NUM_RESOURCES];
> +
> +	/* Resources corresponding to the NVMe devices. */
> +	struct resource		remapped_dev_mem[AHCI_MAX_REMAP];
> +
> +	/* Number of remapped NVMe devices found. */
> +	int			num_remapped_devices;
> +};
> +
> +static inline struct nvme_remap_dev *nrdev_from_bus(struct pci_bus *bus)
> +{
> +	return container_of(bus->sysdata, struct nvme_remap_dev, sysdata);
> +}
> +
> +
> +/******** PCI configuration space **********/
> +
> +/*
> + * Helper macros for tweaking returned contents of PCI configuration space.
> + *
> + * value contains len bytes of data read from reg.
> + * If fixup_reg is included in that range, fix up the contents of that
> + * register to fixed_value.
> + */
> +#define NR_FIX8(fixup_reg, fixed_value) do { \
> +		if (reg <= fixup_reg && fixup_reg < reg + len) \
> +			((u8 *) value)[fixup_reg - reg] = (u8) (fixed_value); \
> +	} while (0)

These implicitly depend on local variables named "reg", "len",
"value".  Can you make that explicit?

> +#define NR_FIX16(fixup_reg, fixed_value) do { \
> +		NR_FIX8(fixup_reg, fixed_value); \
> +		NR_FIX8(fixup_reg + 1, fixed_value >> 8); \
> +	} while (0)
> +
> +#define NR_FIX24(fixup_reg, fixed_value) do { \
> +		NR_FIX8(fixup_reg, fixed_value); \
> +		NR_FIX8(fixup_reg + 1, fixed_value >> 8); \
> +		NR_FIX8(fixup_reg + 2, fixed_value >> 16); \
> +	} while (0)
> +
> +#define NR_FIX32(fixup_reg, fixed_value) do { \
> +		NR_FIX16(fixup_reg, (u16) fixed_value); \
> +		NR_FIX16(fixup_reg + 2, fixed_value >> 16); \
> +	} while (0)
> +
> +/*
> + * Read PCI config space of the slot 0 (AHCI) device.
> + * We pass through the read request to the underlying device, but
> + * tweak the results in some cases.
> + */
> +static int nvme_remap_pci_read_slot0(struct pci_bus *bus, int reg,
> +				     int len, u32 *value)
> +{
> +	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
> +	struct pci_bus *ahci_dev_bus = nrdev->dev->bus;
> +	int ret;
> +
> +	ret = ahci_dev_bus->ops->read(ahci_dev_bus, nrdev->dev->devfn,
> +				      reg, len, value);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Adjust the device class, to prevent this driver from attempting to
> +	 * additionally probe the device we're simulating here.
> +	 */
> +	NR_FIX24(PCI_CLASS_PROG, PCI_CLASS_STORAGE_SATA_AHCI);
> +
> +	/*
> +	 * Unset interrupt pin, otherwise ACPI tries to find routing
> +	 * info for our virtual IRQ, fails, and complains.
> +	 */
> +	NR_FIX8(PCI_INTERRUPT_PIN, 0);
> +
> +	/*
> +	 * Truncate the AHCI BAR to not include the region that covers the
> +	 * hidden devices. This will cause the ahci driver to successfully
> +	 * probe th new device (instead of handing it over to this driver).
> +	 */
> +	if (nrdev->bar_sizing) {
> +		NR_FIX32(PCI_BASE_ADDRESS_5, ~(SZ_16K - 1));
> +		nrdev->bar_sizing = false;
> +	}
> +
> +	return PCIBIOS_SUCCESSFUL;
> +}
> +
> +/*
> + * Read PCI config space of a remapped device.
> + * Since the original PCI config space is inaccessible, we provide a minimal,
> + * fake config space instead.
> + */
> +static int nvme_remap_pci_read_remapped(struct pci_bus *bus, unsigned int port,
> +					int reg, int len, u32 *value)
> +{
> +	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
> +	struct resource *remapped_mem;
> +
> +	if (port > nrdev->num_remapped_devices)
> +		return PCIBIOS_DEVICE_NOT_FOUND;
> +
> +	*value = 0;
> +	remapped_mem = &nrdev->remapped_dev_mem[port - 1];
> +
> +	/* Set a Vendor ID, otherwise Linux assumes no device is present */
> +	NR_FIX16(PCI_VENDOR_ID, PCI_VENDOR_ID_INTEL);
> +
> +	/* Always appear on & bus mastering */
> +	NR_FIX16(PCI_COMMAND, PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER);
> +
> +	/* Set class so that nvme driver probes us */
> +	NR_FIX24(PCI_CLASS_PROG, PCI_CLASS_STORAGE_EXPRESS);
> +
> +	if (nrdev->bar_sizing) {
> +		NR_FIX32(PCI_BASE_ADDRESS_0,
> +			 ~(resource_size(remapped_mem) - 1));
> +		nrdev->bar_sizing = false;
> +	} else {
> +		resource_size_t mem_start = remapped_mem->start;
> +
> +		mem_start |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> +		NR_FIX32(PCI_BASE_ADDRESS_0, mem_start);
> +		mem_start >>= 32;
> +		NR_FIX32(PCI_BASE_ADDRESS_1, mem_start);
> +	}
> +
> +	return PCIBIOS_SUCCESSFUL;
> +}
> +
> +/* Read PCI configuration space. */
> +static int nvme_remap_pci_read(struct pci_bus *bus, unsigned int devfn,
> +			       int reg, int len, u32 *value)
> +{
> +	if (PCI_SLOT(devfn) == 0)
> +		return nvme_remap_pci_read_slot0(bus, reg, len, value);
> +	else
> +		return nvme_remap_pci_read_remapped(bus, PCI_SLOT(devfn),
> +						    reg, len, value);
> +}
> +
> +/*
> + * Write PCI config space of the slot 0 (AHCI) device.
> + * Apart from the special case of BAR sizing, we disable all writes.
> + * Otherwise, the ahci driver could make changes (e.g. unset PCI bus master)
> + * that would affect the operation of the NVMe devices.
> + */
> +static int nvme_remap_pci_write_slot0(struct pci_bus *bus, int reg,
> +				      int len, u32 value)
> +{
> +	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
> +	struct pci_bus *ahci_dev_bus = nrdev->dev->bus;
> +
> +	if (reg >= PCI_BASE_ADDRESS_0 && reg <= PCI_BASE_ADDRESS_5) {
> +		/*
> +		 * Writing all-ones to a BAR means that the size of the
> +		 * memory region is being checked. Flag this so that we can
> +		 * reply with an appropriate size on the next read.
> +		 */
> +		if (value == ~0)
> +			nrdev->bar_sizing = true;
> +
> +		return ahci_dev_bus->ops->write(ahci_dev_bus,
> +						nrdev->dev->devfn,
> +						reg, len, value);
> +	}
> +
> +	return PCIBIOS_SET_FAILED;
> +}
> +
> +/*
> + * Write PCI config space of a remapped device.
> + * Since the original PCI config space is inaccessible, we reject all
> + * writes, except for the special case of BAR probing.
> + */
> +static int nvme_remap_pci_write_remapped(struct pci_bus *bus,
> +					 unsigned int port,
> +					 int reg, int len, u32 value)
> +{
> +	struct nvme_remap_dev *nrdev = nrdev_from_bus(bus);
> +
> +	if (port > nrdev->num_remapped_devices)
> +		return PCIBIOS_DEVICE_NOT_FOUND;
> +
> +	/*
> +	 * Writing all-ones to a BAR means that the size of the memory
> +	 * region is being checked. Flag this so that we can reply with
> +	 * an appropriate size on the next read.
> +	 */
> +	if (value == ~0 && reg >= PCI_BASE_ADDRESS_0
> +			&& reg <= PCI_BASE_ADDRESS_5) {
> +		nrdev->bar_sizing = true;
> +		return PCIBIOS_SUCCESSFUL;
> +	}
> +
> +	return PCIBIOS_SET_FAILED;
> +}
> +
> +/* Write PCI configuration space. */
> +static int nvme_remap_pci_write(struct pci_bus *bus, unsigned int devfn,
> +				int reg, int len, u32 value)
> +{
> +	if (PCI_SLOT(devfn) == 0)
> +		return nvme_remap_pci_write_slot0(bus, reg, len, value);
> +	else
> +		return nvme_remap_pci_write_remapped(bus, PCI_SLOT(devfn),
> +						     reg, len, value);
> +}
> +
> +static struct pci_ops nvme_remap_pci_ops = {
> +	.read	= nvme_remap_pci_read,
> +	.write	= nvme_remap_pci_write,
> +};
> +
> +
> +/******** Initialization & exit **********/
> +
> +/*
> + * Find a PCI domain ID to use for our fake bus.
> + * Start at 0x10000 to not clash with ACPI _SEG domains (16 bits).
> + */
> +static int find_free_domain(void)
> +{
> +	int domain = 0xffff;
> +	struct pci_bus *bus = NULL;
> +
> +	while ((bus = pci_find_next_bus(bus)) != NULL)
> +		domain = max_t(int, domain, pci_domain_nr(bus));
> +
> +	return domain + 1;
> +}

This is the same as vmd_find_free_domain().  There might not be a
conflict now, but both places should call the same interface so future
changes can take both callers into account.

> +static int find_remapped_devices(struct nvme_remap_dev *nrdev,
> +				 struct list_head *resources)
> +{
> +	void __iomem *mmio;
> +	int i, count = 0;
> +	u32 cap;
> +
> +	mmio = pcim_iomap(nrdev->dev, AHCI_PCI_BAR_STANDARD,
> +			  pci_resource_len(nrdev->dev,
> +					   AHCI_PCI_BAR_STANDARD));
> +	if (!mmio)
> +		return -ENODEV;
> +
> +	/* Check if this device might have remapped nvme devices. */
> +	if (pci_resource_len(nrdev->dev, AHCI_PCI_BAR_STANDARD) < SZ_512K ||
> +	    !(readl(mmio + AHCI_VSCAP) & 1))
> +		return -ENODEV;
> +
> +	cap = readq(mmio + AHCI_REMAP_CAP);

Can you provide a spec reference for AHCI_VSCAP and AHCI_REMAP_CAP?
It might not need to be in the code, but it would be useful in the
commit log at least.

> +	for (i = 0; i < AHCI_MAX_REMAP; i++) {
> +		struct resource *remapped_mem;
> +
> +		if ((cap & (1 << i)) == 0)
> +			continue;
> +		if (readl(mmio + ahci_remap_dcc(i))
> +				!= PCI_CLASS_STORAGE_EXPRESS)
> +			continue;
> +
> +		/* We've found a remapped device */
> +		remapped_mem = &nrdev->remapped_dev_mem[count++];
> +		remapped_mem->start =
> +			pci_resource_start(nrdev->dev, AHCI_PCI_BAR_STANDARD)
> +			+ ahci_remap_base(i);
> +		remapped_mem->end = remapped_mem->start
> +			+ AHCI_REMAP_N_SIZE - 1;
> +		remapped_mem->flags = IORESOURCE_MEM | IORESOURCE_PCI_FIXED;
> +		pci_add_resource(resources, remapped_mem);
> +	}
> +
> +	pcim_iounmap(nrdev->dev, mmio);
> +
> +	if (count == 0)
> +		return -ENODEV;
> +
> +	nrdev->num_remapped_devices = count;
> +	dev_info(&nrdev->dev->dev, "Found %d remapped NVMe devices\n",
> +		 nrdev->num_remapped_devices);
> +	return 0;
> +}
> +
> +static void nvme_remap_remove_root_bus(void *data)
> +{
> +	struct pci_bus *bus = data;
> +
> +	pci_stop_root_bus(bus);
> +	pci_remove_root_bus(bus);
> +}
> +
> +static int nvme_remap_probe(struct pci_dev *dev,
> +			    const struct pci_device_id *id)
> +{
> +	struct nvme_remap_dev *nrdev;
> +	LIST_HEAD(resources);
> +	int i;
> +	int ret;
> +	struct pci_dev *child;
> +
> +	nrdev = devm_kzalloc(&dev->dev, sizeof(*nrdev), GFP_KERNEL);
> +	nrdev->sysdata.domain = find_free_domain();
> +	nrdev->dev = dev;
> +	pci_set_drvdata(dev, nrdev);
> +
> +	ret = pcim_enable_device(dev);
> +	if (ret < 0)
> +		return ret;
> +
> +	pci_set_master(dev);
> +
> +	ret = find_remapped_devices(nrdev, &resources);
> +	if (ret)
> +		return ret;
> +
> +	/* Add resources from the original AHCI device */
> +	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +		struct resource *res = &dev->resource[i];
> +
> +		if (res->start) {
> +			struct resource *nr_res = &nrdev->ahci_resources[i];
> +
> +			nr_res->start = res->start;
> +			nr_res->end = res->end;
> +			nr_res->flags = res->flags;
> +			pci_add_resource(&resources, nr_res);
> +		}
> +	}
> +
> +	/* Create virtual interrupts */
> +	nrdev->irq_base = devm_irq_alloc_descs(&dev->dev, -1, 0,
> +					       nrdev->num_remapped_devices + 1,
> +					       0);
> +	if (nrdev->irq_base < 0)
> +		return nrdev->irq_base;
> +
> +	/* Create and populate PCI bus */
> +	nrdev->bus = pci_create_root_bus(&dev->dev, 0, &nvme_remap_pci_ops,
> +					 &nrdev->sysdata, &resources);
> +	if (!nrdev->bus)
> +		return -ENODEV;
> +
> +	if (devm_add_action_or_reset(&dev->dev, nvme_remap_remove_root_bus,
> +				     nrdev->bus))
> +		return -ENOMEM;
> +
> +	/* We don't support sharing MSI interrupts between these devices */
> +	nrdev->bus->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
> +
> +	pci_scan_child_bus(nrdev->bus);
> +
> +	list_for_each_entry(child, &nrdev->bus->devices, bus_list) {
> +		/*
> +		 * Prevent PCI core from trying to move memory BARs around.
> +		 * The hidden NVMe devices are at fixed locations.
> +		 */
> +		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +			struct resource *res = &child->resource[i];
> +
> +			if (res->flags & IORESOURCE_MEM)
> +				res->flags |= IORESOURCE_PCI_FIXED;
> +		}
> +
> +		/* Share the legacy IRQ between all devices */
> +		child->irq = dev->irq;
> +	}
> +
> +	pci_assign_unassigned_bus_resources(nrdev->bus);
> +	pci_bus_add_devices(nrdev->bus);
> +
> +	return 0;
> +}
> +
> +static const struct pci_device_id nvme_remap_ids[] = {
> +	/*
> +	 * Match all Intel RAID controllers.
> +	 *
> +	 * There's overlap here with the set of devices detected by the ahci
> +	 * driver, but ahci will only successfully probe when there
> +	 * *aren't* any remapped NVMe devices, and this driver will only
> +	 * successfully probe when there *are* remapped NVMe devices that
> +	 * need handling.
> +	 */
> +	{
> +		PCI_VDEVICE(INTEL, PCI_ANY_ID),
> +		.class = PCI_CLASS_STORAGE_RAID << 8,
> +		.class_mask = 0xffffff00,
> +	},
> +	{0,}
> +};
> +MODULE_DEVICE_TABLE(pci, nvme_remap_ids);
> +
> +static struct pci_driver nvme_remap_drv = {
> +	.name		= MODULE_NAME,
> +	.id_table	= nvme_remap_ids,
> +	.probe		= nvme_remap_probe,
> +};
> +module_pci_driver(nvme_remap_drv);
> +
> +MODULE_AUTHOR("Daniel Drake <drake@endlessm.com>");
> +MODULE_LICENSE("GPL v2");
> -- 
> 2.20.1
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-10 16:00 ` Keith Busch
@ 2019-06-11  2:46   ` Daniel Drake
  2019-06-12 14:32     ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Drake @ 2019-06-11  2:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Jens Axboe, Keith Busch, Christoph Hellwig,
	Linux PCI, linux-ide, Sagi Grimberg, linux-nvme,
	Linux Upstreaming Team

On Tue, Jun 11, 2019 at 12:00 AM Keith Busch <keith.busch@gmail.com> wrote:
>
> On Mon, Jun 10, 2019 at 1:45 AM Daniel Drake <drake@endlessm.com> wrote:
> > +       /* We don't support sharing MSI interrupts between these devices */
> > +       nrdev->bus->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
>
> And this is a problem, isn't it? Since we don't have an option to open
> the MSI implementation in RAID mode your experience will be much
> better to disable this mode when using Linux as per the current
> recommendation rather than limping along with legacy IRQ.

What's the specific problem that you see here? Is it that the
interrupt delivery mechanism is legacy wire instead of MSI, or is the
problem that the interrupt is shared over the whole set of storage
devices?

I installed Windows 10 on this product in RAID mode and it is using
the legacy interrupt too. Also, on Linux, MSI interrupts have already
been disabled on the AHCI part of such setups for a good while now:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c

The earlier patches from Dan Williams also had the design of sharing
the legacy interrupt:
https://marc.info/?l=linux-ide&m=147709610621480&w=2

I think some kind of MSI support may be possible, perhaps something
similar to what is done by drivers/pci/controller/vmd.c, but it needs
a bit more thought, and I was hoping that we could get the base device
support in place before investigating MSI as a separate step. However,
if the concern you are raising is regarding the sharing of interrupts,
I think that cannot change because the NVMe devices PCI config space
is totally inaccessible when in this mode. That means there is no way
we can configure a per-device MSI message, so the interrupt will
continue to be shared regardless of delivery mechanism.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-10 21:16 ` Bjorn Helgaas
@ 2019-06-11  3:25   ` Daniel Drake
  2019-06-11 19:52     ` Bjorn Helgaas
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Drake @ 2019-06-11  3:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Linux PCI,
	Linux Upstreaming Team, Sagi Grimberg, linux-nvme, linux-ide,
	Dan Williams, Alex Williamson

Hi Bjorn,

Thanks for the quick feedback. You raise some good questions that I'll
be sure to clarify in the next revision. To focus on some of the
pending details here:

On Tue, Jun 11, 2019 at 5:16 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> Ugh.  Is there a spec that details what's actually going on here?

Unfortunately there isn't a great spec to go on.
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/100-series-chipset-datasheet-vol-2.pdf
has some details on the VS_CAP register (section 14.2.10).
Beyond that, Intel contributed patches to enable support for these
devices previously:
https://marc.info/?l=linux-ide&m=147709610621480&w=2
and stated that "The patch contents are [the spec]".
https://marc.info/?l=linux-ide&m=147733119300691&w=2
Later in the thread it was also stated unequivocally that the
interrupt is shared & the original NVMe dev config space is
unavailable.
I'll add references to these details in the next revision.

> This driver makes a lot of assumptions about how this works, e.g.,
> apparently there's an AHCI BAR that covers "hidden devices" plus some
> other stuff of some magic size, whatever is special about device 0,
> etc, but I don't see the source of those assumptions.

The AHCI BAR covering hidden devices is sort-of covered in the VS_CAP
spec so I can at least reference that.

> I'm not really keen on the precedent this sets about pretending things
> are PCI when they're not.  This seems like a bit of a kludge that
> might happen to work now but could easily break in the future because
> it's not based on any spec we can rely on.  Plus it makes future PCI
> maintenance harder because we have to worry about how these differ
> from real PCI devices.
>
> I think this creates a fake PCI host bridge, but not an actual PCIe
> Root Port, right?  I.e., "lspci" doesn't show a new Root Port device,
> does it?
>
> But I suppose "lspci" *does* show new NVMe devices that seem to be
> PCIe endpoints?  But they probably don't *work* like PCIe endpoints,
> e.g., we can't control ASPM, can't use AER, etc?

I appreciate your input here as I don't frequently go down to this
level of detail with PCI. I'm trying to follow the previous
suggestions from Christoph Hellwig, and further clarification on the
most appropriate way to do this would be appreciated:

https://marc.info/?l=linux-ide&m=147923593001525&w=2
"implementing a bridge driver like VMD"
http://lists.infradead.org/pipermail/linux-nvme/2017-October/013325.html
"The right way to do this would be to expose a fake PCIe root port
that both the AHCI and NVMe driver bind to."

I'm not completely clear regarding the difference between a PCI host
bridge and a PCIe root port, but indeed, after my patch, when running
lspci, you see:

1. The original RAID controller, now claimed by this new intel-nvme-remap driver

0000:00:17.0 RAID bus controller: Intel Corporation 82801 Mobile SATA
Controller [RAID mode] (rev 30)
    Subsystem: ASUSTeK Computer Inc. 82801 Mobile SATA Controller [RAID mode]
    Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 16
    Memory at b4390000 (32-bit, non-prefetchable) [size=32K]
    Memory at b43aa000 (32-bit, non-prefetchable) [size=256]
    I/O ports at 4090 [size=8]
    I/O ports at 4080 [size=4]
    I/O ports at 4060 [size=32]
    Memory at b4300000 (32-bit, non-prefetchable) [size=512K]
    Capabilities: [d0] MSI-X: Enable- Count=20 Masked-
    Capabilities: [70] Power Management version 3
    Capabilities: [a8] SATA HBA v1.0
    Kernel driver in use: intel-nvme-remap

2. The RAID controller presented by intel-nvme-remap on a new bus,
with the cfg space tweaked in a way that it gets probed & accepted by
the ahci driver:

10000:00:00.0 SATA controller: Intel Corporation 82801 Mobile SATA
Controller [RAID mode] (rev 30) (prog-if 01 [AHCI 1.0])
    Subsystem: ASUSTeK Computer Inc. 82801 Mobile SATA Controller [RAID mode]
    Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 16, NUMA node 0
    Memory at b4390000 (32-bit, non-prefetchable) [size=32K]
    Memory at b43aa000 (32-bit, non-prefetchable) [size=256]
    I/O ports at 4090 [size=8]
    I/O ports at 4080 [size=4]
    I/O ports at 4060 [size=32]
    Memory at b4300000 (32-bit, non-prefetchable) [size=16K]
    Capabilities: [d0] MSI-X: Enable- Count=20 Masked-
    Capabilities: [70] Power Management version 3
    Capabilities: [a8] SATA HBA v1.0
    Kernel driver in use: ahci

3. The (previously inaccessible) NVMe device as presented on the new
bus by intel-nvme-remap, probed by the nvme driver

10000:00:01.0 Non-Volatile memory controller: Intel Corporation Device
0000 (prog-if 02 [NVM Express])
    Flags: bus master, fast Back2Back, fast devsel, latency 0, IRQ 16,
NUMA node 0
    Memory at b430c000 (64-bit, non-prefetchable) [size=16K]
    Kernel driver in use: nvme

I think Christoph's suggestion does ultimately require us to do some
PCI pretending in some form, but let me know if there are more
accepable ways to do this. If you'd like to see this appear more like
a PCIe root port then I guess I can use pci-bridge-emul.c to do this,
although having a fake root bridge appear in lspci output feels like
I'd be doing even more pretending.

Also happy to experiment with alternative approaches if you have any
suggestions? With the decreasing cost of NVMe SSDs, we're seeing an
influx of upcoming consumer PC products that will ship with the NVMe
disk being the only storage device, combined with the BIOS default of
"RST Optane" mode which will prevent Linux from seeing it at all, so
I'm really keen to swiftly find a way forward here.

Thanks!
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-11  3:25   ` Daniel Drake
@ 2019-06-11 19:52     ` Bjorn Helgaas
  2019-06-12  3:16       ` Daniel Drake
  0 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2019-06-11 19:52 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Linux PCI,
	Linux Upstreaming Team, Sagi Grimberg, linux-nvme, linux-ide,
	Dan Williams, Alex Williamson

On Tue, Jun 11, 2019 at 11:25:55AM +0800, Daniel Drake wrote:
> On Tue, Jun 11, 2019 at 5:16 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > Ugh.  Is there a spec that details what's actually going on here?
> 
> Unfortunately there isn't a great spec to go on.
> https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/100-series-chipset-datasheet-vol-2.pdf
> has some details on the VS_CAP register (section 14.2.10).
> Beyond that, Intel contributed patches to enable support for these
> devices previously:
> https://marc.info/?l=linux-ide&m=147709610621480&w=2
> and stated that "The patch contents are [the spec]".
> https://marc.info/?l=linux-ide&m=147733119300691&w=2

It also said (three years ago) that there was some hope of opening the
specs.  But I guess that hasn't happened.

I'd much prefer lore.kernel.org links, but unfortunately lore doesn't
seem to have linux-ide archives.  If marc.info is the best we can do,
maybe at least include Message-IDs so there's some useful info in the
event marc.info disappears.

> > I think this creates a fake PCI host bridge, but not an actual PCIe
> > Root Port, right?  I.e., "lspci" doesn't show a new Root Port device,
> > does it?
> > ...
> 
> I appreciate your input here as I don't frequently go down to this
> level of detail with PCI. I'm trying to follow the previous
> suggestions from Christoph Hellwig, and further clarification on the
> most appropriate way to do this would be appreciated:
> 
> https://marc.info/?l=linux-ide&m=147923593001525&w=2
> "implementing a bridge driver like VMD"
> http://lists.infradead.org/pipermail/linux-nvme/2017-October/013325.html
> "The right way to do this would be to expose a fake PCIe root port
> that both the AHCI and NVMe driver bind to."
> 
> I'm not completely clear regarding the difference between a PCI host
> bridge and a PCIe root port, but indeed, after my patch, when running
> lspci, you see:
> 
> 1. The original RAID controller, now claimed by this new intel-nvme-remap driver
> 
> 0000:00:17.0 RAID bus controller: Intel Corporation 82801 Mobile SATA
> Controller [RAID mode] (rev 30)
>     Memory at b4390000 (32-bit, non-prefetchable) [size=32K]

> 2. The RAID controller presented by intel-nvme-remap on a new bus,
> with the cfg space tweaked in a way that it gets probed & accepted by
> the ahci driver:
> 
> 10000:00:00.0 SATA controller: Intel Corporation 82801 Mobile SATA
> Controller [RAID mode] (rev 30) (prog-if 01 [AHCI 1.0])
>     Memory at b4390000 (32-bit, non-prefetchable) [size=32K]

Exposing the same device in two different places (0000:00:17.0 and
10000:00:00.0) is definitely an architectural issue.  Logically we're
saying that accesses to b4390000 are claimed by two different devices.

> 3. The (previously inaccessible) NVMe device as presented on the new
> bus by intel-nvme-remap, probed by the nvme driver
> 
> 10000:00:01.0 Non-Volatile memory controller: Intel Corporation Device
> 0000 (prog-if 02 [NVM Express])
>     Memory at b430c000 (64-bit, non-prefetchable) [size=16K]

From a hardware point of view, I think it *was* previously accessible.
Maybe not in a convenient, driver-bindable way, but I don't think your
patch flips any PCI_COMMAND or similar register enable bits.
Everything should have been accessible before if you knew where to
look.

> I think Christoph's suggestion does ultimately require us to do some
> PCI pretending in some form, but let me know if there are more
> accepable ways to do this. If you'd like to see this appear more like
> a PCIe root port then I guess I can use pci-bridge-emul.c to do this,
> although having a fake root bridge appear in lspci output feels like
> I'd be doing even more pretending.

Maybe exposing a Root Port would help rationalize some of the issues,
but I wasn't suggesting that you *need* to expose a Root Port.  I was
just trying to point out that the comment inaccurately claimed you
were.

> Also happy to experiment with alternative approaches if you have any
> suggestions? 

Why do you need these to be PCI devices?  It looks like the main thing
you get is a hook to bind the driver to.  Could you accomplish
something similar by doing some coordination between the ahci and nvme
drivers directly, without involving PCI?

I assume that whatever magic Intel is doing with this "RST Optane"
mode, the resulting platform topology is at least compliant with the
PCI spec, so all the standard things in the spec like AER, DPC, power
management, etc, still work.

> With the decreasing cost of NVMe SSDs, we're seeing an
> influx of upcoming consumer PC products that will ship with the NVMe
> disk being the only storage device, combined with the BIOS default of
> "RST Optane" mode which will prevent Linux from seeing it at all, 
> so I'm really keen to swiftly find a way forward here.

This all sounds urgent, but without details of what this "RST Optane"
mode means actually means, I don't know what to do with it.  I want to
avoid the voodoo programming of "we don't know *why* we're doing this,
but it seems to work."

Bjorn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-11 19:52     ` Bjorn Helgaas
@ 2019-06-12  3:16       ` Daniel Drake
  2019-06-12 13:49         ` Bjorn Helgaas
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Drake @ 2019-06-12  3:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Linux PCI,
	Linux Upstreaming Team, Sagi Grimberg, linux-nvme, linux-ide,
	Dan Williams, Alex Williamson, mjg59

On Wed, Jun 12, 2019 at 3:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> It also said (three years ago) that there was some hope of opening the
> specs.  But I guess that hasn't happened.

I think the brief spec I already linked to may have been published as
a result of the discussion there:
https://marc.info/?l=linux-ide&m=147734288604783&w=2

Either way I'm not aware of any more detailed information having been
published since then.

> > 2. The RAID controller presented by intel-nvme-remap on a new bus,
> > with the cfg space tweaked in a way that it gets probed & accepted by
> > the ahci driver:
> >
> > 10000:00:00.0 SATA controller: Intel Corporation 82801 Mobile SATA
> > Controller [RAID mode] (rev 30) (prog-if 01 [AHCI 1.0])
> >     Memory at b4390000 (32-bit, non-prefetchable) [size=32K]
>
> Exposing the same device in two different places (0000:00:17.0 and
> 10000:00:00.0) is definitely an architectural issue.  Logically we're
> saying that accesses to b4390000 are claimed by two different devices.

I guess intel-nvme-remap could tweak the 0000:00:17.0 device to remove
those BARs so that they ultimately only appear under 10000:00:00.0.
But that doesn't sound particularly nice either.

If we continue down this road, another possibility is to leave the
0000:00:17.0 device untouched, claimed and driven by the ahci driver
as it is now, and rather than have intel-nvme-remap be a separate
driver that claims the PCI device, just have it as a kind of library
that gets called into from ahci. intel-nvme-remap would then create
the "fake" PCI bus but only expose the NVMe devs there (not the AHCI
one). This would deviate a little from the original suggestion of
"expose a fake PCIe root port that both the AHCI and NVMe driver bind
to.".

> > 3. The (previously inaccessible) NVMe device as presented on the new
> > bus by intel-nvme-remap, probed by the nvme driver
> >
> > 10000:00:01.0 Non-Volatile memory controller: Intel Corporation Device
> > 0000 (prog-if 02 [NVM Express])
> >     Memory at b430c000 (64-bit, non-prefetchable) [size=16K]
>
> From a hardware point of view, I think it *was* previously accessible.
> Maybe not in a convenient, driver-bindable way, but I don't think your
> patch flips any PCI_COMMAND or similar register enable bits.
> Everything should have been accessible before if you knew where to
> look.

Pretty much, but in addition to fishing out the NVMe memory address
from the AHCI BAR,  you also have to know to share the interrupt with
AHCI, and also the PCI_COMMAND_MEMORY and PCI_COMMAND_MASTER bits must
be set on the AHCI device in order for the NVMe devices to work.

> Why do you need these to be PCI devices?

I don't have a particular preference, but was trying to explore the
suggestions from the last round of review:

https://marc.info/?l=linux-ide&m=147923593001525&w=2
"implementing a bridge driver like VMD"
http://lists.infradead.org/pipermail/linux-nvme/2017-October/013325.html
"The right way to do this would be to expose a fake PCIe root port
that both the AHCI and NVMe driver bind to."

> It looks like the main thing
> you get is a hook to bind the driver to.  Could you accomplish
> something similar by doing some coordination between the ahci and nvme
> drivers directly, without involving PCI?

That's basically what Dan Williams originally proposed, and Christoph
Hellwig was not particularly excited by it...

Can you take a quick at the original patches and see what you think?
https://marc.info/?l=linux-ide&m=147709611121482&w=2
https://marc.info/?l=linux-ide&m=147709611621483&w=2
https://marc.info/?l=linux-ide&m=147709612221484&w=2
https://marc.info/?l=linux-ide&m=147709612721485&w=2
https://marc.info/?l=linux-ide&m=147709613221487&w=2

> I assume that whatever magic Intel is doing with this "RST Optane"
> mode, the resulting platform topology is at least compliant with the
> PCI spec, so all the standard things in the spec like AER, DPC, power
> management, etc, still work.

That would also be my expectation - those standard things you
configure on the AHCI device would also affect the mode of operation
of the hidden NVMe devices, in the same way that the
PCI_COMMAND_MASTER AHCI bit affects NVMe device access.

> This all sounds urgent, but without details of what this "RST Optane"
> mode means actually means, I don't know what to do with it.  I want to
> avoid the voodoo programming of "we don't know *why* we're doing this,
> but it seems to work."

From the user perspective, we're doing it so that they get access to
their storage device.

But I guess you meant more from the technical architecture
perspective. My understanding comes from
https://mjg59.dreamwidth.org/44694.html : this is a game of Windows
driver politics.
Intel doesn't want the standard Windows NVMe driver to bind to the
NVMe devices, because that driver is power hungry and makes Intel
platforms look bad. So they came up with this scheme to hide the NVMe
devices from view, and then only the power-eficient Intel Windows
driver knows how to find them.

The implementation follows patches, emails and the VS_CAP spec, all
authored by Intel. I'm not confident that we'll get any more than
that. The 2016 patches only appeared 5 months after numerous Lenovo
customers had reported their inability to access their disk on Linux
(they didn't even have a BIOS configuration option at that point).
While some information was then shared in patches and emails, as you
have seen Intel wasn't very forthcoming in providing a decent spec.
Intel's last comment in the 2016 thread wasn't exactly positive:
https://marc.info/?l=linux-ide&m=147953592417285&w=2
and there was no reponse from Intel to my 2017 thread:
http://lists.infradead.org/pipermail/linux-nvme/2017-October/013323.html

So at this point I'd advocate for just piecing together the pieces of
the puzzle that we do have (I'll work to reference the details better
in the next patch revision), accepting that we're working with an
architecture that doesn't seem well thought out, and then figure out
the least painful way to support it.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-12  3:16       ` Daniel Drake
@ 2019-06-12 13:49         ` Bjorn Helgaas
  0 siblings, 0 replies; 19+ messages in thread
From: Bjorn Helgaas @ 2019-06-12 13:49 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Linux PCI,
	Linux Upstreaming Team, Sagi Grimberg, linux-nvme, linux-ide,
	Dan Williams, Alex Williamson, mjg59

On Wed, Jun 12, 2019 at 11:16:03AM +0800, Daniel Drake wrote:
> On Wed, Jun 12, 2019 at 3:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:

> > Why do you need these to be PCI devices?
> 
> I don't have a particular preference, but was trying to explore the
> suggestions from the last round of review:
> 
> https://marc.info/?l=linux-ide&m=147923593001525&w=2
> "implementing a bridge driver like VMD"
> http://lists.infradead.org/pipermail/linux-nvme/2017-October/013325.html
> "The right way to do this would be to expose a fake PCIe root port
> that both the AHCI and NVMe driver bind to."
> 
> > It looks like the main thing
> > you get is a hook to bind the driver to.  Could you accomplish
> > something similar by doing some coordination between the ahci and nvme
> > drivers directly, without involving PCI?
> 
> That's basically what Dan Williams originally proposed, and Christoph
> Hellwig was not particularly excited by it...
> 
> Can you take a quick at the original patches and see what you think?
> https://marc.info/?l=linux-ide&m=147709611121482&w=2
> https://marc.info/?l=linux-ide&m=147709611621483&w=2
> https://marc.info/?l=linux-ide&m=147709612221484&w=2
> https://marc.info/?l=linux-ide&m=147709612721485&w=2
> https://marc.info/?l=linux-ide&m=147709613221487&w=2

I see Christoph's objections starting at
https://marc.info/?l=linux-ide&m=147711904724908&w=2
and I agree that this AHCI/NVMe melding is ugly.

But given the existence of this ugly hardware, my opinion is that
Dan's original patch series (above) is actually a nice way to deal
with it.  That's exactly the sort of thing I was proposing.

Part of Christoph's objection was the issue of how reset works, and
that objection absolutely makes sense to me.  But IMO adding a fake
PCI host bridge and fake PCI devices that really don't work because
they have read-only config space just smears the issue over
PCI/VFIO/etc in addition to AHCI and NVMe.

Bjorn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-11  2:46   ` Daniel Drake
@ 2019-06-12 14:32     ` Keith Busch
  2019-06-13  8:54       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2019-06-12 14:32 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Bjorn Helgaas, Jens Axboe, Keith Busch, Christoph Hellwig,
	Linux PCI, linux-ide, Sagi Grimberg, linux-nvme,
	Linux Upstreaming Team

On Mon, Jun 10, 2019 at 8:46 PM Daniel Drake <drake@endlessm.com> wrote:
> What's the specific problem that you see here?

Performance. Have you had a chance to benchmark these storage devices
comparing legacy vs MSI interrupts? I don't think anyone would chose
the slower one on purpose. These platforms have an option to disable
raid mode, so the kernel's current recommendation should be the user's
best option.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-12 14:32     ` Keith Busch
@ 2019-06-13  8:54       ` Christoph Hellwig
  2019-06-14  2:26         ` Daniel Drake
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2019-06-13  8:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Daniel Drake, Bjorn Helgaas, Jens Axboe, Keith Busch,
	Christoph Hellwig, Linux PCI, linux-ide, Sagi Grimberg,
	linux-nvme, Linux Upstreaming Team

On Wed, Jun 12, 2019 at 08:32:46AM -0600, Keith Busch wrote:
> On Mon, Jun 10, 2019 at 8:46 PM Daniel Drake <drake@endlessm.com> wrote:
> > What's the specific problem that you see here?
> 
> Performance. Have you had a chance to benchmark these storage devices
> comparing legacy vs MSI interrupts? I don't think anyone would chose
> the slower one on purpose. These platforms have an option to disable
> raid mode, so the kernel's current recommendation should be the user's
> best option.

And it isn't just performance.  I really don't understand how

 a) quirks on the PCI ID
 b) reset handling, including the PCI device removal as the last
    escalation step
 c) SR-IOV VFs and their management
 d) power management

and probably various other bits I didn't even think of are going to
work.

So until we get very clear and good documentation from Intel on that
I don't think any form of upstream support will fly.  And given that
Dan who submitted the original patch can't even talk about this thing
any more and apparently got a gag order doesn't really give me confidence
any of this will ever work.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-13  8:54       ` Christoph Hellwig
@ 2019-06-14  2:26         ` Daniel Drake
  2019-06-14 19:36           ` Keith Busch
  2019-06-18  7:46           ` Hannes Reinecke
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Drake @ 2019-06-14  2:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Bjorn Helgaas, Jens Axboe, Keith Busch, Linux PCI,
	linux-ide, Sagi Grimberg, linux-nvme, Linux Upstreaming Team

On Thu, Jun 13, 2019 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
> So until we get very clear and good documentation from Intel on that
> I don't think any form of upstream support will fly.  And given that
> Dan who submitted the original patch can't even talk about this thing
> any more and apparently got a gag order doesn't really give me confidence
> any of this will ever work.

I realise the architecture here seems badly thought out, and the lack
of a decent spec makes the situation worse, but I'd encourage you to
reconsider this from the perspectives of:
 - Are the patches really more ugly than the underlying architecture?
 - We strive to make Linux work well on common platforms and sometimes
have to accept that hardware vendors do questionable things & do not
fully cooperate
 - It works out of the box on Windows

As you said years ago:
https://marc.info/?l=linux-ide&m=147923593001525&w=2
"It seems devices supporting this "slow down the devices and make life
hell for the OS" mode are getting more common, so we'll have to do
something about it."

The frequency of apperance of this configuration appears poised to
grow even more significantly at this point. There appears to be a
significant increase in consumer laptops in development that have NVMe
disk as the only storage device, and come with the BIOS option on by
default. When these reach point of sale, expect to see a whole bunch
more Linux users who struggle with this. We also have indication that
vendors are unwilling to deal with the logistics headache of having
different BIOS settings for Linux, so the lack of support here is
potentially going to stop those vendors from shipping Linux at all.

Even with a spec I don't imagine that we can meet the feature parity
of having the real NVMe PCI device available. Can we just accept the
compromises & start by focusing on the simple case of a consumer
home/family PC?

>  a) quirks on the PCI ID

Intel stated unequivocally that the PCI config space is not available.
So this isn't going to happen, spec or not.
https://marc.info/?l=linux-ide&m=147734288604783&w=2

If we run into a case where we absolutely need quirks, we could
examine doing that on the disk identification data available over the
NVMe protocol (e.g. vendor & model name).

>  b) reset handling, including the PCI device removal as the last
>     escalation step

Apparently can't be supported, but it's not clear that this actually
matters for a home PC...

https://marc.info/?l=linux-ide&m=147733119300691&w=2
"The driver seems to already comprehend instances where the
device does not support nvme_reset_subsystem() requests."

https://marc.info/?l=linux-ide&m=147734288604783&w=2
"Talking with Keith, subsystem-resets are a feature of enterprise-class
NVMe devices.  I think those features are out of scope for the class
of devices that will find themselves in a platform with this
configuration, same for hot-plug."

>  c) SR-IOV VFs and their management

This seems like a server/virtualization topic. I don't see any issues
in not supporting this in the context of a consumer PC.
It seems reasonable to expect people interested in this to be required
to read the kernel logs (to see the message) and proceed with changing
the BIOS setting.

>  d) power management

If there is a way to control the NVMe device power separately from the
AHCI device that would of course be nice, but this seems secondary to
the larger problem of users not being able to access their storage
device.

I'm hopeful that after years of waiting for the situation to improve
without any positive developments, we can find a way to go with the
code we have now, and if we do get a spec from Intel at any point,
make any relevant code improvments when that happens.

I'll work on refreshing Dan's patches & clarifying the knowledge we
have within there, plus the limitations.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-14  2:26         ` Daniel Drake
@ 2019-06-14 19:36           ` Keith Busch
  2019-06-14 20:05             ` Bjorn Helgaas
  2019-06-18  7:46           ` Hannes Reinecke
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Busch @ 2019-06-14 19:36 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Christoph Hellwig, Bjorn Helgaas, Jens Axboe, Keith Busch,
	Linux PCI, linux-ide, Sagi Grimberg, linux-nvme,
	Linux Upstreaming Team

On Thu, Jun 13, 2019 at 8:26 PM Daniel Drake <drake@endlessm.com> wrote:
> On Thu, Jun 13, 2019 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
> >  b) reset handling, including the PCI device removal as the last
> >     escalation step
>
> Apparently can't be supported, but it's not clear that this actually
> matters for a home PC...
>
> https://marc.info/?l=linux-ide&m=147733119300691&w=2
> "The driver seems to already comprehend instances where the
> device does not support nvme_reset_subsystem() requests."
>
> https://marc.info/?l=linux-ide&m=147734288604783&w=2
> "Talking with Keith, subsystem-resets are a feature of enterprise-class
> NVMe devices.  I think those features are out of scope for the class
> of devices that will find themselves in a platform with this
> configuration, same for hot-plug."

NVMe Subsystem resets are not the same thing as conventional PCI
resets. We still have use for the latter in client space.

Even if you wish to forgo the standard features and management
capabilities, you're still having to deal with legacy IRQ, which has
IOPs at a fraction of the hardware's true capabilities when using MSI.
I do not believe that is what users want out of their purchased
hardware, so your best option is still to set to AHCI mode for Linux
for this reason alone, and vendors should be providing this option in
BIOS.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-14 19:36           ` Keith Busch
@ 2019-06-14 20:05             ` Bjorn Helgaas
  2019-06-14 21:05               ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2019-06-14 20:05 UTC (permalink / raw)
  To: Keith Busch
  Cc: Daniel Drake, Jens Axboe, Sagi Grimberg, Linux PCI, linux-nvme,
	Keith Busch, linux-ide, Linux Upstreaming Team,
	Christoph Hellwig

On Fri, Jun 14, 2019 at 01:36:07PM -0600, Keith Busch wrote:

> Even if you wish to forgo the standard features and management
> capabilities, you're still having to deal with legacy IRQ, which has
> IOPs at a fraction of the hardware's true capabilities when using MSI.

> ... your best option is still to set to AHCI mode for Linux
> for this reason alone, and vendors should be providing this option in
> BIOS.

Ugh.  Are you saying the installation instructions for Linux will say
"change the BIOS setting to AHCI"?  That's an unpleasant user
experience, especially if the installation fails if the user hasn't
read the instructions.

Bjorn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-14 20:05             ` Bjorn Helgaas
@ 2019-06-14 21:05               ` Keith Busch
  2019-06-18  7:48                 ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2019-06-14 21:05 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Daniel Drake, Jens Axboe, Sagi Grimberg, Linux PCI, linux-nvme,
	Keith Busch, linux-ide, Linux Upstreaming Team,
	Christoph Hellwig

On Fri, Jun 14, 2019 at 2:06 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> Ugh.  Are you saying the installation instructions for Linux will say
> "change the BIOS setting to AHCI"?  That's an unpleasant user
> experience, especially if the installation fails if the user hasn't
> read the instructions.

:) That is essentially what the dmesg currently says when Linux
detects this, from drivers/ata/ahci.c:ahci_remap_check()

  "Switch your BIOS from RAID to AHCI mode to use them (remapped NVMe devices)"

Unpleasant yes, but what's the lesser of two evils? Recommend the
supported full featured mode, or try to support with degraded
capabilities?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-14  2:26         ` Daniel Drake
  2019-06-14 19:36           ` Keith Busch
@ 2019-06-18  7:46           ` Hannes Reinecke
  2019-06-18  8:06             ` Daniel Drake
  1 sibling, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2019-06-18  7:46 UTC (permalink / raw)
  To: Daniel Drake, Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Linux PCI, Keith Busch, Keith Busch,
	linux-ide, linux-nvme, Bjorn Helgaas, Linux Upstreaming Team

On 6/14/19 4:26 AM, Daniel Drake wrote:
> On Thu, Jun 13, 2019 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
>> So until we get very clear and good documentation from Intel on that
>> I don't think any form of upstream support will fly.  And given that
>> Dan who submitted the original patch can't even talk about this thing
>> any more and apparently got a gag order doesn't really give me confidence
>> any of this will ever work.
> 
> I realise the architecture here seems badly thought out, and the lack
> of a decent spec makes the situation worse, but I'd encourage you to
> reconsider this from the perspectives of:
>  - Are the patches really more ugly than the underlying architecture?
>  - We strive to make Linux work well on common platforms and sometimes
> have to accept that hardware vendors do questionable things & do not
> fully cooperate
>  - It works out of the box on Windows
> 
Actually, there _is_ a register description:

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/300-series-chipset-pch-datasheet-vol-2.pdf

Look for section 15: Intel RST for PCIe Storage.

That gives you a reasonable description of the various registers etc.
You'll have to translate from Intel-speak, but that should be manageable.

In general, I am _quite_ in favour of having Linux Support for these
kind of devices.
I fully agree the interface is ugly, and badly thought out.
But these devices exist and have been sold for quite some time now, so
there is no way we can influence the design on those boxes.

And we really have come a long way from the the original Linux idea of
"hey, that's weird hardware, let's hack together a driver for it" to the
flat rejection of hardware we don't like.
I, for one, prefer the old way.

Looking at the spec I think that Daniels approach of exposing an
additional PCI device is the correct way of going about it (as the
interface design can be thought of a messed-up SR-IOV interface), but
I'll defer to Bjorn here.
But we really should have a driver to get these boxes rolling.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-14 21:05               ` Keith Busch
@ 2019-06-18  7:48                 ` Hannes Reinecke
  0 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2019-06-18  7:48 UTC (permalink / raw)
  To: Keith Busch, Bjorn Helgaas
  Cc: Jens Axboe, Sagi Grimberg, Linux PCI, linux-nvme, Daniel Drake,
	linux-ide, Keith Busch, Linux Upstreaming Team,
	Christoph Hellwig

On 6/14/19 11:05 PM, Keith Busch wrote:
> On Fri, Jun 14, 2019 at 2:06 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>> Ugh.  Are you saying the installation instructions for Linux will say
>> "change the BIOS setting to AHCI"?  That's an unpleasant user
>> experience, especially if the installation fails if the user hasn't
>> read the instructions.
> 
> :) That is essentially what the dmesg currently says when Linux
> detects this, from drivers/ata/ahci.c:ahci_remap_check()
> 
>   "Switch your BIOS from RAID to AHCI mode to use them (remapped NVMe devices)"
> 
> Unpleasant yes, but what's the lesser of two evils? Recommend the
> supported full featured mode, or try to support with degraded
> capabilities?
> 
... and we already had several customer calls because of this.
Quite tedious, and completely unneccesary IMO.
The time for having the interface changed has passed, and we have to
live with it.
I still do prefer to have a driver for it, just to smooth the initial
user experience.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-18  7:46           ` Hannes Reinecke
@ 2019-06-18  8:06             ` Daniel Drake
  2019-06-18 15:15               ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Drake @ 2019-06-18  8:06 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PCI,
	Keith Busch, Keith Busch, linux-ide, linux-nvme, Bjorn Helgaas,
	Linux Upstreaming Team

On Tue, Jun 18, 2019 at 3:46 PM Hannes Reinecke <hare@suse.de> wrote:
> On 6/14/19 4:26 AM, Daniel Drake wrote:
> > On Thu, Jun 13, 2019 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
> >> So until we get very clear and good documentation from Intel on that
> >> I don't think any form of upstream support will fly.  And given that
> >> Dan who submitted the original patch can't even talk about this thing
> >> any more and apparently got a gag order doesn't really give me confidence
> >> any of this will ever work.
> >
> > I realise the architecture here seems badly thought out, and the lack
> > of a decent spec makes the situation worse, but I'd encourage you to
> > reconsider this from the perspectives of:
> >  - Are the patches really more ugly than the underlying architecture?
> >  - We strive to make Linux work well on common platforms and sometimes
> > have to accept that hardware vendors do questionable things & do not
> > fully cooperate
> >  - It works out of the box on Windows
> >
> Actually, there _is_ a register description:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/300-series-chipset-pch-datasheet-vol-2.pdf
>
> Look for section 15: Intel RST for PCIe Storage.
>
> That gives you a reasonable description of the various registers etc.

Thanks for your email! I also spotted it for the first time earlier today.

Section 15 there (D24:F0) describes a special/hidden PCI device which
I can't figure out how to access from Linux (I believe it would be at
D18:F0 in the cases where the 300 PCH is integrated into the SoC, as
it is on the Whiskey Lake platform I have here). That's probably not
important because if even if we had access all the values are probably
read-only, as the BIOS will lock them all down during early boot, as
is documented. But the docs give some interesting insights into the
design.

Section 15.2 is potentially more relevant, as it describes registers
within the AHCI BAR and we do have access to that. Some of these
registers are already used by the current code to determine the
presence of remapped devices. It might be nice to use Device Memory
BAR Length (DMBL_1) but I can't figure out what is meant by "A 1 in
the bit location indicates the corresponding lower memory BAR bit for
the PCIe SSD device is a Read/Write (RW) bit." The value is 0x3fff on
the platform I have here.

We can probably also use these registers for MSI support. I started to
experiment, doesn't quite work but I'll keep poking. The doc suggests
there is a single MSI-X vector for the AHCI SATA device, and AHCI
MSI-X Starting Vector (AMXV) has value 0x140 on this platform. No idea
how to interpret that value. From experimentation, the AHCI SATA disk
generates interrupts on vector 0.

Then there are multiple vectors for the remapped NVMe devices. Device
MSI-X Configuration (DMXC_L_1) is set up to assign vectors 1 to 19 to
NVMe on this platform. But it says "This field is only valid when
DMXC.ID indicates interrupt delivery using MSI-X" but what/where is
DMXC.ID? So far I can get NVMe-related interrupts on vector 1 but
apparently not enough of them, the driver hangs during probe.

I've nearly finished refreshing & extending Dan Williams' patches and
will send them for more discussion soon.

Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-18  8:06             ` Daniel Drake
@ 2019-06-18 15:15               ` Hannes Reinecke
  2019-06-19 13:52                 ` Bjorn Helgaas
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2019-06-18 15:15 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Jens Axboe, Sagi Grimberg, Linux PCI, Linux Upstreaming Team,
	Keith Busch, linux-ide, linux-nvme, Keith Busch, Bjorn Helgaas,
	Christoph Hellwig

On 6/18/19 10:06 AM, Daniel Drake wrote:
> On Tue, Jun 18, 2019 at 3:46 PM Hannes Reinecke <hare@suse.de> wrote:
>> On 6/14/19 4:26 AM, Daniel Drake wrote:
>>> On Thu, Jun 13, 2019 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
>>>> So until we get very clear and good documentation from Intel on that
>>>> I don't think any form of upstream support will fly.  And given that
>>>> Dan who submitted the original patch can't even talk about this thing
>>>> any more and apparently got a gag order doesn't really give me confidence
>>>> any of this will ever work.
>>>
>>> I realise the architecture here seems badly thought out, and the lack
>>> of a decent spec makes the situation worse, but I'd encourage you to
>>> reconsider this from the perspectives of:
>>>  - Are the patches really more ugly than the underlying architecture?
>>>  - We strive to make Linux work well on common platforms and sometimes
>>> have to accept that hardware vendors do questionable things & do not
>>> fully cooperate
>>>  - It works out of the box on Windows
>>>
>> Actually, there _is_ a register description:
>>
>> https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/300-series-chipset-pch-datasheet-vol-2.pdf
>>
>> Look for section 15: Intel RST for PCIe Storage.
>>
>> That gives you a reasonable description of the various registers etc.
> 
> Thanks for your email! I also spotted it for the first time earlier today.
> 
> Section 15 there (D24:F0) describes a special/hidden PCI device which
> I can't figure out how to access from Linux (I believe it would be at
> D18:F0 in the cases where the 300 PCH is integrated into the SoC, as
> it is on the Whiskey Lake platform I have here). That's probably not
> important because if even if we had access all the values are probably
> read-only, as the BIOS will lock them all down during early boot, as
> is documented. But the docs give some interesting insights into the
> design.
> 
> Section 15.2 is potentially more relevant, as it describes registers
> within the AHCI BAR and we do have access to that. Some of these
> registers are already used by the current code to determine the
> presence of remapped devices. It might be nice to use Device Memory
> BAR Length (DMBL_1) but I can't figure out what is meant by "A 1 in
> the bit location indicates the corresponding lower memory BAR bit for
> the PCIe SSD device is a Read/Write (RW) bit." The value is 0x3fff on
> the platform I have here.
> 
> We can probably also use these registers for MSI support. I started to
> experiment, doesn't quite work but I'll keep poking. The doc suggests
> there is a single MSI-X vector for the AHCI SATA device, and AHCI
> MSI-X Starting Vector (AMXV) has value 0x140 on this platform. No idea
> how to interpret that value. From experimentation, the AHCI SATA disk
> generates interrupts on vector 0.
> 
The 0x140 is probably the offset into the PCI config space where the
AHCI MSI-X vector table can be found ...

> Then there are multiple vectors for the remapped NVMe devices. Device
> MSI-X Configuration (DMXC_L_1) is set up to assign vectors 1 to 19 to
> NVMe on this platform. But it says "This field is only valid when
> DMXC.ID indicates interrupt delivery using MSI-X" but what/where is
> DMXC.ID? So far I can get NVMe-related interrupts on vector 1 but
> apparently not enough of them, the driver hangs during probe.
> 
I _think_ the problem here is the automatic interrupt affinity we're
doing; we probably have to exclude the AHCI vector when setting up
interrupt affinity.

> I've nearly finished refreshing & extending Dan Williams' patches and
> will send them for more discussion soon.
> 
THX.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] PCI: Add Intel remapped NVMe device support
  2019-06-18 15:15               ` Hannes Reinecke
@ 2019-06-19 13:52                 ` Bjorn Helgaas
  0 siblings, 0 replies; 19+ messages in thread
From: Bjorn Helgaas @ 2019-06-19 13:52 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Drake, Jens Axboe, Sagi Grimberg, Linux PCI, Keith Busch,
	linux-ide, linux-nvme, Keith Busch, Linux Upstreaming Team,
	Christoph Hellwig

On Tue, Jun 18, 2019 at 05:15:52PM +0200, Hannes Reinecke wrote:
> On 6/18/19 10:06 AM, Daniel Drake wrote:

> > We can probably also use these registers for MSI support. I
> > started to experiment, doesn't quite work but I'll keep poking.
> > The doc suggests there is a single MSI-X vector for the AHCI SATA
> > device, and AHCI MSI-X Starting Vector (AMXV) has value 0x140 on
> > this platform. No idea how to interpret that value. From
> > experimentation, the AHCI SATA disk generates interrupts on vector
> > 0.
> > 
> The 0x140 is probably the offset into the PCI config space where the
> AHCI MSI-X vector table can be found ...

An MSI-X vector table is in memory space, not config space.  You'd
have to look at PCI_MSIX_TABLE_BIR to find which BAR maps it, and then
add PCI_MSIX_TABLE_OFFSET to the BAR value.

Bjorn

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-06-19 13:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-10  7:44 [PATCH] PCI: Add Intel remapped NVMe device support Daniel Drake
2019-06-10 16:00 ` Keith Busch
2019-06-11  2:46   ` Daniel Drake
2019-06-12 14:32     ` Keith Busch
2019-06-13  8:54       ` Christoph Hellwig
2019-06-14  2:26         ` Daniel Drake
2019-06-14 19:36           ` Keith Busch
2019-06-14 20:05             ` Bjorn Helgaas
2019-06-14 21:05               ` Keith Busch
2019-06-18  7:48                 ` Hannes Reinecke
2019-06-18  7:46           ` Hannes Reinecke
2019-06-18  8:06             ` Daniel Drake
2019-06-18 15:15               ` Hannes Reinecke
2019-06-19 13:52                 ` Bjorn Helgaas
2019-06-10 21:16 ` Bjorn Helgaas
2019-06-11  3:25   ` Daniel Drake
2019-06-11 19:52     ` Bjorn Helgaas
2019-06-12  3:16       ` Daniel Drake
2019-06-12 13:49         ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).