linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
@ 2024-02-24 21:03 Wadim Mueller
  2024-02-24 21:04 ` [PATCH 1/3] PCI: Add PCI Endpoint function driver for Block-device passthrough Wadim Mueller
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Wadim Mueller @ 2024-02-24 21:03 UTC (permalink / raw)
  Cc: Wadim Mueller, Bjorn Helgaas, Jonathan Corbet,
	Manivannan Sadhasivam, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Damien Le Moal, Shunsuke Mie, linux-pci, linux-doc, linux-kernel,
	linux-block

Hello,

This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).


Block Passthrough
==================
The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
through a PCI(e) link, one running in RC mode, the other in EP mode.
If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
those from the other SoC (SoC1 in RC mode below), without having any direct connection to
those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:


                                                           +-------------+
                                                           |             |
                                                           |   SD Card   |
                                                           |             |
                                                           +------^------+
                                                                  |
                                                                  |
    +--------------------------+                +-----------------v----------------+
    |                          |      PCI(e)    |                                  |
    |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
    | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
    |                          |                |                                  |
    +--------------------------+                +-----------------^----------------+
                                                                  |
                                                                  |
                                                           +------v------+
                                                           |             |
                                                           |    NVMe     |
                                                           |             |
                                                           +-------------+


This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.

The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.

A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.

Test setup
==========

This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.

A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.

The measurement was done through the FIO tool [1] with 4kiB Blocks.

[1] https://linux.die.net/man/1/fio

Wadim Mueller (3):
  PCI: Add PCI Endpoint function driver for Block-device passthrough
  PCI: Add PCI driver for a PCI EP remote Blockdevice
  Documentation: PCI: Add documentation for the PCI Block Passthrough

 .../function/binding/pci-block-passthru.rst   |   24 +
 Documentation/PCI/endpoint/index.rst          |    3 +
 .../pci-endpoint-block-passthru-function.rst  |  331 ++++
 .../pci-endpoint-block-passthru-howto.rst     |  158 ++
 MAINTAINERS                                   |    8 +
 drivers/block/Kconfig                         |   14 +
 drivers/block/Makefile                        |    1 +
 drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
 drivers/pci/endpoint/functions/Kconfig        |   12 +
 drivers/pci/endpoint/functions/Makefile       |    1 +
 .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
 include/linux/pci-epf-block-passthru.h        |   77 +
 12 files changed, 3069 insertions(+)
 create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
 create mode 100644 drivers/block/pci-remote-disk.c
 create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
 create mode 100644 include/linux/pci-epf-block-passthru.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] PCI: Add PCI Endpoint function driver for Block-device passthrough
  2024-02-24 21:03 [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Wadim Mueller
@ 2024-02-24 21:04 ` Wadim Mueller
  2024-02-24 21:04 ` [PATCH 2/3] PCI: Add PCI driver for a PCI EP remote Blockdevice Wadim Mueller
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Wadim Mueller @ 2024-02-24 21:04 UTC (permalink / raw)
  Cc: Wadim Mueller, Bjorn Helgaas, Jonathan Corbet,
	Manivannan Sadhasivam, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Shunsuke Mie, Damien Le Moal, linux-pci, linux-doc, linux-kernel,
	linux-block

PCI Block Device Passthrough Endpoint function driver. This driver
implements the Block Device function over PCI(e) in the endpoint
device.

This driver implements a simple Register interface which is
configured by the Host (RC) to export a certain block device attached
to the Device acting as an Endpoint.

Which devices are exposed and can be attached to from the Host side is
configurable through ConfigFS. Exporting in Read-Only mode is also
possible as well as exporting only certain partitions of a Block Device.

It further is responsible for carrying out all PCI(e) related activities
like mapping the host memory, transferring the requested block sectors
to the host and triggering MSIs on completion.

Signed-off-by: Wadim Mueller <wafgo01@gmail.com>
---
 drivers/pci/endpoint/functions/Kconfig        |   12 +
 drivers/pci/endpoint/functions/Makefile       |    1 +
 .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
 include/linux/pci-epf-block-passthru.h        |   77 +
 4 files changed, 1483 insertions(+)
 create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
 create mode 100644 include/linux/pci-epf-block-passthru.h

diff --git a/drivers/pci/endpoint/functions/Kconfig b/drivers/pci/endpoint/functions/Kconfig
index 0c9cea0698d7..3e7d1666642a 100644
--- a/drivers/pci/endpoint/functions/Kconfig
+++ b/drivers/pci/endpoint/functions/Kconfig
@@ -47,3 +47,15 @@ config PCI_EPF_MHI
 	   devices such as SDX55.
 
 	   If in doubt, say "N" to disable Endpoint driver for MHI bus.
+
+config PCI_EPF_BLOCK_PASSTHROUGH
+	tristate "PCI Endpoint Block Passthrough driver"
+	depends on PCI_ENDPOINT
+	select CONFIGFS_FS
+	help
+	  Select this configuration option to enable the Block Device Passthrough functionality.
+	  This driver can pass through any Block device available on the Host on which this driver is loaded.
+	  The decision which device is provided as a PCI Endpoint function has to be configured through CONFIG_FS.
+
+	  If in doubt, say "N" to disable Endpoint Block Passhthrough driver.
+
diff --git a/drivers/pci/endpoint/functions/Makefile b/drivers/pci/endpoint/functions/Makefile
index 696473fce50e..a2564d817762 100644
--- a/drivers/pci/endpoint/functions/Makefile
+++ b/drivers/pci/endpoint/functions/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_PCI_EPF_TEST)		+= pci-epf-test.o
 obj-$(CONFIG_PCI_EPF_NTB)		+= pci-epf-ntb.o
 obj-$(CONFIG_PCI_EPF_VNTB) 		+= pci-epf-vntb.o
 obj-$(CONFIG_PCI_EPF_MHI)		+= pci-epf-mhi.o
+obj-$(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)	+= pci-epf-block-passthru.o
diff --git a/drivers/pci/endpoint/functions/pci-epf-block-passthru.c b/drivers/pci/endpoint/functions/pci-epf-block-passthru.c
new file mode 100644
index 000000000000..44c993530484
--- /dev/null
+++ b/drivers/pci/endpoint/functions/pci-epf-block-passthru.c
@@ -0,0 +1,1393 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Block Device Passthrough as an Endpoint Function driver
+*
+* Author: Wadim Mueller <wafgo01@gmail.com>
+*
+* PCI Block Device Passthrough allows one Linux Device to expose its Block devices to the PCI(e) host.
+* The device can export either the full disk or just certain partitions.
+* The PCI Block Passthrough function driver is the part running on SoC2 from the diagram below.
+*
+*                                               +-------------+
+*                                               |             |
+*                                               |   SD Card   |
+*                                               |             |
+*                                               +------^------+
+*                                                      |
+*                                                      |
+*+---------------------+                +--------------v------------+       +---------+
+*|                     |                |                           |       |         |
+*|      SoC1 (RC)      |<-------------->|        SoC2 (EP)          |<----->|  eMMC   |
+*|  (pci-remote-disk)  |                | (pci-epf-block-passthru)  |       |         |
+*|                     |                |                           |       +---------+
+*+---------------------+                +--------------^------------+
+*                                                      |
+*                                                      |
+*                                               +------v------+
+*                                               |             |
+*                                               |    NVMe     |
+*                                               |             |
+*                                               +-------------+
+*
+*/
+
+#include "linux/dev_printk.h"
+#include "linux/jiffies.h"
+#include <linux/delay.h>
+#include <linux/dmaengine.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/pci_ids.h>
+#include <linux/pci-epc.h>
+#include <linux/pci-epf.h>
+#include <linux/pci_regs.h>
+#include <linux/bvec.h>
+#include <linux/kernel.h>
+#include <linux/device.h>
+#include <linux/blk-mq.h>
+#include <linux/workqueue.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/pci.h>
+#include <linux/hdreg.h>
+#include <linux/kthread.h>
+#include <linux/pci-epf-block-passthru.h>
+
+#define blockpt_readb(_x) readb(_x)
+#define blockpt_readw(_x) cpu_to_le16(readw(_x))
+#define blockpt_readl(_x) cpu_to_le32(readl(_x))
+#define blockpt_readq(_x) cpu_to_le64(readq(_x))
+
+#define blockpt_writeb(v, _x) writeb(v, _x)
+#define blockpt_writew(v, _x) writew(cpu_to_le16(v), _x)
+#define blockpt_writel(v, _x) writel(cpu_to_le32(v), _x)
+#define blockpt_writeq(v, _x) writeq(cpu_to_le64(v), _x)
+
+static struct workqueue_struct *kpciblockpt_wq;
+
+struct pci_blockpt_device_common;
+
+struct pci_epf_blockpt_queue {
+	struct pci_epf_blockpt_descr __iomem *descr;
+	dma_addr_t descr_addr;
+	u32 descr_size;
+	struct pci_blockpt_driver_ring __iomem *driver_ring;
+	struct pci_blockpt_device_ring __iomem *device_ring;
+	u32 drv_idx;
+	u32 dev_idx;
+	u32 num_desc;
+	struct task_struct *complete_thr;
+	struct task_struct *submit_thr;
+	struct list_head proc_list;
+	spinlock_t proc_lock;
+	int irq;
+	atomic_t raised_irqs;
+	struct dma_chan *dma_chan;
+	struct semaphore proc_sem;
+	struct pci_epf_blockpt_device *bpt_dev;
+};
+
+struct pci_epf_blockpt_device {
+	struct list_head node;
+	struct pci_blockpt_device_common *dcommon;
+	struct pci_epf_blockpt_queue __percpu *q;
+	struct config_group cfg_grp;
+	char *cfs_disk_name;
+        struct file *bdev_file;
+	struct block_device *bd;
+	int dev_tag;
+	int max_queue;
+	char *device_path;
+	char *dev_name;
+	bool read_only;
+	bool attached;
+	spinlock_t nm_lock;
+};
+
+struct pci_blockpt_device_common {
+	struct pci_epf_blockpt_reg __iomem *bpt_regs;
+	void __iomem *queue_base;
+	struct pci_epf *epf;
+	enum pci_barno blockpt_reg_bar;
+	size_t msix_table_offset;
+	struct delayed_work cmd_handler;
+	struct list_head devices;
+	const struct pci_epc_features *epc_features;
+	int next_disc_idx;
+	size_t queue_offset;
+	size_t queue_size;
+};
+
+static bool no_dma = false;
+static LIST_HEAD(exportable_bds);
+
+static struct pci_epf_header pci_blockpt_header = {
+	.vendorid = PCI_ANY_ID,
+	.deviceid = PCI_ANY_ID,
+	.baseclass_code = PCI_CLASS_OTHERS,
+};
+
+struct pci_epf_blockpt_info {
+	struct list_head node;
+	struct pci_epf_blockpt_queue *queue;
+	struct page *page;
+	size_t page_order;
+	size_t size;
+	struct bio *bio;
+	dma_addr_t dma_addr;
+	struct completion dma_transfer_complete;
+	struct pci_epf_blockpt_descr __iomem *descr;
+	int descr_idx;
+	void __iomem *addr;
+	phys_addr_t phys_addr;
+	enum dma_data_direction dma_dir;
+};
+
+#define blockpt_retry_delay() usleep_range(100, 500)
+#define blockpt_poll_delay() usleep_range(500, 1000)
+
+static int pci_blockpt_rq_completer(void *);
+static int pci_blockpt_rq_submitter(void *);
+
+static void
+pci_epf_blockpt_set_invalid_id_error(struct pci_blockpt_device_common *dcommon,
+				     struct pci_epf_blockpt_reg *reg)
+{
+	struct pci_epf *epf = dcommon->epf;
+	struct device *dev = &epf->dev;
+
+	dev_err(dev, "Could not find device with id: %i\n",
+		blockpt_readb(&reg->dev_idx));
+	blockpt_writel(BPT_STATUS_ERROR, &reg->status);
+}
+
+static struct pci_epf_blockpt_device *
+pci_epf_blockpt_get_device_by_id(struct pci_blockpt_device_common *dcom, u8 id)
+{
+	struct list_head *lh;
+	struct pci_epf_blockpt_device *bpt_dev;
+
+	list_for_each(lh, &exportable_bds) {
+		bpt_dev = list_entry(lh, struct pci_epf_blockpt_device, node);
+		if (bpt_dev->dev_tag == id)
+			return bpt_dev;
+	}
+
+	list_for_each(lh, &dcom->devices) {
+		bpt_dev = list_entry(lh, struct pci_epf_blockpt_device, node);
+		if (bpt_dev->dev_tag == id)
+			return bpt_dev;
+	}
+
+	return NULL;
+}
+
+static void
+move_bpt_device_to_active_list(struct pci_epf_blockpt_device *bpt_dev)
+{
+	spin_lock(&bpt_dev->nm_lock);
+	list_del(&bpt_dev->node);
+	INIT_LIST_HEAD(&bpt_dev->node);
+	list_add_tail(&bpt_dev->node, &bpt_dev->dcommon->devices);
+	spin_unlock(&bpt_dev->nm_lock);
+}
+
+static void
+move_bpt_device_to_exportable_list(struct pci_epf_blockpt_device *bpt_dev)
+{
+	spin_lock(&bpt_dev->nm_lock);
+	list_del(&bpt_dev->node);
+	INIT_LIST_HEAD(&bpt_dev->node);
+	list_add_tail(&bpt_dev->node, &exportable_bds);
+	spin_unlock(&bpt_dev->nm_lock);
+}
+
+static void free_pci_blockpt_info(struct pci_epf_blockpt_info *info)
+{
+	struct pci_blockpt_device_common *dcommon =
+		info->queue->bpt_dev->dcommon;
+	struct device *dev = &dcommon->epf->dev;
+	struct device *dma_dev = dcommon->epf->epc->dev.parent;
+	spinlock_t *lock = &info->queue->proc_lock;
+
+	dma_unmap_single(dma_dev, info->dma_addr, info->size, info->dma_dir);
+	if (info->bio->bi_opf == REQ_OP_READ) {
+		pci_epc_unmap_addr(dcommon->epf->epc, dcommon->epf->func_no,
+				   dcommon->epf->vfunc_no, info->phys_addr);
+		pci_epc_mem_free_addr(dcommon->epf->epc, info->phys_addr,
+				      info->addr, info->size);
+	}
+
+	__free_pages(info->page, info->page_order);
+
+	spin_lock_irq(lock);
+	list_del(&info->node);
+	spin_unlock_irq(lock);
+
+	bio_put(info->bio);
+	devm_kfree(dev, info);
+}
+
+static struct pci_epf_blockpt_info *
+alloc_pci_epf_blockpt_info(struct pci_epf_blockpt_queue *queue, size_t size,
+			   struct pci_epf_blockpt_descr __iomem *descr,
+			   int descr_idx, blk_opf_t opf)
+{
+	struct pci_epf_blockpt_info *binfo;
+	struct pci_blockpt_device_common *dcommon = queue->bpt_dev->dcommon;
+	struct bio *bio;
+	struct device *dev = &dcommon->epf->dev;
+	struct page *page;
+	struct device *dma_dev = dcommon->epf->epc->dev.parent;
+	dma_addr_t dma_addr;
+	struct block_device *bdev = queue->bpt_dev->bd;
+	enum dma_data_direction dma_dir =
+		(opf == REQ_OP_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+	gfp_t alloc_flags = GFP_KERNEL;
+
+	binfo = devm_kzalloc(dev, sizeof(*binfo), alloc_flags);
+	if (unlikely(!binfo)) {
+		dev_err(dev, "Could not allocate bio info\n");
+		return NULL;
+	}
+
+	INIT_LIST_HEAD(&binfo->node);
+	bio = bio_alloc(bdev, 1, opf, alloc_flags);
+	if (unlikely(!bio)) {
+		dev_err(dev, "Could not allocate bio\n");
+		goto free_binfo;
+	}
+
+	binfo->size = size;
+	binfo->page_order = get_order(size);
+	page = alloc_pages(alloc_flags | GFP_DMA, binfo->page_order);
+	if (unlikely(!page)) {
+		dev_err(dev, "Could not allocate %i page(s) for bio\n",
+			1 << binfo->page_order);
+		goto put_bio;
+	}
+
+	binfo->addr = pci_epc_mem_alloc_addr(dcommon->epf->epc,
+					     &binfo->phys_addr, size);
+	if (!binfo->addr) {
+		dev_err(dev,
+			"Failed to allocate PCI address slot for transfer\n");
+		goto release_page;
+	}
+
+	dma_addr = dma_map_single(dma_dev, page_address(page), size, dma_dir);
+	if (dma_mapping_error(dma_dev, dma_addr)) {
+		dev_err(dev, "Failed to map buffer addr\n");
+		goto free_epc_mem;
+	}
+
+	init_completion(&binfo->dma_transfer_complete);
+	binfo->bio = bio;
+	binfo->dma_addr = dma_addr;
+	binfo->queue = queue;
+	binfo->page = page;
+	binfo->descr = descr;
+	binfo->descr_idx = descr_idx;
+	binfo->dma_dir = dma_dir;
+	return binfo;
+free_epc_mem:
+	pci_epc_mem_free_addr(dcommon->epf->epc, binfo->phys_addr, binfo->addr,
+			      size);
+release_page:
+	__free_pages(page, binfo->page_order);
+put_bio:
+	bio_put(bio);
+free_binfo:
+	devm_kfree(dev, binfo);
+	return NULL;
+}
+
+static void pci_epf_blockpt_transfer_complete(struct bio *bio)
+{
+	struct pci_epf_blockpt_info *binfo = bio->bi_private;
+	struct device *dev = &binfo->queue->bpt_dev->dcommon->epf->dev;
+	struct list_head *qlist = &binfo->queue->proc_list;
+	spinlock_t *lock = &binfo->queue->proc_lock;
+	struct semaphore *sem = &binfo->queue->proc_sem;
+
+	if (bio->bi_status != BLK_STS_OK)
+		dev_err_ratelimited(dev, "bio submit error %i\n",
+				    bio->bi_status);
+
+	spin_lock(lock);
+	list_add_tail(&binfo->node, qlist);
+	spin_unlock(lock);
+	up(sem);
+}
+
+static void destroy_all_worker_threads(struct pci_epf_blockpt_device *bpt_dev)
+{
+	int cpu;
+
+	for_each_present_cpu(cpu) {
+		struct pci_epf_blockpt_queue *queue =
+			per_cpu_ptr(bpt_dev->q, cpu);
+		if (queue->submit_thr) {
+			up(&queue->proc_sem);
+			queue->submit_thr = NULL;
+		}
+
+		if (queue->complete_thr) {
+			kthread_stop(queue->complete_thr);
+			queue->complete_thr = NULL;
+		}
+	}
+}
+
+static int alloc_dma_channels(struct pci_epf_blockpt_device *bpt_dev)
+{
+	dma_cap_mask_t mask;
+	int cpu, ret = 0;
+	struct device *dev = &bpt_dev->dcommon->epf->dev;
+
+	dma_cap_zero(mask);
+	dma_cap_set(DMA_MEMCPY, mask);
+
+	for_each_present_cpu(cpu) {
+		struct pci_epf_blockpt_queue *queue =
+			per_cpu_ptr(bpt_dev->q, cpu);
+		queue->dma_chan = dma_request_chan_by_mask(&mask);
+		if (IS_ERR(queue->dma_chan)) {
+			ret = PTR_ERR(queue->dma_chan);
+			dev_warn(
+				dev,
+				"Failed to get DMA channel %s for Queue %i: %i\n",
+				bpt_dev->dev_name, cpu, ret);
+			queue->dma_chan = NULL;
+		}
+		dev_info(dev, "Allocated DMA Channel for %s.%d\n",
+			 bpt_dev->dev_name, cpu);
+	}
+	return ret;
+}
+
+static int start_bpt_worker_threads(struct pci_epf_blockpt_device *bpt_dev)
+{
+	int cpu, ret = 0;
+
+	char tname[64];
+	struct device *dev = &bpt_dev->dcommon->epf->dev;
+
+	for_each_present_cpu(cpu) {
+		struct pci_epf_blockpt_queue *queue =
+			per_cpu_ptr(bpt_dev->q, cpu);
+		if (cpu >= bpt_dev->max_queue)
+			break;
+
+		snprintf(tname, sizeof(tname), "%s-q%d:complete-rq",
+			 bpt_dev->dev_name, cpu);
+		dev_dbg(dev, "creating thread %s\n", tname);
+		queue->complete_thr = kthread_create_on_cpu(
+			pci_blockpt_rq_completer, queue, cpu, tname);
+		if (IS_ERR(queue->complete_thr)) {
+			ret = PTR_ERR(queue->complete_thr);
+			dev_err(dev,
+				"%s Could not create digest kernel thread: %i\n",
+				bpt_dev->device_path, ret);
+			goto check_start_errors;
+		}
+		/* we can wake up the kthread here, because it will wait for its percpu samaphore  */
+		wake_up_process(queue->complete_thr);
+	}
+
+	for_each_present_cpu(cpu) {
+		struct pci_epf_blockpt_queue *queue =
+			per_cpu_ptr(bpt_dev->q, cpu);
+		if (cpu >= bpt_dev->max_queue)
+			break;
+		snprintf(tname, sizeof(tname), "%s-q%d:submit-rq",
+			 bpt_dev->dev_name, cpu);
+		dev_dbg(dev, "creating thread %s\n", tname);
+		queue->submit_thr = kthread_create_on_cpu(
+			pci_blockpt_rq_submitter, queue, cpu, tname);
+		if (IS_ERR(queue->submit_thr)) {
+			ret = PTR_ERR(queue->submit_thr);
+			dev_err(dev,
+				"%s Could not create bio submit kernel thread: %i\n",
+				bpt_dev->device_path, ret);
+			goto check_start_errors;
+		}
+		wake_up_process(queue->submit_thr);
+	}
+
+check_start_errors:
+	if (ret)
+		destroy_all_worker_threads(bpt_dev);
+	else
+		dev_info(dev, "%s started\n", bpt_dev->device_path);
+
+	return ret;
+}
+
+static void set_device_descriptor_queue(struct pci_epf_blockpt_queue *queue)
+{
+	struct device *dev = &queue->bpt_dev->dcommon->epf->dev;
+	struct pci_epf_blockpt_reg __iomem *bpt_regs =
+		queue->bpt_dev->dcommon->bpt_regs;
+
+	queue->num_desc = blockpt_readl(&bpt_regs->num_desc);
+	WARN_ON(queue->num_desc <= 16);
+
+	queue->descr_addr = (dma_addr_t)queue->bpt_dev->dcommon->queue_base +
+			    (dma_addr_t)blockpt_readl(&bpt_regs->queue_offset);
+	queue->descr_size = blockpt_readl(&bpt_regs->qsize);
+	queue->descr =
+		(struct pci_epf_blockpt_descr __iomem *)queue->descr_addr;
+	queue->driver_ring = (struct pci_blockpt_driver_ring
+				      *)((u64)queue->descr_addr +
+					 blockpt_readl(&bpt_regs->drv_offset));
+	queue->device_ring = (struct pci_blockpt_device_ring
+				      *)((u64)queue->descr_addr +
+					 blockpt_readl(&bpt_regs->dev_offset));
+	/* if the queue was (re)set, we need to reset the device and driver indices */
+	queue->dev_idx = queue->drv_idx = 0;
+
+	dev_dbg(dev,
+		"%s: mapping Queue to bus address: 0x%llX. Size = 0x%x. Driver Ring Addr: 0x%llX, Device Ring Addr: 0x%llX\n",
+		queue->bpt_dev->device_path, queue->descr_addr,
+		queue->descr_size, (u64)queue->driver_ring,
+		(u64)queue->device_ring);
+}
+
+static void pci_epf_blockpt_cmd_handler(struct work_struct *work)
+{
+	struct pci_blockpt_device_common *dcommon = container_of(
+		work, struct pci_blockpt_device_common, cmd_handler.work);
+	u32 command;
+	int ret;
+	struct pci_epf *epf = dcommon->epf;
+	struct pci_epf_blockpt_reg *reg = dcommon->bpt_regs;
+	struct pci_epf_blockpt_device *bpt_dev;
+	struct device *dev = &epf->dev;
+	struct list_head *lh;
+	struct pci_epf_blockpt_queue *queue;
+
+	command = blockpt_readl(&reg->command);
+
+	if (!command)
+		goto reset_handler;
+
+	blockpt_writel(0, &reg->command);
+	blockpt_writel(0, &reg->status);
+
+	if (command != 0 && list_empty(&exportable_bds) &&
+	    list_empty(&dcommon->devices)) {
+		WARN_ONCE(1,
+			  "Available Devices must be configured first through \
+                           ConfigFS, before remote partner can send any command\n");
+		goto reset_handler;
+	}
+
+	bpt_dev = pci_epf_blockpt_get_device_by_id(
+		dcommon, blockpt_readb(&reg->dev_idx));
+	if (!bpt_dev) {
+		pci_epf_blockpt_set_invalid_id_error(dcommon, reg);
+		goto reset_handler;
+	}
+
+	if (command & BPT_COMMAND_GET_DEVICES) {
+		int nidx = 0;
+		dev_dbg(dev, "Request for available devices received\n");
+		list_for_each(lh, &exportable_bds) {
+			struct pci_epf_blockpt_device *bpt_dev = list_entry(
+				lh, struct pci_epf_blockpt_device, node);
+			nidx += snprintf(&reg->dev_name[nidx], 64, "%s%s",
+					 (nidx == 0) ? "" : ";",
+					 bpt_dev->device_path);
+		}
+
+		sprintf(&reg->dev_name[nidx], "%s", ";");
+	}
+
+	if (command & BPT_COMMAND_SET_IRQ) {
+		dev_dbg(dev, "%s setting IRQ%d for Queue %i\n",
+			bpt_dev->device_path, blockpt_readl(&reg->irq),
+			blockpt_readb(&reg->qidx));
+		WARN_ON(blockpt_readb(&reg->qidx) >= num_present_cpus());
+		queue = per_cpu_ptr(bpt_dev->q, blockpt_readb(&reg->qidx));
+		queue->irq = blockpt_readl(&reg->irq);
+	}
+
+	if (command & BPT_COMMAND_GET_NUM_SECTORS) {
+		dev_dbg(dev, "%s: Request for number of sectors received\n",
+			bpt_dev->device_path);
+		blockpt_writeq(bdev_nr_sectors(bpt_dev->bd), &reg->num_sectors);
+	}
+
+	if (command & BPT_COMMAND_SET_QUEUE) {
+		dev_dbg(dev, "%s setting Queue %i\n", bpt_dev->device_path,
+			blockpt_readb(&reg->qidx));
+		if (WARN_ON_ONCE(blockpt_readb(&reg->qidx) >=
+				 num_present_cpus())) {
+			blockpt_writel(BPT_STATUS_ERROR, &reg->status);
+			goto reset_handler;
+		}
+
+		queue = per_cpu_ptr(bpt_dev->q, blockpt_readb(&reg->qidx));
+		set_device_descriptor_queue(queue);
+	}
+
+	if (command & BPT_COMMAND_GET_PERMISSION) {
+		blockpt_writeb(bpt_dev->read_only ? BPT_PERMISSION_RO : 0,
+			       &reg->perm);
+	}
+
+	if (command & BPT_COMMAND_START) {
+		if (!no_dma) {
+			ret = alloc_dma_channels(bpt_dev);
+			if (ret)
+				dev_warn(
+					dev,
+					"could not allocate dma channels. Using PIO\n");
+		}
+		ret = start_bpt_worker_threads(bpt_dev);
+		if (ret) {
+			blockpt_writel(BPT_STATUS_ERROR, &reg->status);
+			goto reset_handler;
+		}
+		/* move the device from the exportable_devices to the active ones */
+		move_bpt_device_to_active_list(bpt_dev);
+		bpt_dev->attached = true;
+	}
+
+	if (command & BPT_COMMAND_STOP) {
+		if (bpt_dev->attached) {
+			destroy_all_worker_threads(bpt_dev);
+			move_bpt_device_to_exportable_list(bpt_dev);
+			dev_info(dev, "%s stopped\n", bpt_dev->dev_name);
+			bpt_dev->attached = false;
+		} else {
+			dev_err(dev,
+				"%s try to stop a device which was not started.\n",
+				bpt_dev->dev_name);
+			blockpt_writel(BPT_STATUS_ERROR, &reg->status);
+			goto reset_handler;
+		}
+	}
+	blockpt_writel(BPT_STATUS_SUCCESS, &reg->status);
+
+reset_handler:
+	queue_delayed_work(kpciblockpt_wq, &dcommon->cmd_handler,
+			   msecs_to_jiffies(5));
+}
+
+static void pci_epf_blockpt_unbind(struct pci_epf *epf)
+{
+	struct pci_blockpt_device_common *bpt = epf_get_drvdata(epf);
+	struct pci_epc *epc = epf->epc;
+
+	cancel_delayed_work(&bpt->cmd_handler);
+	pci_epc_clear_bar(epc, epf->func_no, epf->vfunc_no,
+			  &epf->bar[bpt->blockpt_reg_bar]);
+	pci_epf_free_space(epf, bpt->bpt_regs, bpt->blockpt_reg_bar,
+			   PRIMARY_INTERFACE);
+}
+
+static int pci_epf_blockpt_set_bars(struct pci_epf *epf)
+{
+	int ret;
+	struct pci_epf_bar *epf_reg_bar;
+	struct pci_epc *epc = epf->epc;
+	struct device *dev = &epf->dev;
+	struct pci_blockpt_device_common *dcommon = epf_get_drvdata(epf);
+	const struct pci_epc_features *epc_features;
+
+	epc_features = dcommon->epc_features;
+
+	epf_reg_bar = &epf->bar[dcommon->blockpt_reg_bar];
+	ret = pci_epc_set_bar(epc, epf->func_no, epf->vfunc_no, epf_reg_bar);
+	if (ret) {
+		pci_epf_free_space(epf, dcommon->bpt_regs,
+				   dcommon->blockpt_reg_bar, PRIMARY_INTERFACE);
+		dev_err(dev, "Failed to set Register BAR%d\n",
+			dcommon->blockpt_reg_bar);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int pci_epf_blockpt_core_init(struct pci_epf *epf)
+{
+	struct pci_blockpt_device_common *bpt = epf_get_drvdata(epf);
+	struct pci_epf_header *header = epf->header;
+	const struct pci_epc_features *epc_features;
+	struct pci_epc *epc = epf->epc;
+	struct device *dev = &epf->dev;
+	bool msix_capable = false;
+	bool msi_capable = true;
+	int ret;
+
+	epc_features = pci_epc_get_features(epc, epf->func_no, epf->vfunc_no);
+	if (epc_features) {
+		msix_capable = epc_features->msix_capable;
+		msi_capable = epc_features->msi_capable;
+	}
+
+	if (epf->vfunc_no <= 1) {
+		ret = pci_epc_write_header(epc, epf->func_no, epf->vfunc_no,
+					   header);
+		if (ret) {
+			dev_err(dev, "Configuration header write failed\n");
+			return ret;
+		}
+	}
+
+	ret = pci_epf_blockpt_set_bars(epf);
+	if (ret)
+		return ret;
+
+	/* MSIs and MSI-Xs are mutually exclusive; MSI-Xs will not work if the
+	 * configuration is done for both, simultaneously.
+	 */
+	if (msi_capable && !msix_capable) {
+		dev_info(dev, "Configuring MSIs\n");
+		ret = pci_epc_set_msi(epc, epf->func_no, epf->vfunc_no,
+				      epf->msi_interrupts);
+		if (ret) {
+			dev_err(dev, "MSI configuration failed\n");
+			return ret;
+		}
+	}
+
+	if (msix_capable) {
+		dev_info(dev, "Configuring MSI-Xs\n");
+		ret = pci_epc_set_msix(epc, epf->func_no, epf->vfunc_no,
+				       epf->msix_interrupts,
+				       bpt->blockpt_reg_bar,
+				       bpt->msix_table_offset);
+		if (ret) {
+			dev_err(dev, "MSI-X configuration failed\n");
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static int pci_epf_blockpt_alloc_space(struct pci_epf *epf)
+{
+	struct pci_blockpt_device_common *dcommon = epf_get_drvdata(epf);
+	struct device *dev = &epf->dev;
+	size_t msix_table_size = 0;
+	size_t bpt_bar_size;
+	size_t pba_size = 0;
+	bool msix_capable;
+	void *base;
+	enum pci_barno reg_bar = dcommon->blockpt_reg_bar;
+	const struct pci_epc_features *epc_features;
+	size_t bar_reg_size, desc_space;
+
+	epc_features = dcommon->epc_features;
+	bar_reg_size = ALIGN(sizeof(struct pci_epf_blockpt_reg), 128);
+	msix_capable = epc_features->msix_capable;
+	if (msix_capable) {
+		msix_table_size = PCI_MSIX_ENTRY_SIZE * epf->msix_interrupts;
+		pba_size = ALIGN(DIV_ROUND_UP(epf->msix_interrupts, 8), 8);
+	}
+
+	/* some PCI(e) EP controllers have a very limited number of translation windows
+	   to avoid wasting a full translation window for the mapping of the descriptors,
+	   the descriptors will be part of the register bar. For this I choose that 128kiB
+	   must be available. Which is for now the bare minimum required to be supported
+	   by the EPC. Though this is an arbitrary size and can be reduced.
+	*/
+	bpt_bar_size = SZ_128K;
+	if (epc_features->bar[reg_bar].type == BAR_FIXED && epc_features->bar[reg_bar].fixed_size) {
+		if (bpt_bar_size > epc_features->bar[reg_bar].fixed_size)
+			return -ENOMEM;
+
+		bpt_bar_size = epc_features->bar[reg_bar].fixed_size;
+	}
+	desc_space = bpt_bar_size - bar_reg_size - msix_table_size - pba_size;
+	dcommon->msix_table_offset = bar_reg_size + desc_space;
+
+	base = pci_epf_alloc_space(epf, bpt_bar_size, reg_bar,
+				   epc_features, PRIMARY_INTERFACE);
+	if (!base) {
+		dev_err(dev, "Failed to allocated register space\n");
+		return -ENOMEM;
+	}
+
+	dcommon->queue_offset = bar_reg_size;
+	dcommon->queue_size = desc_space;
+	dcommon->bpt_regs = base;
+	dcommon->queue_base = (void *)((u64)base + bar_reg_size);
+	return 0;
+}
+
+static int pci_epf_blockpt_link_init_notifier(struct pci_epf *epf)
+{
+	struct pci_blockpt_device_common *dcommon = epf_get_drvdata(epf);
+	queue_delayed_work(kpciblockpt_wq, &dcommon->cmd_handler,
+			   msecs_to_jiffies(1));
+	return 0;
+}
+
+static void
+pci_epf_blockpt_configure_bar(struct pci_epf *epf,
+			      const struct pci_epc_features *epc_features,
+			      enum pci_barno bar_no)
+{
+	struct pci_epf_bar *epf_bar = &epf->bar[bar_no];
+	
+	if (!!(epc_features->bar[bar_no].only_64bit & (1 << bar_no)))
+		epf_bar->flags |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+}
+
+static const struct pci_epc_event_ops pci_epf_blockpt_event_ops = {
+	.core_init = pci_epf_blockpt_core_init,
+	.link_up = pci_epf_blockpt_link_init_notifier,
+};
+
+static int pci_epf_blockpt_bind(struct pci_epf *epf)
+{
+	int ret;
+	struct pci_blockpt_device_common *dcommon = epf_get_drvdata(epf);
+	const struct pci_epc_features *epc_features;
+	enum pci_barno reg_bar = BAR_0;
+	struct pci_epc *epc = epf->epc;
+	bool linkup_notifier = false;
+	bool core_init_notifier = false;
+	struct pci_epf_blockpt_reg *breg;
+	struct device *dev = &epf->dev;
+
+	if (WARN_ON_ONCE(!epc))
+		return -EINVAL;
+
+	epc_features = pci_epc_get_features(epc, epf->func_no, epf->vfunc_no);
+	if (!epc_features) {
+		dev_err(&epf->dev, "epc_features not implemented\n");
+		return -EOPNOTSUPP;
+	}
+
+	linkup_notifier = epc_features->linkup_notifier;
+	core_init_notifier = epc_features->core_init_notifier;
+	reg_bar = pci_epc_get_first_free_bar(epc_features);
+	if (reg_bar < 0)
+		return -EINVAL;
+
+	dev_info(dev, "allocated BAR%d\n", reg_bar);
+	pci_epf_blockpt_configure_bar(epf, epc_features, reg_bar);
+	dcommon->blockpt_reg_bar = reg_bar;
+
+	dcommon->epc_features = epc_features;
+	ret = pci_epf_blockpt_alloc_space(epf);
+	if (ret)
+		return ret;
+
+	breg = (struct pci_epf_blockpt_reg *)dcommon->bpt_regs;
+	blockpt_writel(BLOCKPT_MAGIC, &breg->magic);
+	blockpt_writel(dcommon->queue_offset, &breg->queue_bar_offset);
+	blockpt_writel(dcommon->queue_size, &breg->available_qsize);
+	blockpt_writel(num_present_cpus(), &breg->num_queues);
+	blockpt_writel(MAX_BLOCK_DEVS, &breg->max_devs);
+	if (!core_init_notifier) {
+		ret = pci_epf_blockpt_core_init(epf);
+		if (ret)
+			return ret;
+	}
+
+	if (!linkup_notifier && !core_init_notifier)
+		queue_work(kpciblockpt_wq, &dcommon->cmd_handler.work);
+
+	return 0;
+}
+
+static const struct pci_epf_device_id pci_epf_blockpt_ids[] = {
+	{
+		.name = "pci_epf_blockpt",
+	},
+	{},
+};
+
+static void pci_epf_blockpt_dma_callback(void *param)
+{
+	struct pci_epf_blockpt_info *bio_info = param;
+	complete(&bio_info->dma_transfer_complete);
+}
+
+static int pci_blockpt_rq_submitter(void *__bpt_queue)
+{
+	struct pci_epf_blockpt_queue *queue = __bpt_queue;
+	struct device *dev = &queue->bpt_dev->dcommon->epf->dev;
+	struct pci_epf *epf = queue->bpt_dev->dcommon->epf;
+	struct pci_epc *epc = epf->epc;
+	struct pci_epf_blockpt_info *bio_info;
+	struct pci_epf_blockpt_descr loc_descr;
+	struct pci_epf_blockpt_descr __iomem *descr;
+	struct dma_async_tx_descriptor *dma_txd;
+	dma_cookie_t dma_cookie;
+	u16 de;
+	int ret = 0;
+	int err;
+
+	while (!kthread_should_stop()) {
+		while (queue->drv_idx !=
+		       blockpt_readw(&queue->driver_ring->idx)) {
+			de = blockpt_readw(
+				&queue->driver_ring->ring[queue->drv_idx]);
+			descr = &queue->descr[de];
+
+			memcpy_fromio(&loc_descr, descr, sizeof(loc_descr));
+
+			BUG_ON(!(loc_descr.si.flags & PBI_EPF_BLOCKPT_F_USED));
+
+			bio_info = alloc_pci_epf_blockpt_info(
+				queue, loc_descr.len, descr, de,
+				(loc_descr.si.opf == WRITE) ? REQ_OP_WRITE :
+							      REQ_OP_READ);
+			if (unlikely(!bio_info)) {
+				dev_err(dev, "Unable to allocate bio_info\n");
+				blockpt_retry_delay();
+				continue;
+			}
+
+			bio_set_dev(bio_info->bio, queue->bpt_dev->bd);
+			bio_info->bio->bi_iter.bi_sector = loc_descr.s_sector;
+			bio_info->bio->bi_opf = loc_descr.si.opf == WRITE ?
+							REQ_OP_WRITE :
+							REQ_OP_READ;
+			if (loc_descr.si.opf == WRITE) {
+				ret = pci_epc_map_addr(epc, epf->func_no,
+						       epf->vfunc_no,
+						       bio_info->phys_addr,
+						       loc_descr.addr,
+						       loc_descr.len);
+				if (ret) {
+					/* This is not an error. Some PCI
+					 * Controllers have very few translation
+					 * windows, and as we run this on all available
+					 * cores  it is not unusual that the translation
+					 * windows are all used for a short period of time.
+					 * Instead of giving up and panic here,
+					 * just wait and retry. It will usually
+					 * be available on the next few retries
+					 */
+					dev_info_ratelimited(
+						dev,
+						"Mapping descriptor failed with %i. Retry\n",
+						ret);
+					goto err_retry;
+				}
+
+				if (queue->dma_chan) {
+					dma_txd = dmaengine_prep_dma_memcpy(
+						queue->dma_chan,
+						bio_info->dma_addr,
+						bio_info->phys_addr,
+						loc_descr.len,
+						DMA_CTRL_ACK |
+							DMA_PREP_INTERRUPT);
+					if (!dma_txd) {
+						ret = -ENODEV;
+						dev_err(dev,
+							"Failed to prepare DMA memcpy\n");
+						goto err_retry;
+					}
+
+					dma_txd->callback =
+						pci_epf_blockpt_dma_callback;
+					dma_txd->callback_param = bio_info;
+					dma_cookie =
+						dma_txd->tx_submit(dma_txd);
+					ret = dma_submit_error(dma_cookie);
+					if (ret) {
+						dev_err_ratelimited(
+							dev,
+							"Failed to do DMA tx_submit %d\n",
+							dma_cookie);
+						goto err_retry;
+					}
+
+					dma_async_issue_pending(
+						queue->dma_chan);
+					ret = wait_for_completion_interruptible_timeout(
+						&bio_info->dma_transfer_complete,
+						msecs_to_jiffies(100));
+					if (ret <= 0) {
+						ret = -ETIMEDOUT;
+						dev_err_ratelimited(
+							dev,
+							"DMA wait_for_completion timeout\n");
+						dmaengine_terminate_sync(
+							queue->dma_chan);
+						goto err_retry;
+					}
+				} else {
+					memcpy_fromio(
+						page_address(bio_info->page),
+						bio_info->addr, loc_descr.len);
+				}
+			}
+
+			bio_info->bio->bi_end_io =
+				pci_epf_blockpt_transfer_complete;
+			bio_info->bio->bi_private = bio_info;
+			err = bio_add_page(bio_info->bio, bio_info->page,
+					   loc_descr.len, 0);
+			if (err != loc_descr.len) {
+				ret = -ENOMEM;
+				dev_err_ratelimited(
+					dev, "failed to add page to bio\n");
+				goto err_retry;
+			}
+
+			queue->drv_idx = (queue->drv_idx + 1) % queue->num_desc;
+			submit_bio(bio_info->bio);
+			continue;
+
+err_retry:
+			if (loc_descr.si.opf == WRITE) {
+				pci_epc_unmap_addr(epf->epc, epf->func_no,
+						   epf->vfunc_no,
+						   bio_info->phys_addr);
+				pci_epc_mem_free_addr(epf->epc,
+						      bio_info->phys_addr,
+						      bio_info->addr,
+						      bio_info->size);
+			}
+			free_pci_blockpt_info(bio_info);
+			blockpt_retry_delay();
+		}
+		blockpt_poll_delay();
+	}
+
+	return 0;
+}
+
+static int pci_blockpt_rq_completer(void *__queue)
+{
+	struct pci_epf_blockpt_queue *queue = __queue;
+	struct device *dev = &queue->bpt_dev->dcommon->epf->dev;
+	struct pci_epf *epf = queue->bpt_dev->dcommon->epf;
+	struct pci_epf_blockpt_info *bi;
+	struct pci_epf_blockpt_descr __iomem *descr;
+	int ret;
+	struct dma_async_tx_descriptor *dma_rxd;
+	dma_cookie_t dma_cookie;
+	char *buf;
+
+	while (!kthread_should_stop()) {
+		/* wait for a new bio to finish */
+		down(&queue->proc_sem);
+		bi = list_first_entry_or_null(
+			&queue->proc_list, struct pci_epf_blockpt_info, node);
+		if (bi == NULL) {
+			dev_info(dev, "%s: stopping digest task for queue %d\n",
+				 queue->bpt_dev->dev_name, smp_processor_id());
+			return 0;
+		}
+
+		descr = bi->descr;
+		BUG_ON(!(descr->si.flags & PBI_EPF_BLOCKPT_F_USED));
+
+		if (descr->si.opf == READ) {
+			ret = pci_epc_map_addr(epf->epc, epf->func_no,
+					       epf->vfunc_no, bi->phys_addr,
+					       descr->addr, descr->len);
+			if (ret) {
+				/* don't panic. simply retry.
+				 * A window will be available sooner or later */
+				dev_info(
+					dev,
+					"Could not map read descriptor. Retry\n");
+				blockpt_retry_delay();
+				up(&queue->proc_sem);
+				continue;
+			}
+
+			if (queue->dma_chan) {
+				dma_rxd = dmaengine_prep_dma_memcpy(
+					queue->dma_chan, bi->phys_addr,
+					bi->dma_addr, descr->len,
+					DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
+				if (!dma_rxd) {
+					dev_err(dev,
+						"Failed to prepare DMA memcpy\n");
+					goto err_retry;
+				}
+
+				dma_rxd->callback =
+					pci_epf_blockpt_dma_callback;
+				dma_rxd->callback_param = bi;
+				dma_cookie = dma_rxd->tx_submit(dma_rxd);
+				ret = dma_submit_error(dma_cookie);
+				if (ret) {
+					dev_err(dev,
+						"Failed to do DMA rx_submit %d\n",
+						dma_cookie);
+					goto err_retry;
+				}
+
+				dma_async_issue_pending(queue->dma_chan);
+				ret = wait_for_completion_interruptible_timeout(
+					&bi->dma_transfer_complete,
+					msecs_to_jiffies(100));
+				if (ret <= 0) {
+					dev_err_ratelimited(
+						dev,
+						"DMA completion timed out\n");
+					dmaengine_terminate_sync(
+						queue->dma_chan);
+					goto err_retry;
+				}
+			} else {
+				buf = kmap_local_page(bi->page);
+				memcpy_toio(bi->addr, buf, bi->descr->len);
+				kunmap_local(buf);
+			}
+		}
+
+		blockpt_writew(bi->descr_idx,
+			       &queue->device_ring->ring[queue->dev_idx]);
+		queue->dev_idx = (queue->dev_idx + 1) % queue->num_desc;
+		blockpt_writew(queue->dev_idx, &queue->device_ring->idx);
+		do {
+			ret = pci_epc_raise_irq(epf->epc, epf->func_no,
+						epf->vfunc_no, PCI_IRQ_MSIX,
+						queue->irq);
+			if (ret < 0) {
+				dev_err_ratelimited(
+					dev, "could not send msix irq%d\n",
+					queue->irq);
+				blockpt_retry_delay();
+			}
+		} while (ret != 0);
+
+		atomic_inc(&queue->raised_irqs);
+		free_pci_blockpt_info(bi);
+		continue;
+err_retry:
+		pci_epc_unmap_addr(epf->epc, epf->func_no, epf->vfunc_no,
+				   bi->phys_addr);
+		blockpt_retry_delay();
+		up(&queue->proc_sem);
+	}
+
+	return 0;
+}
+
+static int pci_epf_blockpt_probe(struct pci_epf *epf,
+				 const struct pci_epf_device_id *id)
+{
+	struct pci_blockpt_device_common *dcommon;
+	struct device *dev = &epf->dev;
+
+	dcommon = devm_kzalloc(dev, sizeof(*dcommon), GFP_KERNEL);
+	if (!dcommon)
+		return -ENOMEM;
+
+	epf->header = &pci_blockpt_header;
+	dcommon->epf = epf;
+	INIT_LIST_HEAD(&dcommon->devices);
+	INIT_LIST_HEAD(&exportable_bds);
+	INIT_DELAYED_WORK(&dcommon->cmd_handler, pci_epf_blockpt_cmd_handler);
+	epf->event_ops = &pci_epf_blockpt_event_ops;
+	epf_set_drvdata(epf, dcommon);
+	return 0;
+}
+
+static void blockpt_free_per_cpu_data(struct pci_epf_blockpt_device *bpt_dev)
+{
+	if (bpt_dev->q) {
+		free_percpu(bpt_dev->q);
+		bpt_dev->q = NULL;
+	}
+}
+
+static void pci_epf_blockpt_remove(struct pci_epf *epf)
+{
+	struct pci_blockpt_device_common *dcommon = epf_get_drvdata(epf);
+	struct pci_epf_blockpt_device *bpt_dev, *dntmp;
+	unsigned long flags;
+	struct pci_epf_blockpt_info *bio_info, *bntmp;
+	int cpu;
+	struct device *dev = &dcommon->epf->dev;
+
+	list_for_each_entry_safe(bpt_dev, dntmp, &dcommon->devices, node) {
+		destroy_all_worker_threads(bpt_dev);
+		fput(bpt_dev->bdev_file);
+		spin_lock_irqsave(&bpt_dev->nm_lock, flags);
+		list_del(&bpt_dev->node);
+		spin_unlock_irqrestore(&bpt_dev->nm_lock, flags);
+
+		for_each_present_cpu(cpu) {
+			list_for_each_entry_safe(
+				bio_info, bntmp,
+				&(per_cpu_ptr(bpt_dev->q, cpu)->proc_list),
+				node) {
+				free_pci_blockpt_info(bio_info);
+			}
+		}
+
+		blockpt_free_per_cpu_data(bpt_dev);
+		kfree(bpt_dev->cfs_disk_name);
+		kfree(bpt_dev->device_path);
+		devm_kfree(dev, bpt_dev);
+	}
+}
+
+static inline struct pci_epf_blockpt_device *
+to_blockpt_dev(struct config_item *item)
+{
+	return container_of(to_config_group(item),
+			    struct pci_epf_blockpt_device, cfg_grp);
+}
+
+static ssize_t pci_blockpt_disc_name_show(struct config_item *item, char *page)
+{
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+	return sprintf(page, "%s\n",
+		       (bpt_dev->device_path != NULL) ? bpt_dev->device_path :
+							"");
+}
+
+static ssize_t pci_blockpt_disc_name_store(struct config_item *item,
+					   const char *page, size_t len)
+{
+	int ret;
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+	struct device *dev = &bpt_dev->dcommon->epf->dev;
+	unsigned long flags;
+	
+	bpt_dev->bdev_file = bdev_file_open_by_path(
+		page,
+		bpt_dev->read_only ? BLK_OPEN_READ : (BLK_OPEN_READ | BLK_OPEN_WRITE),
+		NULL, NULL);
+
+
+	if (IS_ERR(bpt_dev->bdev_file)) {
+		ret = PTR_ERR(bpt_dev->bdev_file);
+		if (ret != -ENOTBLK) {
+			dev_err(dev, "Failed to get block device %s: (%d)\n",
+				page, ret);
+		}
+		return ret;
+	}
+
+	kfree(bpt_dev->device_path);
+	bpt_dev->bd = file_bdev(bpt_dev->bdev_file);;
+	bpt_dev->device_path = kasprintf(GFP_KERNEL, "%s", page);
+	if (unlikely(!bpt_dev->device_path)) {
+		dev_err(dev, "Unable to allocate memory for device path\n");
+		return 0;
+	}
+
+	bpt_dev->dev_name = strrchr(bpt_dev->device_path, '/');
+	if (unlikely(!bpt_dev->dev_name))
+		bpt_dev->dev_name = bpt_dev->device_path;
+	else
+		bpt_dev->dev_name++;
+
+	spin_lock_irqsave(&bpt_dev->nm_lock, flags);
+	list_add_tail(&bpt_dev->node, &exportable_bds);
+	spin_unlock_irqrestore(&bpt_dev->nm_lock, flags);
+	return len;
+}
+
+CONFIGFS_ATTR(pci_blockpt_, disc_name);
+
+static ssize_t pci_blockpt_attached_show(struct config_item *item, char *page)
+{
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+	return sprintf(page, "%i\n", bpt_dev->attached);
+}
+
+CONFIGFS_ATTR_RO(pci_blockpt_, attached);
+
+static ssize_t pci_blockpt_irq_stats_show(struct config_item *item, char *page)
+{
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+	int cpu, next_idx = 0;
+
+	for_each_present_cpu(cpu) {
+		struct pci_epf_blockpt_queue *q = per_cpu_ptr(bpt_dev->q, cpu);
+		next_idx += sprintf(&page[next_idx], "cpu%d: %d\n", cpu,
+				    atomic_read(&q->raised_irqs));
+	}
+
+	return next_idx;
+}
+
+CONFIGFS_ATTR_RO(pci_blockpt_, irq_stats);
+
+static ssize_t pci_blockpt_max_number_of_queues_show(struct config_item *item,
+						     char *page)
+{
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+
+	return sprintf(page, "%i\n", bpt_dev->max_queue);
+}
+
+static ssize_t pci_blockpt_max_number_of_queues_store(struct config_item *item,
+						      const char *page,
+						      size_t len)
+{
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+	u32 mq;
+	int err;
+
+	err = kstrtou32(page, 10, &mq);
+	if (err || mq > num_present_cpus() || mq == 0)
+		return -EINVAL;
+
+	bpt_dev->max_queue = mq;
+	return len;
+}
+
+CONFIGFS_ATTR(pci_blockpt_, max_number_of_queues);
+
+static ssize_t pci_blockpt_read_only_show(struct config_item *item, char *page)
+{
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+
+	return sprintf(page, "%i\n", bpt_dev->read_only);
+}
+
+static ssize_t pci_blockpt_read_only_store(struct config_item *item,
+					   const char *page, size_t len)
+{
+	bool ro;
+	struct pci_epf_blockpt_device *bpt_dev = to_blockpt_dev(item);
+	int ret = kstrtobool(page, &ro);
+
+	if (ret)
+		return ret;
+
+	bpt_dev->read_only = ro;
+	return len;
+}
+
+CONFIGFS_ATTR(pci_blockpt_, read_only);
+
+static struct configfs_attribute *blockpt_attrs[] = {
+	&pci_blockpt_attr_disc_name,
+	&pci_blockpt_attr_read_only,
+	&pci_blockpt_attr_max_number_of_queues,
+	&pci_blockpt_attr_attached,
+	&pci_blockpt_attr_irq_stats,
+	NULL,
+};
+
+static const struct config_item_type blockpt_disk_type = {
+	.ct_attrs = blockpt_attrs,
+	.ct_owner = THIS_MODULE,
+};
+
+static int blockpt_alloc_per_cpu_data(struct pci_epf_blockpt_device *bpt_dev)
+{
+	int cpu;
+
+	bpt_dev->q = alloc_percpu_gfp(struct pci_epf_blockpt_queue,
+				      GFP_KERNEL | __GFP_ZERO);
+	if (bpt_dev->q != NULL) {
+		for_each_possible_cpu(cpu) {
+			struct pci_epf_blockpt_queue *q =
+				per_cpu_ptr(bpt_dev->q, cpu);
+			spin_lock_init(&q->proc_lock);
+			sema_init(&q->proc_sem, 0);
+			INIT_LIST_HEAD(&q->proc_list);
+			q->irq = -EINVAL;
+			q->bpt_dev = bpt_dev;
+		}
+		return 0;
+	} else {
+		return -ENOMEM;
+	}
+}
+
+static struct config_group *pci_epf_blockpt_add_cfs(struct pci_epf *epf,
+						    struct config_group *group)
+{
+	struct pci_epf_blockpt_device *bpt_dev;
+	struct pci_blockpt_device_common *dcommon = epf_get_drvdata(epf);
+	struct device *dev = &epf->dev;
+	int ret;
+
+	bpt_dev = devm_kzalloc(dev, sizeof(*bpt_dev), GFP_KERNEL);
+	if (!bpt_dev) {
+		dev_err(dev, "Could not alloc bpt device\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	bpt_dev->max_queue = num_present_cpus();
+	bpt_dev->cfs_disk_name =
+		kasprintf(GFP_KERNEL, "disc%i", dcommon->next_disc_idx);
+	if (bpt_dev->cfs_disk_name == NULL) {
+		dev_err(dev, "Could not alloc cfs disk name\n");
+		goto free_bpt_dev;
+	}
+
+	bpt_dev->dcommon = dcommon;
+	ret = blockpt_alloc_per_cpu_data(bpt_dev);
+	if (ret)
+		goto free_bpt_dev;
+
+	spin_lock_init(&bpt_dev->nm_lock);
+	INIT_LIST_HEAD(&bpt_dev->node);
+	config_group_init_type_name(&bpt_dev->cfg_grp, bpt_dev->cfs_disk_name,
+				    &blockpt_disk_type);
+	bpt_dev->dev_tag = dcommon->next_disc_idx++;
+	return &bpt_dev->cfg_grp;
+
+free_bpt_dev:
+	devm_kfree(dev, bpt_dev);
+	return NULL;
+}
+
+static struct pci_epf_ops blockpt_ops = {
+	.unbind = pci_epf_blockpt_unbind,
+	.bind = pci_epf_blockpt_bind,
+	.add_cfs = pci_epf_blockpt_add_cfs,
+};
+
+static struct pci_epf_driver blockpt_driver = {
+	.driver.name = "pci_epf_blockpt",
+	.probe = pci_epf_blockpt_probe,
+	.remove = pci_epf_blockpt_remove,
+	.id_table = pci_epf_blockpt_ids,
+	.ops = &blockpt_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init pci_epf_blockpt_init(void)
+{
+	int ret;
+
+	kpciblockpt_wq = alloc_workqueue("kpciblockpt_wq",
+					 WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+	if (!kpciblockpt_wq) {
+		pr_err("Failed to allocate the kpciblockpt work queue\n");
+		return -ENOMEM;
+	}
+
+	ret = pci_epf_register_driver(&blockpt_driver);
+	if (ret) {
+		destroy_workqueue(kpciblockpt_wq);
+		pr_err("Failed to register pci epf blockpt driver\n");
+		return ret;
+	}
+
+	return 0;
+}
+module_init(pci_epf_blockpt_init);
+
+static void __exit pci_epf_blockpt_exit(void)
+{
+	if (kpciblockpt_wq)
+		destroy_workqueue(kpciblockpt_wq);
+	pci_epf_unregister_driver(&blockpt_driver);
+}
+module_exit(pci_epf_blockpt_exit);
+
+module_param(no_dma, bool, 0444);
+MODULE_DESCRIPTION("PCI Endpoint Function Driver for Block Device Passthrough");
+MODULE_AUTHOR("Wadim Mueller <wafgo01@gmail.com>");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/pci-epf-block-passthru.h b/include/linux/pci-epf-block-passthru.h
new file mode 100644
index 000000000000..751f9c863901
--- /dev/null
+++ b/include/linux/pci-epf-block-passthru.h
@@ -0,0 +1,77 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+* PCI Endpoint Function for Blockdevice passthrough header
+*
+* Author: Wadim Mueller <wafgo01@gmail.com>
+*/
+
+#ifndef __LINUX_PCI_EPF_BLOCKPT_H
+#define __LINUX_PCI_EPF_BLOCKPT_H
+
+#include <linux/types.h>
+
+#define MAX_BLOCK_DEVS (16UL)
+
+#define BLOCKPT_MAGIC 0x636f6e74
+
+#define PBI_EPF_BLOCKPT_F_USED BIT(1)
+
+#define BPT_COMMAND_SET_QUEUE BIT(6)
+#define BPT_COMMAND_GET_DEVICES BIT(7)
+#define BPT_COMMAND_START BIT(8)
+#define BPT_COMMAND_GET_NUM_SECTORS BIT(9)
+#define BPT_COMMAND_STOP BIT(10)
+#define BPT_COMMAND_SET_IRQ BIT(11)
+#define BPT_COMMAND_GET_PERMISSION BIT(12)
+
+#define BPT_STATUS_SUCCESS BIT(0)
+#define BPT_STATUS_ERROR BIT(8)
+#define BPT_STATUS_QUEUE_ADDR_INVALID BIT(9)
+
+#define BPT_PERMISSION_RO BIT(0)
+
+struct pci_epf_blockpt_reg {
+	u32 magic;
+	u32 command;
+	u32 status;
+	u32 queue_bar_offset;
+	u32 drv_offset;
+	u32 dev_offset;
+	u32 num_desc;
+	u32 max_devs;
+	u32 irq;
+	u32 qsize;
+	u32 num_queues;
+	u32 queue_offset;
+	u32 available_qsize;
+	u8 dev_idx;
+	u8 perm;
+	u8 qidx;
+	u8 bres0;
+	u64 num_sectors;
+	char dev_name[64 * MAX_BLOCK_DEVS + 1];
+} __packed;
+
+struct pci_epf_blockpt_descr {
+	u64 s_sector; /* start sector of the request */
+	u64 addr; /* where the data is  */
+	u32 len; /* bytes to pu at addr + s_offset*/
+	struct blockpt_si {
+		u8 opf;
+		u8 status;
+		u8 flags;
+		u8 res0;
+	} si;
+};
+
+struct pci_blockpt_driver_ring {
+	u16 idx;
+	u16 ring[]; /* queue size*/
+};
+
+struct pci_blockpt_device_ring {
+	u16 idx;
+	u16 ring[]; /* queue size*/
+};
+
+#endif /* __LINUX_PCI_EPF_BLOCKPT_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] PCI: Add PCI driver for a PCI EP remote Blockdevice
  2024-02-24 21:03 [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Wadim Mueller
  2024-02-24 21:04 ` [PATCH 1/3] PCI: Add PCI Endpoint function driver for Block-device passthrough Wadim Mueller
@ 2024-02-24 21:04 ` Wadim Mueller
  2024-02-24 21:04 ` [PATCH 3/3] Documentation: PCI: Add documentation for the PCI Block Passthrough Wadim Mueller
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Wadim Mueller @ 2024-02-24 21:04 UTC (permalink / raw)
  Cc: Wadim Mueller, Bjorn Helgaas, Jonathan Corbet,
	Manivannan Sadhasivam, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Shunsuke Mie, Damien Le Moal, linux-pci, linux-doc, linux-kernel,
	linux-block

Add PCI Remote Disk Driver for the PCI Endpoint Block Passthrough
function driver. This driver allows you to access the block devices
from a remote PCI Endpoint driver (pci-epf-block-passthru) as local block devices.

This driver is the complement of the Endpoint
function driver (configured through the
CONFIG_PCI_EPF_BLOCK_PASSTHROUGH option on the EP device which exposes
its Block Devices).

After the endpoint driver has configured which Block devices it
wants to export, this driver is responsible to configure (again
through ConfigFS) on the Host
side to which of the exported devices the Host (RC) wants to attach
to.

After the devices are attached to the Host, it can access those
devices as local block devices.

Signed-off-by: Wadim Mueller <wafgo01@gmail.com>
---
 drivers/block/Kconfig           |   14 +
 drivers/block/Makefile          |    1 +
 drivers/block/pci-remote-disk.c | 1047 +++++++++++++++++++++++++++++++
 3 files changed, 1062 insertions(+)
 create mode 100644 drivers/block/pci-remote-disk.c

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 5b9d4aaebb81..f01ae15f4a5e 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -402,6 +402,20 @@ config BLKDEV_UBLK_LEGACY_OPCODES
 	  suggested to enable N if your application(ublk server) switches to
 	  ioctl command encoding.
 
+config PCI_REMOTE_DISK
+	tristate "PCI Remote Disk"
+	depends on BLOCK && PCI
+	select CONFIGFS_FS
+	help
+	  Say Y here if you want include the PCI remote disk, which allows you to map the blockdevices
+	  from a remote PCI Endpoint driver as local block devices. This can be useful if you
+	  have multiple SoCs in your system where the block devices are connected to one SoC and you want to access
+	  those from the other SoC. The decision to which remote disk you want to attach is done through CONFIG_FS.
+	  This option is the complement to the CONFIG_PCI_EPF_BLOCK_PASSTHROUGH options which must
+	  be set on the Endpoint device which exposes its Block Devices.
+
+	  If unsure, say N.
+
 source "drivers/block/rnbd/Kconfig"
 
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 101612cba303..94a10c87b97e 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_SUNVDC)		+= sunvdc.o
 
 obj-$(CONFIG_BLK_DEV_NBD)	+= nbd.o
 obj-$(CONFIG_VIRTIO_BLK)	+= virtio_blk.o
+obj-$(CONFIG_PCI_REMOTE_DISK)   += pci-remote-disk.o
 
 obj-$(CONFIG_XEN_BLKDEV_FRONTEND)	+= xen-blkfront.o
 obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= xen-blkback/
diff --git a/drivers/block/pci-remote-disk.c b/drivers/block/pci-remote-disk.c
new file mode 100644
index 000000000000..ed258e41997a
--- /dev/null
+++ b/drivers/block/pci-remote-disk.c
@@ -0,0 +1,1047 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PCI Remote Disk Device Driver
+ *
+ * Wadim Mueller <wafgo01@gmail.com>
+ */
+
+#include <linux/major.h>
+#include <linux/kernel.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/io.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/mutex.h>
+#include <linux/random.h>
+#include <linux/uaccess.h>
+#include <linux/pci.h>
+#include <linux/pci_ids.h>
+#include <linux/pci_regs.h>
+#include <linux/hdreg.h>
+#include <linux/kthread.h>
+#include <linux/configfs.h>
+#include <linux/pci-epf.h>
+#include <linux/pci-epf-block-passthru.h>
+
+#define NUM_DESRIPTORS 256
+
+/*
+* Queue Size calculation is based on the following layout
+*
+* +------------------------+
+* |      1. Descriptor     |
+* +------------------------+
+* |      2. Descriptor     |
+* +------------------------+
+* |            :           |
+* +------------------------+
+* |            :           |
+* +------------------------+
+* |     Last Descriptor    |
+* +------------------------+
+* +------------------------+
+* |     Driver Ring        |
+* |           :            |
+* |           :            |
+* +------------------------+
+* +------------------------+
+* |     Device Ring        |
+* |           :            |
+* |           :            |
+* +------------------------+
+*/
+
+#define QSIZE                                                         \
+	(ALIGN(NUM_DESRIPTORS * sizeof(struct pci_epf_blockpt_descr), \
+	       sizeof(u64)) +                                         \
+	 ALIGN(sizeof(struct pci_blockpt_driver_ring) +               \
+		       (NUM_DESRIPTORS * sizeof(u16)),                \
+	       sizeof(u64)) +                                         \
+	 ALIGN(sizeof(struct pci_blockpt_device_ring) +               \
+		       (NUM_DESRIPTORS * sizeof(u16)),                \
+	       sizeof(u64)))
+
+#define RD_STATUS_TIMEOUT_COUNT (100)
+
+#define DRV_MODULE_NAME "pci-remote-disk"
+
+#define rd_readb(_x) readb(_x)
+#define rd_readw(_x) cpu_to_le16(readw(_x))
+#define rd_readl(_x) cpu_to_le32(readl(_x))
+#define rd_readq(_x) cpu_to_le64(readq(_x))
+
+#define rd_writeb(v, _x) writeb(v, _x)
+#define rd_writew(v, _x) writew(cpu_to_le16(v), _x)
+#define rd_writel(v, _x) writel(cpu_to_le32(v), _x)
+#define rd_writeq(v, _x) writeq(cpu_to_le64(v), _x)
+
+struct pci_remote_disk_common;
+struct pci_remote_disk_device;
+
+struct pci_remote_disk_queue {
+	struct pci_epf_blockpt_descr __iomem *descr_ring;
+	struct pci_blockpt_driver_ring __iomem *drv_ring;
+	struct pci_blockpt_device_ring __iomem *dev_ring;
+	u64 *descr_tags;
+	u32 descr_size;
+	u32 qbar_offset;
+	u32 drv_offset;
+	u32 dev_offset;
+	u16 drv_idx;
+	u16 dev_idx;
+	int irq;
+	u16 ns_idx;
+	struct task_struct *dp_thr;
+	char irq_name[32];
+	struct semaphore dig_sem;
+	spinlock_t lock;
+	struct task_struct *digest_task;
+	struct pci_remote_disk_device *rdd;
+	u8 idx;
+};
+
+struct pci_remote_disk_device {
+	struct list_head node;
+	struct pci_remote_disk_common *rcom;
+	struct blk_mq_tag_set tag_set;
+	struct config_group cfs_group;
+	struct gendisk *gd;
+	struct pci_remote_disk_queue *queue;
+	u32 num_queues;
+	sector_t capacity;
+	char *r_name;
+	char *npr_name;
+	char *l_name;
+	u8 id;
+	bool attached;
+	bool read_only;
+	size_t queue_space_residue;
+	const struct blk_mq_queue_data *bd;
+};
+
+struct pci_remote_disk_common {
+	struct list_head bd_list;
+	struct pci_dev *pdev;
+	struct pci_epf_blockpt_reg __iomem *base;
+	void __iomem *qbase;
+	void __iomem *qbase_next;
+	void __iomem *bar[PCI_STD_NUM_BARS];
+	int num_irqs;
+	u32 num_queues;
+	size_t qsize;
+};
+
+struct pci_remote_disk_request {
+	struct pci_remote_disk_queue *queue;
+	struct bio *bio;
+	blk_status_t status;
+	struct page *pg;
+	int order;
+	int num_bios;
+	int descr_idx;
+	struct pci_epf_blockpt_descr *descr;
+};
+
+static LIST_HEAD(available_remote_disks);
+
+static irqreturn_t pci_rd_irqhandler(int irq, void *dev_id);
+static blk_status_t pci_rd_queue_rq(struct blk_mq_hw_ctx *hctx,
+				    const struct blk_mq_queue_data *bd);
+static void pci_rd_end_rq(struct request *rq);
+static enum blk_eh_timer_return pci_rd_timeout_rq(struct request *rq);
+
+static const struct blk_mq_ops pci_rd_mq_ops = { .queue_rq = pci_rd_queue_rq,
+						 .complete = pci_rd_end_rq,
+						 .timeout = pci_rd_timeout_rq };
+
+static int pci_rd_open(struct gendisk *bd_disk, fmode_t mode);
+static void pci_rd_release(struct gendisk *disk);
+static int pci_rd_getgeo(struct block_device *bdev, struct hd_geometry *geo);
+static int pci_rd_ioctl(struct block_device *bdev, fmode_t mode,
+			unsigned int cmd, unsigned long arg);
+static int pci_rd_compat_ioctl(struct block_device *bdev, fmode_t mode,
+			       unsigned int cmd, unsigned long arg);
+
+static int pci_remote_disk_dispatch(void *cookie);
+
+static const struct block_device_operations pci_rd_ops = {
+	.open = pci_rd_open,
+	.release = pci_rd_release,
+	.getgeo = pci_rd_getgeo,
+	.owner = THIS_MODULE,
+	.ioctl = pci_rd_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = pci_rd_compat_ioctl,
+#endif
+};
+
+static int pci_remote_disk_send_command(struct pci_remote_disk_common *rcom,
+					u32 cmd)
+{
+	int timeout = 0;
+
+	smp_wmb();
+	rd_writel(cmd, &rcom->base->command);
+	while (++timeout < RD_STATUS_TIMEOUT_COUNT &&
+	       rd_readl(&rcom->base->status) != BPT_STATUS_SUCCESS) {
+		usleep_range(100, 200);
+	}
+
+	if (rd_readl(&rcom->base->status) != BPT_STATUS_SUCCESS) {
+		return -ENODEV;
+	}
+
+	rd_writel(0, &rcom->base->status);
+	return 0;
+}
+
+static inline struct pci_remote_disk_device *
+to_remote_disk_dev(struct config_item *item)
+{
+	return container_of(to_config_group(item),
+			    struct pci_remote_disk_device, cfs_group);
+}
+
+static ssize_t pci_remote_disk_group_remote_name_show(struct config_item *item,
+						      char *page)
+{
+	struct pci_remote_disk_device *rdd = to_remote_disk_dev(item);
+	return sprintf(page, "%s", rdd->r_name);
+}
+
+CONFIGFS_ATTR_RO(pci_remote_disk_group_, remote_name);
+
+static ssize_t pci_remote_disk_group_local_name_show(struct config_item *item,
+						     char *page)
+{
+	struct pci_remote_disk_device *rdd = to_remote_disk_dev(item);
+	return sprintf(page, "%s", rdd->l_name);
+}
+
+static ssize_t pci_remote_disk_group_local_name_store(struct config_item *item,
+						      const char *page,
+						      size_t len)
+{
+	struct pci_remote_disk_device *rdd = to_remote_disk_dev(item);
+	if (rdd->l_name)
+		kfree(rdd->l_name);
+	rdd->l_name = kasprintf(GFP_KERNEL, "%s", page);
+	return len;
+}
+
+CONFIGFS_ATTR(pci_remote_disk_group_, local_name);
+
+static ssize_t pci_remote_disk_group_attach_show(struct config_item *item,
+						 char *page)
+{
+	struct pci_remote_disk_device *rdd = to_remote_disk_dev(item);
+	return sprintf(page, "%d\n", rdd->attached);
+}
+
+static int pci_remote_disk_attach(struct pci_remote_disk_device *rdd)
+{
+	int ret, i;
+	struct device *dev = &rdd->rcom->pdev->dev;
+	struct pci_epf_blockpt_reg __iomem *base = rdd->rcom->base;
+
+	rd_writeb(rdd->id, &base->dev_idx);
+
+	ret = pci_remote_disk_send_command(rdd->rcom,
+					   BPT_COMMAND_GET_NUM_SECTORS);
+	if (ret) {
+		dev_err(dev, "%s: cannot get number of sectors\n",
+			rdd->npr_name);
+		return -ENODEV;
+	}
+
+	rdd->capacity = rd_readq(&base->num_sectors);
+	dev_dbg(dev, "%s capacity 0x%llx\n", rdd->r_name, rdd->capacity);
+	ret = pci_remote_disk_send_command(rdd->rcom,
+					   BPT_COMMAND_GET_PERMISSION);
+	if (ret) {
+		dev_err(dev, "%s: cannot get permission, assume RO\n",
+			rdd->npr_name);
+		rdd->read_only = true;
+	} else {
+		rdd->read_only = rd_readb(&base->perm) & BPT_PERMISSION_RO;
+		dev_dbg(dev, "%s: map in RW mode\n", rdd->npr_name);
+	}
+
+	for (i = 0; i < rdd->num_queues; ++i) {
+		struct pci_remote_disk_queue *queue = &rdd->queue[i];
+		int irq = (rdd->id * rdd->num_queues) + i;
+
+		if (rdd->rcom->qsize < QSIZE) {
+			dev_err(dev,
+				"%s: cannot allocate queue %d, no space left\n",
+				rdd->l_name, i);
+			goto err_free_irq;
+		}
+
+		queue->descr_size = QSIZE;
+		queue->descr_ring = (struct pci_epf_blockpt_descr
+					     *)((u64)rdd->rcom->qbase_next +
+						(u64)(queue->descr_size));
+		rdd->rcom->qbase_next = (void __iomem *)queue->descr_ring;
+		queue->qbar_offset =
+			((u64)rdd->rcom->qbase_next - (u64)rdd->rcom->qbase);
+		memset_io(queue->descr_ring, 0, queue->descr_size);
+		queue->drv_offset =
+			ALIGN(NUM_DESRIPTORS * sizeof(*queue->descr_ring),
+			      sizeof(u64));
+		queue->dev_offset =
+			queue->drv_offset +
+			ALIGN(sizeof(struct pci_blockpt_driver_ring) +
+				      (NUM_DESRIPTORS * sizeof(u16)),
+			      sizeof(u64));
+		queue->drv_ring =
+			(struct pci_blockpt_driver_ring
+				 *)((u64)queue->descr_ring + queue->drv_offset);
+		queue->dev_ring =
+			(struct pci_blockpt_device_ring
+				 *)((u64)queue->descr_ring + queue->dev_offset);
+		sema_init(&queue->dig_sem, 0);
+		queue->dev_idx = queue->drv_idx = queue->ns_idx = 0;
+		dev_dbg(dev,
+			"%s: Setting queue %d addr. #Descriptors %i (%i Bytes). Queue Offset %d\n",
+			rdd->npr_name, i, NUM_DESRIPTORS, queue->descr_size,
+			queue->qbar_offset);
+		snprintf(queue->irq_name, sizeof(queue->irq_name), "rdd-%s-q%d",
+			 rdd->npr_name, i);
+		queue->irq = pci_irq_vector(rdd->rcom->pdev, irq);
+		ret = devm_request_irq(dev, queue->irq, pci_rd_irqhandler,
+				       IRQF_SHARED, queue->irq_name, queue);
+		if (ret) {
+			dev_err(dev, "Can't register %s IRQ. Id %i.\n",
+				queue->irq_name, queue->irq);
+			goto err_free_irq;
+		}
+
+		rd_writeb((u8)i, &base->qidx);
+		rd_writel(queue->drv_offset, &base->drv_offset);
+		rd_writel(queue->dev_offset, &base->dev_offset);
+		rd_writel(NUM_DESRIPTORS, &base->num_desc);
+		rd_writel(queue->descr_size, &base->qsize);
+
+		rd_writel(queue->qbar_offset, &base->queue_offset);
+		ret = pci_remote_disk_send_command(rdd->rcom,
+						   BPT_COMMAND_SET_QUEUE);
+		if (ret) {
+			dev_err(dev, "%s: cannot set queue %d\n", rdd->npr_name,
+				i);
+			goto err_free_irq;
+		}
+
+		rd_writel(irq + 1, &base->irq);
+		ret = pci_remote_disk_send_command(rdd->rcom,
+						   BPT_COMMAND_SET_IRQ);
+		if (ret) {
+			dev_err(dev, "%s: cannot set irq for queue %d\n",
+				rdd->npr_name, i);
+			goto err_free_irq;
+		}
+		queue->digest_task = kthread_create(pci_remote_disk_dispatch,
+						    queue, "rdt-%s.q%d",
+						    rdd->npr_name, i);
+		if (IS_ERR(queue->digest_task)) {
+			dev_err(dev,
+				"%s: Cannot create kernel digest thread for queue %d\n",
+				rdd->npr_name, i);
+			ret = PTR_ERR(queue->digest_task);
+			goto err_free_irq;
+		}
+		rdd->rcom->qsize -= QSIZE;
+		wake_up_process(queue->digest_task);
+	}
+
+	ret = pci_remote_disk_send_command(rdd->rcom, BPT_COMMAND_START);
+	if (ret) {
+		dev_err(dev, "%s: cannot start device\n", rdd->npr_name);
+		goto err_free_irq;
+	}
+
+	rdd->tag_set.ops = &pci_rd_mq_ops;
+	rdd->tag_set.queue_depth = 32;
+	rdd->tag_set.numa_node = NUMA_NO_NODE;
+	rdd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	rdd->tag_set.nr_hw_queues = num_present_cpus();
+	rdd->tag_set.timeout = 5 * HZ;
+	rdd->tag_set.cmd_size = sizeof(struct pci_remote_disk_request);
+	rdd->tag_set.driver_data = rdd;
+	ret = blk_mq_alloc_tag_set(&rdd->tag_set);
+	if (ret) {
+		dev_err(dev, "%s: Could not allocate tag set\n", rdd->npr_name);
+		goto err_free_irq;
+	}
+
+	rdd->gd = blk_mq_alloc_disk(&rdd->tag_set, NULL, rdd);
+	if (IS_ERR(rdd->gd)) {
+		ret = -ENODEV;
+		goto err_blk_mq_free;
+	}
+
+	rdd->gd->fops = &pci_rd_ops;
+	rdd->gd->private_data = rdd->gd->queue->queuedata = rdd;
+	snprintf(rdd->gd->disk_name, sizeof(rdd->gd->disk_name), "%s",
+		 rdd->l_name);
+	set_capacity(rdd->gd, rdd->capacity);
+
+	if (rdd->read_only)
+		dev_dbg(dev, "%s attached in RO mode\n", rdd->npr_name);
+
+	rdd->attached = true;
+	set_disk_ro(rdd->gd, rdd->read_only);
+	return device_add_disk(dev, rdd->gd, NULL);
+
+err_blk_mq_free:
+	blk_mq_free_tag_set(&rdd->tag_set);
+err_free_irq:
+	for (i = 0; i < rdd->num_queues; ++i) {
+		struct pci_remote_disk_queue *queue = &rdd->queue[i];
+		if (queue && queue->irq != -EINVAL)
+			devm_free_irq(dev, queue->irq, queue);
+	}
+
+	return ret;
+}
+
+static int pci_remote_disk_detach(struct pci_remote_disk_device *rdd)
+{
+	struct device *dev = &rdd->rcom->pdev->dev;
+	struct pci_epf_blockpt_reg __iomem *base = rdd->rcom->base;
+	int ret, i;
+
+	rd_writeb(rdd->id, &base->dev_idx);
+	ret = pci_remote_disk_send_command(rdd->rcom, BPT_COMMAND_STOP);
+	if (ret) {
+		dev_err(dev, "%s: cannot stop device\n", rdd->npr_name);
+		return ret;
+	}
+
+	for (i = 0; i < rdd->num_queues; ++i) {
+		struct pci_remote_disk_queue *queue = &rdd->queue[i];
+		kthread_stop(queue->digest_task);
+	}
+
+	del_gendisk(rdd->gd);
+	blk_mq_free_tag_set(&rdd->tag_set);
+	for (i = 0; i < rdd->num_queues; ++i) {
+		struct pci_remote_disk_queue *queue = &rdd->queue[i];
+		if (queue->irq != -EINVAL) {
+			devm_free_irq(dev, queue->irq, queue);
+			queue->irq = -EINVAL;
+		}
+	}
+
+	put_disk(rdd->gd);
+	rdd->attached = false;
+	return 0;
+}
+
+static ssize_t pci_remote_disk_group_attach_store(struct config_item *item,
+						  const char *page, size_t len)
+{
+	bool attach;
+	struct pci_remote_disk_device *rdd = to_remote_disk_dev(item);
+
+	int ret = kstrtobool(page, &attach);
+
+	if (ret)
+		return ret;
+
+	if (!rdd->attached && attach)
+		ret = pci_remote_disk_attach(rdd);
+	else if (rdd->attached && !attach)
+		ret = pci_remote_disk_detach(rdd);
+	else
+		ret = -EINVAL;
+
+	if (ret < 0)
+		return ret;
+
+	return len;
+}
+
+CONFIGFS_ATTR(pci_remote_disk_group_, attach);
+
+static struct configfs_attribute *pci_remote_disk_group_attrs[] = {
+	&pci_remote_disk_group_attr_remote_name,
+	&pci_remote_disk_group_attr_local_name,
+	&pci_remote_disk_group_attr_attach,
+	NULL,
+};
+
+static const struct config_item_type pci_remote_disk_group_type = {
+	.ct_owner = THIS_MODULE,
+	.ct_attrs = pci_remote_disk_group_attrs,
+};
+
+static const struct config_item_type pci_remote_disk_type = {
+	.ct_owner = THIS_MODULE,
+};
+
+static struct configfs_subsystem pci_remote_disk_subsys = {
+	.su_group = {
+		.cg_item = {
+			.ci_namebuf = "pci_remote_disk",
+			.ci_type = &pci_remote_disk_type,
+		},
+	},
+	.su_mutex = __MUTEX_INITIALIZER(pci_remote_disk_subsys.su_mutex),
+};
+
+static const struct pci_device_id pci_remote_disk_tbl[] = {
+	{
+		PCI_DEVICE(0x0, 0xc402),
+	},
+	{ 0 }
+};
+
+static int pci_rd_alloc_descriptor(struct pci_remote_disk_queue *queue)
+{
+	int i;
+	int ret = -ENOSPC;
+	struct device *dev = &queue->rdd->rcom->pdev->dev;
+	spin_lock(&queue->lock);
+	for (i = 0; i < NUM_DESRIPTORS; ++i) {
+		struct pci_epf_blockpt_descr __iomem *de =
+			&queue->descr_ring[queue->ns_idx];
+		u32 flags = READ_ONCE(de->si.flags);
+		if (!(flags & PBI_EPF_BLOCKPT_F_USED)) {
+			dev_dbg(dev, "Found free descriptor at idx %i\n",
+				queue->ns_idx);
+			WRITE_ONCE(de->si.flags,
+				   flags | PBI_EPF_BLOCKPT_F_USED);
+			ret = queue->ns_idx;
+			queue->ns_idx = (queue->ns_idx + 1) % NUM_DESRIPTORS;
+			goto unlock_return;
+		}
+		queue->ns_idx = (queue->ns_idx + 1) % NUM_DESRIPTORS;
+	}
+unlock_return:
+	spin_unlock(&queue->lock);
+	if (ret == -ENOSPC)
+		dev_err_ratelimited(dev, "No free descriptor for Queue %d\n",
+				    queue->idx);
+	return ret;
+}
+
+static bool is_valid_request(unsigned int op)
+{
+	return (op == REQ_OP_READ) || (op == REQ_OP_WRITE);
+}
+
+static blk_status_t pci_rd_queue_rq(struct blk_mq_hw_ctx *hctx,
+				    const struct blk_mq_queue_data *bd)
+{
+	struct req_iterator iter;
+	struct bio_vec bv;
+	int descr_idx;
+	struct pci_remote_disk_device *rdd = hctx->queue->queuedata;
+	struct pci_remote_disk_request *rb_req = blk_mq_rq_to_pdu(bd->rq);
+	struct device *dev = &rdd->rcom->pdev->dev;
+	struct pci_epf_blockpt_descr __iomem *dtu;
+	struct pci_blockpt_driver_ring __iomem *drv_ring;
+	dma_addr_t dma_addr;
+	char *buf;
+	int err;
+	/* this method works well to
+	 * distribute the load across the available queues */
+	struct pci_remote_disk_queue *queue =
+		&rdd->queue[smp_processor_id() % rdd->num_queues];
+
+	drv_ring = queue->drv_ring;
+	rb_req->queue = queue;
+	if (!is_valid_request(req_op(bd->rq))) {
+		dev_err(dev, "Unsupported Request: %i\n", req_op(bd->rq));
+		return BLK_STS_NOTSUPP;
+	}
+
+	descr_idx = pci_rd_alloc_descriptor(queue);
+	if (unlikely(descr_idx < 0))
+		return BLK_STS_AGAIN;
+
+	dtu = &queue->descr_ring[descr_idx];
+	rb_req->order = get_order(blk_rq_bytes(bd->rq));
+	rb_req->pg = alloc_pages(GFP_ATOMIC | GFP_DMA, rb_req->order);
+	if (unlikely(!rb_req->pg)) {
+		dev_err(dev, "cannot alloc %i page(s)\n", (1 << rb_req->order));
+		err = BLK_STS_AGAIN;
+		goto free_descr;
+	}
+
+	rb_req->descr = dtu;
+	rb_req->descr_idx = descr_idx;
+	buf = page_address(rb_req->pg);
+	dma_addr = dma_map_single(dev, buf, blk_rq_bytes(bd->rq),
+				  rq_dma_dir(bd->rq));
+	if (dma_mapping_error(dev, dma_addr)) {
+		dev_err(dev, "failed to map page for descriptor\n");
+		err = BLK_STS_AGAIN;
+		goto free_pages;
+	}
+
+	dtu->addr = dma_addr;
+	dtu->len = blk_rq_bytes(bd->rq);
+	dtu->si.opf = rq_data_dir(bd->rq);
+	if (dtu->si.opf == WRITE) {
+		rq_for_each_segment(bv, bd->rq, iter) {
+			memcpy_from_bvec(buf, &bv);
+			buf += bv.bv_len;
+		}
+	}
+
+	dtu->s_sector = blk_rq_pos(bd->rq);
+	queue->descr_tags[descr_idx] = (u64)rb_req;
+	spin_lock(&queue->lock);
+	rd_writew(descr_idx, &drv_ring->ring[queue->drv_idx]);
+	queue->drv_idx = (queue->drv_idx + 1) % NUM_DESRIPTORS;
+	rd_writew(queue->drv_idx, &drv_ring->idx);
+	spin_unlock(&queue->lock);
+	dev_dbg(dev,
+		"(DIR: %s): Adding desc %i (%i). sector: 0x%llX, len: 0x%x\n",
+		(rq_data_dir(bd->rq) == WRITE) ? "WRITE" : "READ", descr_idx,
+		queue->drv_idx, dtu->s_sector, dtu->len);
+	blk_mq_start_request(bd->rq);
+	return BLK_STS_OK;
+free_pages:
+	__free_pages(rb_req->pg, rb_req->order);
+free_descr:
+	memset(dtu, 0, sizeof(*dtu));
+	return err;
+}
+
+static void pci_rd_end_rq(struct request *rq)
+{
+	struct pci_remote_disk_request *rb_req = blk_mq_rq_to_pdu(rq);
+	blk_mq_end_request(rq, rb_req->status);
+}
+
+static enum blk_eh_timer_return pci_rd_timeout_rq(struct request *rq)
+{
+	struct pci_remote_disk_request *rb_req = blk_mq_rq_to_pdu(rq);
+	struct device *dev = &rb_req->queue->rdd->rcom->pdev->dev;
+	dev_err(dev, "%s : Timeout on queue%d: Descriptor %d\n",
+		rb_req->queue->rdd->l_name, rb_req->queue->idx,
+		rb_req->descr_idx);
+	return BLK_EH_DONE;
+}
+
+static int pci_rd_open(struct gendisk *bd_disk, fmode_t mode)
+{
+	struct pci_remote_disk_common *rcom = bd_disk->private_data;
+	dev_dbg(&rcom->pdev->dev, "%s called\n", __func__);
+	return 0;
+}
+
+static void pci_rd_release(struct gendisk *disk)
+{
+	struct pci_remote_disk_common *rcom = disk->private_data;
+	dev_dbg(&rcom->pdev->dev, "%s called\n", __func__);
+}
+
+static int pci_rd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
+{
+	struct pci_remote_disk_common *rcom = bdev->bd_disk->private_data;
+	dev_dbg(&rcom->pdev->dev, "%s called\n", __func__);
+	geo->heads = 4;
+	geo->sectors = 16;
+	geo->cylinders =
+		get_capacity(bdev->bd_disk) / (geo->heads * geo->sectors);
+	return 0;
+}
+
+static int pci_rd_ioctl(struct block_device *bdev, fmode_t mode,
+			unsigned int cmd, unsigned long arg)
+{
+	return -EINVAL;
+}
+
+#ifdef CONFIG_COMPAT
+static int pci_rd_compat_ioctl(struct block_device *bdev, fmode_t mode,
+			       unsigned int cmd, unsigned long arg)
+{
+	return pci_rd_ioctl(bdev, mode, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static irqreturn_t pci_rd_irqhandler(int irq, void *dev_id)
+{
+	struct pci_remote_disk_queue *queue = dev_id;
+	struct device *dev = &queue->rdd->rcom->pdev->dev;
+
+	BUG_ON(!queue->rdd->attached);
+	dev_dbg(dev, "IRQ%d from %s.%d\n", irq, queue->rdd->l_name, queue->idx);
+	/* wakeup the process to digest the processed request*/
+	up(&queue->dig_sem);
+	return IRQ_HANDLED;
+}
+
+static void pci_rd_clear_descriptor(struct pci_remote_disk_queue *queue,
+				    struct pci_epf_blockpt_descr *descr,
+				    u16 descr_idx)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&queue->lock, flags);
+	queue->descr_tags[descr_idx] = 0;
+	memset(descr, 0, sizeof(*descr));
+	spin_unlock_irqrestore(&queue->lock, flags);
+}
+
+static int pci_remote_disk_dispatch(void *cookie)
+{
+	struct pci_remote_disk_queue *queue = cookie;
+	struct device *dev = &queue->rdd->rcom->pdev->dev;
+	struct pci_blockpt_device_ring __iomem *dev_ring = queue->dev_ring;
+	struct req_iterator iter;
+	struct bio_vec bv;
+	int ret;
+	u16 descr_idx;
+	struct pci_epf_blockpt_descr *desc;
+	struct pci_remote_disk_request *rb_req;
+	struct request *rq;
+	void *buf;
+	unsigned long tmo = msecs_to_jiffies(250);
+
+	while (!kthread_should_stop()) {
+		ret = down_timeout(&queue->dig_sem, tmo);
+
+		if (rd_readw(&dev_ring->idx) == queue->dev_idx)
+			continue;
+
+		while (rd_readw(&dev_ring->idx) != queue->dev_idx) {
+			descr_idx = rd_readw(&dev_ring->ring[queue->dev_idx]);
+			desc = &queue->descr_ring[descr_idx];
+
+			BUG_ON(!(READ_ONCE(desc->si.flags) &
+				 PBI_EPF_BLOCKPT_F_USED));
+
+			rb_req = (struct pci_remote_disk_request *)
+					 queue->descr_tags[descr_idx];
+			BUG_ON(rb_req == NULL);
+
+			rq = blk_mq_rq_from_pdu(rb_req);
+
+			if (rq_data_dir(rq) == READ) {
+				buf = kmap_local_page(rb_req->pg);
+				rq_for_each_segment(bv, rq, iter) {
+					memcpy_to_bvec(&bv, buf);
+					buf += bv.bv_len;
+				}
+				kunmap_local(buf);
+			}
+
+			dma_unmap_single(dev, desc->addr, desc->len,
+					 rq_dma_dir(rq));
+			rb_req->status =
+				(blk_status_t)rd_readb(&desc->si.status);
+
+			pci_rd_clear_descriptor(queue, desc, descr_idx);
+			__free_pages(rb_req->pg, rb_req->order);
+			WRITE_ONCE(queue->dev_idx,
+				   (queue->dev_idx + 1) % NUM_DESRIPTORS);
+			blk_mq_complete_request(rq);
+		}
+	}
+
+	return 0;
+}
+
+static int pci_remote_disk_parse(struct pci_remote_disk_common *rcom)
+{
+	struct pci_remote_disk_device *rdd;
+	struct list_head *lh, *lhtmp;
+	char *sbd, *ebd;
+	int count = 0;
+	int err, i;
+	char *loc_st;
+	struct device *dev = &rcom->pdev->dev;
+
+	loc_st = kasprintf(GFP_KERNEL, "%s", rcom->base->dev_name);
+	sbd = ebd = loc_st;
+
+	while ((ebd = strchr(sbd, ';')) != NULL) {
+		rdd = kzalloc(sizeof(*rdd), GFP_KERNEL);
+		if (!rdd) {
+			dev_err(dev, "Could not allocate rd struct\n");
+			err = -ENOMEM;
+			goto err_free;
+		}
+
+		rdd->num_queues = rcom->num_queues;
+		rdd->queue = kcalloc(rdd->num_queues, sizeof(*rdd->queue),
+				     GFP_KERNEL | __GFP_ZERO);
+		if (rdd->queue == NULL) {
+			dev_err(dev, "unable to alloc queues for device %d\n",
+				count);
+			goto err_free;
+		}
+
+		for (i = 0; i < rdd->num_queues; ++i) {
+			struct pci_remote_disk_queue *queue = &rdd->queue[i];
+			queue->irq = -EINVAL;
+			queue->rdd = rdd;
+			queue->idx = i;
+			spin_lock_init(&queue->lock);
+			queue->descr_tags =
+				kzalloc((sizeof(u64) * NUM_DESRIPTORS),
+					GFP_KERNEL | __GFP_ZERO);
+			if (!queue->descr_tags) {
+				dev_err(dev,
+					"Could not allocate queue descriptor tags\n");
+				err = -ENOMEM;
+				goto err_free;
+			}
+		}
+
+		INIT_LIST_HEAD(&rdd->node);
+		list_add_tail(&rdd->node, &available_remote_disks);
+		rdd->r_name = kmemdup_nul(sbd, ebd - sbd, GFP_KERNEL);
+		if (!rdd->r_name) {
+			dev_err(dev,
+				"Could not allocate memory for remote device name\n");
+			err = -ENOMEM;
+			goto err_free;
+		}
+
+		rdd->rcom = rcom;
+		rdd->id = count;
+		/* get rid of all path seperators  */
+		rdd->npr_name = strrchr(rdd->r_name, '/');
+		rdd->npr_name = (rdd->npr_name == NULL) ? rdd->r_name :
+							  (rdd->npr_name + 1);
+		rdd->l_name = kasprintf(GFP_KERNEL, "pci-rd-%s", rdd->npr_name);
+		if (!rdd->l_name) {
+			dev_err(dev,
+				"Could not allocate memory for local device name\n");
+			err = -ENOMEM;
+			goto err_free;
+		}
+
+		config_group_init_type_name(&rdd->cfs_group, rdd->npr_name,
+					    &pci_remote_disk_group_type);
+		err = configfs_register_group(&pci_remote_disk_subsys.su_group,
+					      &rdd->cfs_group);
+		if (err) {
+			dev_err(dev, "Cannot register configfs group for %s\n",
+				rdd->npr_name);
+			err = -ENODEV;
+			goto err_free;
+		}
+
+		dev_info(dev, "Found %s\n", rdd->r_name);
+		sbd = ebd + 1;
+		count++;
+	}
+
+	kfree(loc_st);
+	return count;
+
+err_free:
+	kfree(loc_st);
+	list_for_each_safe(lh, lhtmp, &available_remote_disks) {
+		rdd = list_entry(lh, struct pci_remote_disk_device, node);
+		if (rdd->r_name) {
+			kfree(rdd->r_name);
+			configfs_unregister_group(&rdd->cfs_group);
+		}
+		kfree(rdd->l_name);
+		list_del(lh);
+		for (i = 0; i < rdd->num_queues; ++i) {
+			struct pci_remote_disk_queue *queue = &rdd->queue[i];
+			if (queue && queue->descr_tags) {
+				kfree(queue->descr_tags);
+				queue = NULL;
+			}
+		}
+		kfree(rdd->queue);
+		kfree(rdd);
+	}
+	return err;
+}
+
+static int pci_remote_disk_probe(struct pci_dev *pdev,
+				 const struct pci_device_id *ent)
+{
+	struct device *dev = &pdev->dev;
+	int err, num, num_irqs;
+	enum pci_barno bar;
+	enum pci_barno def_reg_bar = NO_BAR;
+	void __iomem *base;
+	struct pci_remote_disk_common *rcom =
+		devm_kzalloc(dev, sizeof(*rcom), GFP_KERNEL);
+	if (!rcom)
+		return -ENOMEM;
+
+	rcom->pdev = pdev;
+	if ((dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48)) != 0) &&
+	    dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32)) != 0) {
+		err = -ENODEV;
+		dev_err(dev, "Cannot set DMA mask\n");
+		goto out_free_dev;
+	}
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(dev, "Cannot enable PCI device\n");
+		goto out_free_dev;
+	}
+
+	err = pci_request_regions(pdev, DRV_MODULE_NAME);
+	if (err) {
+		dev_err(dev, "Cannot obtain PCI resources\n");
+		goto err_disable_pdev;
+	}
+
+	pci_set_master(pdev);
+	for (bar = 0; bar < PCI_STD_NUM_BARS; bar++) {
+		if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
+			base = pci_ioremap_bar(pdev, bar);
+			if (!base) {
+				dev_err(dev, "Failed to read BAR%d\n", bar);
+				WARN_ON(bar == def_reg_bar);
+			}
+			rcom->bar[bar] = base;
+			if (rd_readl(base) == BLOCKPT_MAGIC) {
+				def_reg_bar = bar;
+				dev_dbg(dev, "valid magic found at BAR%d", bar);
+				break;
+			}
+		}
+	}
+
+	if (def_reg_bar == NO_BAR) {
+		err = -ENODEV;
+		dev_err(dev, "Unable to find valid BAR\n");
+		goto err_iounmap;
+	}
+
+	rcom->base = rcom->bar[def_reg_bar];
+	if (!rcom->base) {
+		err = -ENOMEM;
+		dev_err(dev, "Cannot perform PCI communictaion without BAR%d\n",
+			def_reg_bar);
+		goto err_iounmap;
+	}
+
+	rcom->qbase = rcom->qbase_next =
+		(void *)(u64)rcom->base +
+		rd_readl(&rcom->base->queue_bar_offset);
+	rcom->qsize = rd_readl(&rcom->base->available_qsize);
+	rcom->num_queues = rd_readb(&rcom->base->num_queues);
+	dev_dbg(dev, "%d queues per device available\n", rcom->num_queues);
+
+	err = pci_remote_disk_send_command(rcom, BPT_COMMAND_GET_DEVICES);
+	if (err) {
+		dev_err(dev, "Cannot get devices\n");
+		goto err_iounmap;
+	}
+
+	dev_dbg(dev, "%s available", rcom->base->dev_name);
+	config_group_init(&pci_remote_disk_subsys.su_group);
+	err = configfs_register_subsystem(&pci_remote_disk_subsys);
+	if (err) {
+		dev_err(dev, "Error %d while registering subsystem %s\n", err,
+			pci_remote_disk_subsys.su_group.cg_item.ci_namebuf);
+		goto err_iounmap;
+	}
+
+	INIT_LIST_HEAD(&available_remote_disks);
+	num = pci_remote_disk_parse(rcom);
+	if (num <= 0) {
+		dev_err(dev, "Unable to parse any valid disk\n");
+		err = -ENODEV;
+		goto err_iounmap;
+	}
+
+	num_irqs = num * rcom->num_queues;
+	/* alloc one vector per queue */
+	rcom->num_irqs = pci_alloc_irq_vectors(pdev, 1, num_irqs,
+					       PCI_IRQ_MSIX | PCI_IRQ_MSI);
+	if (rcom->num_irqs < num_irqs)
+		dev_err(dev, "Failed to get %i MSI-X interrupts: Returned %i\n",
+			num_irqs, rcom->num_irqs);
+
+	dev_dbg(dev, "Allocated %i IRQ Vectors\n", rcom->num_irqs);
+	pci_set_drvdata(pdev, rcom);
+	return 0;
+
+err_iounmap:
+	for (bar = 0; bar < PCI_STD_NUM_BARS; bar++) {
+		if (rcom->bar[bar]) {
+			pci_iounmap(pdev, rcom->bar[bar]);
+			rcom->bar[bar] = NULL;
+		}
+	}
+
+	pci_free_irq_vectors(pdev);
+	pci_release_regions(pdev);
+err_disable_pdev:
+	pci_disable_device(pdev);
+out_free_dev:
+	devm_kfree(dev, rcom);
+	return err;
+}
+
+static void pci_remote_disk_remove(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct pci_remote_disk_common *rcom = pci_get_drvdata(pdev);
+	struct pci_remote_disk_device *rdd, *tmp_rdd;
+	int i;
+
+	list_for_each_entry_safe(rdd, tmp_rdd, &available_remote_disks, node) {
+		if (rdd->attached)
+			pci_remote_disk_detach(rdd);
+
+		kfree(rdd->r_name);
+		kfree(rdd->l_name);
+		configfs_unregister_group(&rdd->cfs_group);
+		for (i = 0; i < rdd->num_queues; ++i) {
+			struct pci_remote_disk_queue *queue = &rdd->queue[i];
+			kfree(queue->descr_tags);
+		}
+		kfree(rdd->queue);
+		list_del(&rdd->node);
+		kfree(rdd);
+	}
+
+	configfs_unregister_subsystem(&pci_remote_disk_subsys);
+	rcom->num_irqs = 0;
+
+	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+		if (rcom->bar[i]) {
+			pci_iounmap(pdev, rcom->bar[i]);
+			rcom->bar[i] = NULL;
+		}
+	}
+
+	pci_free_irq_vectors(pdev);
+	pci_release_regions(pdev);
+	pci_disable_device(pdev);
+	devm_kfree(dev, rcom);
+}
+
+MODULE_DEVICE_TABLE(pci, pci_remote_disk_tbl);
+
+static struct pci_driver pci_remote_disk_driver = {
+	.name = DRV_MODULE_NAME,
+	.id_table = pci_remote_disk_tbl,
+	.probe = pci_remote_disk_probe,
+	.remove = pci_remote_disk_remove,
+	.sriov_configure = pci_sriov_configure_simple,
+};
+
+module_pci_driver(pci_remote_disk_driver);
+
+MODULE_AUTHOR("Wadim Mueller <wafgo01@gmail.com>");
+MODULE_DESCRIPTION("Remote PCI Endpoint Disk driver");
+MODULE_LICENSE("GPL");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] Documentation: PCI: Add documentation for the PCI Block Passthrough
  2024-02-24 21:03 [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Wadim Mueller
  2024-02-24 21:04 ` [PATCH 1/3] PCI: Add PCI Endpoint function driver for Block-device passthrough Wadim Mueller
  2024-02-24 21:04 ` [PATCH 2/3] PCI: Add PCI driver for a PCI EP remote Blockdevice Wadim Mueller
@ 2024-02-24 21:04 ` Wadim Mueller
  2024-02-25 16:09 ` [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Manivannan Sadhasivam
  2024-02-26 11:08 ` Christoph Hellwig
  4 siblings, 0 replies; 11+ messages in thread
From: Wadim Mueller @ 2024-02-24 21:04 UTC (permalink / raw)
  Cc: Wadim Mueller, Bjorn Helgaas, Jonathan Corbet,
	Manivannan Sadhasivam, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Damien Le Moal, Shunsuke Mie, linux-pci, linux-doc, linux-kernel,
	linux-block

Add documentation for the PCI Block Passthrough function device. The endpoint function
driver and the host PCI driver should be configured based on this documentation.

Signed-off-by: Wadim Mueller <wafgo01@gmail.com>
---
 .../function/binding/pci-block-passthru.rst   |  24 ++
 Documentation/PCI/endpoint/index.rst          |   3 +
 .../pci-endpoint-block-passthru-function.rst  | 331 ++++++++++++++++++
 .../pci-endpoint-block-passthru-howto.rst     | 158 +++++++++
 MAINTAINERS                                   |   8 +
 5 files changed, 524 insertions(+)
 create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst

diff --git a/Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst b/Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
new file mode 100644
index 000000000000..60820edce594
--- /dev/null
+++ b/Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PCI Test Endpoint Function
+==========================
+
+name: Should be "pci_epf_blockpt" to bind to the pci_epf_blockpt driver.
+
+Configurable Fields:
+
+================   ===========================================================
+vendorid	   should be 0x0000
+deviceid	   should be 0xc402 for S32CC
+revid		   don't care
+progif_code	   don't care
+subclass_code	   don't care
+baseclass_code	   should be 0xff
+cache_line_size	   don't care
+subsys_vendor_id   don't care
+subsys_id	   don't care
+interrupt_pin	   don't care
+msi_interrupts	   don't care
+msix_interrupts	   don't care
+================   ===========================================================
diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst
index 4d2333e7ae06..2e4e5ac114df 100644
--- a/Documentation/PCI/endpoint/index.rst
+++ b/Documentation/PCI/endpoint/index.rst
@@ -15,6 +15,9 @@ PCI Endpoint Framework
    pci-ntb-howto
    pci-vntb-function
    pci-vntb-howto
+   pci-endpoint-block-passthru-function
+   pci-endpoint-block-passthru-howto
 
    function/binding/pci-test
    function/binding/pci-ntb
+   function/binding/pci-block-passthru
diff --git a/Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst b/Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
new file mode 100644
index 000000000000..dc78d32d8cc2
--- /dev/null
+++ b/Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
@@ -0,0 +1,331 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+PCI Block Device Passthrough Function
+=====================================
+
+:Author: Wadim Mueller <wafgo01@gmail.com>
+
+PCI Block Device Passthrough allows one Linux Device to expose its Block devices to the PCI(e) host. The device can export either the full disk or just certain partitions. Also an export in readonly mode is possible.
+
+This feature is useful if you have a direct connection between two PCI capable SoC's, one running as Root Complex and the other in Endpoint mode, and you want to provide the RC device access to some (or all) Block devices attached to the SoC running is EP mode. This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
+
+The below diagram shows a possible setup with two SoCs, SoC1 working in RC mode, SoC2 in EP mode.
+SoC2 can now export the NVMe, eMMC and the SD Card attached to it (full Disks or some Partitions). For this
+the *pci-epf-block-passthru* driver (located at **drivers/pci/endpoint/functions/pci-epf-block-passthru.c**)
+must be loaded on SoC2. SoC1 requires the PCI Driver *pci-remote-disk* (located at **drivers/block/pci-remote-disk.c**)
+
+After both drivers are loaded SoC2 can configure which devices it wants to expose using ConfigFS.
+SoC1 can afterwards configure (also utilizing ConfigFS) on his side which exported devices it wants attach to.
+After attaching to it, the device will register a disk on SoC1 which can be accessed as a local disk.
+
+
+.. code-block:: text
+
+
+                                                           +-------------+
+                                                           |             |
+                                                           |   SD Card   |
+                                                           |             |
+                                                           +------^------+
+                                                                  |
+                                                                  |
+    +--------------------------+                +-----------------v----------------+
+    |                          |      PCI(e)    |                                  |
+    |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
+    | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
+    |                          |                |                                  |
+    +--------------------------+                +-----------------^----------------+
+                                                                  |
+                                                                  |
+                                                           +------v------+
+                                                           |             |
+                                                           |    NVMe     |
+                                                           |             |
+                                                           +-------------+
+
+.. _register description:
+Registers
+---------
+
+The PCI Block Device Passthrough has the following registers:
+
+1) PCI_BLOCKPT_MAGIC               (offset 0x00)
+2) PCI_BLOCKPT_COMMAND             (offset 0x04)
+3) PCI_BLOCKPT_STATUS              (offset 0x08)
+4) PCI_BLOCKPT_QUEUE_BAR_OFFSET    (offset 0x0C)
+5) PCI_BLOCKPT_DRV_OFFSET          (offset 0x10)
+6) PCI_BLOCKPT_DEV_OFFSET          (offset 0x14)
+7) PCI_BLOCKPT_NUM_DESC            (offset 0x18)
+8) PCI_BLOCKPT_MAX_DEVS            (offset 0x1C)
+9) PCI_BLOCKPT_IRQ                 (offset 0x20)
+10) PCI_BLOCKPT_QSIZE              (offset 0x24)
+11) PCI_BLOCKPT_NUM_QUEUES         (offset 0x28)
+12) PCI_BLOCKPT_QUEUE_OFFSET       (offset 0x2C)
+13) PCI_BLOCKPT_AVAIL_QUEUE_SIZE   (offset 0x30)
+14) PCI_BLOCKPT_DEV_IDX            (offset 0x34)
+15) PCI_BLOCKPT_PERM               (offset 0x35)
+16) PCI_BLOCKPT_QUEUE_IDX          (offset 0x36)
+17) PCI_BLOCKPT_NUM_SECTORS        (offset 0x38)
+18) PCI_BLOCKPT_DEV_NAME           (offset 0x40)
+
+Registers Description
+---------------------
+
+* **PCI_BLOCKPT_MAGIC**
+
+This register is used to identify itself at the Host Driver as a BlockPT device. This 32-bit register must contain the value 0x636f6e74. Any other value will be rejected by the host driver. This Register
+is used to autodetect at which BAR the Registers are mapped by examining this Magic Register.
+
+* **PCI_BLOCKPT_COMMAND**
+
+This register will be used by the host driver to setup the EP device to export the desired block device. Any operation the Host will do in the ConfigFS will be translated to corresponding command values in this register.
+
+.. _command bitfield description:
+
+========        ================================================================
+Bitfield        Description
+========        ================================================================
+Bit 0           unused
+Bit 1           unused
+Bit 2           unused
+Bit 3           unused
+Bit 4           unused
+Bit 5           unused
+Bit 6           **SET_QUEUE**: This tells the Endpoint at which bus address the Queue
+                is in the BAR. This information is used by the EP to find the corresponding
+                Descriptor Queue for the device. The PCI_BLOCKPT_QUEUE_IDX register from `register description`_ identifies the Queue ID this command refers to, PCI_BLOCKPT_QSIZE identifies the BAR size to reserve for this queue and PCI_BLOCKPT_DEV_IDX the device id of this queue.
+Bit 7           **GET_DEVICES**: Through this command bit the host requests from the
+                EP all the available devices the EP Device want to export.
+                answer to this request is placed into Register PCI_BLOCKPT_DEV_NAME
+                where all exported devices are placed in a ';' separated list
+                of device names
+Bit 8           **START**: After configuring the corresponding device, this command
+                is used by the driver to attach to the device. On EP side worker
+                threads are generated to process the descriptors from the host
+                side
+Bit 9		**NUM_SECTORS**: Get number of sectors. The host issues this command to get the
+                size of the block device in number of 512 Byte sectors
+Bit 10          **STOP**: Send to detach from the block device. On reception all
+                worker threads are terminated.
+
+Bit 11          **SET_IRQ**: Sets the IRQ id for the device and Queue (identified by PCI_BLOCKPT_QUEUE_IDX and PCI_BLOCKPT_DEV_IDX from `register description`_)
+Bit 12          **GET_PERMISSION**: Gets the permission for the device, whether Readonly or Read-Write		
+========        ================================================================
+
+
+* **PCI_BLOCKPT_STATUS**
+
+This register reflects the status of the PCI Block Passthrough device.
+
+========       ==============================
+Bitfield       Description
+========       ==============================
+Bit 0          success
+Bit 1          unused
+Bit 2          unused
+Bit 3          unused
+Bit 4          unused
+Bit 5          unused
+Bit 6          unused
+Bit 7          unused
+Bit 8          error
+========       ==============================
+
+* **PCI_BLOCKPT_QUEUE_BAR_OFFSET**
+
+The EP sets this value to the offset from the BAR of the Device where the Descriptor Queues are located (identified by PCI_BLOCKPT_DEV_IDX from `register description`_).
+This Register is WO by EP and RO by RC.
+
+* **PCI_BLOCKPT_DRV_OFFSET**
+
+The descriptor queue which is located in the EP BAR memory region has
+the layout as described in `descriptor queue layout`_ . The Entry in this register contains the **Driver Offset**
+value from this diagram.
+This Register is RO by EP and WO by RC.
+
+* **PCI_BLOCKPT_DRV_OFFSET**
+
+The descriptor queue which is located in the EP BAR memory region has
+the layout as described in `descriptor queue layout`_ . The Entry in this register contains the **Device Offset**
+value from this diagram.
+This Register is RO by EP and WO by RC.
+
+* **PCI_BLOCKPT_NUM_DESC**
+
+This register contains the number of Descriptors in the Descriptor Queue. The minimum number which must be provided
+by the host is 16. Anything below will be rejected by the device
+This Register is RO by EP and WO by RC.
+
+* **PCI_BLOCKPT_MAX_DEVS**
+
+This Register contains the maximum number of devices which can be exported by the EP. This Register is WO by EP and RO from RC.
+
+* **PCI_BLOCKPT_IRQ**
+
+This is the Device and Queue specific MSIX IRQ which will be raised/sent when a descriptor has been processed.
+This Register is RO by EP and WO by RC.
+
+* **PCI_BLOCKPT_QSIZE**
+
+This Register contains the Queue Size in Bytes for the Device and Queue.
+This Register is RO by EP and WO by RC.
+
+* **PCI_BLOCKPT_NUM_QUEUES**
+
+This Register contains the maximum number of Queues the Device supports.
+This Register is WO by EP and RO by RC.
+
+* **PCI_BLOCKPT_QUEUE_OFFSET**
+
+When the BPT_COMMAND_SET_QUEUE command is send, this register contains the Queue Offset of the corresponding queue (identified by PCI_BLOCKPT_QUEUE_IDX from `register description`_)
+This Register is RO by EP and WO by RC.
+
+* **PCI_BLOCKPT_AVAIL_QUEUE_SIZE**
+
+The EP tells the RC with this Register about the amount of free space in the BAR for the descriptors.
+This Register is WO by EP and RO by RC.
+
+.. _blockpt_selector_idx:
+
+* **PCI_BLOCKPT_DEV_IDX**
+
+This register selects which device from the provided list which was requested with a command from `command bitfield description`_ 
+this request for. E.g. if you want to set the queue of the device /dev/mmcblk0 and the list which was delivered with
+from the command GET_DEVICES from `command bitfield description`_ is the following "/dev/nvme0n1p1;/dev/mmcblk0", than you
+set this register to 1 when issues the SET_QUEUE command. If you configure /dev/nvme0n1p1 than this register should be 0.
+This Register is RO by EP and WO by RC.
+
+.. _blockpt_queue_selector_idx:
+* **PCI_BLOCKPT_QUEUE_IDX**
+This register selects which queue from the device specified with PCI_BLOCKPT_DEV_IDX is requested with a command from `command bitfield description`_. This value is limited by PCI_BLOCKPT_NUM_QUEUES.
+This Register is RO by EP and WO by RC.
+
+* **PCI_BLOCKPT_NUM_SECTORS**
+
+The device puts the number of 512 Byte sectors of the device selected with blockpt_selector_idx_ if the command NUM_SECTORS from
+`command bitfield description`_ is send from the host.
+
+* **PCI_BLOCKPT_PERMISSION**
+
+This Register contains the Permission of this device. If the device can only be used in Read-Only mode the first bit is set, otherwise Read-Write mode is possible
+
+* **PCI_BLOCKPT_DEV_NAME**
+
+The device puts the names of all devices it wants to export into this register when it receives the GET_DEVICES command from `command bitfield description`_.
+This field is currently limited to (64 * 16 + 1) bytes.
+
+
+Data Transfer
+-------------
+
+The Data Transfer from the EP to the Host is using a fixed sized Descriptor Queue. This approach is inspired by the VirtIO Specification.
+
+A Descriptor Queue is part of the EP BAR memory region. The Descriptor Queue has a Layout as depicted in `descriptor queue layout`_.
+When the host wants to access data from the EP Disk, it first looks for a free descriptor in the Descriptor Ring. When one is found it
+sets up the Fields in this descriptor as shown in `descriptor layout`_, with the following description:
+
+ * **s_sector** containing the start sector from which the host wants to read from or write to
+ * **len** containing the number of bytes it wants to transfer
+ * **addr** field containing the bus address it wants the data transferred to or from (if you have an IOMMU on your SoC1 than this will be an IOVA, without an IOMMU it will usually be a PA).
+ * **opf** field tells about the operation (READ or WRITE),
+ * **status** field is written to by the EP to tell whether the transfer was successful or not.
+
+After those field are filled in by the Host driver it puts this descriptor index into the driver ring with the layout shown in `driver entry layout`_, and increments
+the **idx** field (using modulo NUM_DESCRIPTORS to implement the ring buffer functionality). When the EP detects that the **idx** field in the driver entry has changed
+it will pick up this descriptor, setup a Block-IO Request and submit it to the Block-IO layer. After the Block-IO layer has processed this request the Descriptor index will be transferred into
+the **Device Ring** as depicted in `device entry layout`_ and the **idx** field incremented there, additionally an MSIX IRQ is raised to the Host.  From there, the Host driver will know that the Request has been finished and will
+deliver it to whoever did the request on the Host side before it will free this descriptor for new transfers.
+
+
+
+
+.. _descriptor layout:
+
+Descriptor Layout
+-----------------------
+.. code-block:: text
+
+                                +--------------------------+
+                                |         s_sector         |
+                                |                          |
+                                +--------------------------+
+                                |           addr           |
+                                |                          |
+                                +--------------------------+
+                                |           len            |
+                                +--------------------------+
+                                | opf | stat | flags | res |
+                                +--------------------------+
+
+
+.. _driver entry layout:
+
+Driver Entry Layout
+-----------------------
+.. code-block:: text
+
+                                +------------------------+
+                                |          idx           |----+
+                                +------------------------+    |
+                                |     descriptor idx 0   |    |
+                                +------------------------+    |
+                                |     descriptor idx 1   |    |         +----------------+
+                                +------------------------+    |         |  Descriptor x  | 
+                                |            :           |    |         +----------------+ 
+                                +------------------------+<---+         | Descriptor x+1 | 
+                                |            :           |------------->+----------------+ 
+                                +------------------------+              | Descriptor x+2 | 
+                                |descriptor idx NUM_DESC |              +----------------+
+                                +------------------------+
+
+
+.. _device entry layout:
+
+Device Entry Layout
+-----------------------
+.. code-block:: text
+
+                                +------------------------+
+                                |          idx           |----+
+                                +------------------------+    |
+                                |     descriptor idx 0   |    |
+                                +------------------------+    |
+                                |     descriptor idx 1   |    |         +----------------+
+                                +------------------------+    |         |  Descriptor x  | 
+                                |            :           |    |         +----------------+ 
+                                +------------------------+<---+         | Descriptor x+1 | 
+                                |            :           |------------->+----------------+ 
+                                +------------------------+              | Descriptor x+2 | 
+                                |descriptor idx NUM_DESC |              +----------------+
+                                +------------------------+
+
+.. _descriptor queue layout:
+
+Descriptor Queue Layout
+-----------------------
+
+.. code-block:: text
+
+     Queue BAR offset ----->    +------------------------+
+                                |      1. Descriptor     |
+                                +------------------------+
+                                |      2. Descriptor     |
+                                +------------------------+
+                                |            :           |
+                                +------------------------+
+                                |            :           |
+                                +------------------------+
+                                |     Last Descriptor    |
+                                +------------------------+
+     Driver Offset ----->       +------------------------+
+                                |     Driver Ring        |
+                                |           :            |
+                                |           :            |
+                                +------------------------+
+     Device Offset ----->       +------------------------+
+                                |     Device Ring        |
+                                |           :            |
+                                |           :            |
+                                +------------------------+
+
diff --git a/Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst b/Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
new file mode 100644
index 000000000000..8e2b954b1199
--- /dev/null
+++ b/Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
@@ -0,0 +1,158 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+PCI Block Passthrough User Guide
+================================
+
+:Author: Wadim Mueller <wafgo01@gmail.com>
+
+This document is a guide to help users use pci-epf-block-passthru function driver
+and pci-remote-disk host driver for accessing remote block-devices which are exported on the Endpoint from the Host. The list of steps to be followed on the host side and EP side is given below.
+
+Endpoint Device
+===============
+
+Endpoint Controller Devices
+---------------------------
+
+To find the list of endpoint controller devices in the system::
+
+	# ls /sys/class/pci_epc/
+	  44100000.pcie
+
+If PCI_ENDPOINT_CONFIGFS is enabled::
+
+	# ls /sys/kernel/config/pci_ep/controllers
+	  44100000.pcie
+
+
+Endpoint Function Drivers
+-------------------------
+
+To find the list of endpoint function drivers in the system::
+
+	# ls /sys/bus/pci-epf/drivers
+	  pci_epf_blockpt
+
+If PCI_ENDPOINT_CONFIGFS is enabled::
+
+	# ls /sys/kernel/config/pci_ep/functions
+	  pci_epf_blockpt
+
+
+Creating pci-epf-blockpt Device
+-------------------------------
+
+PCI endpoint function device can be created using the configfs. To create
+pci-epf-blockpt device, the following commands can be used::
+
+	# mount -t configfs none /sys/kernel/config
+	# cd /sys/kernel/config/pci_ep/
+	# mkdir functions/pci_epf_blockpt/func1
+
+The "mkdir func1" above creates the pci-epf-blockpt function device that will
+be probed by pci_epf_blockpt driver.
+
+The PCI endpoint framework populates the directory with the following
+configurable fields::
+
+	# ls functions/pci_epf_blockpt/func1
+	  baseclass_code	interrupt_pin	progif_code	subsys_id
+	  cache_line_size	msi_interrupts	revid		subsys_vendorid
+	  deviceid          	msix_interrupts	subclass_code	vendorid
+
+
+Configuring pci-epf-blockpt Device
+----------------------------------
+
+The user can configure the pci-epf-blockpt device using configfs entry. In order
+to change the vendorid the following commands can be used::
+
+	# echo 0x0000 > functions/pci_epf_blockpt/func1/vendorid
+	# echo 0xc402 > functions/pci_epf_blockpt/func1/deviceid
+	# echo 16 > functions/pci_epf_test/func1/msi_interrupts
+	# echo 512 > functions/pci_epf_test/func1/msix_interrupts
+
+
+Binding pci-epf-blockpt Device to EP Controller
+-----------------------------------------------
+
+In order for the endpoint function device to be useful, it has to be bound to
+a PCI endpoint controller driver. Use the configfs to bind the function
+device to one of the controller driver present in the system::
+
+	# ln -s functions/pci_epf_blockpt/func1 controllers/44100000.pcie/
+
+Once the above step is completed, the PCI endpoint is ready to establish a link
+with the host.
+
+
+Export the Block Devices
+------------------------
+
+In order for the Block Passthrough function driver to be useful you first need to export
+some of the block devices to the Host. For this a new folder for each exported Block device has
+to be created inside of the blockpt folder. The following example shows how the full mmc device can be exported::
+
+	# cd /sys/kernel/config/pci_ep/functions/pci_epf_blockpt/func1
+	# mkdir mmc0
+	# echo -n /dev/mmcblk0 > mmc0/disc_name
+
+If you also have e.g. an nvme which you want to export you can continue like in the following::
+
+	# mkdir nvme
+	# echo -n /dev/nvme0n1 > nvme/disc_name
+
+Start the Link
+--------------
+
+In order for the endpoint device to establish a link with the host, the _start_
+field should be populated with '1'::
+
+	# echo 1 > controllers/44100000.pcie/start
+
+
+
+Thats it from the EP side. If you now load the pci-remote-disk driver on the RC side you should already see that /dev/mmcblk0 and /dev/nvme0n1 can be attached
+
+
+RootComplex Device
+==================
+
+lspci Output
+------------
+
+Note that the devices listed here correspond to the value populated in 1.4
+above::
+
+	0001:00:00.0 PCI bridge: Qualcomm Device 0115
+	0001:01:00.0 Unassigned class [ff00]: Device 0000:c402
+
+PCI driver
+----------
+
+If the driver was not loaded automatically after `Start the Link`_, you can load it manually by running e.g::
+
+         # insmod pci-remote-disk.ko
+           pci-remote-disk 0001:01:00.0: Found /dev/mmcblk0
+           pci-remote-disk 0001:01:00.0: Found /dev/nvme0n1
+           pci-remote-disk 0001:01:00.0: Found 2 devices
+
+This just shows you which Block devices are exported by the EP. You are not attached to any of them yet. If you e.g. want to attach to the nvme device. Run the following::
+
+         # echo 1 > /sys/kernel/config/pci_remote_disk/nvme0n1/attach 
+           pci-remote-disk 0001:01:00.0: nvme0n1: Setting queue addr. #Descriptors 1024 (28688 Bytes)
+           pci-remote-disk 0001:01:00.0: /dev/nvme0n1 capacity 0x3a386030
+
+After this the device is attached and can be used. By default the devices are exported by the original names with an **pci-rd-** prepended (this can be changed by using the */sys/kernel/config/pci_remote_disk/<DEVICE>/local_name* node). So in this case the output of 'lsblk' would look like the following::
+
+        # lsblk
+          ...
+          ...
+          pci-rd-nvme0n1 259:30   0 465.8G  0 disk 
+
+Thats it, the device should now be usable. You can try to mount it through::
+
+        # mount /dev/pci-rd-nvme0n1 <SOME_DIR> 
+
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 7c51a22cee93..f0ed873470f0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17015,6 +17015,14 @@ F:	drivers/misc/pci_endpoint_test.c
 F:	drivers/pci/endpoint/
 F:	tools/pci/
 
+PCI ENDPOINT BLOCK PASSTHROUGH
+M:	Wadim Mueller <wafgo01@gmail.com>
+L:	linux-pci@vger.kernel.org
+S:	Supported
+F:	drivers/pci/endpoint/functions/pci-epf-block-passthru.c
+F:	drivers/block/pci-remote-disk.c
+F:	include/linux/pci-epf-block-passthru.h
+
 PCI ENHANCED ERROR HANDLING (EEH) FOR POWERPC
 M:	Mahesh J Salgaonkar <mahesh@linux.ibm.com>
 R:	Oliver O'Halloran <oohall@gmail.com>
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
  2024-02-24 21:03 [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Wadim Mueller
                   ` (2 preceding siblings ...)
  2024-02-24 21:04 ` [PATCH 3/3] Documentation: PCI: Add documentation for the PCI Block Passthrough Wadim Mueller
@ 2024-02-25 16:09 ` Manivannan Sadhasivam
  2024-02-25 20:39   ` Wadim Mueller
  2024-02-26 11:08 ` Christoph Hellwig
  4 siblings, 1 reply; 11+ messages in thread
From: Manivannan Sadhasivam @ 2024-02-25 16:09 UTC (permalink / raw)
  To: Wadim Mueller
  Cc: Bjorn Helgaas, Jonathan Corbet, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Damien Le Moal, Shunsuke Mie, linux-pci, linux-doc, linux-kernel,
	linux-block

On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> Hello,
> 
> This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> 
> 
> Block Passthrough
> ==================
> The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> through a PCI(e) link, one running in RC mode, the other in EP mode.
> If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> 
> 
>                                                            +-------------+
>                                                            |             |
>                                                            |   SD Card   |
>                                                            |             |
>                                                            +------^------+
>                                                                   |
>                                                                   |
>     +--------------------------+                +-----------------v----------------+
>     |                          |      PCI(e)    |                                  |
>     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
>     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
>     |                          |                |                                  |
>     +--------------------------+                +-----------------^----------------+
>                                                                   |
>                                                                   |
>                                                            +------v------+
>                                                            |             |
>                                                            |    NVMe     |
>                                                            |             |
>                                                            +-------------+
> 
> 
> This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> 
> The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> 
> A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> 
> Test setup
> ==========
> 
> This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> 
> A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> 
> The measurement was done through the FIO tool [1] with 4kiB Blocks.
> 
> [1] https://linux.die.net/man/1/fio
> 

Thanks for the proposal! We are planning to add virtio function support to
endpoint subsystem to cover usecases like this. I think your usecase can be
satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
support once we have the infra in place. Thoughts?

- Mani

> Wadim Mueller (3):
>   PCI: Add PCI Endpoint function driver for Block-device passthrough
>   PCI: Add PCI driver for a PCI EP remote Blockdevice
>   Documentation: PCI: Add documentation for the PCI Block Passthrough
> 
>  .../function/binding/pci-block-passthru.rst   |   24 +
>  Documentation/PCI/endpoint/index.rst          |    3 +
>  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
>  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
>  MAINTAINERS                                   |    8 +
>  drivers/block/Kconfig                         |   14 +
>  drivers/block/Makefile                        |    1 +
>  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
>  drivers/pci/endpoint/functions/Kconfig        |   12 +
>  drivers/pci/endpoint/functions/Makefile       |    1 +
>  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
>  include/linux/pci-epf-block-passthru.h        |   77 +
>  12 files changed, 3069 insertions(+)
>  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
>  create mode 100644 drivers/block/pci-remote-disk.c
>  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
>  create mode 100644 include/linux/pci-epf-block-passthru.h
> 
> -- 
> 2.25.1
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
  2024-02-25 16:09 ` [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Manivannan Sadhasivam
@ 2024-02-25 20:39   ` Wadim Mueller
  2024-02-26  9:45     ` Manivannan Sadhasivam
  2024-02-26 16:47     ` Frank Li
  0 siblings, 2 replies; 11+ messages in thread
From: Wadim Mueller @ 2024-02-25 20:39 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Wadim Mueller, Bjorn Helgaas, Jonathan Corbet,
	Krzysztof Wilczyński, Kishon Vijay Abraham I, Jens Axboe,
	Lorenzo Pieralisi, Damien Le Moal, Shunsuke Mie, linux-pci,
	linux-doc, linux-kernel, linux-block

On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
> On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> > Hello,
> > 
> > This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> > PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> > Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> > 
> > 
> > Block Passthrough
> > ==================
> > The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> > through a PCI(e) link, one running in RC mode, the other in EP mode.
> > If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> > those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> > those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> > 
> > 
> >                                                            +-------------+
> >                                                            |             |
> >                                                            |   SD Card   |
> >                                                            |             |
> >                                                            +------^------+
> >                                                                   |
> >                                                                   |
> >     +--------------------------+                +-----------------v----------------+
> >     |                          |      PCI(e)    |                                  |
> >     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
> >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
> >     |                          |                |                                  |
> >     +--------------------------+                +-----------------^----------------+
> >                                                                   |
> >                                                                   |
> >                                                            +------v------+
> >                                                            |             |
> >                                                            |    NVMe     |
> >                                                            |             |
> >                                                            +-------------+
> > 
> > 
> > This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> > 
> > The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> > by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> > 
> > A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> > 
> > Test setup
> > ==========
> > 
> > This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> > 
> > A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> > 
> > The measurement was done through the FIO tool [1] with 4kiB Blocks.
> > 
> > [1] https://linux.die.net/man/1/fio
> > 
> 
> Thanks for the proposal! We are planning to add virtio function support to
> endpoint subsystem to cover usecases like this. I think your usecase can be
> satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
> support once we have the infra in place. Thoughts?
> 
> - Mani
>

Hi Mani,
I initially had the plan to implement the virtio-blk as an endpoint
function driver instead of a self baked driver. 

This for sure is more elegant as we could reuse the
virtio-blk pci driver instead of implementing a new one (as I did) 

But I initially had some concerns about the feasibility, especially
that the virtio-blk pci driver is expecting immediate responses to some
register writes which I would not be able to satisfy, simply because we
do not have any kind of interrupt/event which would be triggered on the
EP side when the RC is accessing some BAR Registers (at least there is
no machanism I know of). As virtio is made mainly for Hypervisor <->
Guest communication I was afraid that a Hypersisor is able to Trap every
Register access from the Guest and act accordingly, which I would not be
able to do. I hope this make sense to you.

But to make a long story short, yes I agree with you that virtio-blk
would satisfy my usecase, and I generally think it would be a better
solution, I just did not know that you are working on some
infrastructure for that. And yes I would like to implement the endpoint
function driver for virtio-blk. Is there already an development tree you
use to work on the infrastructre I could have a look at?

- Wadim



> > Wadim Mueller (3):
> >   PCI: Add PCI Endpoint function driver for Block-device passthrough
> >   PCI: Add PCI driver for a PCI EP remote Blockdevice
> >   Documentation: PCI: Add documentation for the PCI Block Passthrough
> > 
> >  .../function/binding/pci-block-passthru.rst   |   24 +
> >  Documentation/PCI/endpoint/index.rst          |    3 +
> >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
> >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
> >  MAINTAINERS                                   |    8 +
> >  drivers/block/Kconfig                         |   14 +
> >  drivers/block/Makefile                        |    1 +
> >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
> >  drivers/pci/endpoint/functions/Kconfig        |   12 +
> >  drivers/pci/endpoint/functions/Makefile       |    1 +
> >  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
> >  include/linux/pci-epf-block-passthru.h        |   77 +
> >  12 files changed, 3069 insertions(+)
> >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
> >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
> >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
> >  create mode 100644 drivers/block/pci-remote-disk.c
> >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
> >  create mode 100644 include/linux/pci-epf-block-passthru.h
> > 
> > -- 
> > 2.25.1
> > 
> 
> -- 
> மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
  2024-02-25 20:39   ` Wadim Mueller
@ 2024-02-26  9:45     ` Manivannan Sadhasivam
  2024-02-26 12:58       ` Damien Le Moal
  2024-02-26 18:47       ` Wadim Mueller
  2024-02-26 16:47     ` Frank Li
  1 sibling, 2 replies; 11+ messages in thread
From: Manivannan Sadhasivam @ 2024-02-26  9:45 UTC (permalink / raw)
  To: Wadim Mueller
  Cc: Bjorn Helgaas, Jonathan Corbet, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Damien Le Moal, Shunsuke Mie, linux-pci, linux-doc, linux-kernel,
	linux-block

On Sun, Feb 25, 2024 at 09:39:17PM +0100, Wadim Mueller wrote:
> On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
> > On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> > > Hello,
> > > 
> > > This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> > > PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> > > Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> > > 
> > > 
> > > Block Passthrough
> > > ==================
> > > The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> > > through a PCI(e) link, one running in RC mode, the other in EP mode.
> > > If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> > > those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> > > those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> > > 
> > > 
> > >                                                            +-------------+
> > >                                                            |             |
> > >                                                            |   SD Card   |
> > >                                                            |             |
> > >                                                            +------^------+
> > >                                                                   |
> > >                                                                   |
> > >     +--------------------------+                +-----------------v----------------+
> > >     |                          |      PCI(e)    |                                  |
> > >     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
> > >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
> > >     |                          |                |                                  |
> > >     +--------------------------+                +-----------------^----------------+
> > >                                                                   |
> > >                                                                   |
> > >                                                            +------v------+
> > >                                                            |             |
> > >                                                            |    NVMe     |
> > >                                                            |             |
> > >                                                            +-------------+
> > > 
> > > 
> > > This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> > > 
> > > The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> > > by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> > > 
> > > A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> > > 
> > > Test setup
> > > ==========
> > > 
> > > This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> > > 
> > > A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> > > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> > > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> > > 
> > > The measurement was done through the FIO tool [1] with 4kiB Blocks.
> > > 
> > > [1] https://linux.die.net/man/1/fio
> > > 
> > 
> > Thanks for the proposal! We are planning to add virtio function support to
> > endpoint subsystem to cover usecases like this. I think your usecase can be
> > satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
> > support once we have the infra in place. Thoughts?
> > 
> > - Mani
> >
> 
> Hi Mani,
> I initially had the plan to implement the virtio-blk as an endpoint
> function driver instead of a self baked driver. 
> 
> This for sure is more elegant as we could reuse the
> virtio-blk pci driver instead of implementing a new one (as I did) 
> 
> But I initially had some concerns about the feasibility, especially
> that the virtio-blk pci driver is expecting immediate responses to some
> register writes which I would not be able to satisfy, simply because we
> do not have any kind of interrupt/event which would be triggered on the
> EP side when the RC is accessing some BAR Registers (at least there is
> no machanism I know of). As virtio is made mainly for Hypervisor <->

Right. There is a limitation currently w.r.t triggering doorbell from the host
to endpoint. But I believe that could be addressed later by repurposing the
endpoint MSI controller [1].

> As virtio is made mainly for Hypervisor <->
> Guest communication I was afraid that a Hypersisor is able to Trap every
> Register access from the Guest and act accordingly, which I would not be
> able to do. I hope this make sense to you.
> 

I'm not worrying about the hypervisor right now. Here the endpoint is exposing
the virtio devices and host is consuming it. There is no virtualization play
here. I talked about this in the last plumbers [2].

> But to make a long story short, yes I agree with you that virtio-blk
> would satisfy my usecase, and I generally think it would be a better
> solution, I just did not know that you are working on some
> infrastructure for that. And yes I would like to implement the endpoint
> function driver for virtio-blk. Is there already an development tree you
> use to work on the infrastructre I could have a look at?
> 

Shunsuke has a WIP branch [3], that I plan to co-work in the coming days.
You can use it as a reference in the meantime.

- Mani

[1] https://lore.kernel.org/all/20230911220920.1817033-1-Frank.Li@nxp.com/
[2] https://www.youtube.com/watch?v=1tqOTge0eq0
[3] https://github.com/ShunsukeMie/linux-virtio-rdma/tree/v6.6-rc1-epf-vcon

> - Wadim
> 
> 
> 
> > > Wadim Mueller (3):
> > >   PCI: Add PCI Endpoint function driver for Block-device passthrough
> > >   PCI: Add PCI driver for a PCI EP remote Blockdevice
> > >   Documentation: PCI: Add documentation for the PCI Block Passthrough
> > > 
> > >  .../function/binding/pci-block-passthru.rst   |   24 +
> > >  Documentation/PCI/endpoint/index.rst          |    3 +
> > >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
> > >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
> > >  MAINTAINERS                                   |    8 +
> > >  drivers/block/Kconfig                         |   14 +
> > >  drivers/block/Makefile                        |    1 +
> > >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
> > >  drivers/pci/endpoint/functions/Kconfig        |   12 +
> > >  drivers/pci/endpoint/functions/Makefile       |    1 +
> > >  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
> > >  include/linux/pci-epf-block-passthru.h        |   77 +
> > >  12 files changed, 3069 insertions(+)
> > >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
> > >  create mode 100644 drivers/block/pci-remote-disk.c
> > >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
> > >  create mode 100644 include/linux/pci-epf-block-passthru.h
> > > 
> > > -- 
> > > 2.25.1
> > > 
> > 
> > -- 
> > மணிவண்ணன் சதாசிவம்

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
  2024-02-24 21:03 [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Wadim Mueller
                   ` (3 preceding siblings ...)
  2024-02-25 16:09 ` [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Manivannan Sadhasivam
@ 2024-02-26 11:08 ` Christoph Hellwig
  4 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2024-02-26 11:08 UTC (permalink / raw)
  To: Wadim Mueller
  Cc: Bjorn Helgaas, Jonathan Corbet, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Kishon Vijay Abraham I, Jens Axboe,
	Lorenzo Pieralisi, Damien Le Moal, Shunsuke Mie, linux-pci,
	linux-doc, linux-kernel, linux-block

Please don't just create a new (and as far as I can tell underspecified)
new "hardware" interface for this.  If the nvme endpoint work is too
much for your use case maybe just implement a minimal virtio_blk
interface.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
  2024-02-26  9:45     ` Manivannan Sadhasivam
@ 2024-02-26 12:58       ` Damien Le Moal
  2024-02-26 18:47       ` Wadim Mueller
  1 sibling, 0 replies; 11+ messages in thread
From: Damien Le Moal @ 2024-02-26 12:58 UTC (permalink / raw)
  To: Manivannan Sadhasivam, Wadim Mueller
  Cc: Bjorn Helgaas, Jonathan Corbet, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Shunsuke Mie, linux-pci, linux-doc, linux-kernel, linux-block

On 2024/02/26 1:45, Manivannan Sadhasivam wrote:

[...]

>> As virtio is made mainly for Hypervisor <->
>> Guest communication I was afraid that a Hypersisor is able to Trap every
>> Register access from the Guest and act accordingly, which I would not be
>> able to do. I hope this make sense to you.
>>
> 
> I'm not worrying about the hypervisor right now. Here the endpoint is exposing
> the virtio devices and host is consuming it. There is no virtualization play
> here. I talked about this in the last plumbers [2].

FYI, we are still working on our NVMe PCI EPF function driver. It is working OK
using either a rockpro64 (PCI Gen2) board and a Radxa Rock 5B board (PCI Gen3,
rk3588 SoC/DWC EPF driver). Just been super busy recently with the block layer &
ATA stuff so I have not been able to rebase/cleanup and send stuff. This driver
also depends on many cleanup/improvement patches (see below).

> 
>> But to make a long story short, yes I agree with you that virtio-blk
>> would satisfy my usecase, and I generally think it would be a better
>> solution, I just did not know that you are working on some
>> infrastructure for that. And yes I would like to implement the endpoint
>> function driver for virtio-blk. Is there already an development tree you
>> use to work on the infrastructre I could have a look at?
>>
> 
> Shunsuke has a WIP branch [3], that I plan to co-work in the coming days.
> You can use it as a reference in the meantime.

This one is very similar to what I did in my series:

https://github.com/torvalds/linux/commit/05e21d458b1eaa8c22697f12a1ae42dcb04ff377

My series is here:

https://github.com/damien-lemoal/linux/tree/rock5b_ep_v8

It is a bit of a mess but what's there is:
1) Add the "map_info" EPF method to get mapping that are not dependent on the
host address alignment. That is similar to the align_mem method Shunsuke
introduced, but with more info to make it generic and allow EPF to deal with any
host DMA address.
2) Fixes for the rockpro64 DMA mapping as it is broken
3) Adds rk2588 EPF driver
4) Adds the NVMe EPF function driver. That is implemented as a PCI EPF frontend
to an NVMe-of controller so that any NMVe-Of supported device can be exposed
over PCI (block device, file, real NVMe controller).

There are also a bunch of API changes and cleanups to make the EPF code (core
and driver) more compact/easier to read.

Once I am done with my current work on the block layer side, I intend to come
back to this for the next cycle. I still need to complete the IRQ lehacy -> intx
renaming as well...

Cheers.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
  2024-02-25 20:39   ` Wadim Mueller
  2024-02-26  9:45     ` Manivannan Sadhasivam
@ 2024-02-26 16:47     ` Frank Li
  1 sibling, 0 replies; 11+ messages in thread
From: Frank Li @ 2024-02-26 16:47 UTC (permalink / raw)
  To: Wadim Mueller
  Cc: Manivannan Sadhasivam, Bjorn Helgaas, Jonathan Corbet,
	Krzysztof Wilczyński, Kishon Vijay Abraham I, Jens Axboe,
	Lorenzo Pieralisi, Damien Le Moal, Shunsuke Mie, linux-pci,
	linux-doc, linux-kernel, linux-block

On Sun, Feb 25, 2024 at 09:39:17PM +0100, Wadim Mueller wrote:
> On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
> > On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> > > Hello,
> > > 
> > > This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> > > PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> > > Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> > > 
> > > 
> > > Block Passthrough
> > > ==================
> > > The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> > > through a PCI(e) link, one running in RC mode, the other in EP mode.
> > > If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> > > those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> > > those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> > > 
> > > 
> > >                                                            +-------------+
> > >                                                            |             |
> > >                                                            |   SD Card   |
> > >                                                            |             |
> > >                                                            +------^------+
> > >                                                                   |
> > >                                                                   |
> > >     +--------------------------+                +-----------------v----------------+
> > >     |                          |      PCI(e)    |                                  |
> > >     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
> > >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
> > >     |                          |                |                                  |
> > >     +--------------------------+                +-----------------^----------------+
> > >                                                                   |
> > >                                                                   |
> > >                                                            +------v------+
> > >                                                            |             |
> > >                                                            |    NVMe     |
> > >                                                            |             |
> > >                                                            +-------------+
> > > 
> > > 
> > > This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> > > 
> > > The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> > > by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> > > 
> > > A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> > > 
> > > Test setup
> > > ==========
> > > 
> > > This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> > > 
> > > A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> > > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> > > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> > > 
> > > The measurement was done through the FIO tool [1] with 4kiB Blocks.
> > > 
> > > [1] https://linux.die.net/man/1/fio
> > > 
> > 
> > Thanks for the proposal! We are planning to add virtio function support to
> > endpoint subsystem to cover usecases like this. I think your usecase can be
> > satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
> > support once we have the infra in place. Thoughts?
> > 
> > - Mani
> >
> 
> Hi Mani,
> I initially had the plan to implement the virtio-blk as an endpoint
> function driver instead of a self baked driver. 
> 
> This for sure is more elegant as we could reuse the
> virtio-blk pci driver instead of implementing a new one (as I did) 
> 
> But I initially had some concerns about the feasibility, especially
> that the virtio-blk pci driver is expecting immediate responses to some
> register writes which I would not be able to satisfy, simply because we
> do not have any kind of interrupt/event which would be triggered on the
> EP side when the RC is accessing some BAR Registers (at least there is
> no machanism I know of). As virtio is made mainly for Hypervisor <->

A possible solution is use ITS MSI to triggger at irq at EP side.
https://lore.kernel.org/linux-pci/20230911220920.1817033-1-Frank.Li@nxp.com/
Any ways, virtio layer need some modify. 

> Guest communication I was afraid that a Hypersisor is able to Trap every
> Register access from the Guest and act accordingly, which I would not be
> able to do. I hope this make sense to you.
> 
> But to make a long story short, yes I agree with you that virtio-blk
> would satisfy my usecase, and I generally think it would be a better
> solution, I just did not know that you are working on some
> infrastructure for that. And yes I would like to implement the endpoint
> function driver for virtio-blk. Is there already an development tree you
> use to work on the infrastructre I could have a look at?

There are many one try this
https://patchew.org/linux/20230427104428.862643-1-mie@igel.co.jp/
https://lore.kernel.org/linux-pci/796eb893-f7e2-846c-e75f-9a5774089b8e@igel.co.jp/
https://lore.kernel.org/imx/d098a631-9930-26d3-48f3-8f95386c8e50@ti.com/T/#t
https://lore.kernel.org/linux-pci/20200702082143.25259-1-kishon@ti.com/

With EDMA support and ITS MSI, it should be possible now.

Frank

> 
> - Wadim
> 
> 
> 
> > > Wadim Mueller (3):
> > >   PCI: Add PCI Endpoint function driver for Block-device passthrough
> > >   PCI: Add PCI driver for a PCI EP remote Blockdevice
> > >   Documentation: PCI: Add documentation for the PCI Block Passthrough
> > > 
> > >  .../function/binding/pci-block-passthru.rst   |   24 +
> > >  Documentation/PCI/endpoint/index.rst          |    3 +
> > >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
> > >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
> > >  MAINTAINERS                                   |    8 +
> > >  drivers/block/Kconfig                         |   14 +
> > >  drivers/block/Makefile                        |    1 +
> > >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
> > >  drivers/pci/endpoint/functions/Kconfig        |   12 +
> > >  drivers/pci/endpoint/functions/Makefile       |    1 +
> > >  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
> > >  include/linux/pci-epf-block-passthru.h        |   77 +
> > >  12 files changed, 3069 insertions(+)
> > >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
> > >  create mode 100644 drivers/block/pci-remote-disk.c
> > >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
> > >  create mode 100644 include/linux/pci-epf-block-passthru.h
> > > 
> > > -- 
> > > 2.25.1
> > > 
> > 
> > -- 
> > மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Add support for Block Passthrough Endpoint function driver
  2024-02-26  9:45     ` Manivannan Sadhasivam
  2024-02-26 12:58       ` Damien Le Moal
@ 2024-02-26 18:47       ` Wadim Mueller
  1 sibling, 0 replies; 11+ messages in thread
From: Wadim Mueller @ 2024-02-26 18:47 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Bjorn Helgaas, Jonathan Corbet, Krzysztof Wilczyński,
	Kishon Vijay Abraham I, Jens Axboe, Lorenzo Pieralisi,
	Damien Le Moal, Shunsuke Mie, linux-pci, linux-doc, linux-kernel,
	linux-block

--text follows this line--
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> writes:

> On Sun, Feb 25, 2024 at 09:39:17PM +0100, Wadim Mueller wrote:
>> On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
>> > On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
>> > > Hello,
>> > >=20
>> > > This series adds support for the Block Passthrough PCI(e) Endpoint f=
unctionality.
>> > > PCI Block Device Passthrough allows one Linux Device running in EP m=
ode to expose its Block devices to the PCI(e) host (RC). The device can exp=
ort either the full disk or just certain partitions.
>> > > Also an export in readonly mode is possible. This is useful if you w=
ant to share the same blockdevice between different SoCs, providing each So=
C its own partition(s).
>> > >=20
>> > >=20
>> > > Block Passthrough
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> > > The PCI Block Passthrough can be a useful feature if you have multip=
le SoCs in your system connected
>> > > through a PCI(e) link, one running in RC mode, the other in EP mode.
>> > > If the block devices are connected to one SoC (SoC2 in EP Mode from =
the diagramm below) and you want to access
>> > > those from the other SoC (SoC1 in RC mode below), without having any=
 direct connection to
>> > > those block devices (e.g. if you want to share an NVMe between two S=
oCs). An simple example of such a configurationis is shown below:
>> > >=20
>> > >=20
>> > >                                                            +--------=
-----+
>> > >                                                            |        =
     |
>> > >                                                            |   SD Ca=
rd   |
>> > >                                                            |        =
     |
>> > >                                                            +------^-=
-----+
>> > >                                                                   |
>> > >                                                                   |
>> > >     +--------------------------+                +-----------------v-=
---------------+
>> > >     |                          |      PCI(e)    |                   =
               |
>> > >     |         SoC1 (RC)        |<-------------->|            SoC2 (E=
P)             |
>> > >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLO=
CK_PASSTHROUGH)|
>> > >     |                          |                |                   =
               |
>> > >     +--------------------------+                +-----------------^-=
---------------+
>> > >                                                                   |
>> > >                                                                   |
>> > >                                                            +------v-=
-----+
>> > >                                                            |        =
     |
>> > >                                                            |    NVMe=
     |
>> > >                                                            |        =
     |
>> > >                                                            +--------=
-----+
>> > >=20
>> > >=20
>> > > This is to a certain extent a similar functionality which NBD expose=
s over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framewor=
k.
>> > >=20
>> > > The Endpoint Function driver creates parallel Queues which run on se=
perate CPU Cores using percpu structures. The number of parallel queues is =
limited
>> > > by the number of CPUs on the EP device. The actual number of queues =
is configurable (as all other features of the driver) through CONFIGFS.
>> > >=20
>> > > A documentation about the functional description as well as a user g=
uide showing how both drivers can be configured is part of this series.
>> > >=20
>> > > Test setup
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> > >=20
>> > > This series has been tested on an NXP S32G2 SoC running in Endpoint =
mode with a direct connection to an ARM64 host machine.
>> > >=20
>> > > A performance measurement on the described setup shows good performa=
nce metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth =
of ~2GiB/s.
>> > > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... w=
ithout DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB K=
ingston NVMe
>> > > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read D=
atarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
>> > >=20
>> > > The measurement was done through the FIO tool [1] with 4kiB Blocks.
>> > >=20
>> > > [1] https://linux.die.net/man/1/fio
>> > >=20
>> >=20
>> > Thanks for the proposal! We are planning to add virtio function suppor=
t to
>> > endpoint subsystem to cover usecases like this. I think your usecase c=
an be
>> > satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint f=
unction
>> > support once we have the infra in place. Thoughts?
>> >=20
>> > - Mani
>> >
>>=20
>> Hi Mani,
>> I initially had the plan to implement the virtio-blk as an endpoint
>> function driver instead of a self baked driver.=20
>>=20
>> This for sure is more elegant as we could reuse the
>> virtio-blk pci driver instead of implementing a new one (as I did)=20
>>=20
>> But I initially had some concerns about the feasibility, especially
>> that the virtio-blk pci driver is expecting immediate responses to some
>> register writes which I would not be able to satisfy, simply because we
>> do not have any kind of interrupt/event which would be triggered on the
>> EP side when the RC is accessing some BAR Registers (at least there is
>> no machanism I know of). As virtio is made mainly for Hypervisor <->
>
> Right. There is a limitation currently w.r.t triggering doorbell from the=
 host
> to endpoint. But I believe that could be addressed later by repurposing t=
he
> endpoint MSI controller [1].
>
>> As virtio is made mainly for Hypervisor <->
>> Guest communication I was afraid that a Hypersisor is able to Trap every
>> Register access from the Guest and act accordingly, which I would not be
>> able to do. I hope this make sense to you.
>>=20
>
> I'm not worrying about the hypervisor right now. Here the endpoint is exp=
osing
> the virtio devices and host is consuming it. There is no virtualization p=
lay
> here. I talked about this in the last plumbers [2].
>

Okay, I understand this. The hypervisor was more of an example. I will
try to explain.

I am currently reading through the virtio spec [1].
In chapter 4.1.4.5.1 there is the following statement:

"The device MUST reset ISR status to 0 on driver read."

So I was wondering, how we, as an PCI EP Device, supposed to clear a
register when the driver reads the same register? I mean how do we detect a
register read?
If you are a hypervisor its easy to do so, because you can intercept
every memory access made my the guest (the same applies if you build
custom HW for this purpose). But for us as an EP device its
difficult to detect this, even with MSIs and Doorbell Registers in
place.

Modifying the virtio layer to write to some doorbell register after
reading the ISR status register would be possible, but kind of ugly.


[1] https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.=
pdf

>> But to make a long story short, yes I agree with you that virtio-blk
>> would satisfy my usecase, and I generally think it would be a better
>> solution, I just did not know that you are working on some
>> infrastructure for that. And yes I would like to implement the endpoint
>> function driver for virtio-blk. Is there already an development tree you
>> use to work on the infrastructre I could have a look at?
>>=20
>
> Shunsuke has a WIP branch [3], that I plan to co-work in the coming days.
> You can use it as a reference in the meantime.
>
> - Mani
>
> [1] https://lore.kernel.org/all/20230911220920.1817033-1-Frank.Li@nxp.com/
> [2] https://www.youtube.com/watch?v=3D1tqOTge0eq0
> [3] https://github.com/ShunsukeMie/linux-virtio-rdma/tree/v6.6-rc1-epf-vc=
on
>
>> - Wadim
>>=20
>>=20
>>=20
>> > > Wadim Mueller (3):
>> > >   PCI: Add PCI Endpoint function driver for Block-device passthrough
>> > >   PCI: Add PCI driver for a PCI EP remote Blockdevice
>> > >   Documentation: PCI: Add documentation for the PCI Block Passthrough
>> > >=20
>> > >  .../function/binding/pci-block-passthru.rst   |   24 +
>> > >  Documentation/PCI/endpoint/index.rst          |    3 +
>> > >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
>> > >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
>> > >  MAINTAINERS                                   |    8 +
>> > >  drivers/block/Kconfig                         |   14 +
>> > >  drivers/block/Makefile                        |    1 +
>> > >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
>> > >  drivers/pci/endpoint/functions/Kconfig        |   12 +
>> > >  drivers/pci/endpoint/functions/Makefile       |    1 +
>> > >  .../functions/pci-epf-block-passthru.c        | 1393 ++++++++++++++=
+++
>> > >  include/linux/pci-epf-block-passthru.h        |   77 +
>> > >  12 files changed, 3069 insertions(+)
>> > >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-=
block-passthru.rst
>> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-pa=
ssthru-function.rst
>> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-pa=
ssthru-howto.rst
>> > >  create mode 100644 drivers/block/pci-remote-disk.c
>> > >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-pas=
sthru.c
>> > >  create mode 100644 include/linux/pci-epf-block-passthru.h
>> > >=20
>> > > --=20
>> > > 2.25.1
>> > >=20
>> >=20
>> > --=20
>> > =E0=AE=AE=E0=AE=A3=E0=AE=BF=E0=AE=B5=E0=AE=A3=E0=AF=8D=E0=AE=A3=E0=AE=
=A9=E0=AF=8D =E0=AE=9A=E0=AE=A4=E0=AE=BE=E0=AE=9A=E0=AE=BF=E0=AE=B5=E0=AE=
=AE=E0=AF=8D


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-26 19:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-24 21:03 [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Wadim Mueller
2024-02-24 21:04 ` [PATCH 1/3] PCI: Add PCI Endpoint function driver for Block-device passthrough Wadim Mueller
2024-02-24 21:04 ` [PATCH 2/3] PCI: Add PCI driver for a PCI EP remote Blockdevice Wadim Mueller
2024-02-24 21:04 ` [PATCH 3/3] Documentation: PCI: Add documentation for the PCI Block Passthrough Wadim Mueller
2024-02-25 16:09 ` [PATCH 0/3] Add support for Block Passthrough Endpoint function driver Manivannan Sadhasivam
2024-02-25 20:39   ` Wadim Mueller
2024-02-26  9:45     ` Manivannan Sadhasivam
2024-02-26 12:58       ` Damien Le Moal
2024-02-26 18:47       ` Wadim Mueller
2024-02-26 16:47     ` Frank Li
2024-02-26 11:08 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).