All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
@ 2016-04-14 18:04 Helen Mae Koike Fornazier
  2016-04-21 13:33 ` Helen Koike
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Helen Mae Koike Fornazier @ 2016-04-14 18:04 UTC (permalink / raw)


From: Rob Nelson <rlnelson@google.com>

This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specificaiton, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric at google.com>
  Frank Swiderski <fes at google.com>
  Ted Tso <tytso at mit.edu>
  Keith Busch <keith.busch at intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson at google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin at kernel.org>
[koike: updated for current APIs]
Signed-off-by: Helen Mae Koike Fornazier <helen.koike at collabora.co.uk>

Conflicts:
	drivers/nvme/host/Kconfig
	drivers/nvme/host/pci.c
---

As stated above, this patch provides a HUGE improvement in performance. I would like to work on that to get it upstream.

I also would like to know if you think this code can be made more generic, maybe s/CONFIG_NVME_VENDOS_EXT_GOOGLE/CONFIG_NVME_VM_OPTIMIZATION ? And remove the if(vendor == PCI_VENDOR_ID_GOOGLE)?

I am not sure if all the code blocks inside the if(vendor == PCI_VENDOR_ID_GOOGLE) are only specific to google or if we can remove this check, what do you think?

Thanks in advance for your comments.

 drivers/nvme/host/Kconfig |   7 +++
 drivers/nvme/host/pci.c   | 150 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h      |  21 +++++++
 3 files changed, 178 insertions(+)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index c894841..dc4fddc 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -24,3 +24,10 @@ config BLK_DEV_NVME_SCSI
 	  to say N here, unless you run a distro that abuses the SCSI
 	  emulation to provide stable device names for mount by id, like
 	  some OpenSuSE and SLES versions.
+
+config NVME_VENDOR_EXT_GOOGLE
+	bool "NVMe Vendor Extension for Improved Virtualization"
+	depends on BLK_DEV_NVME
+	---help---
+	  Google extension to reduce the number of MMIO doorbell
+	  writes for the NVMe driver
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 24ccda3..b6d8fd8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -50,6 +50,9 @@
 #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
 #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
 		
+/* Google Vendor ID is not in include/linux/pci_ids.h */
+#define PCI_VENDOR_ID_GOOGLE 0x1AE0
+
 /*
  * We handle AEN commands ourselves and don't even let the
  * block layer know about them.
@@ -107,6 +110,13 @@ struct nvme_dev {
 #define NVME_CTRL_RESETTING    0
 #define NVME_CTRL_REMOVING     1
 
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	u32 *db_mem;
+	dma_addr_t doorbell;
+	u32 *ei_mem;
+	dma_addr_t eventidx;
+#endif
+
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 };
@@ -139,6 +149,12 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	u32 *sq_doorbell_addr;
+	u32 *sq_eventidx_addr;
+	u32 *cq_doorbell_addr;
+	u32 *cq_eventidx_addr;
+#endif
 };
 
 /*
@@ -176,6 +192,9 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	BUILD_BUG_ON(sizeof(struct nvme_doorbell_memory) != 64);
+#endif
 }
 
 /*
@@ -305,6 +324,51 @@ static void nvme_complete_async_event(struct nvme_dev *dev,
 	}
 }
 
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+static int nvme_vendor_memory_size(struct nvme_dev *dev)
+{
+	return ((num_possible_cpus() + 1) * 8 * dev->db_stride);
+}
+
+static int nvme_set_doorbell_memory(struct nvme_dev *dev)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.doorbell_memory.opcode = nvme_admin_doorbell_memory;
+	c.doorbell_memory.prp1 = cpu_to_le64(dev->doorbell);
+	c.doorbell_memory.prp2 = cpu_to_le64(dev->eventidx);
+
+	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
+}
+
+static inline int nvme_ext_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	/* Borrowed from vring_need_event */
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+static void nvme_ext_write_doorbell(u16 value, u32 __iomem* q_db,
+			   u32* db_addr, volatile u32* event_idx)
+{
+	u16 old_value;
+	if (!db_addr)
+		goto ring_doorbell;
+
+	old_value = *db_addr;
+	*db_addr = value;
+
+	rmb();
+	if (!nvme_ext_need_event(*event_idx, value, old_value))
+		goto no_doorbell;
+
+ring_doorbell:
+	writel(value, q_db);
+no_doorbell:
+	return;
+}
+#endif
+
 /**
  * __nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
  * @nvmeq: The queue to use
@@ -322,9 +386,19 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 	else
 		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	if (nvmeq->sq_doorbell_addr)
+		wmb();
+#endif
+
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	nvme_ext_write_doorbell(tail, nvmeq->q_db,
+		nvmeq->sq_doorbell_addr, nvmeq->sq_eventidx_addr);
+#else
 	writel(tail, nvmeq->q_db);
+#endif
 	nvmeq->sq_tail = tail;
 }
 
@@ -741,6 +815,10 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		struct nvme_completion cqe = nvmeq->cqes[head];
 		struct request *req;
 
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+		if (to_pci_dev(nvmeq->dev->dev)->vendor == PCI_VENDOR_ID_GOOGLE)
+			rmb();
+#endif
 		if (++head == nvmeq->q_depth) {
 			head = 0;
 			phase = !phase;
@@ -785,7 +863,14 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+		nvme_ext_write_doorbell(head,
+			nvmeq->q_db + nvmeq->dev->db_stride,
+			nvmeq->cq_doorbell_addr,
+			nvmeq->cq_eventidx_addr);
+#else
 		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+#endif
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1168,6 +1253,17 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	dev->queues[qid] = nvmeq;
 	dev->queue_count++;
 
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	if (dev->db_mem && dev->ei_mem && qid != 0) {
+		nvmeq->sq_doorbell_addr = &dev->db_mem[qid * 2 * dev->db_stride];
+		nvmeq->cq_doorbell_addr =
+			&dev->db_mem[(qid * 2 + 1) * dev->db_stride];
+		nvmeq->sq_eventidx_addr = &dev->ei_mem[qid * 2 * dev->db_stride];
+		nvmeq->cq_eventidx_addr =
+			&dev->ei_mem[(qid * 2 + 1) * dev->db_stride];
+	}
+#endif
+
 	return nvmeq;
 
  free_cqdma:
@@ -1198,6 +1294,16 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	if (to_pci_dev(dev->dev)->vendor == PCI_VENDOR_ID_GOOGLE && qid != 0) {
+		nvmeq->sq_doorbell_addr = &dev->db_mem[qid * 2 * dev->db_stride];
+		nvmeq->cq_doorbell_addr =
+			&dev->db_mem[(qid * 2 + 1) * dev->db_stride];
+		nvmeq->sq_eventidx_addr = &dev->ei_mem[qid * 2 * dev->db_stride];
+		nvmeq->cq_eventidx_addr =
+			&dev->ei_mem[(qid * 2 + 1) * dev->db_stride];
+	}
+#endif
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -1676,6 +1782,19 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+		if (to_pci_dev(dev->dev)->vendor == PCI_VENDOR_ID_GOOGLE) {
+			int res = nvme_set_doorbell_memory(dev);
+			if (res) {
+				// Free memory and continue on.
+				dma_free_coherent(dev->dev, 8192, dev->db_mem, dev->doorbell);
+				dma_free_coherent(dev->dev, 8192, dev->ei_mem, dev->doorbell);
+				dev->db_mem = 0;
+				dev->ei_mem = 0;
+			}
+		}
+#endif
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1740,8 +1859,31 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 
 	pci_enable_pcie_error_reporting(pdev);
 	pci_save_state(pdev);
+
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	if (pdev->vendor == PCI_VENDOR_ID_GOOGLE) {
+		int mem_size = nvme_vendor_memory_size(dev);
+		dev->db_mem = dma_alloc_coherent(&pdev->dev, mem_size, &dev->doorbell, GFP_KERNEL);
+		if (!dev->db_mem) {
+			result = -ENOMEM;
+			goto disable;
+		}
+		dev->ei_mem = dma_alloc_coherent(&pdev->dev, mem_size, &dev->eventidx, GFP_KERNEL);
+		if (!dev->ei_mem) {
+			result = -ENOMEM;
+			goto dma_free;
+		}
+	}
+#endif
+
 	return 0;
 
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+  dma_free:
+	dma_free_coherent(&pdev->dev, nvme_vendor_memory_size(dev), dev->db_mem, dev->doorbell);
+	dev->db_mem = 0;
+#endif
+
  disable:
 	pci_disable_device(pdev);
 	return result;
@@ -1757,6 +1899,14 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
 static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	int mem_size = nvme_vendor_memory_size(dev);
+
+	if (dev->db_mem)
+		dma_free_coherent(&pdev->dev, mem_size, dev->db_mem, dev->doorbell);
+	if (dev->ei_mem)
+		dma_free_coherent(&pdev->dev, mem_size, dev->ei_mem, dev->eventidx);
+#endif
 
 	if (pdev->msi_enabled)
 		pci_disable_msi(pdev);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index a55986f..d3a8289 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -387,6 +387,9 @@ enum nvme_admin_opcode {
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	nvme_admin_doorbell_memory	= 0xC0,
+#endif
 };
 
 enum {
@@ -516,6 +519,18 @@ struct nvme_format_cmd {
 	__u32			rsvd11[5];
 };
 
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+struct nvme_doorbell_memory {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+#endif
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -529,6 +544,9 @@ struct nvme_command {
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
 		struct nvme_abort_cmd abort;
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+		struct nvme_doorbell_memory doorbell_memory;
+#endif
 	};
 };
 
@@ -575,6 +593,9 @@ enum {
 	NVME_SC_BAD_ATTRIBUTES		= 0x180,
 	NVME_SC_INVALID_PI		= 0x181,
 	NVME_SC_READ_ONLY		= 0x182,
+#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
+	NVME_SC_DOORBELL_MEMORY_INVALID	= 0x1C0,
+#endif
 	NVME_SC_WRITE_FAULT		= 0x280,
 	NVME_SC_READ_ERROR		= 0x281,
 	NVME_SC_GUARD_CHECK		= 0x282,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-14 18:04 [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices Helen Mae Koike Fornazier
@ 2016-04-21 13:33 ` Helen Koike
  2016-04-21 13:38   ` Christoph Hellwig
  2016-05-05 15:15 ` Helen Koike
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2016-04-21 13:33 UTC (permalink / raw)


Hi

On 14-04-2016 15:04, Helen Mae Koike Fornazier wrote:
> From: Rob Nelson <rlnelson at google.com>
>
> This change provides a mechanism to reduce the number of MMIO doorbell
> writes for the NVMe driver. When running in a virtualized environment
> like QEMU, the cost of an MMIO is quite hefy here. The main idea for
> the patch is provide the device two memory location locations:
>   1) to store the doorbell values so they can be lookup without the doorbell
>      MMIO write
>   2) to store an event index.
> I believe the doorbell value is obvious, the event index not so much.
> Similar to the virtio specificaiton, the virtual device can tell the
> driver (guest OS) not to write MMIO unless you are writing past this
> value.
>
> FYI: doorbell values are written by the nvme driver (guest OS) and the
> event index is written by the virtual device (host OS).
>
> The patch implements a new admin command that will communicate where
> these two memory locations reside. If the command fails, the nvme
> driver will work as before without any optimizations.
>
> Contributions:
>    Eric Northup <digitaleric at google.com>
>    Frank Swiderski <fes at google.com>
>    Ted Tso <tytso at mit.edu>
>    Keith Busch <keith.busch at intel.com>
>
> Just to give an idea on the performance boost with the vendor
> extension: Running fio [1], a stock NVMe driver I get about 200K read
> IOPs with my vendor patch I get about 1000K read IOPs. This was
> running with a null device i.e. the backing device simply returned
> success on every read IO request.
>
> [1] Running on a 4 core machine:
>    fio --time_based --name=benchmark --runtime=30
>    --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
>    --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
>    --rw=randread --blocksize=4k --randrepeat=false
>
> Signed-off-by: Rob Nelson <rlnelson at google.com>
> [mlin: port for upstream]
> Signed-off-by: Ming Lin <mlin at kernel.org>
> [koike: updated for current APIs]
> Signed-off-by: Helen Mae Koike Fornazier <helen.koike at collabora.co.uk>
>
> Conflicts:
> 	drivers/nvme/host/Kconfig
> 	drivers/nvme/host/pci.c
> ---
>
> As stated above, this patch provides a HUGE improvement in performance. I would like to work on that to get it upstream.
>
> I also would like to know if you think this code can be made more generic, maybe s/CONFIG_NVME_VENDOS_EXT_GOOGLE/CONFIG_NVME_VM_OPTIMIZATION ? And remove the if(vendor == PCI_VENDOR_ID_GOOGLE)?
>
> I am not sure if all the code blocks inside the if(vendor == PCI_VENDOR_ID_GOOGLE) are only specific to google or if we can remove this check, what do you think?
>
> Thanks in advance for your comments.

I understand now that it can't be generic, as it is based on a vendor 
specific command, but I still would like to know your opinion about this 
patch and if it can make to upstream.

Thank you

>
>   drivers/nvme/host/Kconfig |   7 +++
>   drivers/nvme/host/pci.c   | 150 ++++++++++++++++++++++++++++++++++++++++++++++
>   include/linux/nvme.h      |  21 +++++++
>   3 files changed, 178 insertions(+)
>
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index c894841..dc4fddc 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -24,3 +24,10 @@ config BLK_DEV_NVME_SCSI
>   	  to say N here, unless you run a distro that abuses the SCSI
>   	  emulation to provide stable device names for mount by id, like
>   	  some OpenSuSE and SLES versions.
> +
> +config NVME_VENDOR_EXT_GOOGLE
> +	bool "NVMe Vendor Extension for Improved Virtualization"
> +	depends on BLK_DEV_NVME
> +	---help---
> +	  Google extension to reduce the number of MMIO doorbell
> +	  writes for the NVMe driver
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 24ccda3..b6d8fd8 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -50,6 +50,9 @@
>   #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
>   #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
>   		
> +/* Google Vendor ID is not in include/linux/pci_ids.h */
> +#define PCI_VENDOR_ID_GOOGLE 0x1AE0
> +
>   /*
>    * We handle AEN commands ourselves and don't even let the
>    * block layer know about them.
> @@ -107,6 +110,13 @@ struct nvme_dev {
>   #define NVME_CTRL_RESETTING    0
>   #define NVME_CTRL_REMOVING     1
>   
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	u32 *db_mem;
> +	dma_addr_t doorbell;
> +	u32 *ei_mem;
> +	dma_addr_t eventidx;
> +#endif
> +
>   	struct nvme_ctrl ctrl;
>   	struct completion ioq_wait;
>   };
> @@ -139,6 +149,12 @@ struct nvme_queue {
>   	u16 qid;
>   	u8 cq_phase;
>   	u8 cqe_seen;
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	u32 *sq_doorbell_addr;
> +	u32 *sq_eventidx_addr;
> +	u32 *cq_doorbell_addr;
> +	u32 *cq_eventidx_addr;
> +#endif
>   };
>   
>   /*
> @@ -176,6 +192,9 @@ static inline void _nvme_check_size(void)
>   	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
>   	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
>   	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	BUILD_BUG_ON(sizeof(struct nvme_doorbell_memory) != 64);
> +#endif
>   }
>   
>   /*
> @@ -305,6 +324,51 @@ static void nvme_complete_async_event(struct nvme_dev *dev,
>   	}
>   }
>   
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +static int nvme_vendor_memory_size(struct nvme_dev *dev)
> +{
> +	return ((num_possible_cpus() + 1) * 8 * dev->db_stride);
> +}
> +
> +static int nvme_set_doorbell_memory(struct nvme_dev *dev)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.doorbell_memory.opcode = nvme_admin_doorbell_memory;
> +	c.doorbell_memory.prp1 = cpu_to_le64(dev->doorbell);
> +	c.doorbell_memory.prp2 = cpu_to_le64(dev->eventidx);
> +
> +	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
> +}
> +
> +static inline int nvme_ext_need_event(u16 event_idx, u16 new_idx, u16 old)
> +{
> +	/* Borrowed from vring_need_event */
> +	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
> +}
> +
> +static void nvme_ext_write_doorbell(u16 value, u32 __iomem* q_db,
> +			   u32* db_addr, volatile u32* event_idx)
> +{
> +	u16 old_value;
> +	if (!db_addr)
> +		goto ring_doorbell;
> +
> +	old_value = *db_addr;
> +	*db_addr = value;
> +
> +	rmb();
> +	if (!nvme_ext_need_event(*event_idx, value, old_value))
> +		goto no_doorbell;
> +
> +ring_doorbell:
> +	writel(value, q_db);
> +no_doorbell:
> +	return;
> +}
> +#endif
> +
>   /**
>    * __nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
>    * @nvmeq: The queue to use
> @@ -322,9 +386,19 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>   	else
>   		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>   
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	if (nvmeq->sq_doorbell_addr)
> +		wmb();
> +#endif
> +
>   	if (++tail == nvmeq->q_depth)
>   		tail = 0;
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	nvme_ext_write_doorbell(tail, nvmeq->q_db,
> +		nvmeq->sq_doorbell_addr, nvmeq->sq_eventidx_addr);
> +#else
>   	writel(tail, nvmeq->q_db);
> +#endif
>   	nvmeq->sq_tail = tail;
>   }
>   
> @@ -741,6 +815,10 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
>   		struct nvme_completion cqe = nvmeq->cqes[head];
>   		struct request *req;
>   
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +		if (to_pci_dev(nvmeq->dev->dev)->vendor == PCI_VENDOR_ID_GOOGLE)
> +			rmb();
> +#endif
>   		if (++head == nvmeq->q_depth) {
>   			head = 0;
>   			phase = !phase;
> @@ -785,7 +863,14 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
>   		return;
>   
>   	if (likely(nvmeq->cq_vector >= 0))
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +		nvme_ext_write_doorbell(head,
> +			nvmeq->q_db + nvmeq->dev->db_stride,
> +			nvmeq->cq_doorbell_addr,
> +			nvmeq->cq_eventidx_addr);
> +#else
>   		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> +#endif
>   	nvmeq->cq_head = head;
>   	nvmeq->cq_phase = phase;
>   
> @@ -1168,6 +1253,17 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
>   	dev->queues[qid] = nvmeq;
>   	dev->queue_count++;
>   
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	if (dev->db_mem && dev->ei_mem && qid != 0) {
> +		nvmeq->sq_doorbell_addr = &dev->db_mem[qid * 2 * dev->db_stride];
> +		nvmeq->cq_doorbell_addr =
> +			&dev->db_mem[(qid * 2 + 1) * dev->db_stride];
> +		nvmeq->sq_eventidx_addr = &dev->ei_mem[qid * 2 * dev->db_stride];
> +		nvmeq->cq_eventidx_addr =
> +			&dev->ei_mem[(qid * 2 + 1) * dev->db_stride];
> +	}
> +#endif
> +
>   	return nvmeq;
>   
>    free_cqdma:
> @@ -1198,6 +1294,16 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
>   	nvmeq->cq_head = 0;
>   	nvmeq->cq_phase = 1;
>   	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	if (to_pci_dev(dev->dev)->vendor == PCI_VENDOR_ID_GOOGLE && qid != 0) {
> +		nvmeq->sq_doorbell_addr = &dev->db_mem[qid * 2 * dev->db_stride];
> +		nvmeq->cq_doorbell_addr =
> +			&dev->db_mem[(qid * 2 + 1) * dev->db_stride];
> +		nvmeq->sq_eventidx_addr = &dev->ei_mem[qid * 2 * dev->db_stride];
> +		nvmeq->cq_eventidx_addr =
> +			&dev->ei_mem[(qid * 2 + 1) * dev->db_stride];
> +	}
> +#endif
>   	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
>   	dev->online_queues++;
>   	spin_unlock_irq(&nvmeq->q_lock);
> @@ -1676,6 +1782,19 @@ static int nvme_dev_add(struct nvme_dev *dev)
>   		if (blk_mq_alloc_tag_set(&dev->tagset))
>   			return 0;
>   		dev->ctrl.tagset = &dev->tagset;
> +
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +		if (to_pci_dev(dev->dev)->vendor == PCI_VENDOR_ID_GOOGLE) {
> +			int res = nvme_set_doorbell_memory(dev);
> +			if (res) {
> +				// Free memory and continue on.
> +				dma_free_coherent(dev->dev, 8192, dev->db_mem, dev->doorbell);
> +				dma_free_coherent(dev->dev, 8192, dev->ei_mem, dev->doorbell);
> +				dev->db_mem = 0;
> +				dev->ei_mem = 0;
> +			}
> +		}
> +#endif
>   	} else {
>   		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
>   
> @@ -1740,8 +1859,31 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>   
>   	pci_enable_pcie_error_reporting(pdev);
>   	pci_save_state(pdev);
> +
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	if (pdev->vendor == PCI_VENDOR_ID_GOOGLE) {
> +		int mem_size = nvme_vendor_memory_size(dev);
> +		dev->db_mem = dma_alloc_coherent(&pdev->dev, mem_size, &dev->doorbell, GFP_KERNEL);
> +		if (!dev->db_mem) {
> +			result = -ENOMEM;
> +			goto disable;
> +		}
> +		dev->ei_mem = dma_alloc_coherent(&pdev->dev, mem_size, &dev->eventidx, GFP_KERNEL);
> +		if (!dev->ei_mem) {
> +			result = -ENOMEM;
> +			goto dma_free;
> +		}
> +	}
> +#endif
> +
>   	return 0;
>   
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +  dma_free:
> +	dma_free_coherent(&pdev->dev, nvme_vendor_memory_size(dev), dev->db_mem, dev->doorbell);
> +	dev->db_mem = 0;
> +#endif
> +
>    disable:
>   	pci_disable_device(pdev);
>   	return result;
> @@ -1757,6 +1899,14 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
>   static void nvme_pci_disable(struct nvme_dev *dev)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	int mem_size = nvme_vendor_memory_size(dev);
> +
> +	if (dev->db_mem)
> +		dma_free_coherent(&pdev->dev, mem_size, dev->db_mem, dev->doorbell);
> +	if (dev->ei_mem)
> +		dma_free_coherent(&pdev->dev, mem_size, dev->ei_mem, dev->eventidx);
> +#endif
>   
>   	if (pdev->msi_enabled)
>   		pci_disable_msi(pdev);
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index a55986f..d3a8289 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -387,6 +387,9 @@ enum nvme_admin_opcode {
>   	nvme_admin_format_nvm		= 0x80,
>   	nvme_admin_security_send	= 0x81,
>   	nvme_admin_security_recv	= 0x82,
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	nvme_admin_doorbell_memory	= 0xC0,
> +#endif
>   };
>   
>   enum {
> @@ -516,6 +519,18 @@ struct nvme_format_cmd {
>   	__u32			rsvd11[5];
>   };
>   
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +struct nvme_doorbell_memory {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__u32			rsvd1[5];
> +	__le64			prp1;
> +	__le64			prp2;
> +	__u32			rsvd12[6];
> +};
> +#endif
> +
>   struct nvme_command {
>   	union {
>   		struct nvme_common_command common;
> @@ -529,6 +544,9 @@ struct nvme_command {
>   		struct nvme_format_cmd format;
>   		struct nvme_dsm_cmd dsm;
>   		struct nvme_abort_cmd abort;
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +		struct nvme_doorbell_memory doorbell_memory;
> +#endif
>   	};
>   };
>   
> @@ -575,6 +593,9 @@ enum {
>   	NVME_SC_BAD_ATTRIBUTES		= 0x180,
>   	NVME_SC_INVALID_PI		= 0x181,
>   	NVME_SC_READ_ONLY		= 0x182,
> +#ifdef CONFIG_NVME_VENDOR_EXT_GOOGLE
> +	NVME_SC_DOORBELL_MEMORY_INVALID	= 0x1C0,
> +#endif
>   	NVME_SC_WRITE_FAULT		= 0x280,
>   	NVME_SC_READ_ERROR		= 0x281,
>   	NVME_SC_GUARD_CHECK		= 0x282,

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-21 13:33 ` Helen Koike
@ 2016-04-21 13:38   ` Christoph Hellwig
  2016-04-21 18:11     ` Ming Lin
  2016-05-11 16:50     ` Helen Koike
  0 siblings, 2 replies; 49+ messages in thread
From: Christoph Hellwig @ 2016-04-21 13:38 UTC (permalink / raw)


I've not spent much time at it due to a busy conference week, but there
are two main comments on why I don't think this is suitable as is
despite really liking the general idea.

 - I'd really like this to be proposed as an official extension to the
   NVMe technical workgroup.  I'm also happy to help with making that
   happen.  I really don't like to merge it until we have some basic
   agreement on it, although once the basic agreement is there it
   shouldn't be too hard to also support the older google specific
   version.  And this is no new feedback, a couple of people including
   me said that a long time ago, and we've seen zero action on it.
 - the code is a mess in this version.  I really don't see the need for
   all the ifdefs, but if you really want to keep them they should move
   out of the main code path and just stub out helpers that would
   otherwise do work.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-21 13:38   ` Christoph Hellwig
@ 2016-04-21 18:11     ` Ming Lin
  2016-04-27 15:26       ` Helen Koike
  2016-05-03  8:05       ` Christoph Hellwig
  2016-05-11 16:50     ` Helen Koike
  1 sibling, 2 replies; 49+ messages in thread
From: Ming Lin @ 2016-04-21 18:11 UTC (permalink / raw)


On Thu, 2016-04-21@06:38 -0700, Christoph Hellwig wrote:
> I've not spent much time at it due to a busy conference week, but there
> are two main comments on why I don't think this is suitable as is
> despite really liking the general idea.
> 
>  - I'd really like this to be proposed as an official extension to the
>    NVMe technical workgroup.  I'm also happy to help with making that
>    happen.  I really don't like to merge it until we have some basic

That'd be great help. Thanks.

>    agreement on it, although once the basic agreement is there it
>    shouldn't be too hard to also support the older google specific
>    version.  And this is no new feedback, a couple of people including
>    me said that a long time ago, and we've seen zero action on it.

For "action", did you mean the "agreement" or the spec extension
proposal?

I think you'll make the proposal, right?

>  - the code is a mess in this version.  I really don't see the need for
>    all the ifdefs, but if you really want to keep them they should move
>    out of the main code path and just stub out helpers that would
>    otherwise do work.

I made the mess :)

I thought this before.
Probably add a new file ext.c and move these code into it.

Then add a nvme_ext_ops ...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-21 18:11     ` Ming Lin
@ 2016-04-27 15:26       ` Helen Koike
  2016-05-03  8:05       ` Christoph Hellwig
  1 sibling, 0 replies; 49+ messages in thread
From: Helen Koike @ 2016-04-27 15:26 UTC (permalink / raw)


Hi Christoph,

On 21/04/2016 15:11, Ming Lin wrote:
> On Thu, 2016-04-21@06:38 -0700, Christoph Hellwig wrote:
>> I've not spent much time at it due to a busy conference week, but there
>> are two main comments on why I don't think this is suitable as is
>> despite really liking the general idea.
>>
>>   - I'd really like this to be proposed as an official extension to the
>>     NVMe technical workgroup.  I'm also happy to help with making that
>>     happen.  I really don't like to merge it until we have some basic
> That'd be great help. Thanks.
>
>>     agreement on it, although once the basic agreement is there it
>>     shouldn't be too hard to also support the older google specific
>>     version.  And this is no new feedback, a couple of people including
>>     me said that a long time ago, and we've seen zero action on it.

I would appreciate if we could work together to make this proposal happen.
I prepared a small draft to propose this feature, could you take a look 
and comment please? 
https://people.collabora.co.uk/~koike/nvme-set-doorbel-mem.odt

> For "action", did you mean the "agreement" or the spec extension
> proposal?
>
> I think you'll make the proposal, right?
>
>>   - the code is a mess in this version.  I really don't see the need for
>>     all the ifdefs, but if you really want to keep them they should move
>>     out of the main code path and just stub out helpers that would
>>     otherwise do work.

I'll rework on that as soon as we have an agreement about the format of 
the proposed command.

> I made the mess :)
>
> I thought this before.
> Probably add a new file ext.c and move these code into it.
>
> Then add a nvme_ext_ops ...
>

Helen

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-21 18:11     ` Ming Lin
  2016-04-27 15:26       ` Helen Koike
@ 2016-05-03  8:05       ` Christoph Hellwig
  2016-05-04 16:48         ` Helen Koike
  1 sibling, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2016-05-03  8:05 UTC (permalink / raw)


On Thu, Apr 21, 2016@11:11:35AM -0700, Ming Lin wrote:
> >    agreement on it, although once the basic agreement is there it
> >    shouldn't be too hard to also support the older google specific
> >    version.  And this is no new feedback, a couple of people including
> >    me said that a long time ago, and we've seen zero action on it.
> 
> For "action", did you mean the "agreement" or the spec extension
> proposal?
> 
> I think you'll make the proposal, right?

No, we really need Google to bring it in.

> I made the mess :)
> 
> I thought this before.
> Probably add a new file ext.c and move these code into it.
> 
> Then add a nvme_ext_ops ...

Uh, no.  Just have some inline helpers that do the right thing
for the right "hardware".

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-05-03  8:05       ` Christoph Hellwig
@ 2016-05-04 16:48         ` Helen Koike
  2016-05-05 14:17           ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2016-05-04 16:48 UTC (permalink / raw)


Hi,

On 03-05-2016 05:05, Christoph Hellwig wrote:
> On Thu, Apr 21, 2016@11:11:35AM -0700, Ming Lin wrote:
>>>     agreement on it, although once the basic agreement is there it
>>>     shouldn't be too hard to also support the older google specific
>>>     version.  And this is no new feedback, a couple of people including
>>>     me said that a long time ago, and we've seen zero action on it.
>>
>> For "action", did you mean the "agreement" or the spec extension
>> proposal?
>>
>> I think you'll make the proposal, right?
>
> No, we really need Google to bring it in.

We (Collabora) have Google's support and are completing this work for them.

But before we submit this to the NVMe consortium, I would appreciate 
your comments on this proposal draft:

	https://people.collabora.co.uk/~koike/nvme-set-doorbel-mem.odt

I had formatted the document using ascii art for the tables and send it 
using git send-email but it seems that my message was held due to 
suspicious header, so I am just re-sending without that document 
content, sorry if you are receiving this email twice.

Anyway, it would be good to discuss the proposal in the document above.

Questions:

1) I predict the command Set Doorbell/EventIdx Memory in the proposal, 
should have an Unset command?

2) What should be filled in the function field in figure 40?

3) In the original patch, the command Set Doorbell/EventIdx Memory is 
assumed to be performed after all the queues are created, then what 
should happen if a queue is created after this command? We should abort 
the create queue command? Or we should first unset the doorbel/eventIdx 
memory then create the queues and then set it again? Or we just change 
the original proposal and if the controller has this functionality, it 
should support queues being created after the Set Doorbell/EventIdx 
Memory command?
Personally I think the controller should support creating queues after 
the Set Doorbell/EventIdx Memory command was performed, what do you think?


Helen

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-05-04 16:48         ` Helen Koike
@ 2016-05-05 14:17           ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2016-05-05 14:17 UTC (permalink / raw)


On Wed, May 04, 2016@01:48:43PM -0300, Helen Koike wrote:
> 1) I predict the command Set Doorbell/EventIdx Memory in the proposal,
> should have an Unset command?

Why would you unset it?  What's probably more important is to define
that you do / do not need to set it again after a reset.  Probably the
former.

> 2) What should be filled in the function field in figure 40?

Probably nothing for now, just let the editor update it.

> 3) In the original patch, the command Set Doorbell/EventIdx Memory is
> assumed to be performed after all the queues are created, then what should
> happen if a queue is created after this command? We should abort the create
> queue command? Or we should first unset the doorbel/eventIdx memory then
> create the queues and then set it again? Or we just change the original
> proposal and if the controller has this functionality, it should support
> queues being created after the Set Doorbell/EventIdx Memory command?
> Personally I think the controller should support creating queues after the
> Set Doorbell/EventIdx Memory command was performed, what do you think?

I would not treat it any different than the existing doorbell registers,
which have the same issue.  NVMe allows you to create I/O queues
any time, but you better get the sizing of the bar / memory right
for the maximum possible number of queues allocated.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-14 18:04 [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices Helen Mae Koike Fornazier
  2016-04-21 13:33 ` Helen Koike
@ 2016-05-05 15:15 ` Helen Koike
  2016-05-05 15:24 ` Helen Koike
  2016-08-16  1:41   ` Helen Koike
  3 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2016-05-05 15:15 UTC (permalink / raw)


Hi,

I am re-sending the proposition of the Set Doorbell/EventIdx memory command in plain text to be easier to review.

It can also be viewed here:
	https://people.collabora.com/~koike/nvme-set-doorbel-mem-v2.odt

Changes since v1: 
	- TODO about the unset command removed
	- Add text 5.18 specifying that the command is not retained through resets

Helen

-----
All referenced figures and sections indexes refers to the NVMe specification Revision 1.2a, October 23, 2015.

Proposal modifications:

=============================================================
* Add admin command in Figure 40 as shown below
=============================================================

Figure 40: Opcodes for Admin Commands
+---------+----------+-----------+----------+------+------------------+-----------------------------+
| Generic | Function |    Data   | Combined | O/M1 | Namespace        |          Command            |
| Command |          | Transfer4 | opcode2  |      | Identifier Used3 |                             |
+---------+----------+-----------+----------+------+------------------+-----------------------------+
|   ...   |   ...    |    ...    |    ...   |  ... |       ...        |            ...              |
+---------+----------+-----------+----------+------+------------------+-----------------------------+
|   0b    |    ?     |    11b    |    16h   |   O  |        No        | Set Doorbel/EventIdx Memory |
+---------+----------+-----------+----------+------+------------------+-----------------------------+

=============================================================
* Add a sub-section in 5 Admin Command Set as shown below:
=============================================================

5.18 Set Doorbell/EventIdx memory

The Set Doorbell/Eventidx memory command is used to reduce the number of MMIO
writes to the doorbell registers by ?remapping? the registers in a buffer in
the host, which can improve performances mainly in virtualized environments
where MMIOs are costly.
The host shall provide two buffers (Doorbell memory, updated by the host, and
EventIdx memory, updated by the controller) which mimic the doorbell structure
defined in section 3.1 and indicated in Figure (5.18.Figure1).
The host should update the doorbell memory instead of updating the doorbell
registers as usual, if the value written in memory pass the eventIdx which
refers to the same Submission/Completion Queue, then the doorbell register
should be updated.

The host shall provide a memory as indicated in (5.18.Figure1) with enough
size according to the parameter y given, all queue with QID greater then y will
not be affected and shall use the classic doorbell registers.

The controller might read from the Doorbell memory buffer and update the EventIdx
buffer before the host writes to the Doorbell register, it is implementation
specific when the controller perform those actions.
The Set Doorbell/EventIdx memory command is not retained if a controller reset
occurs.

Note: The consumer and producer shall take queue wrap conditions into account.

Figure (5.18.Figure1) Doorbell/EventIdx Memory buffer structure
+------------------------------+----------------------------+-----------------------------------------+
| Start (Offset in the buffer) | End (Offset in the buffer) |              Description                |
+------------------------------+----------------------------+-----------------------------------------+
|            00h               |            03h             | Submission Queue 0 Tail Mem Doorbell or |
|                              |                            | EventIdx (Admin)                        |
+------------------------------+----------------------------+-----------------------------------------+
|         00h + (1 *           |         03h + (1 *         | Completion Queue 0 Head Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx (Admin)                        |
+------------------------------+----------------------------+-----------------------------------------+
|         00h + (2 *           |         03h + (2 *         | Submission Queue 1 Tail Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+
|         00h + (3 *           |         03h + (3 *         | Completion Queue 1 Head Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+
|            ...               |            ...             |                 ...                     |
+------------------------------+----------------------------+-----------------------------------------+
|        00h + (2y *           |        03h + (2y *         | Submission Queue y Tail Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+
|     00h + ((2y + 1) *        |     03h + ((2y + 1) *      | Completion Queue y Head Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+


Figure (5.18.Figure 2): Set Doorbell/EventIdx Memory - PRP Entry 1
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 63:00 | PRP Entry 1 (PRP1): This field contains the first PRP entry,      |
|       | specifying the start of the Doorbell memory data buffer.          |
+-------+-------------------------------------------------------------------+

Figure (5.18.Figure 3): Set Doorbell/EventIdx Memory - PRP Entry 2
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 63:00 | PRP Entry 2 (PRP2): This field contains the second PRP entry.     |
|       |  Refer to Figure 11 for the definition of this field.             |
+-------+-------------------------------------------------------------------+

Figure (5.18.Figure 4): Set Doorbell/EventIdx Memory - Command Dword 10 and Command Dword 11
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 63:00 | EventIdx Data Pointer (EDPTR): This field contain the equivalent  |
|       | PRP1 and PRP2 for the EventIdx data buffer. Command Dword 10      |
|       | contain the PRP1 and Command Dword 11 the PRP2.                   |
+-------+-------------------------------------------------------------------+

Figure (5.18.Figure 5): Set Doorbell/EventIdx Memory - Command Dword 12
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 31:00 | Number of queues (NQS): The y value as indicated in (5.18.Figure1)|
|       | which defines the minimum size of the data buffers and the number |
|       | of queues to cover. This is a 0?s based value.                    |
+-------+-------------------------------------------------------------------+

5.18.1 Command Completion

If the command is completed, then the controller shall post a completion queue
entry to the Admin Completion Queue indicating the status for the command. 
Set Doorbell/EventIdx memory command specific status values are defined in
Figure (5.18.Figure 6).

Figure (5.18.Figure 6): Set Doorbell/EventIdx Memory ? Command Specific Status Values
+-------+-------------------------------------------------------------------+
| Value | Description                                                       |
+-------+-------------------------------------------------------------------+
| 0Ch   | Invalid memory address                                            |
+-------+-------------------------------------------------------------------+

=============================================================
* Add the following option in the Identify commands Figure 90
=============================================================

+---------+-----+-------------------------------------------------------------+
| Bytes   | O/M |                  Description                                |
+---------+-----+-------------------------------------------------------------+
|   ...   | ... |                      ...                                    |
+---------+-----+-------------------------------------------------------------+
|         Admin Command Set Attributes & Optional Controller Capabilities     |
+---------+-----+-------------------------------------------------------------+
| 257:256 |  M  |                      ...                                    |
|         |     | Bit 4 if set to '1' then the controller supports the Set    |
|         |     | Doorbell/EventIdx Memory command. If cleared to '0' then the|
|         |     | controller does not support the  Set Doorbell/EventIdx      |
|         |     | Memory command.                                             |
|         |     |                      ...                                    |
+---------+-----+-------------------------------------------------------------+
|   ...   | ... |                      ...                                    |
+---------+-----+-------------------------------------------------------------+
-- 
1.9.1

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-14 18:04 [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices Helen Mae Koike Fornazier
  2016-04-21 13:33 ` Helen Koike
  2016-05-05 15:15 ` Helen Koike
@ 2016-05-05 15:24 ` Helen Koike
  2016-08-16  1:41   ` Helen Koike
  3 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2016-05-05 15:24 UTC (permalink / raw)


Hi,

I am re-sending the proposition of the Set Doorbell/EventIdx memory command in plain text to be easier to review.

It can also be viewed here:
	https://people.collabora.com/~koike/nvme-set-doorbel-mem-v2.odt

Changes since v1: 
	- TODO about the unset command removed
	- Add text 5.18 specifying that the command is not retained through resets

Helen

-----
All referenced figures and sections indexes refers to the NVMe specification Revision 1.2a, October 23, 2015.

Proposal modifications:

=============================================================
* Add admin command in Figure 40 as shown below
=============================================================

Figure 40: Opcodes for Admin Commands
+---------+----------+-----------+----------+------+------------------+-----------------------------+
| Generic | Function |    Data   | Combined | O/M1 | Namespace        |          Command            |
| Command |          | Transfer4 | opcode2  |      | Identifier Used3 |                             |
+---------+----------+-----------+----------+------+------------------+-----------------------------+
|   ...   |   ...    |    ...    |    ...   |  ... |       ...        |            ...              |
+---------+----------+-----------+----------+------+------------------+-----------------------------+
|   0b    |    ?     |    11b    |    16h   |   O  |        No        | Set Doorbel/EventIdx Memory |
+---------+----------+-----------+----------+------+------------------+-----------------------------+

=============================================================
* Add a sub-section in 5 Admin Command Set as shown below:
=============================================================

5.18 Set Doorbell/EventIdx memory

The Set Doorbell/Eventidx memory command is used to reduce the number of MMIO
writes to the doorbell registers by ?remapping? the registers in a buffer in
the host, which can improve performances mainly in virtualized environments
where MMIOs are costly.
The host shall provide two buffers (Doorbell memory, updated by the host, and
EventIdx memory, updated by the controller) which mimic the doorbell structure
defined in section 3.1 and indicated in Figure (5.18.Figure1).
The host should update the doorbell memory instead of updating the doorbell
registers as usual, if the value written in memory pass the eventIdx which
refers to the same Submission/Completion Queue, then the doorbell register
should be updated.

The host shall provide a memory as indicated in (5.18.Figure1) with enough
size according to the parameter y given, all queue with QID greater then y will
not be affected and shall use the classic doorbell registers.

The controller might read from the Doorbell memory buffer and update the EventIdx
buffer before the host writes to the Doorbell register, it is implementation
specific when the controller perform those actions.
The Set Doorbell/EventIdx memory command is not retained if a controller reset
occurs.

Note: The consumer and producer shall take queue wrap conditions into account.

Figure (5.18.Figure1) Doorbell/EventIdx Memory buffer structure
+------------------------------+----------------------------+-----------------------------------------+
| Start (Offset in the buffer) | End (Offset in the buffer) |              Description                |
+------------------------------+----------------------------+-----------------------------------------+
|            00h               |            03h             | Submission Queue 0 Tail Mem Doorbell or |
|                              |                            | EventIdx (Admin)                        |
+------------------------------+----------------------------+-----------------------------------------+
|         00h + (1 *           |         03h + (1 *         | Completion Queue 0 Head Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx (Admin)                        |
+------------------------------+----------------------------+-----------------------------------------+
|         00h + (2 *           |         03h + (2 *         | Submission Queue 1 Tail Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+
|         00h + (3 *           |         03h + (3 *         | Completion Queue 1 Head Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+
|            ...               |            ...             |                 ...                     |
+------------------------------+----------------------------+-----------------------------------------+
|        00h + (2y *           |        03h + (2y *         | Submission Queue y Tail Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+
|     00h + ((2y + 1) *        |     03h + ((2y + 1) *      | Completion Queue y Head Mem Doorbell or |
|      (4 << CAP.DSTRD))       |      (4 << CAP.DSTRD))     | EventIdx                                |
+------------------------------+----------------------------+-----------------------------------------+


Figure (5.18.Figure 2): Set Doorbell/EventIdx Memory - PRP Entry 1
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 63:00 | PRP Entry 1 (PRP1): This field contains the first PRP entry,      |
|       | specifying the start of the Doorbell memory data buffer.          |
+-------+-------------------------------------------------------------------+

Figure (5.18.Figure 3): Set Doorbell/EventIdx Memory - PRP Entry 2
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 63:00 | PRP Entry 2 (PRP2): This field contains the second PRP entry.     |
|       |  Refer to Figure 11 for the definition of this field.             |
+-------+-------------------------------------------------------------------+

Figure (5.18.Figure 4): Set Doorbell/EventIdx Memory - Command Dword 10 and Command Dword 11
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 63:00 | EventIdx Data Pointer (EDPTR): This field contain the equivalent  |
|       | PRP1 and PRP2 for the EventIdx data buffer. Command Dword 10      |
|       | contain the PRP1 and Command Dword 11 the PRP2.                   |
+-------+-------------------------------------------------------------------+

Figure (5.18.Figure 5): Set Doorbell/EventIdx Memory - Command Dword 12
+-------+-------------------------------------------------------------------+
| Bit   | Description                                                       |
+-------+-------------------------------------------------------------------+
| 31:00 | Number of queues (NQS): The y value as indicated in (5.18.Figure1)|
|       | which defines the minimum size of the data buffers and the number |
|       | of queues to cover. This is a 0?s based value.                    |
+-------+-------------------------------------------------------------------+

5.18.1 Command Completion

If the command is completed, then the controller shall post a completion queue
entry to the Admin Completion Queue indicating the status for the command. 
Set Doorbell/EventIdx memory command specific status values are defined in
Figure (5.18.Figure 6).

Figure (5.18.Figure 6): Set Doorbell/EventIdx Memory ? Command Specific Status Values
+-------+-------------------------------------------------------------------+
| Value | Description                                                       |
+-------+-------------------------------------------------------------------+
| 0Ch   | Invalid memory address                                            |
+-------+-------------------------------------------------------------------+

=============================================================
* Add the following option in the Identify commands Figure 90
=============================================================

+---------+-----+-------------------------------------------------------------+
| Bytes   | O/M |                  Description                                |
+---------+-----+-------------------------------------------------------------+
|   ...   | ... |                      ...                                    |
+---------+-----+-------------------------------------------------------------+
|         Admin Command Set Attributes & Optional Controller Capabilities     |
+---------+-----+-------------------------------------------------------------+
| 257:256 |  M  |                      ...                                    |
|         |     | Bit 4 if set to '1' then the controller supports the Set    |
|         |     | Doorbell/EventIdx Memory command. If cleared to '0' then the|
|         |     | controller does not support the  Set Doorbell/EventIdx      |
|         |     | Memory command.                                             |
|         |     |                      ...                                    |
+---------+-----+-------------------------------------------------------------+
|   ...   | ... |                      ...                                    |
+---------+-----+-------------------------------------------------------------+
-- 
1.9.1

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-04-21 13:38   ` Christoph Hellwig
  2016-04-21 18:11     ` Ming Lin
@ 2016-05-11 16:50     ` Helen Koike
  2016-05-11 17:43       ` Keith Busch
  1 sibling, 1 reply; 49+ messages in thread
From: Helen Koike @ 2016-05-11 16:50 UTC (permalink / raw)


Hi Christoph

On 21-04-2016 10:38, Christoph Hellwig wrote:
> I've not spent much time at it due to a busy conference week, but there
> are two main comments on why I don't think this is suitable as is
> despite really liking the general idea.
>
>   - I'd really like this to be proposed as an official extension to the
>     NVMe technical workgroup.  I'm also happy to help with making that
>     happen.  I really don't like to merge it until we have some basic
>     agreement on it, although once the basic agreement is there it
>     shouldn't be too hard to also support the older google specific
>     version.  And this is no new feedback, a couple of people including
>     me said that a long time ago, and we've seen zero action on it.


How the process of submitting a proposal to the NVMe technical workgroup 
works? If it is just a matter of sending a document as the entry point 
and as you offered help, I was wondering if you could just forward this 
proposal to them and I can continue the work from that point onwards (as 
we wouldn't like to burden you), would that be possible?

Here is the last draft: 
https://people.collabora.co.uk/~koike/nvme-set-doorbel-mem-v2.odt

Thank you
Helen



>   - the code is a mess in this version.  I really don't see the need for
>     all the ifdefs, but if you really want to keep them they should move
>     out of the main code path and just stub out helpers that would
>     otherwise do work.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-05-11 16:50     ` Helen Koike
@ 2016-05-11 17:43       ` Keith Busch
  2016-05-12  4:05         ` Helen Koike
  0 siblings, 1 reply; 49+ messages in thread
From: Keith Busch @ 2016-05-11 17:43 UTC (permalink / raw)


On Wed, May 11, 2016@01:50:25PM -0300, Helen Koike wrote:
> 
> On 21-04-2016 10:38, Christoph Hellwig wrote:
> >I've not spent much time at it due to a busy conference week, but there
> >are two main comments on why I don't think this is suitable as is
> >despite really liking the general idea.
> >
> >  - I'd really like this to be proposed as an official extension to the
> >    NVMe technical workgroup.  I'm also happy to help with making that
> >    happen.  I really don't like to merge it until we have some basic
> >    agreement on it, although once the basic agreement is there it
> >    shouldn't be too hard to also support the older google specific
> >    version.  And this is no new feedback, a couple of people including
> >    me said that a long time ago, and we've seen zero action on it.
> 
> 
> How the process of submitting a proposal to the NVMe technical workgroup
> works? If it is just a matter of sending a document as the entry point and
> as you offered help, I was wondering if you could just forward this proposal
> to them and I can continue the work from that point onwards (as we wouldn't
> like to burden you), would that be possible?

Are you a contributing member of the NVM Express organization? If so,
you directly send your proposal to the technical working group mailing
list to start the process without a middle-man.

For more information, see www.nvmexpress.org/join-nvme/.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-05-11 17:43       ` Keith Busch
@ 2016-05-12  4:05         ` Helen Koike
  2016-05-12  7:05           ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2016-05-12  4:05 UTC (permalink / raw)




On 11-05-2016 14:43, Keith Busch wrote:
> On Wed, May 11, 2016@01:50:25PM -0300, Helen Koike wrote:
>>
>> On 21-04-2016 10:38, Christoph Hellwig wrote:
>>> I've not spent much time at it due to a busy conference week, but there
>>> are two main comments on why I don't think this is suitable as is
>>> despite really liking the general idea.
>>>
>>>   - I'd really like this to be proposed as an official extension to the
>>>     NVMe technical workgroup.  I'm also happy to help with making that
>>>     happen.  I really don't like to merge it until we have some basic
>>>     agreement on it, although once the basic agreement is there it
>>>     shouldn't be too hard to also support the older google specific
>>>     version.  And this is no new feedback, a couple of people including
>>>     me said that a long time ago, and we've seen zero action on it.
>>
>>
>> How the process of submitting a proposal to the NVMe technical workgroup
>> works? If it is just a matter of sending a document as the entry point and
>> as you offered help, I was wondering if you could just forward this proposal
>> to them and I can continue the work from that point onwards (as we wouldn't
>> like to burden you), would that be possible?
>
> Are you a contributing member of the NVM Express organization? If so,
> you directly send your proposal to the technical working group mailing
> list to start the process without a middle-man.

No, I am not, that is why I am asking. If it's not a burden to you I 
would appreciate some help to start the proposal with the NVMe technical 
workgroup, otherwise I'll need to go through a more bureaucratic path to 
do it in Google's name.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
  2016-05-12  4:05         ` Helen Koike
@ 2016-05-12  7:05           ` Christoph Hellwig
       [not found]             ` <CALu-fzt8=kmzZDqKsRYE42Q+F+Zu3U_s6XErpd9izXdxG1cUMA@mail.gmail.com>
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2016-05-12  7:05 UTC (permalink / raw)


On Thu, May 12, 2016@01:05:32AM -0300, Helen Koike wrote:
> No, I am not, that is why I am asking. If it's not a burden to you I would
> appreciate some help to start the proposal with the NVMe technical
> workgroup, otherwise I'll need to go through a more bureaucratic path to do
> it in Google's name.

As far as I understand the disclosure rules prohibit memebers from
bringing up proposal from others.  It would be really helpful
if you/Google were part of this.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices
       [not found]               ` <CAPst7KAAymUqbNwzxNAczduF6iLXQAMm5-auEdXZ_zHwhHtbWQ@mail.gmail.com>
@ 2016-05-12 17:29                 ` Mike Waychison
  0 siblings, 0 replies; 49+ messages in thread
From: Mike Waychison @ 2016-05-12 17:29 UTC (permalink / raw)


On Thu, May 12, 2016@7:40 AM, Monish Shah <monish@google.com> wrote:
>
> [+mikew]
>
> Hello all,
>
> I don't have all the context, but this sounds like the same thing that Mike Waychison discussed with me recently.  Mike has been added to the Technical Working Group e-mail alias and can communicate on Google's behalf.  I'm happy to help, where I can.
>
> Monish
>
>
> Monish Shah | Hardware Engineer | monish at google.com | 650-253-2822
>
>
> On Thu, May 12, 2016@7:21 AM, Rob Nelson <rlnelson@google.com> wrote:
>>
>> Apologies for my absence here but I've since left my position working on virtualization here at Google.
>>
>> Monish is the lead contact for Google with the NVMe working group, so he might help here. However, Frank/Eric
>> are still on the virtualization team and can comment if they are interested in moving forward here.
>>
>> RIght after I left (end of last year), I thought there was already a proposal that was virtualization friendly
>> (not my extenstion) but something similar to reduce number of guest exits. I'll let Monish comment on
>> that since I have no clue where it went.
>>
>> On Thu, May 12, 2016@12:05 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>> On Thu, May 12, 2016@01:05:32AM -0300, Helen Koike wrote:
>>> > No, I am not, that is why I am asking. If it's not a burden to you I would
>>> > appreciate some help to start the proposal with the NVMe technical
>>> > workgroup, otherwise I'll need to go through a more bureaucratic path to do
>>> > it in Google's name.
>>>
>>> As far as I understand the disclosure rules prohibit memebers from
>>> bringing up proposal from others.  It would be really helpful
>>> if you/Google were part of this.
>>
>>

-html

Hi Christoph,

Helen and team have been looking at this on my team's behalf.  In
summary, we'd like to work together to get the feature in question
accepted upstream, and understand that this needs to go through the
working group for it to be adopted.

I'm relatively unfamiliar with affecting change through the working
group so far, but would like to work together to do so nevertheless if
this is what is required.  Please help direct what is needed from us
to move this forward and we will do our best to oblige.

Thanks,

Mike Waychison

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 RFC 0/2] Virtual NVMe device optimization
  2016-04-14 18:04 [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices Helen Mae Koike Fornazier
@ 2016-08-16  1:41   ` Helen Koike
  2016-05-05 15:15 ` Helen Koike
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2016-08-16  1:41 UTC (permalink / raw)
  To: hch, mlin, fes, keith.busch, rlnelson, axboe, digitaleric, tytso,
	mikew, monish
  Cc: Helen Koike, open list, open list:NVM EXPRESS DRIVER,
	open list:PCI SUBSYSTEM

Please, check commit "nvme: improve performance for virtual NVMe devices" for more details
Commit "PCI: Add Google device ID" only adds a new id in pci_ids.h

Patches are based in the linux-block/for-next branch and available here:
https://github.com/helen-fornazier/opw-staging/commits/nvme/dev

Helen Koike (1):
  PCI: Add Google device ID

Rob Nelson (1):
  nvme: improve performance for virtual NVMe devices

 drivers/nvme/host/Kconfig  |  11 ++++
 drivers/nvme/host/Makefile |   1 +
 drivers/nvme/host/pci.c    |  29 ++++++++++-
 drivers/nvme/host/vdb.c    | 125 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/vdb.h    | 118 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h       |  17 ++++++
 include/linux/pci_ids.h    |   2 +
 7 files changed, 301 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/host/vdb.c
 create mode 100644 drivers/nvme/host/vdb.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 RFC 0/2] Virtual NVMe device optimization
@ 2016-08-16  1:41   ` Helen Koike
  0 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2016-08-16  1:41 UTC (permalink / raw)


Please, check commit "nvme: improve performance for virtual NVMe devices" for more details
Commit "PCI: Add Google device ID" only adds a new id in pci_ids.h

Patches are based in the linux-block/for-next branch and available here:
https://github.com/helen-fornazier/opw-staging/commits/nvme/dev

Helen Koike (1):
  PCI: Add Google device ID

Rob Nelson (1):
  nvme: improve performance for virtual NVMe devices

 drivers/nvme/host/Kconfig  |  11 ++++
 drivers/nvme/host/Makefile |   1 +
 drivers/nvme/host/pci.c    |  29 ++++++++++-
 drivers/nvme/host/vdb.c    | 125 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/vdb.h    | 118 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h       |  17 ++++++
 include/linux/pci_ids.h    |   2 +
 7 files changed, 301 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/host/vdb.c
 create mode 100644 drivers/nvme/host/vdb.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 RFC 1/2] PCI: Add Google device ID
  2016-08-16  1:41   ` Helen Koike
  (?)
@ 2016-08-16  1:41   ` Helen Koike
  2016-08-18 21:59     ` Bjorn Helgaas
  -1 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2016-08-16  1:41 UTC (permalink / raw)
  To: hch, mlin, fes, keith.busch, rlnelson, axboe, digitaleric, tytso,
	mikew, monish
  Cc: Helen Koike, Helen Koike, Bjorn Helgaas, open list:PCI SUBSYSTEM,
	open list

Add device ID for the local SSDs (NVMe) in Google Clound Engine

Signed-off-by: Helen Koike <helen.koike@collabora.com>
---

Changes since v2:
- This is a new patch in the serie

 include/linux/pci_ids.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index c58752f..d422afc 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -948,6 +948,8 @@
 #define PCI_DEVICE_ID_APPLE_IPID2_GMAC	0x006b
 #define PCI_DEVICE_ID_APPLE_TIGON3	0x1645
 
+#define PCI_VENDOR_ID_GOOGLE		0x1AE0
+
 #define PCI_VENDOR_ID_YAMAHA		0x1073
 #define PCI_DEVICE_ID_YAMAHA_724	0x0004
 #define PCI_DEVICE_ID_YAMAHA_724F	0x000d
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
  2016-08-16  1:41   ` Helen Koike
@ 2016-08-16  1:41     ` Helen Koike
  -1 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2016-08-16  1:41 UTC (permalink / raw)
  To: hch, mlin, fes, keith.busch, rlnelson, axboe, digitaleric, tytso,
	mikew, monish
  Cc: Helen Koike, open list, open list:NVM EXPRESS DRIVER

From: Rob Nelson <rlnelson@google.com>

This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specificaiton, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric@google.com>
  Frank Swiderski <fes@google.com>
  Ted Tso <tytso@mit.edu>
  Keith Busch <keith.busch@intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson@google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin@kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike@collabora.co.uk>
---

Changes since v2:
- Add vdb.c and vdb.h, the idea is to let the code in pci.c clean and to
make it easier to integrate with the official nvme extention when nvme
consortium publishes it
- Remove rmb (I couldn't see why they were necessary here, please let me
know if I am wrong)
- Reposition wmb
- Transform specific code in helper functions
- Coding style (checkpatch, remove unecessary goto, change if statement
logic to decrease identation)
- Rename feature to CONFIG_NVME_VDB
- Remove some PCI_VENDOR_ID_GOOGLE checks

 drivers/nvme/host/Kconfig  |  11 ++++
 drivers/nvme/host/Makefile |   1 +
 drivers/nvme/host/pci.c    |  29 ++++++++++-
 drivers/nvme/host/vdb.c    | 125 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/vdb.h    | 118 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h       |  17 ++++++
 6 files changed, 299 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/host/vdb.c
 create mode 100644 drivers/nvme/host/vdb.h

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index db39d53..d3f4da9 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -43,3 +43,14 @@ config NVME_RDMA
 	  from https://github.com/linux-nvme/nvme-cli.
 
 	  If unsure, say N.
+
+config NVME_VDB
+	bool "NVMe Virtual Doorbell Extension for Improved Virtualization"
+	depends on NVME_CORE
+	---help---
+	  This provides support for the Virtual Doorbell Extension which
+	  reduces the number of required MMIOs to ring doorbells, improving
+	  performance in virtualized environments where MMIO causes a high
+	  overhead.
+
+	  If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 47abcec..d4d0e3d 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -8,6 +8,7 @@ nvme-core-$(CONFIG_BLK_DEV_NVME_SCSI)	+= scsi.o
 nvme-core-$(CONFIG_NVM)			+= lightnvm.o
 
 nvme-y					+= pci.o
+nvme-$(CONFIG_NVME_VDB)			+= vdb.o
 
 nvme-fabrics-y				+= fabrics.o
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cf8b3d7..20bbc33 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -44,6 +44,7 @@
 #include <asm/unaligned.h>
 
 #include "nvme.h"
+#include "vdb.h"
 
 #define NVME_Q_DEPTH		1024
 #define NVME_AQ_DEPTH		256
@@ -99,6 +100,7 @@ struct nvme_dev {
 	dma_addr_t cmb_dma_addr;
 	u64 cmb_size;
 	u32 cmbsz;
+	struct nvme_vdb_dev vdb_d;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 };
@@ -131,6 +133,7 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+	struct nvme_vdb_queue vdb_q;
 };
 
 /*
@@ -171,6 +174,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_doorbell_memory) != 64);
 }
 
 /*
@@ -285,7 +289,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
-	writel(tail, nvmeq->q_db);
+	nvme_write_doorbell_sq(&nvmeq->vdb_q, tail, nvmeq->q_db);
 	nvmeq->sq_tail = tail;
 }
 
@@ -713,7 +717,8 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+		nvme_write_doorbell_cq(&nvmeq->vdb_q, head,
+				       nvmeq->q_db + nvmeq->dev->db_stride);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1068,6 +1073,8 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	dev->queues[qid] = nvmeq;
 	dev->queue_count++;
 
+	nvme_init_doorbell_mem(&dev->vdb_d, &nvmeq->vdb_q, qid, dev->db_stride);
+
 	return nvmeq;
 
  free_cqdma:
@@ -1098,6 +1105,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
+	nvme_init_doorbell_mem(&dev->vdb_d, &nvmeq->vdb_q, qid, dev->db_stride);
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -1588,6 +1596,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+		nvme_set_doorbell_memory(dev->dev, &dev->vdb_d,
+					 &dev->ctrl, dev->db_stride);
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1655,6 +1666,18 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 
 	pci_enable_pcie_error_reporting(pdev);
 	pci_save_state(pdev);
+
+	/*
+	 * Google cloud support in memory doorbells extension, reducing the
+	 * number of MMIOs, optimizing performance for virtualized environments
+	 */
+	if (pdev->vendor == PCI_VENDOR_ID_GOOGLE) {
+		result = nvme_dma_alloc_doorbell_mem(dev->dev, &dev->vdb_d,
+						     dev->db_stride);
+		if (result)
+			goto disable;
+	}
+
 	return 0;
 
  disable:
@@ -1673,6 +1696,8 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
+	nvme_dma_free_doorbell_mem(dev->dev, &dev->vdb_d, dev->db_stride);
+
 	if (pdev->msi_enabled)
 		pci_disable_msi(pdev);
 	else if (pdev->msix_enabled)
diff --git a/drivers/nvme/host/vdb.c b/drivers/nvme/host/vdb.c
new file mode 100644
index 0000000..e0c3fef
--- /dev/null
+++ b/drivers/nvme/host/vdb.c
@@ -0,0 +1,125 @@
+/*
+ * NVM Express device driver
+ * Copyright (C) 2015-2016, Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include "nvme.h"
+#include "vdb.h"
+
+static inline unsigned int nvme_doorbell_memory_size(u32 stride)
+{
+	return ((num_possible_cpus() + 1) * 8 * stride);
+}
+
+int nvme_dma_alloc_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride)
+{
+	unsigned int mem_size = nvme_doorbell_memory_size(stride);
+
+	vdb_d->db_mem = dma_alloc_coherent(dev, mem_size, &vdb_d->doorbell,
+					   GFP_KERNEL);
+	if (!vdb_d->db_mem)
+		return -ENOMEM;
+	vdb_d->ei_mem = dma_alloc_coherent(dev, mem_size, &vdb_d->eventidx,
+					   GFP_KERNEL);
+	if (!vdb_d->ei_mem) {
+		dma_free_coherent(dev, mem_size,
+				  vdb_d->db_mem, vdb_d->doorbell);
+		vdb_d->db_mem = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void nvme_dma_free_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride)
+{
+	unsigned int mem_size = nvme_doorbell_memory_size(stride);
+
+	if (vdb_d->db_mem) {
+		dma_free_coherent(dev, mem_size,
+				  vdb_d->db_mem, vdb_d->doorbell);
+		vdb_d->db_mem = NULL;
+	}
+	if (vdb_d->ei_mem) {
+		dma_free_coherent(dev, mem_size,
+				  vdb_d->ei_mem, vdb_d->eventidx);
+		vdb_d->ei_mem = NULL;
+	}
+}
+
+void nvme_init_doorbell_mem(struct nvme_vdb_dev *vdb_d,
+			    struct nvme_vdb_queue *vdb_q,
+			    int qid, u32 stride)
+{
+	if (!vdb_d->db_mem || !qid)
+		return;
+
+	vdb_q->sq_doorbell_addr = &vdb_d->db_mem[SQ_IDX(qid, stride)];
+	vdb_q->cq_doorbell_addr = &vdb_d->db_mem[CQ_IDX(qid, stride)];
+	vdb_q->sq_eventidx_addr = &vdb_d->ei_mem[SQ_IDX(qid, stride)];
+	vdb_q->cq_eventidx_addr = &vdb_d->ei_mem[CQ_IDX(qid, stride)];
+}
+
+void nvme_set_doorbell_memory(struct device *dev,
+			      struct nvme_vdb_dev *vdb_d,
+			      struct nvme_ctrl *ctrl,
+			      u32 stride)
+{
+	struct nvme_command c;
+
+	if (!vdb_d->db_mem)
+		return;
+
+	memset(&c, 0, sizeof(c));
+	c.doorbell_memory.opcode = nvme_admin_doorbell_memory;
+	c.doorbell_memory.prp1 = cpu_to_le64(vdb_d->doorbell);
+	c.doorbell_memory.prp2 = cpu_to_le64(vdb_d->eventidx);
+
+	if (nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0))
+		/* Free memory and continue on */
+		nvme_dma_free_doorbell_mem(dev, vdb_d, stride);
+}
+
+static inline int nvme_ext_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	/* Borrowed from vring_need_event */
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+void nvme_write_doorbell(u16 value,
+			 u32 __iomem *q_db,
+			 u32 *db_addr,
+			 volatile u32 *event_idx)
+{
+	u16 old_value;
+
+	if (!db_addr) {
+		writel(value, q_db);
+		return;
+	}
+
+	/*
+	 * Ensure that the queue is written before updating
+	 * the doorbell in memory
+	 */
+	wmb();
+
+	old_value = *db_addr;
+	*db_addr = value;
+
+	if (nvme_ext_need_event(*event_idx, value, old_value))
+		writel(value, q_db);
+}
diff --git a/drivers/nvme/host/vdb.h b/drivers/nvme/host/vdb.h
new file mode 100644
index 0000000..37edd75
--- /dev/null
+++ b/drivers/nvme/host/vdb.h
@@ -0,0 +1,118 @@
+/*
+ * NVM Express device driver
+ * Copyright (C) 2015-2016, Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _NVME_VDB_H
+#define _NVME_VDB_H
+
+#ifdef CONFIG_NVME_VDB
+
+#define SQ_IDX(qid, stride) ((qid) * 2 * (stride))
+#define CQ_IDX(qid, stride) (((qid) * 2 + 1) * (stride))
+
+struct nvme_vdb_dev {
+	u32 *db_mem;
+	dma_addr_t doorbell;
+	u32 *ei_mem;
+	dma_addr_t eventidx;
+};
+
+struct nvme_vdb_queue {
+	u32 *sq_doorbell_addr;
+	u32 *sq_eventidx_addr;
+	u32 *cq_doorbell_addr;
+	u32 *cq_eventidx_addr;
+};
+
+int nvme_dma_alloc_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride);
+
+void nvme_dma_free_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride);
+
+void nvme_init_doorbell_mem(struct nvme_vdb_dev *vdb_d,
+			    struct nvme_vdb_queue *vdb_q,
+			    int qid, u32 stride);
+
+void nvme_set_doorbell_memory(struct device *dev,
+			      struct nvme_vdb_dev *vdb_d,
+			      struct nvme_ctrl *ctrl,
+			      u32 stride);
+
+void nvme_write_doorbell(u16 value,
+			 u32 __iomem *q_db,
+			 u32 *db_addr,
+			 volatile u32 *event_idx);
+
+static inline void nvme_write_doorbell_cq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	nvme_write_doorbell(value, q_db,
+			    vdb_q->cq_doorbell_addr,
+			    vdb_q->cq_eventidx_addr);
+}
+
+static inline void nvme_write_doorbell_sq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	nvme_write_doorbell(value, q_db,
+			    vdb_q->sq_doorbell_addr,
+			    vdb_q->sq_eventidx_addr);
+}
+
+#else /* CONFIG_NVME_VDB */
+
+struct nvme_vdb_dev {};
+
+struct nvme_vdb_queue {};
+
+static inline int nvme_dma_alloc_doorbell_mem(struct device *dev,
+					      struct nvme_vdb_dev *vdb_d,
+					      u32 stride)
+{
+	return 0;
+}
+
+static inline void nvme_dma_free_doorbell_mem(struct device *dev,
+					      struct nvme_vdb_dev *vdb_d,
+					      u32 stride)
+{}
+
+static inline void nvme_set_doorbell_memory(struct device *dev,
+					    struct nvme_vdb_dev *vdb_d,
+					    struct nvme_ctrl *ctrl,
+					    u32 stride)
+{}
+
+static inline void nvme_init_doorbell_mem(struct nvme_vdb_dev *vdb_d,
+					  struct nvme_vdb_queue *vdb_q,
+					  int qid, u32 stride)
+{}
+
+static inline void nvme_write_doorbell_cq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	writel(value, q_db);
+}
+
+static inline void nvme_write_doorbell_sq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	writel(value, q_db);
+}
+
+#endif /* CONFIG_NVME_VDB */
+
+#endif /* _NVME_VDB_H */
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index d8b37ba..46d0412 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -562,6 +562,9 @@ enum nvme_admin_opcode {
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
+#ifdef CONFIG_NVME_VDB
+	nvme_admin_doorbell_memory	= 0xC0,
+#endif
 };
 
 enum {
@@ -827,6 +830,16 @@ struct nvmf_property_get_command {
 	__u8		resv4[16];
 };
 
+struct nvme_doorbell_memory {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -845,6 +858,7 @@ struct nvme_command {
 		struct nvmf_connect_command connect;
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
+		struct nvme_doorbell_memory doorbell_memory;
 	};
 };
 
@@ -934,6 +948,9 @@ enum {
 	/*
 	 * Media and Data Integrity Errors:
 	 */
+#ifdef CONFIG_NVME_VDB
+	NVME_SC_DOORBELL_MEMORY_INVALID	= 0x1C0,
+#endif
 	NVME_SC_WRITE_FAULT		= 0x280,
 	NVME_SC_READ_ERROR		= 0x281,
 	NVME_SC_GUARD_CHECK		= 0x282,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
@ 2016-08-16  1:41     ` Helen Koike
  0 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2016-08-16  1:41 UTC (permalink / raw)


From: Rob Nelson <rlnelson@google.com>

This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specificaiton, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric at google.com>
  Frank Swiderski <fes at google.com>
  Ted Tso <tytso at mit.edu>
  Keith Busch <keith.busch at intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson at google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin at kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike at collabora.co.uk>
---

Changes since v2:
- Add vdb.c and vdb.h, the idea is to let the code in pci.c clean and to
make it easier to integrate with the official nvme extention when nvme
consortium publishes it
- Remove rmb (I couldn't see why they were necessary here, please let me
know if I am wrong)
- Reposition wmb
- Transform specific code in helper functions
- Coding style (checkpatch, remove unecessary goto, change if statement
logic to decrease identation)
- Rename feature to CONFIG_NVME_VDB
- Remove some PCI_VENDOR_ID_GOOGLE checks

 drivers/nvme/host/Kconfig  |  11 ++++
 drivers/nvme/host/Makefile |   1 +
 drivers/nvme/host/pci.c    |  29 ++++++++++-
 drivers/nvme/host/vdb.c    | 125 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/vdb.h    | 118 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h       |  17 ++++++
 6 files changed, 299 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/host/vdb.c
 create mode 100644 drivers/nvme/host/vdb.h

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index db39d53..d3f4da9 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -43,3 +43,14 @@ config NVME_RDMA
 	  from https://github.com/linux-nvme/nvme-cli.
 
 	  If unsure, say N.
+
+config NVME_VDB
+	bool "NVMe Virtual Doorbell Extension for Improved Virtualization"
+	depends on NVME_CORE
+	---help---
+	  This provides support for the Virtual Doorbell Extension which
+	  reduces the number of required MMIOs to ring doorbells, improving
+	  performance in virtualized environments where MMIO causes a high
+	  overhead.
+
+	  If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 47abcec..d4d0e3d 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -8,6 +8,7 @@ nvme-core-$(CONFIG_BLK_DEV_NVME_SCSI)	+= scsi.o
 nvme-core-$(CONFIG_NVM)			+= lightnvm.o
 
 nvme-y					+= pci.o
+nvme-$(CONFIG_NVME_VDB)			+= vdb.o
 
 nvme-fabrics-y				+= fabrics.o
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cf8b3d7..20bbc33 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -44,6 +44,7 @@
 #include <asm/unaligned.h>
 
 #include "nvme.h"
+#include "vdb.h"
 
 #define NVME_Q_DEPTH		1024
 #define NVME_AQ_DEPTH		256
@@ -99,6 +100,7 @@ struct nvme_dev {
 	dma_addr_t cmb_dma_addr;
 	u64 cmb_size;
 	u32 cmbsz;
+	struct nvme_vdb_dev vdb_d;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 };
@@ -131,6 +133,7 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+	struct nvme_vdb_queue vdb_q;
 };
 
 /*
@@ -171,6 +174,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_doorbell_memory) != 64);
 }
 
 /*
@@ -285,7 +289,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
-	writel(tail, nvmeq->q_db);
+	nvme_write_doorbell_sq(&nvmeq->vdb_q, tail, nvmeq->q_db);
 	nvmeq->sq_tail = tail;
 }
 
@@ -713,7 +717,8 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+		nvme_write_doorbell_cq(&nvmeq->vdb_q, head,
+				       nvmeq->q_db + nvmeq->dev->db_stride);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1068,6 +1073,8 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	dev->queues[qid] = nvmeq;
 	dev->queue_count++;
 
+	nvme_init_doorbell_mem(&dev->vdb_d, &nvmeq->vdb_q, qid, dev->db_stride);
+
 	return nvmeq;
 
  free_cqdma:
@@ -1098,6 +1105,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
+	nvme_init_doorbell_mem(&dev->vdb_d, &nvmeq->vdb_q, qid, dev->db_stride);
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -1588,6 +1596,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+		nvme_set_doorbell_memory(dev->dev, &dev->vdb_d,
+					 &dev->ctrl, dev->db_stride);
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1655,6 +1666,18 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 
 	pci_enable_pcie_error_reporting(pdev);
 	pci_save_state(pdev);
+
+	/*
+	 * Google cloud support in memory doorbells extension, reducing the
+	 * number of MMIOs, optimizing performance for virtualized environments
+	 */
+	if (pdev->vendor == PCI_VENDOR_ID_GOOGLE) {
+		result = nvme_dma_alloc_doorbell_mem(dev->dev, &dev->vdb_d,
+						     dev->db_stride);
+		if (result)
+			goto disable;
+	}
+
 	return 0;
 
  disable:
@@ -1673,6 +1696,8 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
+	nvme_dma_free_doorbell_mem(dev->dev, &dev->vdb_d, dev->db_stride);
+
 	if (pdev->msi_enabled)
 		pci_disable_msi(pdev);
 	else if (pdev->msix_enabled)
diff --git a/drivers/nvme/host/vdb.c b/drivers/nvme/host/vdb.c
new file mode 100644
index 0000000..e0c3fef
--- /dev/null
+++ b/drivers/nvme/host/vdb.c
@@ -0,0 +1,125 @@
+/*
+ * NVM Express device driver
+ * Copyright (C) 2015-2016, Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include "nvme.h"
+#include "vdb.h"
+
+static inline unsigned int nvme_doorbell_memory_size(u32 stride)
+{
+	return ((num_possible_cpus() + 1) * 8 * stride);
+}
+
+int nvme_dma_alloc_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride)
+{
+	unsigned int mem_size = nvme_doorbell_memory_size(stride);
+
+	vdb_d->db_mem = dma_alloc_coherent(dev, mem_size, &vdb_d->doorbell,
+					   GFP_KERNEL);
+	if (!vdb_d->db_mem)
+		return -ENOMEM;
+	vdb_d->ei_mem = dma_alloc_coherent(dev, mem_size, &vdb_d->eventidx,
+					   GFP_KERNEL);
+	if (!vdb_d->ei_mem) {
+		dma_free_coherent(dev, mem_size,
+				  vdb_d->db_mem, vdb_d->doorbell);
+		vdb_d->db_mem = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void nvme_dma_free_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride)
+{
+	unsigned int mem_size = nvme_doorbell_memory_size(stride);
+
+	if (vdb_d->db_mem) {
+		dma_free_coherent(dev, mem_size,
+				  vdb_d->db_mem, vdb_d->doorbell);
+		vdb_d->db_mem = NULL;
+	}
+	if (vdb_d->ei_mem) {
+		dma_free_coherent(dev, mem_size,
+				  vdb_d->ei_mem, vdb_d->eventidx);
+		vdb_d->ei_mem = NULL;
+	}
+}
+
+void nvme_init_doorbell_mem(struct nvme_vdb_dev *vdb_d,
+			    struct nvme_vdb_queue *vdb_q,
+			    int qid, u32 stride)
+{
+	if (!vdb_d->db_mem || !qid)
+		return;
+
+	vdb_q->sq_doorbell_addr = &vdb_d->db_mem[SQ_IDX(qid, stride)];
+	vdb_q->cq_doorbell_addr = &vdb_d->db_mem[CQ_IDX(qid, stride)];
+	vdb_q->sq_eventidx_addr = &vdb_d->ei_mem[SQ_IDX(qid, stride)];
+	vdb_q->cq_eventidx_addr = &vdb_d->ei_mem[CQ_IDX(qid, stride)];
+}
+
+void nvme_set_doorbell_memory(struct device *dev,
+			      struct nvme_vdb_dev *vdb_d,
+			      struct nvme_ctrl *ctrl,
+			      u32 stride)
+{
+	struct nvme_command c;
+
+	if (!vdb_d->db_mem)
+		return;
+
+	memset(&c, 0, sizeof(c));
+	c.doorbell_memory.opcode = nvme_admin_doorbell_memory;
+	c.doorbell_memory.prp1 = cpu_to_le64(vdb_d->doorbell);
+	c.doorbell_memory.prp2 = cpu_to_le64(vdb_d->eventidx);
+
+	if (nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0))
+		/* Free memory and continue on */
+		nvme_dma_free_doorbell_mem(dev, vdb_d, stride);
+}
+
+static inline int nvme_ext_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	/* Borrowed from vring_need_event */
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+void nvme_write_doorbell(u16 value,
+			 u32 __iomem *q_db,
+			 u32 *db_addr,
+			 volatile u32 *event_idx)
+{
+	u16 old_value;
+
+	if (!db_addr) {
+		writel(value, q_db);
+		return;
+	}
+
+	/*
+	 * Ensure that the queue is written before updating
+	 * the doorbell in memory
+	 */
+	wmb();
+
+	old_value = *db_addr;
+	*db_addr = value;
+
+	if (nvme_ext_need_event(*event_idx, value, old_value))
+		writel(value, q_db);
+}
diff --git a/drivers/nvme/host/vdb.h b/drivers/nvme/host/vdb.h
new file mode 100644
index 0000000..37edd75
--- /dev/null
+++ b/drivers/nvme/host/vdb.h
@@ -0,0 +1,118 @@
+/*
+ * NVM Express device driver
+ * Copyright (C) 2015-2016, Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _NVME_VDB_H
+#define _NVME_VDB_H
+
+#ifdef CONFIG_NVME_VDB
+
+#define SQ_IDX(qid, stride) ((qid) * 2 * (stride))
+#define CQ_IDX(qid, stride) (((qid) * 2 + 1) * (stride))
+
+struct nvme_vdb_dev {
+	u32 *db_mem;
+	dma_addr_t doorbell;
+	u32 *ei_mem;
+	dma_addr_t eventidx;
+};
+
+struct nvme_vdb_queue {
+	u32 *sq_doorbell_addr;
+	u32 *sq_eventidx_addr;
+	u32 *cq_doorbell_addr;
+	u32 *cq_eventidx_addr;
+};
+
+int nvme_dma_alloc_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride);
+
+void nvme_dma_free_doorbell_mem(struct device *dev,
+				struct nvme_vdb_dev *vdb_d,
+				u32 stride);
+
+void nvme_init_doorbell_mem(struct nvme_vdb_dev *vdb_d,
+			    struct nvme_vdb_queue *vdb_q,
+			    int qid, u32 stride);
+
+void nvme_set_doorbell_memory(struct device *dev,
+			      struct nvme_vdb_dev *vdb_d,
+			      struct nvme_ctrl *ctrl,
+			      u32 stride);
+
+void nvme_write_doorbell(u16 value,
+			 u32 __iomem *q_db,
+			 u32 *db_addr,
+			 volatile u32 *event_idx);
+
+static inline void nvme_write_doorbell_cq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	nvme_write_doorbell(value, q_db,
+			    vdb_q->cq_doorbell_addr,
+			    vdb_q->cq_eventidx_addr);
+}
+
+static inline void nvme_write_doorbell_sq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	nvme_write_doorbell(value, q_db,
+			    vdb_q->sq_doorbell_addr,
+			    vdb_q->sq_eventidx_addr);
+}
+
+#else /* CONFIG_NVME_VDB */
+
+struct nvme_vdb_dev {};
+
+struct nvme_vdb_queue {};
+
+static inline int nvme_dma_alloc_doorbell_mem(struct device *dev,
+					      struct nvme_vdb_dev *vdb_d,
+					      u32 stride)
+{
+	return 0;
+}
+
+static inline void nvme_dma_free_doorbell_mem(struct device *dev,
+					      struct nvme_vdb_dev *vdb_d,
+					      u32 stride)
+{}
+
+static inline void nvme_set_doorbell_memory(struct device *dev,
+					    struct nvme_vdb_dev *vdb_d,
+					    struct nvme_ctrl *ctrl,
+					    u32 stride)
+{}
+
+static inline void nvme_init_doorbell_mem(struct nvme_vdb_dev *vdb_d,
+					  struct nvme_vdb_queue *vdb_q,
+					  int qid, u32 stride)
+{}
+
+static inline void nvme_write_doorbell_cq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	writel(value, q_db);
+}
+
+static inline void nvme_write_doorbell_sq(struct nvme_vdb_queue *vdb_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	writel(value, q_db);
+}
+
+#endif /* CONFIG_NVME_VDB */
+
+#endif /* _NVME_VDB_H */
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index d8b37ba..46d0412 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -562,6 +562,9 @@ enum nvme_admin_opcode {
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
+#ifdef CONFIG_NVME_VDB
+	nvme_admin_doorbell_memory	= 0xC0,
+#endif
 };
 
 enum {
@@ -827,6 +830,16 @@ struct nvmf_property_get_command {
 	__u8		resv4[16];
 };
 
+struct nvme_doorbell_memory {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -845,6 +858,7 @@ struct nvme_command {
 		struct nvmf_connect_command connect;
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
+		struct nvme_doorbell_memory doorbell_memory;
 	};
 };
 
@@ -934,6 +948,9 @@ enum {
 	/*
 	 * Media and Data Integrity Errors:
 	 */
+#ifdef CONFIG_NVME_VDB
+	NVME_SC_DOORBELL_MEMORY_INVALID	= 0x1C0,
+#endif
 	NVME_SC_WRITE_FAULT		= 0x280,
 	NVME_SC_READ_ERROR		= 0x281,
 	NVME_SC_GUARD_CHECK		= 0x282,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
  2016-08-16  1:41     ` Helen Koike
@ 2016-08-16 20:45       ` J Freyensee
  -1 siblings, 0 replies; 49+ messages in thread
From: J Freyensee @ 2016-08-16 20:45 UTC (permalink / raw)
  To: Helen Koike, hch, mlin, fes, keith.busch, rlnelson, axboe,
	digitaleric, tytso, mikew, monish
  Cc: open list, open list:NVM EXPRESS DRIVER, Huffman, Amber, Minturn, Dave B

On Mon, 2016-08-15 at 22:41 -0300, Helen Koike wrote:


>  
> +struct nvme_doorbell_memory {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__u32			rsvd1[5];
> +	__le64			prp1;
> +	__le64			prp2;
> +	__u32			rsvd12[6];
> +};
> +
>  struct nvme_command {
>  	union {
>  		struct nvme_common_command common;
> @@ -845,6 +858,7 @@ struct nvme_command {
>  		struct nvmf_connect_command connect;
>  		struct nvmf_property_set_command prop_set;
>  		struct nvmf_property_get_command prop_get;
> +		struct nvme_doorbell_memory doorbell_memory;
>  	};
>  };

This looks like a new NVMe command being introduced, not found in the
latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)?

This is a big NACK, the command needs to be part of the NVMe standard
before adding it to the NVMe code base (this is exactly how NVMe-over-
Fabrics standard got implemented).  I would bring your proposal to
nvmexpress.org.

Jay


>  
> @@ -934,6 +948,9 @@ enum {
>  	/*
>  	 * Media and Data Integrity Errors:
>  	 */
> +#ifdef CONFIG_NVME_VDB
> +	NVME_SC_DOORBELL_MEMORY_INVALID	= 0x1C0,
> +#endif
>  	NVME_SC_WRITE_FAULT		= 0x280,
>  	NVME_SC_READ_ERROR		= 0x281,
>  	NVME_SC_GUARD_CHECK		= 0x282,

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
@ 2016-08-16 20:45       ` J Freyensee
  0 siblings, 0 replies; 49+ messages in thread
From: J Freyensee @ 2016-08-16 20:45 UTC (permalink / raw)


On Mon, 2016-08-15@22:41 -0300, Helen Koike wrote:


> ?
> +struct nvme_doorbell_memory {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__u32			rsvd1[5];
> +	__le64			prp1;
> +	__le64			prp2;
> +	__u32			rsvd12[6];
> +};
> +
> ?struct nvme_command {
> ?	union {
> ?		struct nvme_common_command common;
> @@ -845,6 +858,7 @@ struct nvme_command {
> ?		struct nvmf_connect_command connect;
> ?		struct nvmf_property_set_command prop_set;
> ?		struct nvmf_property_get_command prop_get;
> +		struct nvme_doorbell_memory doorbell_memory;
> ?	};
> ?};

This looks like a new NVMe command being introduced, not found in the
latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)?

This is a big NACK, the command needs to be part of the NVMe standard
before adding it to the NVMe code base (this is exactly how NVMe-over-
Fabrics standard got implemented). ?I would bring your proposal to
nvmexpress.org.

Jay


> ?
> @@ -934,6 +948,9 @@ enum {
> ?	/*
> ?	?* Media and Data Integrity Errors:
> ?	?*/
> +#ifdef CONFIG_NVME_VDB
> +	NVME_SC_DOORBELL_MEMORY_INVALID	= 0x1C0,
> +#endif
> ?	NVME_SC_WRITE_FAULT		= 0x280,
> ?	NVME_SC_READ_ERROR		= 0x281,
> ?	NVME_SC_GUARD_CHECK		= 0x282,

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
  2016-08-16 20:45       ` J Freyensee
@ 2016-08-16 23:45         ` Keith Busch
  -1 siblings, 0 replies; 49+ messages in thread
From: Keith Busch @ 2016-08-16 23:45 UTC (permalink / raw)
  To: J Freyensee
  Cc: Helen Koike, hch, mlin, fes, rlnelson, axboe, digitaleric, tytso,
	mikew, monish, open list, open list:NVM EXPRESS DRIVER, Huffman,
	Amber, Minturn, Dave B

On Tue, Aug 16, 2016 at 01:45:03PM -0700, J Freyensee wrote:
> On Mon, 2016-08-15 at 22:41 -0300, Helen Koike wrote:
> >  		struct nvmf_connect_command connect;
> >  		struct nvmf_property_set_command prop_set;
> >  		struct nvmf_property_get_command prop_get;
> > +		struct nvme_doorbell_memory doorbell_memory;
> >  	};
> >  };
> 
> This looks like a new NVMe command being introduced, not found in the
> latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)?
> 
> This is a big NACK, the command needs to be part of the NVMe standard
> before adding it to the NVMe code base (this is exactly how NVMe-over-
> Fabrics standard got implemented).  I would bring your proposal to
> nvmexpress.org.

While this is an approved TPAR up for consideration soon, I don't think we
want to include this in mainline before it is ratified lest the current
form conflicts with the future spec.

We can still review and comment on the code, though it's not normally
done publicly until the spec is also public. The proposal originated
from a public RFC, though, so it's not a secret.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
@ 2016-08-16 23:45         ` Keith Busch
  0 siblings, 0 replies; 49+ messages in thread
From: Keith Busch @ 2016-08-16 23:45 UTC (permalink / raw)


On Tue, Aug 16, 2016@01:45:03PM -0700, J Freyensee wrote:
> On Mon, 2016-08-15@22:41 -0300, Helen Koike wrote:
> > ?		struct nvmf_connect_command connect;
> > ?		struct nvmf_property_set_command prop_set;
> > ?		struct nvmf_property_get_command prop_get;
> > +		struct nvme_doorbell_memory doorbell_memory;
> > ?	};
> > ?};
> 
> This looks like a new NVMe command being introduced, not found in the
> latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)?
> 
> This is a big NACK, the command needs to be part of the NVMe standard
> before adding it to the NVMe code base (this is exactly how NVMe-over-
> Fabrics standard got implemented). ?I would bring your proposal to
> nvmexpress.org.

While this is an approved TPAR up for consideration soon, I don't think we
want to include this in mainline before it is ratified lest the current
form conflicts with the future spec.

We can still review and comment on the code, though it's not normally
done publicly until the spec is also public. The proposal originated
from a public RFC, though, so it's not a secret.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 RFC 1/2] PCI: Add Google device ID
  2016-08-16  1:41   ` [PATCH v3 RFC 1/2] PCI: Add Google device ID Helen Koike
@ 2016-08-18 21:59     ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2016-08-18 21:59 UTC (permalink / raw)
  To: Helen Koike
  Cc: hch, mlin, fes, keith.busch, rlnelson, axboe, digitaleric, tytso,
	mikew, monish, Helen Koike, Bjorn Helgaas,
	open list:PCI SUBSYSTEM, open list

On Mon, Aug 15, 2016 at 10:41:19PM -0300, Helen Koike wrote:
> Add device ID for the local SSDs (NVMe) in Google Clound Engine

s/Clound/Cloud/

> Signed-off-by: Helen Koike <helen.koike@collabora.com>
> ---
> 
> Changes since v2:
> - This is a new patch in the serie
> 
>  include/linux/pci_ids.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index c58752f..d422afc 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -948,6 +948,8 @@
>  #define PCI_DEVICE_ID_APPLE_IPID2_GMAC	0x006b
>  #define PCI_DEVICE_ID_APPLE_TIGON3	0x1645
>  
> +#define PCI_VENDOR_ID_GOOGLE		0x1AE0

See the comment at the top of pci_ids.h.

It looks like you only use this in drivers/nvme/host/pci.c.

>  #define PCI_VENDOR_ID_YAMAHA		0x1073
>  #define PCI_DEVICE_ID_YAMAHA_724	0x0004
>  #define PCI_DEVICE_ID_YAMAHA_724F	0x000d
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 RFC] nvme: improve performance for virtual NVMe devices
  2016-08-16  1:41     ` Helen Koike
  (?)
  (?)
@ 2017-03-17 21:44     ` Helen Koike
  2017-03-17 22:28       ` Keith Busch
  -1 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2017-03-17 21:44 UTC (permalink / raw)


From: Helen Koike <helen.koike@collabora.co.uk>

This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specification, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric at google.com>
  Frank Swiderski <fes at google.com>
  Ted Tso <tytso at mit.edu>
  Keith Busch <keith.busch at intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson at google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin at kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike at collabora.co.uk>

---

This patch is based on git://git.infradead.org/nvme.git master
Tested through Google Cloud Engine
TPAR is ratified by the NVME working group

Changes since last version
	- Rename to dbbuf (be closer to the latest TPAR)
	- Modify the opcodes according to the latest TPAR
	- Check if the device support this feature through the OACS bit
	- little cleanups
---
 drivers/nvme/host/Kconfig  |   9 ++++
 drivers/nvme/host/Makefile |   1 +
 drivers/nvme/host/dbbuf.c  | 125 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/dbbuf.h  | 118 ++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/pci.c    |  22 +++++++-
 include/linux/nvme.h       |  13 +++++
 6 files changed, 286 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/host/dbbuf.c
 create mode 100644 drivers/nvme/host/dbbuf.h

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 90745a6..7ada6f3 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -59,4 +59,13 @@ config NVME_FC
 	  To configure a NVMe over Fabrics controller use the nvme-cli tool
 	  from https://github.com/linux-nvme/nvme-cli.
 
+config NVME_DBBUF
+	bool "NVM Express Doorbell Buffer Config Command"
+	depends on NVME_CORE
+	---help---
+	  This provides support for the Doorbell Buffer Config Command, which
+	  reduces the number of required MMIOs to ring doorbells, improving
+	  performance for virtualised environments where MMIO causes a high
+	  overhead.
+
 	  If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index f1a7d94..ed9efe5 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -9,6 +9,7 @@ nvme-core-$(CONFIG_BLK_DEV_NVME_SCSI)	+= scsi.o
 nvme-core-$(CONFIG_NVM)			+= lightnvm.o
 
 nvme-y					+= pci.o
+nvme-$(CONFIG_NVME_DBBUF)		+= dbbuf.o
 
 nvme-fabrics-y				+= fabrics.o
 
diff --git a/drivers/nvme/host/dbbuf.c b/drivers/nvme/host/dbbuf.c
new file mode 100644
index 0000000..3900bb5
--- /dev/null
+++ b/drivers/nvme/host/dbbuf.c
@@ -0,0 +1,125 @@
+/*
+ * NVM Express device driver
+ * Copyright (C) 2015-2017, Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include "nvme.h"
+#include "dbbuf.h"
+
+static inline unsigned int nvme_dbbuf_size(u32 stride)
+{
+	return ((num_possible_cpus() + 1) * 8 * stride);
+}
+
+int nvme_dma_alloc_dbbuf(struct device *dev,
+			 struct nvme_dbbuf_dev *dbbuf_d,
+			 u32 stride)
+{
+	unsigned int mem_size = nvme_dbbuf_size(stride);
+
+	dbbuf_d->db_mem = dma_alloc_coherent(dev, mem_size, &dbbuf_d->doorbell,
+					     GFP_KERNEL);
+	if (!dbbuf_d->db_mem)
+		return -ENOMEM;
+	dbbuf_d->ei_mem = dma_alloc_coherent(dev, mem_size, &dbbuf_d->eventidx,
+					     GFP_KERNEL);
+	if (!dbbuf_d->ei_mem) {
+		dma_free_coherent(dev, mem_size,
+				  dbbuf_d->db_mem, dbbuf_d->doorbell);
+		dbbuf_d->db_mem = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void nvme_dma_free_dbbuf(struct device *dev,
+			 struct nvme_dbbuf_dev *dbbuf_d,
+			 u32 stride)
+{
+	unsigned int mem_size = nvme_dbbuf_size(stride);
+
+	if (dbbuf_d->db_mem) {
+		dma_free_coherent(dev, mem_size,
+				  dbbuf_d->db_mem, dbbuf_d->doorbell);
+		dbbuf_d->db_mem = NULL;
+	}
+	if (dbbuf_d->ei_mem) {
+		dma_free_coherent(dev, mem_size,
+				  dbbuf_d->ei_mem, dbbuf_d->eventidx);
+		dbbuf_d->ei_mem = NULL;
+	}
+}
+
+void nvme_init_dbbuf(struct nvme_dbbuf_dev *dbbuf_d,
+		     struct nvme_dbbuf_queue *dbbuf_q,
+		     int qid, u32 stride)
+{
+	if (!dbbuf_d->db_mem || !qid)
+		return;
+
+	dbbuf_q->sq_doorbell_addr = &dbbuf_d->db_mem[SQ_IDX(qid, stride)];
+	dbbuf_q->cq_doorbell_addr = &dbbuf_d->db_mem[CQ_IDX(qid, stride)];
+	dbbuf_q->sq_eventidx_addr = &dbbuf_d->ei_mem[SQ_IDX(qid, stride)];
+	dbbuf_q->cq_eventidx_addr = &dbbuf_d->ei_mem[CQ_IDX(qid, stride)];
+}
+
+void nvme_set_dbbuf(struct device *dev,
+		    struct nvme_dbbuf_dev *dbbuf_d,
+		    struct nvme_ctrl *ctrl,
+		    u32 stride)
+{
+	struct nvme_command c;
+
+	if (!dbbuf_d->db_mem)
+		return;
+
+	memset(&c, 0, sizeof(c));
+	c.dbbuf.opcode = nvme_admin_dbbuf;
+	c.dbbuf.prp1 = cpu_to_le64(dbbuf_d->doorbell);
+	c.dbbuf.prp2 = cpu_to_le64(dbbuf_d->eventidx);
+
+	if (nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0))
+		/* Free memory and continue on */
+		nvme_dma_free_dbbuf(dev, dbbuf_d, stride);
+}
+
+static inline int nvme_ext_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	/* Borrowed from vring_need_event */
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+void nvme_write_doorbell(u16 value,
+			 u32 __iomem *q_db,
+			 u32 *db_addr,
+			 volatile u32 *event_idx)
+{
+	u16 old_value;
+
+	if (!db_addr) {
+		writel(value, q_db);
+		return;
+	}
+
+	/*
+	 * Ensure that the queue is written before updating
+	 * the doorbell in memory
+	 */
+	wmb();
+
+	old_value = *db_addr;
+	*db_addr = value;
+
+	if (nvme_ext_need_event(*event_idx, value, old_value))
+		writel(value, q_db);
+}
diff --git a/drivers/nvme/host/dbbuf.h b/drivers/nvme/host/dbbuf.h
new file mode 100644
index 0000000..0c0a83f
--- /dev/null
+++ b/drivers/nvme/host/dbbuf.h
@@ -0,0 +1,118 @@
+/*
+ * NVM Express device driver
+ * Copyright (C) 2015-2017, Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _NVME_DBBUF_H
+#define _NVME_DBBUF_H
+
+#define SQ_IDX(qid, stride) ((qid) * 2 * (stride))
+#define CQ_IDX(qid, stride) (((qid) * 2 + 1) * (stride))
+
+#ifdef CONFIG_NVME_DBBUF
+
+struct nvme_dbbuf_dev {
+	u32 *db_mem;
+	dma_addr_t doorbell;
+	u32 *ei_mem;
+	dma_addr_t eventidx;
+};
+
+struct nvme_dbbuf_queue {
+	u32 *sq_doorbell_addr;
+	u32 *sq_eventidx_addr;
+	u32 *cq_doorbell_addr;
+	u32 *cq_eventidx_addr;
+};
+
+int nvme_dma_alloc_dbbuf(struct device *dev,
+			 struct nvme_dbbuf_dev *dbbuf_d,
+			 u32 stride);
+
+void nvme_dma_free_dbbuf(struct device *dev,
+			 struct nvme_dbbuf_dev *dbbuf_d,
+			 u32 stride);
+
+void nvme_init_dbbuf(struct nvme_dbbuf_dev *dbbuf_d,
+		     struct nvme_dbbuf_queue *dbbuf_q,
+		     int qid, u32 stride);
+
+void nvme_set_dbbuf(struct device *dev,
+		    struct nvme_dbbuf_dev *dbbuf_d,
+		    struct nvme_ctrl *ctrl,
+		    u32 stride);
+
+void nvme_write_doorbell(u16 value,
+			 u32 __iomem *q_db,
+			 u32 *db_addr,
+			 volatile u32 *event_idx);
+
+static inline void nvme_write_doorbell_cq(struct nvme_dbbuf_queue *dbbuf_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	nvme_write_doorbell(value, q_db,
+			    dbbuf_q->cq_doorbell_addr,
+			    dbbuf_q->cq_eventidx_addr);
+}
+
+static inline void nvme_write_doorbell_sq(struct nvme_dbbuf_queue *dbbuf_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	nvme_write_doorbell(value, q_db,
+			    dbbuf_q->sq_doorbell_addr,
+			    dbbuf_q->sq_eventidx_addr);
+}
+
+#else /* CONFIG_NVME_DBBUF */
+
+struct nvme_dbbuf_dev {};
+
+struct nvme_dbbuf_queue {};
+
+static inline int nvme_dma_alloc_dbbuf(struct device *dev,
+				       struct nvme_dbbuf_dev *dbbuf_d,
+				       u32 stride)
+{
+	return 0;
+}
+
+static inline void nvme_dma_free_dbbuf(struct device *dev,
+				       struct nvme_dbbuf_dev *dbbuf_d,
+				       u32 stride)
+{}
+
+static inline void nvme_set_dbbuf(struct device *dev,
+				  struct nvme_dbbuf_dev *dbbuf_d,
+				  struct nvme_ctrl *ctrl,
+				  u32 stride)
+{}
+
+static inline void nvme_init_dbbuf(struct nvme_dbbuf_dev *dbbuf_d,
+				   struct nvme_dbbuf_queue *dbbuf_q,
+				   int qid, u32 stride)
+{}
+
+static inline void nvme_write_doorbell_cq(struct nvme_dbbuf_queue *dbbuf_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	writel(value, q_db);
+}
+
+static inline void nvme_write_doorbell_sq(struct nvme_dbbuf_queue *dbbuf_q,
+					  u16 value, u32 __iomem *q_db)
+{
+	writel(value, q_db);
+}
+
+#endif /* CONFIG_NVME_DBBUF */
+
+#endif /* _NVME_DBBUF_H */
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b7bd9c..3723697 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -46,6 +46,7 @@
 #include <linux/sed-opal.h>
 
 #include "nvme.h"
+#include "dbbuf.h"
 
 #define NVME_Q_DEPTH		1024
 #define NVME_AQ_DEPTH		256
@@ -103,6 +104,7 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	struct nvme_dbbuf_dev dbbuf_d;
 };
 
 static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
@@ -133,6 +135,7 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+	struct nvme_dbbuf_queue dbbuf_q;
 };
 
 /*
@@ -174,6 +177,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
 }
 
 /*
@@ -300,7 +304,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
-	writel(tail, nvmeq->q_db);
+	nvme_write_doorbell_sq(&nvmeq->dbbuf_q, tail, nvmeq->q_db);
 	nvmeq->sq_tail = tail;
 }
 
@@ -716,7 +720,8 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+		nvme_write_doorbell_cq(&nvmeq->dbbuf_q, head,
+				       nvmeq->q_db + nvmeq->dev->db_stride);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1066,6 +1071,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	nvmeq->q_depth = depth;
 	nvmeq->qid = qid;
 	nvmeq->cq_vector = -1;
+	nvme_init_dbbuf(&dev->dbbuf_d, &nvmeq->dbbuf_q, qid, dev->db_stride);
 	dev->queues[qid] = nvmeq;
 	dev->queue_count++;
 
@@ -1099,6 +1105,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+	nvme_init_dbbuf(&dev->dbbuf_d, &nvmeq->dbbuf_q, qid, dev->db_stride);
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
 }
@@ -1568,6 +1575,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+		nvme_set_dbbuf(dev->dev, &dev->dbbuf_d,
+			       &dev->ctrl, dev->db_stride);
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1700,6 +1710,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	nvme_pci_disable(dev);
+	nvme_dma_free_dbbuf(dev->dev, &dev->dbbuf_d, dev->db_stride);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
@@ -1800,6 +1811,13 @@ static void nvme_reset_work(struct work_struct *work)
 		dev->ctrl.opal_dev = NULL;
 	}
 
+	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
+		result = nvme_dma_alloc_dbbuf(dev->dev, &dev->dbbuf_d,
+					      dev->db_stride);
+		if (result)
+			goto out;
+	}
+
 	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c43d435..43a6289 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
 struct nvme_lbaf {
@@ -603,6 +604,7 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
@@ -874,6 +876,16 @@ struct nvmf_property_get_command {
 	__u8		resv4[16];
 };
 
+struct nvme_dbbuf {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -893,6 +905,7 @@ struct nvme_command {
 		struct nvmf_connect_command connect;
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
+		struct nvme_dbbuf dbbuf;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-17 22:28       ` Keith Busch
@ 2017-03-17 22:26         ` Helen Koike
  2017-03-24  4:23         ` [PATCH v5 " Helen Koike
  1 sibling, 0 replies; 49+ messages in thread
From: Helen Koike @ 2017-03-17 22:26 UTC (permalink / raw)



On 2017-03-17 07:28 PM, Keith Busch wrote:
> On Fri, Mar 17, 2017@06:44:59PM -0300, Helen Koike wrote:
> > diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> > index 90745a6..7ada6f3 100644
> > --- a/drivers/nvme/host/Kconfig
> > +++ b/drivers/nvme/host/Kconfig
> > @@ -59,4 +59,13 @@ config NVME_FC
> >        To configure a NVMe over Fabrics controller use the nvme-cli tool
> >        from https://github.com/linux-nvme/nvme-cli.
> >
> > +config NVME_DBBUF
> > +    bool "NVM Express Doorbell Buffer Config Command"
> > +    depends on NVME_CORE
> > +    ---help---
> > +      This provides support for the Doorbell Buffer Config Command, which
> > +      reduces the number of required MMIOs to ring doorbells, improving
> > +      performance for virtualised environments where MMIO causes a high
> > +      overhead.
> > +
> >        If unsure, say N.
>
> It looks like you took the last line from NVME_FC's help. But I don't
> think this capabilty should enabled through kernel config, though. Just
> have the driver react based on the device's advertised capabilities, ya?
>

The advantage is that you can chose to not use the feature (not sure if it has any usecase) and reduce the code size if unused. But the code is really small. I'll remove CONF_NVME_DBBUF then and send another version, the code will be much cleaner imho.

Helen

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-17 21:44     ` [PATCH v4 RFC] " Helen Koike
@ 2017-03-17 22:28       ` Keith Busch
  2017-03-17 22:26         ` Helen Koike
  2017-03-24  4:23         ` [PATCH v5 " Helen Koike
  0 siblings, 2 replies; 49+ messages in thread
From: Keith Busch @ 2017-03-17 22:28 UTC (permalink / raw)


On Fri, Mar 17, 2017@06:44:59PM -0300, Helen Koike wrote:
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index 90745a6..7ada6f3 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -59,4 +59,13 @@ config NVME_FC
>  	  To configure a NVMe over Fabrics controller use the nvme-cli tool
>  	  from https://github.com/linux-nvme/nvme-cli.
>  
> +config NVME_DBBUF
> +	bool "NVM Express Doorbell Buffer Config Command"
> +	depends on NVME_CORE
> +	---help---
> +	  This provides support for the Doorbell Buffer Config Command, which
> +	  reduces the number of required MMIOs to ring doorbells, improving
> +	  performance for virtualised environments where MMIO causes a high
> +	  overhead.
> +
>  	  If unsure, say N.

It looks like you took the last line from NVME_FC's help. But I don't
think this capabilty should enabled through kernel config, though. Just
have the driver react based on the device's advertised capabilities, ya?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v5 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-17 22:28       ` Keith Busch
  2017-03-17 22:26         ` Helen Koike
@ 2017-03-24  4:23         ` Helen Koike
  2017-03-27  9:49           ` Christoph Hellwig
                             ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Helen Koike @ 2017-03-24  4:23 UTC (permalink / raw)


This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specification, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric at google.com>
  Frank Swiderski <fes at google.com>
  Ted Tso <tytso at mit.edu>
  Keith Busch <keith.busch at intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson at google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin at kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike at collabora.com>

---

This patch is based on git://git.infradead.org/nvme.git master
Tested through Google Cloud Engine
TPAR is ratified by the NVME working group

Changes since v4
	- Remove CONFIG_NVME_DBBUF
	- Remove dbbuf.{c,h}, move the code to pci.c
	- Coding style changes
	- Functions and vaiables renamed

Changes since v3
	- Rename to dbbuf (be closer to the latest TPAR)
	- Modify the opcodes according to the latest TPAR
	- Check if the device support this feature through the OACS bit
	- little cleanups
---
 drivers/nvme/host/pci.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/nvme.h    |  13 +++++
 2 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b7bd9c..4289eb1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -51,6 +51,8 @@
 #define NVME_AQ_DEPTH		256
 #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
 #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
+#define SQ_IDX(qid, stride)	((qid) * 2 * (stride))
+#define CQ_IDX(qid, stride)	(((qid) * 2 + 1) * (stride))
 
 /*
  * We handle AEN commands ourselves and don't even let the
@@ -103,6 +105,12 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	struct {
+		u32 *dbs;
+		u32 *eis;
+		dma_addr_t dbs_dma_addr;
+		dma_addr_t eis_dma_addr;
+	} dbbuf;
 };
 
 static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
@@ -133,6 +141,12 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+	struct {
+		u32 *sq_db;
+		u32 *cq_db;
+		u32 *sq_ei;
+		u32 *cq_ei;
+	} dbbuf;
 };
 
 /*
@@ -174,6 +188,121 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
+}
+
+static inline unsigned int nvme_dbbuf_size(u32 stride)
+{
+	return ((num_possible_cpus() + 1) * 8 * stride);
+}
+
+static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	dev->dbbuf.dbs = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf.dbs_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf.dbs)
+		return -ENOMEM;
+	dev->dbbuf.eis = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf.eis_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf.eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf.dbs, dev->dbbuf.dbs_dma_addr);
+		dev->dbbuf.dbs = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void nvme_dbbuf_dma_free(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	if (dev->dbbuf.dbs) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf.dbs, dev->dbbuf.dbs_dma_addr);
+		dev->dbbuf.dbs = NULL;
+	}
+	if (dev->dbbuf.eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf.eis, dev->dbbuf.eis_dma_addr);
+		dev->dbbuf.eis = NULL;
+	}
+}
+
+static void nvme_dbbuf_init(struct nvme_dev *dev,
+			    struct nvme_queue *nvmeq, int qid)
+{
+	if (!dev->dbbuf.dbs || !qid)
+		return;
+
+	nvmeq->dbbuf.sq_db = &dev->dbbuf.dbs[SQ_IDX(qid, dev->db_stride)];
+	nvmeq->dbbuf.cq_db = &dev->dbbuf.dbs[CQ_IDX(qid, dev->db_stride)];
+	nvmeq->dbbuf.sq_ei = &dev->dbbuf.eis[SQ_IDX(qid, dev->db_stride)];
+	nvmeq->dbbuf.cq_ei = &dev->dbbuf.eis[CQ_IDX(qid, dev->db_stride)];
+}
+
+static void nvme_dbbuf_set(struct nvme_dev *dev)
+{
+	struct nvme_command c;
+
+	if (!dev->dbbuf.dbs)
+		return;
+
+	memset(&c, 0, sizeof(c));
+	c.dbbuf.opcode = nvme_admin_dbbuf;
+	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf.dbs_dma_addr);
+	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf.eis_dma_addr);
+
+	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0))
+		/* Free memory and continue on */
+		nvme_dbbuf_dma_free(dev);
+}
+
+static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	/* Borrowed from vring_need_event */
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+static void nvme_write_doorbell(u16 value,
+				u32 __iomem *db,
+				u32 *dbbuf_db,
+				volatile u32 *dbbuf_ei)
+{
+	u16 old_value;
+
+	if (!dbbuf_db) {
+		writel(value, db);
+		return;
+	}
+
+	/*
+	 * Ensure that the queue is written before updating
+	 * the doorbell in memory
+	 */
+	wmb();
+
+	old_value = *dbbuf_db;
+	*dbbuf_db = value;
+	if (nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
+		writel(value, db);
+}
+
+static inline void nvme_write_doorbell_cq(struct nvme_queue *nvmeq, u16 value)
+{
+	nvme_write_doorbell(value, nvmeq->q_db + nvmeq->dev->db_stride,
+			    nvmeq->dbbuf.cq_db, nvmeq->dbbuf.cq_ei);
+}
+
+static inline void nvme_write_doorbell_sq(struct nvme_queue *nvmeq, u16 value)
+{
+	nvme_write_doorbell(value, nvmeq->q_db,
+			    nvmeq->dbbuf.sq_db, nvmeq->dbbuf.sq_ei);
 }
 
 /*
@@ -300,7 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
-	writel(tail, nvmeq->q_db);
+	nvme_write_doorbell_sq(nvmeq, tail);
 	nvmeq->sq_tail = tail;
 }
 
@@ -716,7 +845,7 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+		nvme_write_doorbell_cq(nvmeq, head);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1066,6 +1195,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	nvmeq->q_depth = depth;
 	nvmeq->qid = qid;
 	nvmeq->cq_vector = -1;
+	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->queues[qid] = nvmeq;
 	dev->queue_count++;
 
@@ -1099,6 +1229,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
 }
@@ -1568,6 +1699,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+		nvme_dbbuf_set(dev);
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1700,6 +1833,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	nvme_pci_disable(dev);
+	nvme_dbbuf_dma_free(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
@@ -1800,6 +1934,12 @@ static void nvme_reset_work(struct work_struct *work)
 		dev->ctrl.opal_dev = NULL;
 	}
 
+	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
+		result = nvme_dbbuf_dma_alloc(dev);
+		if (result)
+			goto out;
+	}
+
 	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c43d435..43a6289 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
 struct nvme_lbaf {
@@ -603,6 +604,7 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
@@ -874,6 +876,16 @@ struct nvmf_property_get_command {
 	__u8		resv4[16];
 };
 
+struct nvme_dbbuf {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -893,6 +905,7 @@ struct nvme_command {
 		struct nvmf_connect_command connect;
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
+		struct nvme_dbbuf dbbuf;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v5 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-24  4:23         ` [PATCH v5 " Helen Koike
@ 2017-03-27  9:49           ` Christoph Hellwig
  2017-03-27 16:04             ` Helen Koike
  2017-03-27 14:43           ` Keith Busch
  2017-03-27 16:50           ` [PATCH v6 " Helen Koike
  2 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2017-03-27  9:49 UTC (permalink / raw)


> +#define SQ_IDX(qid, stride)	((qid) * 2 * (stride))
> +#define CQ_IDX(qid, stride)	(((qid) * 2 + 1) * (stride))

Please use inline functions for these.

> +	struct {
> +		u32 *dbs;
> +		u32 *eis;
> +		dma_addr_t dbs_dma_addr;
> +		dma_addr_t eis_dma_addr;
> +	} dbbuf;

No need for a struct here, also please keep the field and its dma_addr_t
together.

> +	struct {
> +		u32 *sq_db;
> +		u32 *cq_db;
> +		u32 *sq_ei;
> +		u32 *cq_ei;
> +	} dbbuf;

No need for the struct here either.

> +
> +static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
> +{
> +	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
> +
> +	dev->dbbuf.dbs = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf.dbs_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf.dbs)
> +		return -ENOMEM;
> +	dev->dbbuf.eis = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf.eis_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf.eis) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf.dbs, dev->dbbuf.dbs_dma_addr);
> +		dev->dbbuf.dbs = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	return 0;

Please use normal kernel-style goto unwinding.

> +static void nvme_dbbuf_set(struct nvme_dev *dev)
> +{
> +	struct nvme_command c;
> +
> +	if (!dev->dbbuf.dbs)
> +		return;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.dbbuf.opcode = nvme_admin_dbbuf;
> +	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf.dbs_dma_addr);
> +	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf.eis_dma_addr);
> +
> +	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0))
> +		/* Free memory and continue on */
> +		nvme_dbbuf_dma_free(dev);

At least log a warning.

> +static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
> +{
> +	/* Borrowed from vring_need_event */

I don't think this comment matters.

> +static void nvme_write_doorbell(u16 value,
> +				u32 __iomem *db,
> +				u32 *dbbuf_db,
> +				volatile u32 *dbbuf_ei)
> +{

Very odd formatting.  Why not:

static void nvme_write_doorbell(u16 value, u32 __iomem *db, u32 *dbbuf_db,
		volatile u32 *dbbuf_ei)

?

> +	u16 old_value;
> +
> +	if (!dbbuf_db) {
> +		writel(value, db);
> +		return;
> +	}

I'd prefer to keep this in the ultimate callers to make the flow
easier to read.

> +static inline void nvme_write_doorbell_cq(struct nvme_queue *nvmeq, u16 value)
> +{
> +	nvme_write_doorbell(value, nvmeq->q_db + nvmeq->dev->db_stride,
> +			    nvmeq->dbbuf.cq_db, nvmeq->dbbuf.cq_ei);
> +}
> +
> +static inline void nvme_write_doorbell_sq(struct nvme_queue *nvmeq, u16 value)
> +{
> +	nvme_write_doorbell(value, nvmeq->q_db,
> +			    nvmeq->dbbuf.sq_db, nvmeq->dbbuf.sq_ei);
>  }

I'd skip these wrappers entirely.

> +	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
> +		result = nvme_dbbuf_dma_alloc(dev);
> +		if (result)
> +			goto out;
> +	}

Should we really fail the init here or just print a warning?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v5 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-24  4:23         ` [PATCH v5 " Helen Koike
  2017-03-27  9:49           ` Christoph Hellwig
@ 2017-03-27 14:43           ` Keith Busch
  2017-03-27 16:50           ` [PATCH v6 " Helen Koike
  2 siblings, 0 replies; 49+ messages in thread
From: Keith Busch @ 2017-03-27 14:43 UTC (permalink / raw)


On Fri, Mar 24, 2017@01:23:43AM -0300, Helen Koike wrote:
> +	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
> +		result = nvme_dbbuf_dma_alloc(dev);
> +		if (result)
> +			goto out;
> +	}

We are still functional without enabling this feature, so I don't think
we should error out. Just log a warning that it's less than optimal.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v5 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-27  9:49           ` Christoph Hellwig
@ 2017-03-27 16:04             ` Helen Koike
  2017-03-27 16:25               ` Helen Koike
  0 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2017-03-27 16:04 UTC (permalink / raw)


Hi Keith,

Thanks for your review. Please, see my comments below

On 2017-03-27 06:49 AM, Christoph Hellwig wrote:
>> +#define SQ_IDX(qid, stride)	((qid) * 2 * (stride))
>> +#define CQ_IDX(qid, stride)	(((qid) * 2 + 1) * (stride))
>
> Please use inline functions for these.
>
>> +	struct {
>> +		u32 *dbs;
>> +		u32 *eis;
>> +		dma_addr_t dbs_dma_addr;
>> +		dma_addr_t eis_dma_addr;
>> +	} dbbuf;
>
> No need for a struct here, also please keep the field and its dma_addr_t
> together.
>
>> +	struct {
>> +		u32 *sq_db;
>> +		u32 *cq_db;
>> +		u32 *sq_ei;
>> +		u32 *cq_ei;
>> +	} dbbuf;
>
> No need for the struct here either.
>
>> +
>> +static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
>> +{
>> +	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
>> +
>> +	dev->dbbuf.dbs = dma_alloc_coherent(dev->dev, mem_size,
>> +					    &dev->dbbuf.dbs_dma_addr,
>> +					    GFP_KERNEL);
>> +	if (!dev->dbbuf.dbs)
>> +		return -ENOMEM;
>> +	dev->dbbuf.eis = dma_alloc_coherent(dev->dev, mem_size,
>> +					    &dev->dbbuf.eis_dma_addr,
>> +					    GFP_KERNEL);
>> +	if (!dev->dbbuf.eis) {
>> +		dma_free_coherent(dev->dev, mem_size,
>> +				  dev->dbbuf.dbs, dev->dbbuf.dbs_dma_addr);
>> +		dev->dbbuf.dbs = NULL;
>> +		return -ENOMEM;
>> +	}
>> +
>> +	return 0;
>
> Please use normal kernel-style goto unwinding.

I don't think it is necessary as there is only a single item to unwind. 
 From my experience in other parts of the kernel code we usually use 
goto when there are several other steps to unwind. But I don't mind 
changing it if you still prefer a goto here.

>
>> +static void nvme_dbbuf_set(struct nvme_dev *dev)
>> +{
>> +	struct nvme_command c;
>> +
>> +	if (!dev->dbbuf.dbs)
>> +		return;
>> +
>> +	memset(&c, 0, sizeof(c));
>> +	c.dbbuf.opcode = nvme_admin_dbbuf;
>> +	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf.dbs_dma_addr);
>> +	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf.eis_dma_addr);
>> +
>> +	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0))
>> +		/* Free memory and continue on */
>> +		nvme_dbbuf_dma_free(dev);
>
> At least log a warning.
>
>> +static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
>> +{
>> +	/* Borrowed from vring_need_event */
>
> I don't think this comment matters.
>
>> +static void nvme_write_doorbell(u16 value,
>> +				u32 __iomem *db,
>> +				u32 *dbbuf_db,
>> +				volatile u32 *dbbuf_ei)
>> +{
>
> Very odd formatting.  Why not:
>
> static void nvme_write_doorbell(u16 value, u32 __iomem *db, u32 *dbbuf_db,
> 		volatile u32 *dbbuf_ei)
>
> ?
>
>> +	u16 old_value;
>> +
>> +	if (!dbbuf_db) {
>> +		writel(value, db);
>> +		return;
>> +	}
>
> I'd prefer to keep this in the ultimate callers to make the flow
> easier to read.
>
>> +static inline void nvme_write_doorbell_cq(struct nvme_queue *nvmeq, u16 value)
>> +{
>> +	nvme_write_doorbell(value, nvmeq->q_db + nvmeq->dev->db_stride,
>> +			    nvmeq->dbbuf.cq_db, nvmeq->dbbuf.cq_ei);
>> +}
>> +
>> +static inline void nvme_write_doorbell_sq(struct nvme_queue *nvmeq, u16 value)
>> +{
>> +	nvme_write_doorbell(value, nvmeq->q_db,
>> +			    nvmeq->dbbuf.sq_db, nvmeq->dbbuf.sq_ei);
>>  }
>
> I'd skip these wrappers entirely.

I added them to avoid future mistakes as mixing nvmeq->dbbuf.sq_db with 
nvmeq->dbbuf.sq_ei. But I don't mind to remove them either.

>
>> +	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
>> +		result = nvme_dbbuf_dma_alloc(dev);
>> +		if (result)
>> +			goto out;
>> +	}
>
> Should we really fail the init here or just print a warning?
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v5 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-27 16:04             ` Helen Koike
@ 2017-03-27 16:25               ` Helen Koike
  0 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2017-03-27 16:25 UTC (permalink / raw)


Sorry, the last email was a reply to Christoph (thank you all for reviewing)

On 2017-03-27 01:04 PM, Helen Koike wrote:
> Hi Keith,
>
> Thanks for your review. Please, see my comments below
>
> On 2017-03-27 06:49 AM, Christoph Hellwig wrote:
> >> +#define SQ_IDX(qid, stride)    ((qid) * 2 * (stride))
> >> +#define CQ_IDX(qid, stride)    (((qid) * 2 + 1) * (stride))
> >
> > Please use inline functions for these.
> >
> >> +    struct {
> >> +        u32 *dbs;
> >> +        u32 *eis;
> >> +        dma_addr_t dbs_dma_addr;
> >> +        dma_addr_t eis_dma_addr;
> >> +    } dbbuf;
> >
> > No need for a struct here, also please keep the field and its dma_addr_t
> > together.
> >
> >> +    struct {
> >> +        u32 *sq_db;
> >> +        u32 *cq_db;
> >> +        u32 *sq_ei;
> >> +        u32 *cq_ei;
> >> +    } dbbuf;
> >
> > No need for the struct here either.
> >
> >> +
> >> +static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
> >> +{
> >> +    unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
> >> +
> >> +    dev->dbbuf.dbs = dma_alloc_coherent(dev->dev, mem_size,
> >> +                        &dev->dbbuf.dbs_dma_addr,
> >> +                        GFP_KERNEL);
> >> +    if (!dev->dbbuf.dbs)
> >> +        return -ENOMEM;
> >> +    dev->dbbuf.eis = dma_alloc_coherent(dev->dev, mem_size,
> >> +                        &dev->dbbuf.eis_dma_addr,
> >> +                        GFP_KERNEL);
> >> +    if (!dev->dbbuf.eis) {
> >> +        dma_free_coherent(dev->dev, mem_size,
> >> +                  dev->dbbuf.dbs, dev->dbbuf.dbs_dma_addr);
> >> +        dev->dbbuf.dbs = NULL;
> >> +        return -ENOMEM;
> >> +    }
> >> +
> >> +    return 0;
> >
> > Please use normal kernel-style goto unwinding.
>
> I don't think it is necessary as there is only a single item to unwind.
> From my experience in other parts of the kernel code we usually use goto
> when there are several other steps to unwind. But I don't mind changing
> it if you still prefer a goto here.
>
> >
> >> +static void nvme_dbbuf_set(struct nvme_dev *dev)
> >> +{
> >> +    struct nvme_command c;
> >> +
> >> +    if (!dev->dbbuf.dbs)
> >> +        return;
> >> +
> >> +    memset(&c, 0, sizeof(c));
> >> +    c.dbbuf.opcode = nvme_admin_dbbuf;
> >> +    c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf.dbs_dma_addr);
> >> +    c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf.eis_dma_addr);
> >> +
> >> +    if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0))
> >> +        /* Free memory and continue on */
> >> +        nvme_dbbuf_dma_free(dev);
> >
> > At least log a warning.
> >
> >> +static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx,
> >> u16 old)
> >> +{
> >> +    /* Borrowed from vring_need_event */
> >
> > I don't think this comment matters.
> >
> >> +static void nvme_write_doorbell(u16 value,
> >> +                u32 __iomem *db,
> >> +                u32 *dbbuf_db,
> >> +                volatile u32 *dbbuf_ei)
> >> +{
> >
> > Very odd formatting.  Why not:
> >
> > static void nvme_write_doorbell(u16 value, u32 __iomem *db, u32
> > *dbbuf_db,
> >         volatile u32 *dbbuf_ei)
> >
> > ?
> >
> >> +    u16 old_value;
> >> +
> >> +    if (!dbbuf_db) {
> >> +        writel(value, db);
> >> +        return;
> >> +    }
> >
> > I'd prefer to keep this in the ultimate callers to make the flow
> > easier to read.
> >
> >> +static inline void nvme_write_doorbell_cq(struct nvme_queue *nvmeq,
> >> u16 value)
> >> +{
> >> +    nvme_write_doorbell(value, nvmeq->q_db + nvmeq->dev->db_stride,
> >> +                nvmeq->dbbuf.cq_db, nvmeq->dbbuf.cq_ei);
> >> +}
> >> +
> >> +static inline void nvme_write_doorbell_sq(struct nvme_queue *nvmeq,
> >> u16 value)
> >> +{
> >> +    nvme_write_doorbell(value, nvmeq->q_db,
> >> +                nvmeq->dbbuf.sq_db, nvmeq->dbbuf.sq_ei);
> >>  }
> >
> > I'd skip these wrappers entirely.
>
> I added them to avoid future mistakes as mixing nvmeq->dbbuf.sq_db with
> nvmeq->dbbuf.sq_ei. But I don't mind to remove them either.
>
> >
> >> +    if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
> >> +        result = nvme_dbbuf_dma_alloc(dev);
> >> +        if (result)
> >> +            goto out;
> >> +    }
> >
> > Should we really fail the init here or just print a warning?
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v6 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-24  4:23         ` [PATCH v5 " Helen Koike
  2017-03-27  9:49           ` Christoph Hellwig
  2017-03-27 14:43           ` Keith Busch
@ 2017-03-27 16:50           ` Helen Koike
  2017-03-30 17:33             ` Keith Busch
  2 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2017-03-27 16:50 UTC (permalink / raw)


From: Helen Koike <helen.koike@collabora.co.uk>

This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specification, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric at google.com>
  Frank Swiderski <fes at google.com>
  Ted Tso <tytso at mit.edu>
  Keith Busch <keith.busch at intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson at google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin at kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike at collabora.co.uk>

---

This patch is based on git://git.infradead.org/nvme.git master
Tested through Google Cloud Engine
TPAR is ratified by the NVME working group

changes since v5:
	- remove anonymous dbbuf struct in struct nvme_dev and struct nvme_queue
	- use inline functions for sq_idx and cq_idx instead of macros
	- add a warning when nvme_dbbuf_set fails
	- remove comment "Borrowed from vring_need_event"
	- change formatting of the parameters in nvme_write_doorbell
	- don't fail nvme_reset_work if nvme_dbbuf_dma_alloc fails
	- remove nvme_write_doorbell_{sq,cq} wrappers
	- in nvme_write_doorbell(), call writel last

Changes since v4
	- Remove CONFIG_NVME_DBBUF
	- Remove dbbuf.{c,h}, move the code to pci.c
	- Coding style changes
	- Functions and vaiables renamed

Changes since v3
	- Rename to dbbuf (be closer to the latest TPAR)
	- Modify the opcodes according to the latest TPAR
	- Check if the device support this feature through the OACS bit
	- little cleanups
---
 drivers/nvme/host/pci.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/nvme.h    |  13 +++++
 2 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b7bd9c..00edeca 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -103,8 +103,22 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	u32 *dbbuf_dbs;
+	dma_addr_t dbbuf_dbs_dma_addr;
+	u32 *dbbuf_eis;
+	dma_addr_t dbbuf_eis_dma_addr;
 };
 
+static inline unsigned int sq_idx(unsigned int qid, u32 stride)
+{
+	return qid * 2 * stride;
+}
+
+static inline unsigned int cq_idx(unsigned int qid, u32 stride)
+{
+	return (qid * 2 + 1) * stride;
+}
+
 static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
 {
 	return container_of(ctrl, struct nvme_dev, ctrl);
@@ -133,6 +147,10 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+	u32 *dbbuf_sq_db;
+	u32 *dbbuf_cq_db;
+	u32 *dbbuf_sq_ei;
+	u32 *dbbuf_cq_ei;
 };
 
 /*
@@ -174,6 +192,107 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
+}
+
+static inline unsigned int nvme_dbbuf_size(u32 stride)
+{
+	return ((num_possible_cpus() + 1) * 8 * stride);
+}
+
+static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf_dbs_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf_dbs)
+		return -ENOMEM;
+	dev->dbbuf_eis = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf_eis_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf_eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
+		dev->dbbuf_dbs = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void nvme_dbbuf_dma_free(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	if (dev->dbbuf_dbs) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
+		dev->dbbuf_dbs = NULL;
+	}
+	if (dev->dbbuf_eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_eis, dev->dbbuf_eis_dma_addr);
+		dev->dbbuf_eis = NULL;
+	}
+}
+
+static void nvme_dbbuf_init(struct nvme_dev *dev,
+			    struct nvme_queue *nvmeq, int qid)
+{
+	if (!dev->dbbuf_dbs || !qid)
+		return;
+
+	nvmeq->dbbuf_sq_db = &dev->dbbuf_dbs[sq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_cq_db = &dev->dbbuf_dbs[cq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_sq_ei = &dev->dbbuf_eis[sq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_cq_ei = &dev->dbbuf_eis[cq_idx(qid, dev->db_stride)];
+}
+
+static void nvme_dbbuf_set(struct nvme_dev *dev)
+{
+	struct nvme_command c;
+
+	if (!dev->dbbuf_dbs)
+		return;
+
+	memset(&c, 0, sizeof(c));
+	c.dbbuf.opcode = nvme_admin_dbbuf;
+	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf_dbs_dma_addr);
+	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf_eis_dma_addr);
+
+	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0)) {
+		dev_warn(dev->dev, "unable to set dbbuf\n");
+		/* Free memory and continue on */
+		nvme_dbbuf_dma_free(dev);
+	}
+}
+
+static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+static void nvme_write_doorbell(u16 value, u32 __iomem *db, u32 *dbbuf_db,
+				volatile u32 *dbbuf_ei)
+{
+	if (dbbuf_db) {
+		u16 old_value;
+
+		/*
+		 * Ensure that the queue is written before updating
+		 * the doorbell in memory
+		 */
+		wmb();
+
+		old_value = *dbbuf_db;
+		*dbbuf_db = value;
+		if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
+			return;
+	}
+
+	writel(value, db);
 }
 
 /*
@@ -300,7 +419,8 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
-	writel(tail, nvmeq->q_db);
+	nvme_write_doorbell(tail, nvmeq->q_db, nvmeq->dbbuf_sq_db,
+			    nvmeq->dbbuf_sq_ei);
 	nvmeq->sq_tail = tail;
 }
 
@@ -716,7 +836,8 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+		nvme_write_doorbell(head, nvmeq->q_db + nvmeq->dev->db_stride,
+				    nvmeq->dbbuf_cq_db, nvmeq->dbbuf_cq_ei);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1066,6 +1187,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	nvmeq->q_depth = depth;
 	nvmeq->qid = qid;
 	nvmeq->cq_vector = -1;
+	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->queues[qid] = nvmeq;
 	dev->queue_count++;
 
@@ -1099,6 +1221,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
 }
@@ -1568,6 +1691,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+		nvme_dbbuf_set(dev);
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1700,6 +1825,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	nvme_pci_disable(dev);
+	nvme_dbbuf_dma_free(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
@@ -1800,6 +1926,13 @@ static void nvme_reset_work(struct work_struct *work)
 		dev->ctrl.opal_dev = NULL;
 	}
 
+	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
+		result = nvme_dbbuf_dma_alloc(dev);
+		if (result)
+			dev_warn(dev->dev,
+				 "unable to allocate dma for dbbuf\n");
+	}
+
 	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c43d435..43a6289 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
 struct nvme_lbaf {
@@ -603,6 +604,7 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
@@ -874,6 +876,16 @@ struct nvmf_property_get_command {
 	__u8		resv4[16];
 };
 
+struct nvme_dbbuf {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -893,6 +905,7 @@ struct nvme_command {
 		struct nvmf_connect_command connect;
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
+		struct nvme_dbbuf dbbuf;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v6 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-27 16:50           ` [PATCH v6 " Helen Koike
@ 2017-03-30 17:33             ` Keith Busch
  2017-03-30 17:46               ` [PATCH v7 " Helen Koike
  0 siblings, 1 reply; 49+ messages in thread
From: Keith Busch @ 2017-03-30 17:33 UTC (permalink / raw)


On Mon, Mar 27, 2017@01:50:08PM -0300, Helen Koike wrote:
> changes since v5:
> 	- remove anonymous dbbuf struct in struct nvme_dev and struct nvme_queue
> 	- use inline functions for sq_idx and cq_idx instead of macros
> 	- add a warning when nvme_dbbuf_set fails
> 	- remove comment "Borrowed from vring_need_event"
> 	- change formatting of the parameters in nvme_write_doorbell
> 	- don't fail nvme_reset_work if nvme_dbbuf_dma_alloc fails
> 	- remove nvme_write_doorbell_{sq,cq} wrappers
> 	- in nvme_write_doorbell(), call writel last

I'm pretty much okay with this. The only thing I'd change is to call
nvme_dbbuf_dma_free only when we're unbinding so that we don't have to
reallocate it on a every controller reset:
 
> +static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
> +{
> +	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);

	if (dev->dbbuf_dbs)
		return 0;

> +
> +	dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf_dbs_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf_dbs)
> +		return -ENOMEM;
> +	dev->dbbuf_eis = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf_eis_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf_eis) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
> +		dev->dbbuf_dbs = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}

> @@ -1066,6 +1187,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
>  	nvmeq->q_depth = depth;
>  	nvmeq->qid = qid;
>  	nvmeq->cq_vector = -1;
> +	nvme_dbbuf_init(dev, nvmeq, qid);
>  	dev->queues[qid] = nvmeq;
>  	dev->queue_count++;

Remove the above since it's duplicated in nvme_init_queue.

> @@ -1700,6 +1825,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		nvme_disable_admin_queue(dev, shutdown);
>  	}
>  	nvme_pci_disable(dev);
> +	nvme_dbbuf_dma_free(dev);

And move this to nvme_pci_free_ctrl().

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v7 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-30 17:33             ` Keith Busch
@ 2017-03-30 17:46               ` Helen Koike
  2017-03-31  7:01                 ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Helen Koike @ 2017-03-30 17:46 UTC (permalink / raw)


This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specification, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric at google.com>
  Frank Swiderski <fes at google.com>
  Ted Tso <tytso at mit.edu>
  Keith Busch <keith.busch at intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson at google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin at kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike at collabora.com>

---

This patch is based on git://git.infradead.org/nvme.git master
Tested through Google Cloud Engine
TPAR is ratified by the NVME working group

changes since v6:
	- remove nvme_dbbuf_init() from nvme_alloc_queue() as it is
	already called in nvme_init_queue()
	- free dbbuf_dbs only in nvme_pci_free_ctrl(), change
	nvme_dbbuf_dma_alloc() to do nothing if the dbbuf_dbs is already
	allocated

changes since v5:
	- remove anonymous dbbuf struct in struct nvme_dev and struct nvme_queue
	- use inline functions for sq_idx and cq_idx instead of macros
	- add a warning when nvme_dbbuf_set fails
	- remove comment "Borrowed from vring_need_event"
	- change formatting of the parameters in nvme_write_doorbell
	- don't fail nvme_reset_work if nvme_dbbuf_dma_alloc fails
	- remove nvme_write_doorbell_{sq,cq} wrappers
	- in nvme_write_doorbell(), call writel last

Changes since v4
	- Remove CONFIG_NVME_DBBUF
	- Remove dbbuf.{c,h}, move the code to pci.c
	- Coding style changes
	- Functions and vaiables renamed

Changes since v3
	- Rename to dbbuf (be closer to the latest TPAR)
	- Modify the opcodes according to the latest TPAR
	- Check if the device support this feature through the OACS bit
	- little cleanups
---
 drivers/nvme/host/pci.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/nvme.h    |  13 +++++
 2 files changed, 150 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b7bd9c..3ce705a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -103,8 +103,22 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	u32 *dbbuf_dbs;
+	dma_addr_t dbbuf_dbs_dma_addr;
+	u32 *dbbuf_eis;
+	dma_addr_t dbbuf_eis_dma_addr;
 };
 
+static inline unsigned int sq_idx(unsigned int qid, u32 stride)
+{
+	return qid * 2 * stride;
+}
+
+static inline unsigned int cq_idx(unsigned int qid, u32 stride)
+{
+	return (qid * 2 + 1) * stride;
+}
+
 static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
 {
 	return container_of(ctrl, struct nvme_dev, ctrl);
@@ -133,6 +147,10 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+	u32 *dbbuf_sq_db;
+	u32 *dbbuf_cq_db;
+	u32 *dbbuf_sq_ei;
+	u32 *dbbuf_cq_ei;
 };
 
 /*
@@ -174,6 +192,110 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
+}
+
+static inline unsigned int nvme_dbbuf_size(u32 stride)
+{
+	return ((num_possible_cpus() + 1) * 8 * stride);
+}
+
+static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	if (dev->dbbuf_dbs)
+		return 0;
+
+	dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf_dbs_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf_dbs)
+		return -ENOMEM;
+	dev->dbbuf_eis = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf_eis_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf_eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
+		dev->dbbuf_dbs = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void nvme_dbbuf_dma_free(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	if (dev->dbbuf_dbs) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
+		dev->dbbuf_dbs = NULL;
+	}
+	if (dev->dbbuf_eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_eis, dev->dbbuf_eis_dma_addr);
+		dev->dbbuf_eis = NULL;
+	}
+}
+
+static void nvme_dbbuf_init(struct nvme_dev *dev,
+			    struct nvme_queue *nvmeq, int qid)
+{
+	if (!dev->dbbuf_dbs || !qid)
+		return;
+
+	nvmeq->dbbuf_sq_db = &dev->dbbuf_dbs[sq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_cq_db = &dev->dbbuf_dbs[cq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_sq_ei = &dev->dbbuf_eis[sq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_cq_ei = &dev->dbbuf_eis[cq_idx(qid, dev->db_stride)];
+}
+
+static void nvme_dbbuf_set(struct nvme_dev *dev)
+{
+	struct nvme_command c;
+
+	if (!dev->dbbuf_dbs)
+		return;
+
+	memset(&c, 0, sizeof(c));
+	c.dbbuf.opcode = nvme_admin_dbbuf;
+	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf_dbs_dma_addr);
+	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf_eis_dma_addr);
+
+	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0)) {
+		dev_warn(dev->dev, "unable to set dbbuf\n");
+		/* Free memory and continue on */
+		nvme_dbbuf_dma_free(dev);
+	}
+}
+
+static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+static void nvme_write_doorbell(u16 value, u32 __iomem *db, u32 *dbbuf_db,
+				volatile u32 *dbbuf_ei)
+{
+	if (dbbuf_db) {
+		u16 old_value;
+
+		/*
+		 * Ensure that the queue is written before updating
+		 * the doorbell in memory
+		 */
+		wmb();
+
+		old_value = *dbbuf_db;
+		*dbbuf_db = value;
+		if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
+			return;
+	}
+
+	writel(value, db);
 }
 
 /*
@@ -300,7 +422,8 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
-	writel(tail, nvmeq->q_db);
+	nvme_write_doorbell(tail, nvmeq->q_db, nvmeq->dbbuf_sq_db,
+			    nvmeq->dbbuf_sq_ei);
 	nvmeq->sq_tail = tail;
 }
 
@@ -716,7 +839,8 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+		nvme_write_doorbell(head, nvmeq->q_db + nvmeq->dev->db_stride,
+				    nvmeq->dbbuf_cq_db, nvmeq->dbbuf_cq_ei);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1099,6 +1223,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
 }
@@ -1568,6 +1693,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+		nvme_dbbuf_set(dev);
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1733,6 +1860,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 {
 	struct nvme_dev *dev = to_nvme_dev(ctrl);
 
+	nvme_dbbuf_dma_free(dev);
 	put_device(dev->dev);
 	if (dev->tagset.tags)
 		blk_mq_free_tag_set(&dev->tagset);
@@ -1800,6 +1928,13 @@ static void nvme_reset_work(struct work_struct *work)
 		dev->ctrl.opal_dev = NULL;
 	}
 
+	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
+		result = nvme_dbbuf_dma_alloc(dev);
+		if (result)
+			dev_warn(dev->dev,
+				 "unable to allocate dma for dbbuf\n");
+	}
+
 	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c43d435..43a6289 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
 struct nvme_lbaf {
@@ -603,6 +604,7 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
@@ -874,6 +876,16 @@ struct nvmf_property_get_command {
 	__u8		resv4[16];
 };
 
+struct nvme_dbbuf {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -893,6 +905,7 @@ struct nvme_command {
 		struct nvmf_connect_command connect;
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
+		struct nvme_dbbuf dbbuf;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-30 17:46               ` [PATCH v7 " Helen Koike
@ 2017-03-31  7:01                 ` Christoph Hellwig
  2017-04-01 21:50                   ` Helen Koike
  2017-04-10 15:51                   ` [PATCH v8] " Helen Koike
  0 siblings, 2 replies; 49+ messages in thread
From: Christoph Hellwig @ 2017-03-31  7:01 UTC (permalink / raw)


I'd still prefer to not hide the MMIO path behind nvme_write_doorbell,
but except for that this looks fine to me:

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v7 RFC] nvme: improve performance for virtual NVMe devices
  2017-03-31  7:01                 ` Christoph Hellwig
@ 2017-04-01 21:50                   ` Helen Koike
  2017-04-10 15:31                     ` Helen Koike
  2017-04-10 15:37                     ` Christoph Hellwig
  2017-04-10 15:51                   ` [PATCH v8] " Helen Koike
  1 sibling, 2 replies; 49+ messages in thread
From: Helen Koike @ 2017-04-01 21:50 UTC (permalink / raw)



Hi Christoph,

Thanks for reviewing this patch

On 2017-03-31 04:01 AM, Christoph Hellwig wrote:
> I'd still prefer to not hide the MMIO path behind nvme_write_doorbell,
> but except for that this looks fine to me:
>

Do you mean wrapping the call to writel()? What do you think if I send a new
version with the changes below? It wouldn't hide the call to writel()
inside nvme_write_doorbell anymore.

Helen

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3ce705a..0e5f2a1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -277,8 +277,9 @@ static inline int nvme_dbbuf_need_event(u16 
event_idx, u16 new_idx, u16 old)
         return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
  }

-static void nvme_write_doorbell(u16 value, u32 __iomem *db, u32 *dbbuf_db,
-                               volatile u32 *dbbuf_ei)
+/* Update dbbuf and return true if a MMIO is required */
+static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
+                                             volatile u32 *dbbuf_ei)
  {
         if (dbbuf_db) {
                 u16 old_value;
@@ -291,11 +292,12 @@ static void nvme_write_doorbell(u16 value, u32 
__iomem *db, u32 *dbbuf_db,

                 old_value = *dbbuf_db;
                 *dbbuf_db = value;
+
                 if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
-                       return;
+                       return false;
         }

-       writel(value, db);
+       return true;
  }

  /*
@@ -422,8 +424,9 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,

         if (++tail == nvmeq->q_depth)
                 tail = 0;
-       nvme_write_doorbell(tail, nvmeq->q_db, nvmeq->dbbuf_sq_db,
-                           nvmeq->dbbuf_sq_ei);
+       if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
+                                             nvmeq->dbbuf_sq_ei))
+               writel(tail, nvmeq->q_db);
         nvmeq->sq_tail = tail;
  }

@@ -839,8 +842,9 @@ static void __nvme_process_cq(struct nvme_queue 
*nvmeq, unsigned int *tag)
                 return;

         if (likely(nvmeq->cq_vector >= 0))
-               nvme_write_doorbell(head, nvmeq->q_db + 
nvmeq->dev->db_stride,
-                                   nvmeq->dbbuf_cq_db, nvmeq->dbbuf_cq_ei);
+               if (nvme_dbbuf_update_and_check_event(head, 
nvmeq->dbbuf_cq_db,
+                                                     nvmeq->dbbuf_cq_ei))
+                       writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
         nvmeq->cq_head = head;
         nvmeq->cq_phase = phase;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 RFC] nvme: improve performance for virtual NVMe devices
  2017-04-01 21:50                   ` Helen Koike
@ 2017-04-10 15:31                     ` Helen Koike
  2017-04-10 15:37                     ` Christoph Hellwig
  1 sibling, 0 replies; 49+ messages in thread
From: Helen Koike @ 2017-04-10 15:31 UTC (permalink / raw)


ping

On 2017-04-01 06:50 PM, Helen Koike wrote:
>
> Hi Christoph,
>
> Thanks for reviewing this patch
>
> On 2017-03-31 04:01 AM, Christoph Hellwig wrote:
>> I'd still prefer to not hide the MMIO path behind nvme_write_doorbell,
>> but except for that this looks fine to me:
>>
>
> Do you mean wrapping the call to writel()? What do you think if I send a
> new
> version with the changes below? It wouldn't hide the call to writel()
> inside nvme_write_doorbell anymore.
>
> Helen
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3ce705a..0e5f2a1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -277,8 +277,9 @@ static inline int nvme_dbbuf_need_event(u16
> event_idx, u16 new_idx, u16 old)
>         return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
>  }
>
> -static void nvme_write_doorbell(u16 value, u32 __iomem *db, u32 *dbbuf_db,
> -                               volatile u32 *dbbuf_ei)
> +/* Update dbbuf and return true if a MMIO is required */
> +static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
> +                                             volatile u32 *dbbuf_ei)
>  {
>         if (dbbuf_db) {
>                 u16 old_value;
> @@ -291,11 +292,12 @@ static void nvme_write_doorbell(u16 value, u32
> __iomem *db, u32 *dbbuf_db,
>
>                 old_value = *dbbuf_db;
>                 *dbbuf_db = value;
> +
>                 if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
> -                       return;
> +                       return false;
>         }
>
> -       writel(value, db);
> +       return true;
>  }
>
>  /*
> @@ -422,8 +424,9 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>
>         if (++tail == nvmeq->q_depth)
>                 tail = 0;
> -       nvme_write_doorbell(tail, nvmeq->q_db, nvmeq->dbbuf_sq_db,
> -                           nvmeq->dbbuf_sq_ei);
> +       if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> +                                             nvmeq->dbbuf_sq_ei))
> +               writel(tail, nvmeq->q_db);
>         nvmeq->sq_tail = tail;
>  }
>
> @@ -839,8 +842,9 @@ static void __nvme_process_cq(struct nvme_queue
> *nvmeq, unsigned int *tag)
>                 return;
>
>         if (likely(nvmeq->cq_vector >= 0))
> -               nvme_write_doorbell(head, nvmeq->q_db +
> nvmeq->dev->db_stride,
> -                                   nvmeq->dbbuf_cq_db,
> nvmeq->dbbuf_cq_ei);
> +               if (nvme_dbbuf_update_and_check_event(head,
> nvmeq->dbbuf_cq_db,
> +                                                     nvmeq->dbbuf_cq_ei))
> +                       writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
>         nvmeq->cq_head = head;
>         nvmeq->cq_phase = phase;
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v7 RFC] nvme: improve performance for virtual NVMe devices
  2017-04-01 21:50                   ` Helen Koike
  2017-04-10 15:31                     ` Helen Koike
@ 2017-04-10 15:37                     ` Christoph Hellwig
  1 sibling, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2017-04-10 15:37 UTC (permalink / raw)


On Sat, Apr 01, 2017@06:50:03PM -0300, Helen Koike wrote:
> 
> Hi Christoph,
> 
> Thanks for reviewing this patch
> 
> On 2017-03-31 04:01 AM, Christoph Hellwig wrote:
> > I'd still prefer to not hide the MMIO path behind nvme_write_doorbell,
> > but except for that this looks fine to me:
> > 
> 
> Do you mean wrapping the call to writel()? What do you think if I send a new
> version with the changes below? It wouldn't hide the call to writel()
> inside nvme_write_doorbell anymore.

Yes, that's the version I'd prefer.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v8] nvme: improve performance for virtual NVMe devices
  2017-03-31  7:01                 ` Christoph Hellwig
  2017-04-01 21:50                   ` Helen Koike
@ 2017-04-10 15:51                   ` Helen Koike
  2017-04-14 18:10                       ` Helen Koike
  1 sibling, 1 reply; 49+ messages in thread
From: Helen Koike @ 2017-04-10 15:51 UTC (permalink / raw)


From: Helen Koike <helen.koike@collabora.co.uk>

This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specification, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric at google.com>
  Frank Swiderski <fes at google.com>
  Ted Tso <tytso at mit.edu>
  Keith Busch <keith.busch at intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson at google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin at kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike at collabora.co.uk>
Reviewed-by: Christoph Hellwig <hch at lst.de>

---

This patch is based on git://git.infradead.org/nvme.git master
Tested through Google Cloud Engine
TPAR is ratified by the NVME working group

changes since v7:
	- remove nvme_write_doorbell(), add
	nvme_dbbuf_update_and_check_event() instead that doesn't write
	to the doorbell, only returns true if db needs to be updated
	- add reviewed-by tag from Christoph

changes since v6:
	- remove nvme_dbbuf_init() from nvme_alloc_queue() as it is
	already called in nvme_init_queue()
	- free dbbuf_dbs only in nvme_pci_free_ctrl(), change
	nvme_dbbuf_dma_alloc() to do nothing the dbbuf_dbs is already
	allocated

changes since v5:
	- remove anonymous dbbuf struct in struct nvme_dev and struct nvme_queue
	- use inline functions for sq_idx and cq_idx instead of macros
	- add a warning when nvme_dbbuf_set fails
	- remove comment "Borrowed from vring_need_event"
	- change formatting of the parameters in nvme_write_doorbell
	- don't fail nvme_reset_work if nvme_dbbuf_dma_alloc fails
	- remove nvme_write_doorbell_{sq,cq} wrappers
	- in nvme_write_doorbell(), call writel last

Changes since v4
	- Remove CONFIG_NVME_DBBUF
	- Remove dbbuf.{c,h}, move the code to pci.c
	- Coding style changes
	- Functions and vaiables renamed

Changes since v3
	- Rename to dbbuf (be closer to the latest TPAR)
	- Modify the opcodes according to the latest TPAR
	- Check if the device support this feature through the OACS bit
	- little cleanups

nvme_dbbuf_update_and_check_event
---
 drivers/nvme/host/pci.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/nvme.h    |  13 +++++
 2 files changed, 154 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b7bd9c..0e5f2a1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -103,8 +103,22 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	u32 *dbbuf_dbs;
+	dma_addr_t dbbuf_dbs_dma_addr;
+	u32 *dbbuf_eis;
+	dma_addr_t dbbuf_eis_dma_addr;
 };
 
+static inline unsigned int sq_idx(unsigned int qid, u32 stride)
+{
+	return qid * 2 * stride;
+}
+
+static inline unsigned int cq_idx(unsigned int qid, u32 stride)
+{
+	return (qid * 2 + 1) * stride;
+}
+
 static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
 {
 	return container_of(ctrl, struct nvme_dev, ctrl);
@@ -133,6 +147,10 @@ struct nvme_queue {
 	u16 qid;
 	u8 cq_phase;
 	u8 cqe_seen;
+	u32 *dbbuf_sq_db;
+	u32 *dbbuf_cq_db;
+	u32 *dbbuf_sq_ei;
+	u32 *dbbuf_cq_ei;
 };
 
 /*
@@ -174,6 +192,112 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
+}
+
+static inline unsigned int nvme_dbbuf_size(u32 stride)
+{
+	return ((num_possible_cpus() + 1) * 8 * stride);
+}
+
+static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	if (dev->dbbuf_dbs)
+		return 0;
+
+	dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf_dbs_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf_dbs)
+		return -ENOMEM;
+	dev->dbbuf_eis = dma_alloc_coherent(dev->dev, mem_size,
+					    &dev->dbbuf_eis_dma_addr,
+					    GFP_KERNEL);
+	if (!dev->dbbuf_eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
+		dev->dbbuf_dbs = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void nvme_dbbuf_dma_free(struct nvme_dev *dev)
+{
+	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
+
+	if (dev->dbbuf_dbs) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
+		dev->dbbuf_dbs = NULL;
+	}
+	if (dev->dbbuf_eis) {
+		dma_free_coherent(dev->dev, mem_size,
+				  dev->dbbuf_eis, dev->dbbuf_eis_dma_addr);
+		dev->dbbuf_eis = NULL;
+	}
+}
+
+static void nvme_dbbuf_init(struct nvme_dev *dev,
+			    struct nvme_queue *nvmeq, int qid)
+{
+	if (!dev->dbbuf_dbs || !qid)
+		return;
+
+	nvmeq->dbbuf_sq_db = &dev->dbbuf_dbs[sq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_cq_db = &dev->dbbuf_dbs[cq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_sq_ei = &dev->dbbuf_eis[sq_idx(qid, dev->db_stride)];
+	nvmeq->dbbuf_cq_ei = &dev->dbbuf_eis[cq_idx(qid, dev->db_stride)];
+}
+
+static void nvme_dbbuf_set(struct nvme_dev *dev)
+{
+	struct nvme_command c;
+
+	if (!dev->dbbuf_dbs)
+		return;
+
+	memset(&c, 0, sizeof(c));
+	c.dbbuf.opcode = nvme_admin_dbbuf;
+	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf_dbs_dma_addr);
+	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf_eis_dma_addr);
+
+	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0)) {
+		dev_warn(dev->dev, "unable to set dbbuf\n");
+		/* Free memory and continue on */
+		nvme_dbbuf_dma_free(dev);
+	}
+}
+
+static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
+{
+	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
+}
+
+/* Update dbbuf and return true if an MMIO is required */
+static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
+					      volatile u32 *dbbuf_ei)
+{
+	if (dbbuf_db) {
+		u16 old_value;
+
+		/*
+		 * Ensure that the queue is written before updating
+		 * the doorbell in memory
+		 */
+		wmb();
+
+		old_value = *dbbuf_db;
+		*dbbuf_db = value;
+
+		if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
+			return false;
+	}
+
+	return true;
 }
 
 /*
@@ -300,7 +424,9 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
-	writel(tail, nvmeq->q_db);
+	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
+					      nvmeq->dbbuf_sq_ei))
+		writel(tail, nvmeq->q_db);
 	nvmeq->sq_tail = tail;
 }
 
@@ -716,7 +842,9 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 		return;
 
 	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
+		if (nvme_dbbuf_update_and_check_event(head, nvmeq->dbbuf_cq_db,
+						      nvmeq->dbbuf_cq_ei))
+			writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
@@ -1099,6 +1227,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->online_queues++;
 	spin_unlock_irq(&nvmeq->q_lock);
 }
@@ -1568,6 +1697,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
+
+		nvme_dbbuf_set(dev);
 	} else {
 		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
 
@@ -1733,6 +1864,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 {
 	struct nvme_dev *dev = to_nvme_dev(ctrl);
 
+	nvme_dbbuf_dma_free(dev);
 	put_device(dev->dev);
 	if (dev->tagset.tags)
 		blk_mq_free_tag_set(&dev->tagset);
@@ -1800,6 +1932,13 @@ static void nvme_reset_work(struct work_struct *work)
 		dev->ctrl.opal_dev = NULL;
 	}
 
+	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
+		result = nvme_dbbuf_dma_alloc(dev);
+		if (result)
+			dev_warn(dev->dev,
+				 "unable to allocate dma for dbbuf\n");
+	}
+
 	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c43d435..43a6289 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
 struct nvme_lbaf {
@@ -603,6 +604,7 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
@@ -874,6 +876,16 @@ struct nvmf_property_get_command {
 	__u8		resv4[16];
 };
 
+struct nvme_dbbuf {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -893,6 +905,7 @@ struct nvme_command {
 		struct nvmf_connect_command connect;
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
+		struct nvme_dbbuf dbbuf;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v8] nvme: improve performance for virtual NVMe devices
  2017-04-10 15:51                   ` [PATCH v8] " Helen Koike
@ 2017-04-14 18:10                       ` Helen Koike
  0 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2017-04-14 18:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, fes, axboe, rlnelson,
	open list : NVM EXPRESS DRIVER, mikew, digitaleric, monish,
	james_p_freyensee @ linux . intel . com, tytso, mlin, sagi,
	linux-kernel



On 2017-04-10 12:51 PM, Helen Koike wrote:
> From: Helen Koike <helen.koike@collabora.co.uk>
>
> This change provides a mechanism to reduce the number of MMIO doorbell
> writes for the NVMe driver. When running in a virtualized environment
> like QEMU, the cost of an MMIO is quite hefy here. The main idea for
> the patch is provide the device two memory location locations:
>  1) to store the doorbell values so they can be lookup without the doorbell
>     MMIO write
>  2) to store an event index.
> I believe the doorbell value is obvious, the event index not so much.
> Similar to the virtio specification, the virtual device can tell the
> driver (guest OS) not to write MMIO unless you are writing past this
> value.
>
> FYI: doorbell values are written by the nvme driver (guest OS) and the
> event index is written by the virtual device (host OS).
>
> The patch implements a new admin command that will communicate where
> these two memory locations reside. If the command fails, the nvme
> driver will work as before without any optimizations.
>
> Contributions:
>   Eric Northup <digitaleric@google.com>
>   Frank Swiderski <fes@google.com>
>   Ted Tso <tytso@mit.edu>
>   Keith Busch <keith.busch@intel.com>
>
> Just to give an idea on the performance boost with the vendor
> extension: Running fio [1], a stock NVMe driver I get about 200K read
> IOPs with my vendor patch I get about 1000K read IOPs. This was
> running with a null device i.e. the backing device simply returned
> success on every read IO request.
>
> [1] Running on a 4 core machine:
>   fio --time_based --name=benchmark --runtime=30
>   --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
>   --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
>   --rw=randread --blocksize=4k --randrepeat=false
>
> Signed-off-by: Rob Nelson <rlnelson@google.com>
> [mlin: port for upstream]
> Signed-off-by: Ming Lin <mlin@kernel.org>
> [koike: updated for upstream]
> Signed-off-by: Helen Koike <helen.koike@collabora.co.uk>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> ---
>
> This patch is based on git://git.infradead.org/nvme.git master
> Tested through Google Cloud Engine
> TPAR is ratified by the NVME working group
>
> changes since v7:
> 	- remove nvme_write_doorbell(), add
> 	nvme_dbbuf_update_and_check_event() instead that doesn't write
> 	to the doorbell, only returns true if db needs to be updated
> 	- add reviewed-by tag from Christoph
>
> changes since v6:
> 	- remove nvme_dbbuf_init() from nvme_alloc_queue() as it is
> 	already called in nvme_init_queue()
> 	- free dbbuf_dbs only in nvme_pci_free_ctrl(), change
> 	nvme_dbbuf_dma_alloc() to do nothing the dbbuf_dbs is already
> 	allocated
>
> changes since v5:
> 	- remove anonymous dbbuf struct in struct nvme_dev and struct nvme_queue
> 	- use inline functions for sq_idx and cq_idx instead of macros
> 	- add a warning when nvme_dbbuf_set fails
> 	- remove comment "Borrowed from vring_need_event"
> 	- change formatting of the parameters in nvme_write_doorbell
> 	- don't fail nvme_reset_work if nvme_dbbuf_dma_alloc fails
> 	- remove nvme_write_doorbell_{sq,cq} wrappers
> 	- in nvme_write_doorbell(), call writel last
>
> Changes since v4
> 	- Remove CONFIG_NVME_DBBUF
> 	- Remove dbbuf.{c,h}, move the code to pci.c
> 	- Coding style changes
> 	- Functions and vaiables renamed
>
> Changes since v3
> 	- Rename to dbbuf (be closer to the latest TPAR)
> 	- Modify the opcodes according to the latest TPAR
> 	- Check if the device support this feature through the OACS bit
> 	- little cleanups
>
> nvme_dbbuf_update_and_check_event
> ---
>  drivers/nvme/host/pci.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/nvme.h    |  13 +++++
>  2 files changed, 154 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5b7bd9c..0e5f2a1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -103,8 +103,22 @@ struct nvme_dev {
>  	u32 cmbloc;
>  	struct nvme_ctrl ctrl;
>  	struct completion ioq_wait;
> +	u32 *dbbuf_dbs;
> +	dma_addr_t dbbuf_dbs_dma_addr;
> +	u32 *dbbuf_eis;
> +	dma_addr_t dbbuf_eis_dma_addr;
>  };
>
> +static inline unsigned int sq_idx(unsigned int qid, u32 stride)
> +{
> +	return qid * 2 * stride;
> +}
> +
> +static inline unsigned int cq_idx(unsigned int qid, u32 stride)
> +{
> +	return (qid * 2 + 1) * stride;
> +}
> +
>  static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
>  {
>  	return container_of(ctrl, struct nvme_dev, ctrl);
> @@ -133,6 +147,10 @@ struct nvme_queue {
>  	u16 qid;
>  	u8 cq_phase;
>  	u8 cqe_seen;
> +	u32 *dbbuf_sq_db;
> +	u32 *dbbuf_cq_db;
> +	u32 *dbbuf_sq_ei;
> +	u32 *dbbuf_cq_ei;
>  };
>
>  /*
> @@ -174,6 +192,112 @@ static inline void _nvme_check_size(void)
>  	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
>  	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
>  	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
> +	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
> +}
> +
> +static inline unsigned int nvme_dbbuf_size(u32 stride)
> +{
> +	return ((num_possible_cpus() + 1) * 8 * stride);
> +}
> +
> +static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
> +{
> +	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
> +
> +	if (dev->dbbuf_dbs)
> +		return 0;
> +
> +	dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf_dbs_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf_dbs)
> +		return -ENOMEM;
> +	dev->dbbuf_eis = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf_eis_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf_eis) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
> +		dev->dbbuf_dbs = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static void nvme_dbbuf_dma_free(struct nvme_dev *dev)
> +{
> +	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
> +
> +	if (dev->dbbuf_dbs) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
> +		dev->dbbuf_dbs = NULL;
> +	}
> +	if (dev->dbbuf_eis) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf_eis, dev->dbbuf_eis_dma_addr);
> +		dev->dbbuf_eis = NULL;
> +	}
> +}
> +
> +static void nvme_dbbuf_init(struct nvme_dev *dev,
> +			    struct nvme_queue *nvmeq, int qid)
> +{
> +	if (!dev->dbbuf_dbs || !qid)
> +		return;
> +
> +	nvmeq->dbbuf_sq_db = &dev->dbbuf_dbs[sq_idx(qid, dev->db_stride)];
> +	nvmeq->dbbuf_cq_db = &dev->dbbuf_dbs[cq_idx(qid, dev->db_stride)];
> +	nvmeq->dbbuf_sq_ei = &dev->dbbuf_eis[sq_idx(qid, dev->db_stride)];
> +	nvmeq->dbbuf_cq_ei = &dev->dbbuf_eis[cq_idx(qid, dev->db_stride)];
> +}
> +
> +static void nvme_dbbuf_set(struct nvme_dev *dev)
> +{
> +	struct nvme_command c;
> +
> +	if (!dev->dbbuf_dbs)
> +		return;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.dbbuf.opcode = nvme_admin_dbbuf;
> +	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf_dbs_dma_addr);
> +	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf_eis_dma_addr);
> +
> +	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0)) {
> +		dev_warn(dev->dev, "unable to set dbbuf\n");
> +		/* Free memory and continue on */
> +		nvme_dbbuf_dma_free(dev);
> +	}
> +}
> +
> +static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
> +{
> +	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
> +}
> +
> +/* Update dbbuf and return true if an MMIO is required */
> +static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
> +					      volatile u32 *dbbuf_ei)
> +{
> +	if (dbbuf_db) {
> +		u16 old_value;
> +
> +		/*
> +		 * Ensure that the queue is written before updating
> +		 * the doorbell in memory
> +		 */
> +		wmb();
> +
> +		old_value = *dbbuf_db;
> +		*dbbuf_db = value;
> +
> +		if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
> +			return false;
> +	}
> +
> +	return true;
>  }
>
>  /*
> @@ -300,7 +424,9 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>
>  	if (++tail == nvmeq->q_depth)
>  		tail = 0;
> -	writel(tail, nvmeq->q_db);
> +	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> +					      nvmeq->dbbuf_sq_ei))
> +		writel(tail, nvmeq->q_db);
>  	nvmeq->sq_tail = tail;
>  }
>
> @@ -716,7 +842,9 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
>  		return;
>
>  	if (likely(nvmeq->cq_vector >= 0))
> -		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> +		if (nvme_dbbuf_update_and_check_event(head, nvmeq->dbbuf_cq_db,
> +						      nvmeq->dbbuf_cq_ei))
> +			writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
>  	nvmeq->cq_head = head;
>  	nvmeq->cq_phase = phase;
>
> @@ -1099,6 +1227,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
>  	nvmeq->cq_phase = 1;
>  	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>  	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
> +	nvme_dbbuf_init(dev, nvmeq, qid);
>  	dev->online_queues++;
>  	spin_unlock_irq(&nvmeq->q_lock);
>  }
> @@ -1568,6 +1697,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
>  		if (blk_mq_alloc_tag_set(&dev->tagset))
>  			return 0;
>  		dev->ctrl.tagset = &dev->tagset;
> +
> +		nvme_dbbuf_set(dev);
>  	} else {
>  		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
>
> @@ -1733,6 +1864,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
>  {
>  	struct nvme_dev *dev = to_nvme_dev(ctrl);
>
> +	nvme_dbbuf_dma_free(dev);
>  	put_device(dev->dev);
>  	if (dev->tagset.tags)
>  		blk_mq_free_tag_set(&dev->tagset);
> @@ -1800,6 +1932,13 @@ static void nvme_reset_work(struct work_struct *work)
>  		dev->ctrl.opal_dev = NULL;
>  	}
>
> +	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
> +		result = nvme_dbbuf_dma_alloc(dev);
> +		if (result)
> +			dev_warn(dev->dev,
> +				 "unable to allocate dma for dbbuf\n");
> +	}
> +
>  	result = nvme_setup_io_queues(dev);
>  	if (result)
>  		goto out;
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index c43d435..43a6289 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -245,6 +245,7 @@ enum {
>  	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
>  	NVME_CTRL_VWC_PRESENT			= 1 << 0,
>  	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
> +	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
>  };
>
>  struct nvme_lbaf {
> @@ -603,6 +604,7 @@ enum nvme_admin_opcode {
>  	nvme_admin_download_fw		= 0x11,
>  	nvme_admin_ns_attach		= 0x15,
>  	nvme_admin_keep_alive		= 0x18,
> +	nvme_admin_dbbuf		= 0x7C,
>  	nvme_admin_format_nvm		= 0x80,
>  	nvme_admin_security_send	= 0x81,
>  	nvme_admin_security_recv	= 0x82,
> @@ -874,6 +876,16 @@ struct nvmf_property_get_command {
>  	__u8		resv4[16];
>  };
>
> +struct nvme_dbbuf {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__u32			rsvd1[5];
> +	__le64			prp1;
> +	__le64			prp2;
> +	__u32			rsvd12[6];
> +};
> +
>  struct nvme_command {
>  	union {
>  		struct nvme_common_command common;
> @@ -893,6 +905,7 @@ struct nvme_command {
>  		struct nvmf_connect_command connect;
>  		struct nvmf_property_set_command prop_set;
>  		struct nvmf_property_get_command prop_get;
> +		struct nvme_dbbuf dbbuf;
>  	};
>  };
>
>

+ Add missing maintainers from scripts/get_maintainer.pl in the email thread

Hi,

I would like to know if it would be possible to get this patch for 
kernel 4.12.
Should I send a pull request? Or do you usually get the patch from the 
email thread?

Thanks,
Helen

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v8] nvme: improve performance for virtual NVMe devices
@ 2017-04-14 18:10                       ` Helen Koike
  0 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2017-04-14 18:10 UTC (permalink / raw)




On 2017-04-10 12:51 PM, Helen Koike wrote:
> From: Helen Koike <helen.koike at collabora.co.uk>
>
> This change provides a mechanism to reduce the number of MMIO doorbell
> writes for the NVMe driver. When running in a virtualized environment
> like QEMU, the cost of an MMIO is quite hefy here. The main idea for
> the patch is provide the device two memory location locations:
>  1) to store the doorbell values so they can be lookup without the doorbell
>     MMIO write
>  2) to store an event index.
> I believe the doorbell value is obvious, the event index not so much.
> Similar to the virtio specification, the virtual device can tell the
> driver (guest OS) not to write MMIO unless you are writing past this
> value.
>
> FYI: doorbell values are written by the nvme driver (guest OS) and the
> event index is written by the virtual device (host OS).
>
> The patch implements a new admin command that will communicate where
> these two memory locations reside. If the command fails, the nvme
> driver will work as before without any optimizations.
>
> Contributions:
>   Eric Northup <digitaleric at google.com>
>   Frank Swiderski <fes at google.com>
>   Ted Tso <tytso at mit.edu>
>   Keith Busch <keith.busch at intel.com>
>
> Just to give an idea on the performance boost with the vendor
> extension: Running fio [1], a stock NVMe driver I get about 200K read
> IOPs with my vendor patch I get about 1000K read IOPs. This was
> running with a null device i.e. the backing device simply returned
> success on every read IO request.
>
> [1] Running on a 4 core machine:
>   fio --time_based --name=benchmark --runtime=30
>   --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
>   --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
>   --rw=randread --blocksize=4k --randrepeat=false
>
> Signed-off-by: Rob Nelson <rlnelson at google.com>
> [mlin: port for upstream]
> Signed-off-by: Ming Lin <mlin at kernel.org>
> [koike: updated for upstream]
> Signed-off-by: Helen Koike <helen.koike at collabora.co.uk>
> Reviewed-by: Christoph Hellwig <hch at lst.de>
>
> ---
>
> This patch is based on git://git.infradead.org/nvme.git master
> Tested through Google Cloud Engine
> TPAR is ratified by the NVME working group
>
> changes since v7:
> 	- remove nvme_write_doorbell(), add
> 	nvme_dbbuf_update_and_check_event() instead that doesn't write
> 	to the doorbell, only returns true if db needs to be updated
> 	- add reviewed-by tag from Christoph
>
> changes since v6:
> 	- remove nvme_dbbuf_init() from nvme_alloc_queue() as it is
> 	already called in nvme_init_queue()
> 	- free dbbuf_dbs only in nvme_pci_free_ctrl(), change
> 	nvme_dbbuf_dma_alloc() to do nothing the dbbuf_dbs is already
> 	allocated
>
> changes since v5:
> 	- remove anonymous dbbuf struct in struct nvme_dev and struct nvme_queue
> 	- use inline functions for sq_idx and cq_idx instead of macros
> 	- add a warning when nvme_dbbuf_set fails
> 	- remove comment "Borrowed from vring_need_event"
> 	- change formatting of the parameters in nvme_write_doorbell
> 	- don't fail nvme_reset_work if nvme_dbbuf_dma_alloc fails
> 	- remove nvme_write_doorbell_{sq,cq} wrappers
> 	- in nvme_write_doorbell(), call writel last
>
> Changes since v4
> 	- Remove CONFIG_NVME_DBBUF
> 	- Remove dbbuf.{c,h}, move the code to pci.c
> 	- Coding style changes
> 	- Functions and vaiables renamed
>
> Changes since v3
> 	- Rename to dbbuf (be closer to the latest TPAR)
> 	- Modify the opcodes according to the latest TPAR
> 	- Check if the device support this feature through the OACS bit
> 	- little cleanups
>
> nvme_dbbuf_update_and_check_event
> ---
>  drivers/nvme/host/pci.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/nvme.h    |  13 +++++
>  2 files changed, 154 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5b7bd9c..0e5f2a1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -103,8 +103,22 @@ struct nvme_dev {
>  	u32 cmbloc;
>  	struct nvme_ctrl ctrl;
>  	struct completion ioq_wait;
> +	u32 *dbbuf_dbs;
> +	dma_addr_t dbbuf_dbs_dma_addr;
> +	u32 *dbbuf_eis;
> +	dma_addr_t dbbuf_eis_dma_addr;
>  };
>
> +static inline unsigned int sq_idx(unsigned int qid, u32 stride)
> +{
> +	return qid * 2 * stride;
> +}
> +
> +static inline unsigned int cq_idx(unsigned int qid, u32 stride)
> +{
> +	return (qid * 2 + 1) * stride;
> +}
> +
>  static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
>  {
>  	return container_of(ctrl, struct nvme_dev, ctrl);
> @@ -133,6 +147,10 @@ struct nvme_queue {
>  	u16 qid;
>  	u8 cq_phase;
>  	u8 cqe_seen;
> +	u32 *dbbuf_sq_db;
> +	u32 *dbbuf_cq_db;
> +	u32 *dbbuf_sq_ei;
> +	u32 *dbbuf_cq_ei;
>  };
>
>  /*
> @@ -174,6 +192,112 @@ static inline void _nvme_check_size(void)
>  	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
>  	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
>  	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
> +	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
> +}
> +
> +static inline unsigned int nvme_dbbuf_size(u32 stride)
> +{
> +	return ((num_possible_cpus() + 1) * 8 * stride);
> +}
> +
> +static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
> +{
> +	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
> +
> +	if (dev->dbbuf_dbs)
> +		return 0;
> +
> +	dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf_dbs_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf_dbs)
> +		return -ENOMEM;
> +	dev->dbbuf_eis = dma_alloc_coherent(dev->dev, mem_size,
> +					    &dev->dbbuf_eis_dma_addr,
> +					    GFP_KERNEL);
> +	if (!dev->dbbuf_eis) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
> +		dev->dbbuf_dbs = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static void nvme_dbbuf_dma_free(struct nvme_dev *dev)
> +{
> +	unsigned int mem_size = nvme_dbbuf_size(dev->db_stride);
> +
> +	if (dev->dbbuf_dbs) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr);
> +		dev->dbbuf_dbs = NULL;
> +	}
> +	if (dev->dbbuf_eis) {
> +		dma_free_coherent(dev->dev, mem_size,
> +				  dev->dbbuf_eis, dev->dbbuf_eis_dma_addr);
> +		dev->dbbuf_eis = NULL;
> +	}
> +}
> +
> +static void nvme_dbbuf_init(struct nvme_dev *dev,
> +			    struct nvme_queue *nvmeq, int qid)
> +{
> +	if (!dev->dbbuf_dbs || !qid)
> +		return;
> +
> +	nvmeq->dbbuf_sq_db = &dev->dbbuf_dbs[sq_idx(qid, dev->db_stride)];
> +	nvmeq->dbbuf_cq_db = &dev->dbbuf_dbs[cq_idx(qid, dev->db_stride)];
> +	nvmeq->dbbuf_sq_ei = &dev->dbbuf_eis[sq_idx(qid, dev->db_stride)];
> +	nvmeq->dbbuf_cq_ei = &dev->dbbuf_eis[cq_idx(qid, dev->db_stride)];
> +}
> +
> +static void nvme_dbbuf_set(struct nvme_dev *dev)
> +{
> +	struct nvme_command c;
> +
> +	if (!dev->dbbuf_dbs)
> +		return;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.dbbuf.opcode = nvme_admin_dbbuf;
> +	c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf_dbs_dma_addr);
> +	c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf_eis_dma_addr);
> +
> +	if (nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0)) {
> +		dev_warn(dev->dev, "unable to set dbbuf\n");
> +		/* Free memory and continue on */
> +		nvme_dbbuf_dma_free(dev);
> +	}
> +}
> +
> +static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
> +{
> +	return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);
> +}
> +
> +/* Update dbbuf and return true if an MMIO is required */
> +static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
> +					      volatile u32 *dbbuf_ei)
> +{
> +	if (dbbuf_db) {
> +		u16 old_value;
> +
> +		/*
> +		 * Ensure that the queue is written before updating
> +		 * the doorbell in memory
> +		 */
> +		wmb();
> +
> +		old_value = *dbbuf_db;
> +		*dbbuf_db = value;
> +
> +		if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
> +			return false;
> +	}
> +
> +	return true;
>  }
>
>  /*
> @@ -300,7 +424,9 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>
>  	if (++tail == nvmeq->q_depth)
>  		tail = 0;
> -	writel(tail, nvmeq->q_db);
> +	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> +					      nvmeq->dbbuf_sq_ei))
> +		writel(tail, nvmeq->q_db);
>  	nvmeq->sq_tail = tail;
>  }
>
> @@ -716,7 +842,9 @@ static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
>  		return;
>
>  	if (likely(nvmeq->cq_vector >= 0))
> -		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> +		if (nvme_dbbuf_update_and_check_event(head, nvmeq->dbbuf_cq_db,
> +						      nvmeq->dbbuf_cq_ei))
> +			writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
>  	nvmeq->cq_head = head;
>  	nvmeq->cq_phase = phase;
>
> @@ -1099,6 +1227,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
>  	nvmeq->cq_phase = 1;
>  	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>  	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
> +	nvme_dbbuf_init(dev, nvmeq, qid);
>  	dev->online_queues++;
>  	spin_unlock_irq(&nvmeq->q_lock);
>  }
> @@ -1568,6 +1697,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
>  		if (blk_mq_alloc_tag_set(&dev->tagset))
>  			return 0;
>  		dev->ctrl.tagset = &dev->tagset;
> +
> +		nvme_dbbuf_set(dev);
>  	} else {
>  		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
>
> @@ -1733,6 +1864,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
>  {
>  	struct nvme_dev *dev = to_nvme_dev(ctrl);
>
> +	nvme_dbbuf_dma_free(dev);
>  	put_device(dev->dev);
>  	if (dev->tagset.tags)
>  		blk_mq_free_tag_set(&dev->tagset);
> @@ -1800,6 +1932,13 @@ static void nvme_reset_work(struct work_struct *work)
>  		dev->ctrl.opal_dev = NULL;
>  	}
>
> +	if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) {
> +		result = nvme_dbbuf_dma_alloc(dev);
> +		if (result)
> +			dev_warn(dev->dev,
> +				 "unable to allocate dma for dbbuf\n");
> +	}
> +
>  	result = nvme_setup_io_queues(dev);
>  	if (result)
>  		goto out;
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index c43d435..43a6289 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -245,6 +245,7 @@ enum {
>  	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
>  	NVME_CTRL_VWC_PRESENT			= 1 << 0,
>  	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
> +	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
>  };
>
>  struct nvme_lbaf {
> @@ -603,6 +604,7 @@ enum nvme_admin_opcode {
>  	nvme_admin_download_fw		= 0x11,
>  	nvme_admin_ns_attach		= 0x15,
>  	nvme_admin_keep_alive		= 0x18,
> +	nvme_admin_dbbuf		= 0x7C,
>  	nvme_admin_format_nvm		= 0x80,
>  	nvme_admin_security_send	= 0x81,
>  	nvme_admin_security_recv	= 0x82,
> @@ -874,6 +876,16 @@ struct nvmf_property_get_command {
>  	__u8		resv4[16];
>  };
>
> +struct nvme_dbbuf {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__u32			rsvd1[5];
> +	__le64			prp1;
> +	__le64			prp2;
> +	__u32			rsvd12[6];
> +};
> +
>  struct nvme_command {
>  	union {
>  		struct nvme_common_command common;
> @@ -893,6 +905,7 @@ struct nvme_command {
>  		struct nvmf_connect_command connect;
>  		struct nvmf_property_set_command prop_set;
>  		struct nvmf_property_get_command prop_get;
> +		struct nvme_dbbuf dbbuf;
>  	};
>  };
>
>

+ Add missing maintainers from scripts/get_maintainer.pl in the email thread

Hi,

I would like to know if it would be possible to get this patch for 
kernel 4.12.
Should I send a pull request? Or do you usually get the patch from the 
email thread?

Thanks,
Helen

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v8] nvme: improve performance for virtual NVMe devices
  2017-04-14 18:10                       ` Helen Koike
@ 2017-04-17 23:01                         ` Keith Busch
  -1 siblings, 0 replies; 49+ messages in thread
From: Keith Busch @ 2017-04-17 23:01 UTC (permalink / raw)
  To: Helen Koike
  Cc: Christoph Hellwig, fes, axboe, rlnelson,
	open list : NVM EXPRESS DRIVER, mikew, digitaleric, monish,
	james_p_freyensee @ linux . intel . com, tytso, mlin, sagi,
	linux-kernel

On Fri, Apr 14, 2017 at 03:10:30PM -0300, Helen Koike wrote:
> + Add missing maintainers from scripts/get_maintainer.pl in the email thread
> 
> Hi,
> 
> I would like to know if it would be possible to get this patch for kernel
> 4.12.
> Should I send a pull request? Or do you usually get the patch from the email
> thread?

Hi Helen,

This looks good to me. I pulled this into a branch for the next merge
window here:

http://git.infradead.org/nvme.git/shortlog/refs/heads/for-next

There's only one other commit at the moment in addition to yours, so
I'll just need to make sure we've got everything we want for 4.12 before
submitting our next pull request.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v8] nvme: improve performance for virtual NVMe devices
@ 2017-04-17 23:01                         ` Keith Busch
  0 siblings, 0 replies; 49+ messages in thread
From: Keith Busch @ 2017-04-17 23:01 UTC (permalink / raw)


On Fri, Apr 14, 2017@03:10:30PM -0300, Helen Koike wrote:
> + Add missing maintainers from scripts/get_maintainer.pl in the email thread
> 
> Hi,
> 
> I would like to know if it would be possible to get this patch for kernel
> 4.12.
> Should I send a pull request? Or do you usually get the patch from the email
> thread?

Hi Helen,

This looks good to me. I pulled this into a branch for the next merge
window here:

http://git.infradead.org/nvme.git/shortlog/refs/heads/for-next

There's only one other commit at the moment in addition to yours, so
I'll just need to make sure we've got everything we want for 4.12 before
submitting our next pull request.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v8] nvme: improve performance for virtual NVMe devices
  2017-04-17 23:01                         ` Keith Busch
@ 2017-04-17 23:20                           ` Helen Koike
  -1 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2017-04-17 23:20 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, fes, axboe, rlnelson,
	open list : NVM EXPRESS DRIVER, mikew, digitaleric, monish,
	james_p_freyensee @ linux . intel . com, tytso, mlin, sagi,
	linux-kernel



On 2017-04-17 08:01 PM, Keith Busch wrote:
> On Fri, Apr 14, 2017 at 03:10:30PM -0300, Helen Koike wrote:
>> + Add missing maintainers from scripts/get_maintainer.pl in the email thread
>>
>> Hi,
>>
>> I would like to know if it would be possible to get this patch for kernel
>> 4.12.
>> Should I send a pull request? Or do you usually get the patch from the email
>> thread?
>
> Hi Helen,
>
> This looks good to me. I pulled this into a branch for the next merge
> window here:

Thanks :)

>
> http://git.infradead.org/nvme.git/shortlog/refs/heads/for-next
>
> There's only one other commit at the moment in addition to yours, so
> I'll just need to make sure we've got everything we want for 4.12 before
> submitting our next pull request.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v8] nvme: improve performance for virtual NVMe devices
@ 2017-04-17 23:20                           ` Helen Koike
  0 siblings, 0 replies; 49+ messages in thread
From: Helen Koike @ 2017-04-17 23:20 UTC (permalink / raw)




On 2017-04-17 08:01 PM, Keith Busch wrote:
> On Fri, Apr 14, 2017@03:10:30PM -0300, Helen Koike wrote:
>> + Add missing maintainers from scripts/get_maintainer.pl in the email thread
>>
>> Hi,
>>
>> I would like to know if it would be possible to get this patch for kernel
>> 4.12.
>> Should I send a pull request? Or do you usually get the patch from the email
>> thread?
>
> Hi Helen,
>
> This looks good to me. I pulled this into a branch for the next merge
> window here:

Thanks :)

>
> http://git.infradead.org/nvme.git/shortlog/refs/heads/for-next
>
> There's only one other commit at the moment in addition to yours, so
> I'll just need to make sure we've got everything we want for 4.12 before
> submitting our next pull request.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v8] nvme: improve performance for virtual NVMe devices
  2017-04-17 23:01                         ` Keith Busch
@ 2017-04-20 10:22                           ` Sagi Grimberg
  -1 siblings, 0 replies; 49+ messages in thread
From: Sagi Grimberg @ 2017-04-20 10:22 UTC (permalink / raw)
  To: Keith Busch, Helen Koike
  Cc: Christoph Hellwig, fes, axboe, rlnelson,
	open list : NVM EXPRESS DRIVER, mikew, digitaleric, monish,
	james_p_freyensee @ linux . intel . com, tytso, mlin,
	linux-kernel

>> + Add missing maintainers from scripts/get_maintainer.pl in the email thread
>>
>> Hi,
>>
>> I would like to know if it would be possible to get this patch for kernel
>> 4.12.
>> Should I send a pull request? Or do you usually get the patch from the email
>> thread?
>
> Hi Helen,
>
> This looks good to me. I pulled this into a branch for the next merge
> window here:
>
> http://git.infradead.org/nvme.git/shortlog/refs/heads/for-next
>
> There's only one other commit at the moment in addition to yours, so
> I'll just need to make sure we've got everything we want for 4.12 before
> submitting our next pull request.

I'm fine with this too.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v8] nvme: improve performance for virtual NVMe devices
@ 2017-04-20 10:22                           ` Sagi Grimberg
  0 siblings, 0 replies; 49+ messages in thread
From: Sagi Grimberg @ 2017-04-20 10:22 UTC (permalink / raw)


>> + Add missing maintainers from scripts/get_maintainer.pl in the email thread
>>
>> Hi,
>>
>> I would like to know if it would be possible to get this patch for kernel
>> 4.12.
>> Should I send a pull request? Or do you usually get the patch from the email
>> thread?
>
> Hi Helen,
>
> This looks good to me. I pulled this into a branch for the next merge
> window here:
>
> http://git.infradead.org/nvme.git/shortlog/refs/heads/for-next
>
> There's only one other commit at the moment in addition to yours, so
> I'll just need to make sure we've got everything we want for 4.12 before
> submitting our next pull request.

I'm fine with this too.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-04-20 10:22 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-14 18:04 [PATCH v2 RFC] nvme: improve performance for virtual NVMe devices Helen Mae Koike Fornazier
2016-04-21 13:33 ` Helen Koike
2016-04-21 13:38   ` Christoph Hellwig
2016-04-21 18:11     ` Ming Lin
2016-04-27 15:26       ` Helen Koike
2016-05-03  8:05       ` Christoph Hellwig
2016-05-04 16:48         ` Helen Koike
2016-05-05 14:17           ` Christoph Hellwig
2016-05-11 16:50     ` Helen Koike
2016-05-11 17:43       ` Keith Busch
2016-05-12  4:05         ` Helen Koike
2016-05-12  7:05           ` Christoph Hellwig
     [not found]             ` <CALu-fzt8=kmzZDqKsRYE42Q+F+Zu3U_s6XErpd9izXdxG1cUMA@mail.gmail.com>
     [not found]               ` <CAPst7KAAymUqbNwzxNAczduF6iLXQAMm5-auEdXZ_zHwhHtbWQ@mail.gmail.com>
2016-05-12 17:29                 ` Mike Waychison
2016-05-05 15:15 ` Helen Koike
2016-05-05 15:24 ` Helen Koike
2016-08-16  1:41 ` [PATCH v3 RFC 0/2] Virtual NVMe device optimization Helen Koike
2016-08-16  1:41   ` Helen Koike
2016-08-16  1:41   ` [PATCH v3 RFC 1/2] PCI: Add Google device ID Helen Koike
2016-08-18 21:59     ` Bjorn Helgaas
2016-08-16  1:41   ` [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices Helen Koike
2016-08-16  1:41     ` Helen Koike
2016-08-16 20:45     ` J Freyensee
2016-08-16 20:45       ` J Freyensee
2016-08-16 23:45       ` Keith Busch
2016-08-16 23:45         ` Keith Busch
2017-03-17 21:44     ` [PATCH v4 RFC] " Helen Koike
2017-03-17 22:28       ` Keith Busch
2017-03-17 22:26         ` Helen Koike
2017-03-24  4:23         ` [PATCH v5 " Helen Koike
2017-03-27  9:49           ` Christoph Hellwig
2017-03-27 16:04             ` Helen Koike
2017-03-27 16:25               ` Helen Koike
2017-03-27 14:43           ` Keith Busch
2017-03-27 16:50           ` [PATCH v6 " Helen Koike
2017-03-30 17:33             ` Keith Busch
2017-03-30 17:46               ` [PATCH v7 " Helen Koike
2017-03-31  7:01                 ` Christoph Hellwig
2017-04-01 21:50                   ` Helen Koike
2017-04-10 15:31                     ` Helen Koike
2017-04-10 15:37                     ` Christoph Hellwig
2017-04-10 15:51                   ` [PATCH v8] " Helen Koike
2017-04-14 18:10                     ` Helen Koike
2017-04-14 18:10                       ` Helen Koike
2017-04-17 23:01                       ` Keith Busch
2017-04-17 23:01                         ` Keith Busch
2017-04-17 23:20                         ` Helen Koike
2017-04-17 23:20                           ` Helen Koike
2017-04-20 10:22                         ` Sagi Grimberg
2017-04-20 10:22                           ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.