[PATCH 0/5] Adding blk-mq and DMA support to pmem block driver

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] Adding blk-mq and DMA support to pmem block driver
@ 2017-07-31 22:24 Dave Jiang
  2017-07-31 22:24 ` [PATCH 1/5] dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels Dave Jiang
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Dave Jiang @ 2017-07-31 22:24 UTC (permalink / raw)
  To: vinod.koul, dan.j.williams; +Cc: dmaengine, linux-nvdimm

The following series implements adds blk-mq support to the pmem block driver
and also adds infrastructure code to ioatdma and dmaengine in order to
support copying to and from scatterlist in order to process block
requests provided by blk-mq. The usage of DMA engines available on certain
platforms allow us to drastically reduce CPU utilization and at the same time
maintain performance that is good enough. Experimentations have been done on
DRAM backed pmem block device that showed the utilization of DMA engine is
beneficial. User can revert back to original behavior by providing
queue_mode=0 to the nd_pmem kernel module if desired.

---

Dave Jiang (5):
      dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels
      dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support
      dmaengine: add SG support to dmaengine_unmap
      libnvdimm: Adding blk-mq support to the pmem driver
      libnvdimm: add DMA support for pmem blk-mq

 drivers/dma/dmaengine.c   |   45 +++++-
 drivers/dma/ioat/dma.h    |    8 +
 drivers/dma/ioat/init.c   |    6 -
 drivers/dma/ioat/prep.c   |  105 ++++++++++++++
 drivers/nvdimm/pmem.c     |  340 ++++++++++++++++++++++++++++++++++++++++++---
 drivers/nvdimm/pmem.h     |    3 
 include/linux/dmaengine.h |   14 ++
 7 files changed, 488 insertions(+), 33 deletions(-)

--
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels
  2017-07-31 22:24 [PATCH 0/5] Adding blk-mq and DMA support to pmem block driver Dave Jiang
@ 2017-07-31 22:24 ` Dave Jiang
  2017-07-31 22:24 ` [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support Dave Jiang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Dave Jiang @ 2017-07-31 22:24 UTC (permalink / raw)
  To: vinod.koul, dan.j.williams; +Cc: dmaengine, linux-nvdimm

Commit 7618d0359c16 ("dmaengine: ioatdma: Set non RAID channels to be
private capable") makes all non-RAID ioatdma channels as private to be
requestable by dma_request_channel(). With PQ CAP support going away for
ioatdma, this would make all channels private. To support the usage of
ioatdma for blk-mq implementation of pmem we need as many channels we can
share in order to be high performing. Thus reverting the patch.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/ioat/init.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c
index 6ad4384..e437112 100644
--- a/drivers/dma/ioat/init.c
+++ b/drivers/dma/ioat/init.c
@@ -1163,9 +1163,6 @@ static int ioat3_dma_probe(struct ioatdma_device *ioat_dma, int dca)
 		}
 	}
 
-	if (!(ioat_dma->cap & (IOAT_CAP_XOR | IOAT_CAP_PQ)))
-		dma_cap_set(DMA_PRIVATE, dma->cap_mask);
-
 	err = ioat_probe(ioat_dma);
 	if (err)
 		return err;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support
  2017-07-31 22:24 [PATCH 0/5] Adding blk-mq and DMA support to pmem block driver Dave Jiang
  2017-07-31 22:24 ` [PATCH 1/5] dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels Dave Jiang
@ 2017-07-31 22:24 ` Dave Jiang
  2017-08-01  2:14   ` Dan Williams
  2017-07-31 22:24 ` [PATCH 3/5] dmaengine: add SG support to dmaengine_unmap Dave Jiang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Dave Jiang @ 2017-07-31 22:24 UTC (permalink / raw)
  To: vinod.koul, dan.j.williams; +Cc: dmaengine, linux-nvdimm

Adding ioatdma support to copy from a physically contiguos buffer to a
provided scatterlist and vice versa. This is used to support
reading/writing persistent memory in the pmem driver.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/ioat/dma.h    |    8 +++
 drivers/dma/ioat/init.c   |    3 +
 drivers/dma/ioat/prep.c   |  105 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dmaengine.h |   10 ++++
 4 files changed, 126 insertions(+)

diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
index 56200ee..b9a8d82 100644
--- a/drivers/dma/ioat/dma.h
+++ b/drivers/dma/ioat/dma.h
@@ -370,6 +370,14 @@ struct dma_async_tx_descriptor *
 ioat_dma_prep_memcpy_lock(struct dma_chan *c, dma_addr_t dma_dest,
 			   dma_addr_t dma_src, size_t len, unsigned long flags);
 struct dma_async_tx_descriptor *
+ioat_dma_prep_memcpy_to_sg_lock(struct dma_chan *c,
+		struct scatterlist *dst_sg, unsigned int dst_nents,
+		dma_addr_t src, unsigned long flags);
+struct dma_async_tx_descriptor *
+ioat_dma_prep_memcpy_from_sg_lock(struct dma_chan *c, dma_addr_t dst,
+		struct scatterlist *dst_sg, unsigned int dst_nents,
+		unsigned long flags);
+struct dma_async_tx_descriptor *
 ioat_prep_interrupt_lock(struct dma_chan *c, unsigned long flags);
 struct dma_async_tx_descriptor *
 ioat_prep_xor(struct dma_chan *chan, dma_addr_t dest, dma_addr_t *src,
diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c
index e437112..1e03f3e 100644
--- a/drivers/dma/ioat/init.c
+++ b/drivers/dma/ioat/init.c
@@ -1091,6 +1091,9 @@ static int ioat3_dma_probe(struct ioatdma_device *ioat_dma, int dca)
 
 	dma = &ioat_dma->dma_dev;
 	dma->device_prep_dma_memcpy = ioat_dma_prep_memcpy_lock;
+	dma->device_prep_dma_memcpy_to_sg = ioat_dma_prep_memcpy_to_sg_lock;
+	dma->device_prep_dma_memcpy_from_sg =
+		ioat_dma_prep_memcpy_from_sg_lock;
 	dma->device_issue_pending = ioat_issue_pending;
 	dma->device_alloc_chan_resources = ioat_alloc_chan_resources;
 	dma->device_free_chan_resources = ioat_free_chan_resources;
diff --git a/drivers/dma/ioat/prep.c b/drivers/dma/ioat/prep.c
index 243421a..49e220c 100644
--- a/drivers/dma/ioat/prep.c
+++ b/drivers/dma/ioat/prep.c
@@ -159,6 +159,111 @@ ioat_dma_prep_memcpy_lock(struct dma_chan *c, dma_addr_t dma_dest,
 	return &desc->txd;
 }
 
+struct dma_async_tx_descriptor *
+ioat_dma_prep_memcpy_to_sg_lock(struct dma_chan *c,
+		struct scatterlist *dst_sg, unsigned int dst_nents,
+		dma_addr_t dma_src, unsigned long flags)
+{
+	struct ioatdma_chan *ioat_chan = to_ioat_chan(c);
+	struct ioat_dma_descriptor *hw = NULL;
+	struct ioat_ring_ent *desc = NULL;
+	dma_addr_t src = dma_src;
+	int num_descs, idx, i;
+	struct scatterlist *s;
+	size_t total_len = 0, len;
+
+
+	if (test_bit(IOAT_CHAN_DOWN, &ioat_chan->state))
+		return NULL;
+
+	/*
+	 * The upper layer will garantee that each entry does not exceed
+	 * xfercap.
+	 */
+	num_descs = dst_nents;
+
+	if (likely(num_descs) &&
+	    ioat_check_space_lock(ioat_chan, num_descs) == 0)
+		idx = ioat_chan->head;
+	else
+		return NULL;
+
+	for_each_sg(dst_sg, s, dst_nents, i) {
+		desc = ioat_get_ring_ent(ioat_chan, idx + i);
+		hw = desc->hw;
+		len = sg_dma_len(s);
+		hw->size = len;
+		hw->ctl = 0;
+		hw->src_addr = src;
+		hw->dst_addr = sg_dma_address(s);
+		src += len;
+		total_len += len;
+		dump_desc_dbg(ioat_chan, desc);
+	}
+
+	desc->txd.flags = flags;
+	desc->len = total_len;
+	hw->ctl_f.int_en = !!(flags & DMA_PREP_INTERRUPT);
+	hw->ctl_f.fence = !!(flags & DMA_PREP_FENCE);
+	hw->ctl_f.compl_write = 1;
+	dump_desc_dbg(ioat_chan, desc);
+	/* we leave the channel locked to ensure in order submission */
+
+	return &desc->txd;
+}
+
+struct dma_async_tx_descriptor *
+ioat_dma_prep_memcpy_from_sg_lock(struct dma_chan *c, dma_addr_t dma_dst,
+		struct scatterlist *src_sg, unsigned int src_nents,
+		unsigned long flags)
+{
+	struct ioatdma_chan *ioat_chan = to_ioat_chan(c);
+	struct ioat_dma_descriptor *hw = NULL;
+	struct ioat_ring_ent *desc = NULL;
+	dma_addr_t dst = dma_dst;
+	int num_descs, idx, i;
+	struct scatterlist *s;
+	size_t total_len = 0, len;
+
+
+	if (test_bit(IOAT_CHAN_DOWN, &ioat_chan->state))
+		return NULL;
+
+	/*
+	 * The upper layer will garantee that each entry does not exceed
+	 * xfercap.
+	 */
+	num_descs = src_nents;
+
+	if (likely(num_descs) &&
+	    ioat_check_space_lock(ioat_chan, num_descs) == 0)
+		idx = ioat_chan->head;
+	else
+		return NULL;
+
+	for_each_sg(src_sg, s, src_nents, i) {
+		desc = ioat_get_ring_ent(ioat_chan, idx + i);
+		hw = desc->hw;
+		len = sg_dma_len(s);
+		hw->size = len;
+		hw->ctl = 0;
+		hw->src_addr = sg_dma_address(s);
+		hw->dst_addr = dst;
+		dst += len;
+		total_len += len;
+		dump_desc_dbg(ioat_chan, desc);
+	}
+
+	desc->txd.flags = flags;
+	desc->len = total_len;
+	hw->ctl_f.int_en = !!(flags & DMA_PREP_INTERRUPT);
+	hw->ctl_f.fence = !!(flags & DMA_PREP_FENCE);
+	hw->ctl_f.compl_write = 1;
+	dump_desc_dbg(ioat_chan, desc);
+	/* we leave the channel locked to ensure in order submission */
+
+	return &desc->txd;
+}
 
 static struct dma_async_tx_descriptor *
 __ioat_prep_xor_lock(struct dma_chan *c, enum sum_check_flags *result,
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 5336808..6202e7c 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -694,6 +694,8 @@ struct dma_filter {
  * @device_prep_dma_memset: prepares a memset operation
  * @device_prep_dma_memset_sg: prepares a memset operation over a scatter list
  * @device_prep_dma_interrupt: prepares an end of chain interrupt operation
+ * @device_prep_dma_memcpy_to_sg: prepares memcpy from buffer to scatterlist
+ * @device_prep_dma_memcpy_from_sg: prepares memcpy from scatterlist to buffer
  * @device_prep_slave_sg: prepares a slave dma operation
  * @device_prep_dma_cyclic: prepare a cyclic dma operation suitable for audio.
  *	The function takes a buffer of size buf_len. The callback function will
@@ -776,6 +778,14 @@ struct dma_device {
 		struct scatterlist *dst_sg, unsigned int dst_nents,
 		struct scatterlist *src_sg, unsigned int src_nents,
 		unsigned long flags);
+	struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_to_sg)(
+		struct dma_chan *chan,
+		struct scatterlist *dst_sg, unsigned int dst_nents,
+		dma_addr_t src, unsigned long flags);
+	struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_from_sg)(
+		struct dma_chan *chan, dma_addr_t dst,
+		struct scatterlist *dst_sg, unsigned int dst_nents,
+		unsigned long flags);
 
 	struct dma_async_tx_descriptor *(*device_prep_slave_sg)(
 		struct dma_chan *chan, struct scatterlist *sgl,

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] dmaengine: add SG support to dmaengine_unmap
  2017-07-31 22:24 [PATCH 0/5] Adding blk-mq and DMA support to pmem block driver Dave Jiang
  2017-07-31 22:24 ` [PATCH 1/5] dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels Dave Jiang
  2017-07-31 22:24 ` [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support Dave Jiang
@ 2017-07-31 22:24 ` Dave Jiang
  2017-07-31 22:24 ` [PATCH 4/5] libnvdimm: Adding blk-mq support to the pmem driver Dave Jiang
  2017-07-31 22:24 ` [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq Dave Jiang
  4 siblings, 0 replies; 20+ messages in thread
From: Dave Jiang @ 2017-07-31 22:24 UTC (permalink / raw)
  To: vinod.koul, dan.j.williams; +Cc: dmaengine, linux-nvdimm

This should provide support to unmap scatterlist with the
dmaengine_unmap_data. We will support only 1 scatterlist per
direction. The DMA addresses array has been overloaded for the
2 or less entries DMA unmap data structure in order to store the
SG pointer(s).

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/dmaengine.c   |   45 ++++++++++++++++++++++++++++++++++++---------
 include/linux/dmaengine.h |    4 ++++
 2 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index d9118ec..81d10c97 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -1126,16 +1126,35 @@ static void dmaengine_unmap(struct kref *kref)
 {
 	struct dmaengine_unmap_data *unmap = container_of(kref, typeof(*unmap), kref);
 	struct device *dev = unmap->dev;
-	int cnt, i;
+	int cnt, i, sg_nents;
+	struct scatterlist *sg;
+
+	sg_nents = dma_unmap_data_sg_to_nents(unmap, unmap->map_cnt);
+	if (sg_nents) {
+		i = 0;
+		cnt = 1;
+		sg = (struct scatterlist *)unmap->addr[i];
+		dma_unmap_sg(dev, sg, sg_nents, DMA_TO_DEVICE);
+	} else {
+		cnt = unmap->to_cnt;
+		for (i = 0; i < cnt; i++)
+			dma_unmap_page(dev, unmap->addr[i], unmap->len,
+					DMA_TO_DEVICE);
+	}
+
+	sg_nents = dma_unmap_data_sg_from_nents(unmap, unmap->map_cnt);
+	if (sg_nents) {
+		sg = (struct scatterlist *)unmap->addr[i];
+		dma_unmap_sg(dev, sg, sg_nents, DMA_FROM_DEVICE);
+		cnt++;
+		i++;
+	} else {
+		cnt += unmap->from_cnt;
+		for (; i < cnt; i++)
+			dma_unmap_page(dev, unmap->addr[i], unmap->len,
+					DMA_FROM_DEVICE);
+	}
 
-	cnt = unmap->to_cnt;
-	for (i = 0; i < cnt; i++)
-		dma_unmap_page(dev, unmap->addr[i], unmap->len,
-			       DMA_TO_DEVICE);
-	cnt += unmap->from_cnt;
-	for (; i < cnt; i++)
-		dma_unmap_page(dev, unmap->addr[i], unmap->len,
-			       DMA_FROM_DEVICE);
 	cnt += unmap->bidi_cnt;
 	for (; i < cnt; i++) {
 		if (unmap->addr[i] == 0)
@@ -1179,6 +1198,10 @@ static int __init dmaengine_init_unmap_pool(void)
 		size = sizeof(struct dmaengine_unmap_data) +
 		       sizeof(dma_addr_t) * p->size;
 
+		/* add 2 more entries for SG nents overload */
+		if (i == 0)
+			size += sizeof(dma_addr_t) * 2;
+
 		p->cache = kmem_cache_create(p->name, size, 0,
 					     SLAB_HWCACHE_ALIGN, NULL);
 		if (!p->cache)
@@ -1205,6 +1228,10 @@ dmaengine_get_unmap_data(struct device *dev, int nr, gfp_t flags)
 		return NULL;
 
 	memset(unmap, 0, sizeof(*unmap));
+	/* clear the overloaded sg nents entries */
+	if (nr < 3)
+		memset(&unmap->addr[nr], 0,
+				DMA_UNMAP_SG_ENTS * sizeof(dma_addr_t));
 	kref_init(&unmap->kref);
 	unmap->dev = dev;
 	unmap->map_cnt = nr;
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 6202e7c..ec65b9b 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -475,6 +475,10 @@ struct dmaengine_unmap_data {
 	dma_addr_t addr[0];
 };
 
+#define DMA_UNMAP_SG_ENTS	2
+#define dma_unmap_data_sg_to_nents(x, n) x->addr[n]
+#define dma_unmap_data_sg_from_nents(x, n) x->addr[n+1]
+
 /**
  * struct dma_async_tx_descriptor - async transaction descriptor
  * ---dma generic offload fields---

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] libnvdimm: Adding blk-mq support to the pmem driver
  2017-07-31 22:24 [PATCH 0/5] Adding blk-mq and DMA support to pmem block driver Dave Jiang
                   ` (2 preceding siblings ...)
  2017-07-31 22:24 ` [PATCH 3/5] dmaengine: add SG support to dmaengine_unmap Dave Jiang
@ 2017-07-31 22:24 ` Dave Jiang
  2017-08-01 19:02   ` Ross Zwisler
  2017-07-31 22:24 ` [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq Dave Jiang
  4 siblings, 1 reply; 20+ messages in thread
From: Dave Jiang @ 2017-07-31 22:24 UTC (permalink / raw)
  To: vinod.koul, dan.j.williams; +Cc: dmaengine, linux-nvdimm

Adding blk-mq support to the pmem driver in addition to the direct bio
support. This allows for hardware offloading via DMA engines. By default
the bio method will be enabled. The blk-mq support can be turned on via
module parameter queue_mode=1.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/nvdimm/pmem.c |  131 +++++++++++++++++++++++++++++++++++++++++++++----
 drivers/nvdimm/pmem.h |    3 +
 2 files changed, 122 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c544d46..347835b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -31,10 +31,24 @@
 #include <linux/pmem.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
+#include <linux/blk-mq.h>
 #include "pmem.h"
 #include "pfn.h"
 #include "nd.h"
 
+enum {
+	PMEM_Q_BIO = 0,
+	PMEM_Q_MQ = 1,
+};
+
+static int queue_mode = PMEM_Q_BIO;
+module_param(queue_mode, int, 0444);
+MODULE_PARM_DESC(queue_mode, "Pmem Queue Mode");
+
+struct pmem_cmd {
+	struct request *rq;
+};
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
 	/*
@@ -239,9 +253,12 @@ static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
 };
 
-static void pmem_release_queue(void *q)
+static void pmem_release_queue(void *pmem)
 {
-	blk_cleanup_queue(q);
+
+	blk_cleanup_queue(((struct pmem_device *)pmem)->q);
+	if (queue_mode == PMEM_Q_MQ)
+		blk_mq_free_tag_set(&((struct pmem_device *)pmem)->tag_set);
 }
 
 static void pmem_freeze_queue(void *q)
@@ -259,6 +276,65 @@ static void pmem_release_disk(void *__pmem)
 	put_disk(pmem->disk);
 }
 
+static int pmem_handle_cmd(struct pmem_cmd *cmd)
+{
+	struct request *req = cmd->rq;
+	struct request_queue *q = req->q;
+	struct pmem_device *pmem = q->queuedata;
+	struct nd_region *nd_region = to_region(pmem);
+	struct bio_vec bvec;
+	struct req_iterator iter;
+	int rc = 0;
+
+	if (req->cmd_flags & REQ_FLUSH)
+		nvdimm_flush(nd_region);
+
+	rq_for_each_segment(bvec, req, iter) {
+		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
+				bvec.bv_offset, op_is_write(req_op(req)),
+				iter.iter.bi_sector);
+		if (rc < 0)
+			break;
+	}
+
+	if (req->cmd_flags & REQ_FUA)
+		nvdimm_flush(nd_region);
+
+	blk_mq_end_request(cmd->rq, rc);
+
+	return 0;
+}
+
+static int pmem_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	struct pmem_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	int rc;
+
+	cmd->rq = bd->rq;
+
+	blk_mq_start_request(bd->rq);
+
+	rc = pmem_handle_cmd(cmd);
+	if (rc < 0)
+		rc = BLK_MQ_RQ_QUEUE_ERROR;
+	else
+		rc = BLK_MQ_RQ_QUEUE_OK;
+
+	return rc;
+}
+
+static int pmem_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+		unsigned int index)
+{
+	return 0;
+}
+
+static const struct blk_mq_ops pmem_mq_ops = {
+	.queue_rq	= pmem_queue_rq,
+	.init_hctx	= pmem_init_hctx,
+};
+
 static int pmem_attach_disk(struct device *dev,
 		struct nd_namespace_common *ndns)
 {
@@ -275,6 +351,7 @@ static int pmem_attach_disk(struct device *dev,
 	struct request_queue *q;
 	struct gendisk *disk;
 	void *addr;
+	int rc;
 
 	/* while nsio_rw_bytes is active, parse a pfn info block if present */
 	if (is_nd_pfn(dev)) {
@@ -303,17 +380,47 @@ static int pmem_attach_disk(struct device *dev,
 		return -EBUSY;
 	}
 
-	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
-	if (!q)
-		return -ENOMEM;
+	if (queue_mode == PMEM_Q_MQ) {
+		pmem->tag_set.ops = &pmem_mq_ops;
+		pmem->tag_set.nr_hw_queues = nr_online_nodes;
+		pmem->tag_set.queue_depth = 64;
+		pmem->tag_set.numa_node = dev_to_node(dev);
+		pmem->tag_set.cmd_size = sizeof(struct pmem_cmd);
+		pmem->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+		pmem->tag_set.driver_data = pmem;
+
+		rc = blk_mq_alloc_tag_set(&pmem->tag_set);
+		if (rc < 0)
+			return rc;
+
+		pmem->q = blk_mq_init_queue(&pmem->tag_set);
+		if (IS_ERR(pmem->q)) {
+			blk_mq_free_tag_set(&pmem->tag_set);
+			return -ENOMEM;
+		}
 
-	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
-		return -ENOMEM;
+		if (devm_add_action_or_reset(dev, pmem_release_queue, pmem)) {
+			blk_mq_free_tag_set(&pmem->tag_set);
+			return -ENOMEM;
+		}
+	} else if (queue_mode == PMEM_Q_BIO) {
+		pmem->q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+		if (!pmem->q)
+			return -ENOMEM;
+
+		if (devm_add_action_or_reset(dev, pmem_release_queue, pmem))
+			return -ENOMEM;
+
+		blk_queue_make_request(pmem->q, pmem_make_request);
+	} else {
+		dev_warn(dev, "Invalid queue mode: %d\n", queue_mode);
+		return -EINVAL;
+	}
 
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
-		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
-				altmap);
+		addr = devm_memremap_pages(dev, &pfn_res,
+				&pmem->q->q_usage_counter, altmap);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
@@ -322,7 +429,7 @@ static int pmem_attach_disk(struct device *dev,
 		res->start += pmem->data_offset;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
-				&q->q_usage_counter, NULL);
+				&pmem->q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
@@ -332,7 +439,7 @@ static int pmem_attach_disk(struct device *dev,
 	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
 	 */
-	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
+	if (devm_add_action_or_reset(dev, pmem_freeze_queue, pmem->q))
 		return -ENOMEM;
 
 	if (IS_ERR(addr))
@@ -354,7 +461,7 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->disk = disk;
 
 	disk->fops		= &pmem_fops;
-	disk->queue		= q;
+	disk->queue		= pmem->q;
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	nvdimm_namespace_disk_name(ndns, disk->disk_name);
 	set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 7f4dbd7..1836c7b 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -4,6 +4,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/fs.h>
+#include <linux/blk-mq.h>
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
 struct pmem_device {
@@ -20,6 +21,8 @@ struct pmem_device {
 	struct badblocks	bb;
 	struct dax_device	*dax_dev;
 	struct gendisk		*disk;
+	struct blk_mq_tag_set	tag_set;
+	struct request_queue	*q;
 };
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-07-31 22:24 [PATCH 0/5] Adding blk-mq and DMA support to pmem block driver Dave Jiang
                   ` (3 preceding siblings ...)
  2017-07-31 22:24 ` [PATCH 4/5] libnvdimm: Adding blk-mq support to the pmem driver Dave Jiang
@ 2017-07-31 22:24 ` Dave Jiang
  2017-08-01  7:34   ` Johannes Thumshirn
  2017-08-01 20:42   ` Ross Zwisler
  4 siblings, 2 replies; 20+ messages in thread
From: Dave Jiang @ 2017-07-31 22:24 UTC (permalink / raw)
  To: vinod.koul, dan.j.williams; +Cc: dmaengine, linux-nvdimm

Adding DMA support for pmem blk reads. This provides signficant CPU
reduction with large memory reads with good performance. DMAs are triggered
with test against bio_multiple_segment(), so the small I/Os (4k or less?)
are still performed by the CPU in order to reduce latency. By default
the pmem driver will be using blk-mq with DMA.

Numbers below are measured against pmem simulated via DRAM using
memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
platform.  Keep in mind the performance for actual persistent memory
will differ.
Fio 2.21 was used.

64k: 1 task queuedepth=1
CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU

64k: 16 tasks queuedepth=16
CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU

2M: 1 task queuedepth=1
CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU

2M: 16 tasks queuedepth=16
CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/nvdimm/pmem.c |  225 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 208 insertions(+), 17 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 347835b..02ca9f0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -32,6 +32,8 @@
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include <linux/blk-mq.h>
+#include <linux/dmaengine.h>
+#include <linux/dma-mapping.h>
 #include "pmem.h"
 #include "pfn.h"
 #include "nd.h"
@@ -41,12 +43,24 @@ enum {
 	PMEM_Q_MQ = 1,
 };
 
-static int queue_mode = PMEM_Q_BIO;
+static int queue_mode = PMEM_Q_MQ;
 module_param(queue_mode, int, 0444);
 MODULE_PARM_DESC(queue_mode, "Pmem Queue Mode");
 
+static int queue_depth = 128;
+module_param(queue_depth, int, 0444);
+MODULE_PARM_DESC(queue_depth, "I/O Queue Depth for multi queue mode");
+
+/* typically maps to number of DMA channels/devices per socket */
+static int q_per_node = 8;
+module_param(q_per_node, int, 0444);
+MODULE_PARM_DESC(q_per_node, "Hardware queues per node\n");
+
 struct pmem_cmd {
 	struct request *rq;
+	struct dma_chan *chan;
+	int sg_nents;
+	struct scatterlist sg[];
 };
 
 static struct device *to_dev(struct pmem_device *pmem)
@@ -276,6 +290,160 @@ static void pmem_release_disk(void *__pmem)
 	put_disk(pmem->disk);
 }
 
+static void nd_pmem_dma_callback(void *data,
+		const struct dmaengine_result *res)
+{
+	struct pmem_cmd *cmd = data;
+	struct request *req = cmd->rq;
+	struct request_queue *q = req->q;
+	struct pmem_device *pmem = q->queuedata;
+	struct nd_region *nd_region = to_region(pmem);
+	struct device *dev = to_dev(pmem);
+	int rc = 0;
+
+	dev_dbg(dev, "%s()\n", __func__);
+
+	if (res) {
+		enum dmaengine_tx_result dma_err = res->result;
+
+		switch (dma_err) {
+		case DMA_TRANS_READ_FAILED:
+		case DMA_TRANS_WRITE_FAILED:
+		case DMA_TRANS_ABORTED:
+			dev_dbg(dev, "bio failed\n");
+			rc = -ENXIO;
+			break;
+		case DMA_TRANS_NOERROR:
+		default:
+			break;
+		}
+	}
+
+	if (req->cmd_flags & REQ_FUA)
+		nvdimm_flush(nd_region);
+
+	dev_dbg(dev, "ending request\n");
+	blk_mq_end_request(cmd->rq, rc);
+}
+
+static int pmem_handle_cmd_dma(struct pmem_cmd *cmd, bool is_write)
+{
+	struct request *req = cmd->rq;
+	struct request_queue *q = req->q;
+	struct pmem_device *pmem = q->queuedata;
+	struct device *dev = to_dev(pmem);
+	phys_addr_t pmem_off = blk_rq_pos(req) * 512 + pmem->data_offset;
+	void *pmem_addr = pmem->virt_addr + pmem_off;
+	struct nd_region *nd_region = to_region(pmem);
+	size_t len;
+	struct dma_device *dma = cmd->chan->device;
+	struct dmaengine_unmap_data *unmap;
+	dma_cookie_t cookie;
+	struct dma_async_tx_descriptor *txd;
+	struct page *page;
+	unsigned int off;
+	int rc;
+	enum dma_data_direction dir;
+	dma_addr_t dma_addr;
+
+	if (req->cmd_flags & REQ_FLUSH)
+		nvdimm_flush(nd_region);
+
+	unmap = dmaengine_get_unmap_data(dma->dev, 2, GFP_NOWAIT);
+	if (!unmap) {
+		dev_dbg(dev, "failed to get dma unmap data\n");
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * If reading from pmem, writing to scatterlist,
+	 * and if writing to pmem, reading from scatterlist.
+	 */
+	dir = is_write ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+	cmd->sg_nents = blk_rq_map_sg(req->q, req, cmd->sg);
+	if (cmd->sg_nents < 1) {
+		rc = -EINVAL;
+		goto err;
+	}
+
+	if (cmd->sg_nents > 128) {
+		rc = -ENOMEM;
+		dev_warn(dev, "Number of sg greater than allocated\n");
+		goto err;
+	}
+
+	rc = dma_map_sg(dma->dev, cmd->sg, cmd->sg_nents, dir);
+	if (rc < 1) {
+		rc = -ENXIO;
+		goto err;
+	}
+
+	len = blk_rq_payload_bytes(req);
+	page = virt_to_page(pmem_addr);
+	off = (u64)pmem_addr & ~PAGE_MASK;
+	dir = is_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+	dma_addr = dma_map_page(dma->dev, page, off, len, dir);
+	if (dma_mapping_error(dma->dev, unmap->addr[0])) {
+		dev_dbg(dma->dev, "src DMA mapping error\n");
+		goto err_unmap_sg;
+	}
+
+	unmap->len = len;
+
+	if (is_write) {
+		unmap->addr[0] = dma_addr;
+		unmap->addr[1] = (dma_addr_t)cmd->sg;
+		unmap->to_cnt = 1;
+		unmap->from_cnt = 0;
+		dma_unmap_data_sg_from_nents(unmap, 2) = cmd->sg_nents;
+		txd = dma->device_prep_dma_memcpy_from_sg(cmd->chan, dma_addr,
+				cmd->sg, cmd->sg_nents, DMA_PREP_INTERRUPT);
+	} else {
+		unmap->addr[0] = (dma_addr_t)cmd->sg;
+		unmap->addr[1] = dma_addr;
+		unmap->from_cnt = 1;
+		unmap->to_cnt = 0;
+		dma_unmap_data_sg_to_nents(unmap, 2) = cmd->sg_nents;
+		txd = dma->device_prep_dma_memcpy_to_sg(cmd->chan, cmd->sg,
+			cmd->sg_nents, dma_addr, DMA_PREP_INTERRUPT);
+	}
+
+	if (!txd) {
+		dev_dbg(dma->dev, "dma prep failed\n");
+		goto err_unmap_buffer;
+	}
+
+	txd->callback_result = nd_pmem_dma_callback;
+	txd->callback_param = cmd;
+	dma_set_unmap(txd, unmap);
+	cookie = dmaengine_submit(txd);
+	if (dma_submit_error(cookie)) {
+		dev_dbg(dma->dev, "dma submit error\n");
+		goto err_set_unmap;
+	}
+
+	dmaengine_unmap_put(unmap);
+	dma_async_issue_pending(cmd->chan);
+
+	return 0;
+
+err_set_unmap:
+	dmaengine_unmap_put(unmap);
+err_unmap_buffer:
+	dma_unmap_page(dev, dma_addr, len, dir);
+err_unmap_sg:
+	if (dir == DMA_TO_DEVICE)
+		dir = DMA_FROM_DEVICE;
+	else
+		dir = DMA_TO_DEVICE;
+	dma_unmap_sg(dev, cmd->sg, cmd->sg_nents, dir);
+	dmaengine_unmap_put(unmap);
+err:
+	blk_mq_end_request(cmd->rq, -ENXIO);
+	return -ENXIO;
+}
+
 static int pmem_handle_cmd(struct pmem_cmd *cmd)
 {
 	struct request *req = cmd->rq;
@@ -309,13 +477,17 @@ static int pmem_queue_rq(struct blk_mq_hw_ctx *hctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct pmem_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	struct request *req;
 	int rc;
 
-	cmd->rq = bd->rq;
-
-	blk_mq_start_request(bd->rq);
+	req = cmd->rq = bd->rq;
+	cmd->chan = dma_find_channel(DMA_MEMCPY);
+	blk_mq_start_request(req);
 
-	rc = pmem_handle_cmd(cmd);
+	if (cmd->chan && bio_multiple_segments(req->bio))
+		rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
+	else
+		rc = pmem_handle_cmd(cmd);
 	if (rc < 0)
 		rc = BLK_MQ_RQ_QUEUE_ERROR;
 	else
@@ -348,10 +520,10 @@ static int pmem_attach_disk(struct device *dev,
 	struct nd_pfn_sb *pfn_sb;
 	struct pmem_device *pmem;
 	struct resource pfn_res;
-	struct request_queue *q;
 	struct gendisk *disk;
 	void *addr;
 	int rc;
+	struct dma_chan *chan;
 
 	/* while nsio_rw_bytes is active, parse a pfn info block if present */
 	if (is_nd_pfn(dev)) {
@@ -381,11 +553,20 @@ static int pmem_attach_disk(struct device *dev,
 	}
 
 	if (queue_mode == PMEM_Q_MQ) {
+		chan = dma_find_channel(DMA_MEMCPY);
+		if (!chan) {
+			queue_mode = PMEM_Q_BIO;
+			dev_warn(dev, "Forced back to PMEM_Q_BIO, no DMA\n");
+		}
+	}
+
+	if (queue_mode == PMEM_Q_MQ) {
 		pmem->tag_set.ops = &pmem_mq_ops;
-		pmem->tag_set.nr_hw_queues = nr_online_nodes;
-		pmem->tag_set.queue_depth = 64;
+		pmem->tag_set.nr_hw_queues = nr_online_nodes * q_per_node;
+		pmem->tag_set.queue_depth = queue_depth;
 		pmem->tag_set.numa_node = dev_to_node(dev);
-		pmem->tag_set.cmd_size = sizeof(struct pmem_cmd);
+		pmem->tag_set.cmd_size = sizeof(struct pmem_cmd) +
+			sizeof(struct scatterlist) * 128;
 		pmem->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 		pmem->tag_set.driver_data = pmem;
 
@@ -446,14 +627,17 @@ static int pmem_attach_disk(struct device *dev,
 		return PTR_ERR(addr);
 	pmem->virt_addr = addr;
 
-	blk_queue_write_cache(q, true, true);
-	blk_queue_make_request(q, pmem_make_request);
-	blk_queue_physical_block_size(q, PAGE_SIZE);
-	blk_queue_max_hw_sectors(q, UINT_MAX);
-	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
-	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
-	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
-	q->queuedata = pmem;
+	blk_queue_write_cache(pmem->q, true, true);
+	blk_queue_physical_block_size(pmem->q, PAGE_SIZE);
+	if (queue_mode == PMEM_Q_MQ) {
+		blk_queue_max_hw_sectors(pmem->q, 0x200000);
+		blk_queue_max_segments(pmem->q, 128);
+	} else
+		blk_queue_max_hw_sectors(pmem->q, UINT_MAX);
+	blk_queue_bounce_limit(pmem->q, BLK_BOUNCE_ANY);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->q);
+	queue_flag_set_unlocked(QUEUE_FLAG_DAX, pmem->q);
+	pmem->q->queuedata = pmem;
 
 	disk = alloc_disk_node(0, nid);
 	if (!disk)
@@ -588,15 +772,22 @@ static struct nd_device_driver nd_pmem_driver = {
 
 static int __init pmem_init(void)
 {
+	if (queue_mode == PMEM_Q_MQ)
+		dmaengine_get();
+
 	return nd_driver_register(&nd_pmem_driver);
 }
 module_init(pmem_init);
 
 static void pmem_exit(void)
 {
+	if (queue_mode == PMEM_Q_MQ)
+		dmaengine_put();
+
 	driver_unregister(&nd_pmem_driver.drv);
 }
 module_exit(pmem_exit);
 
+MODULE_SOFTDEP("pre: dmaengine");
 MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
 MODULE_LICENSE("GPL v2");

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support
  2017-07-31 22:24 ` [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support Dave Jiang
@ 2017-08-01  2:14   ` Dan Williams
  2017-08-01 16:39     ` Dave Jiang
  2017-08-02  4:57     ` Vinod Koul
  0 siblings, 2 replies; 20+ messages in thread
From: Dan Williams @ 2017-08-01  2:14 UTC (permalink / raw)
  To: Dave Jiang; +Cc: Vinod Koul, dmaengine, linux-nvdimm

On Mon, Jul 31, 2017 at 3:24 PM, Dave Jiang <dave.jiang@intel.com> wrote:
> Adding ioatdma support to copy from a physically contiguos buffer to a
> provided scatterlist and vice versa. This is used to support
> reading/writing persistent memory in the pmem driver.
>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/dma/ioat/dma.h    |    8 +++
>  drivers/dma/ioat/init.c   |    3 +
>  drivers/dma/ioat/prep.c   |  105 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dmaengine.h |   10 ++++
>  4 files changed, 126 insertions(+)
>
> diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
> index 56200ee..b9a8d82 100644
> --- a/drivers/dma/ioat/dma.h
> +++ b/drivers/dma/ioat/dma.h
> @@ -370,6 +370,14 @@ struct dma_async_tx_descriptor *
>  ioat_dma_prep_memcpy_lock(struct dma_chan *c, dma_addr_t dma_dest,
>                            dma_addr_t dma_src, size_t len, unsigned long flags);
>  struct dma_async_tx_descriptor *
> +ioat_dma_prep_memcpy_to_sg_lock(struct dma_chan *c,
> +               struct scatterlist *dst_sg, unsigned int dst_nents,
> +               dma_addr_t src, unsigned long flags);
> +struct dma_async_tx_descriptor *
> +ioat_dma_prep_memcpy_from_sg_lock(struct dma_chan *c, dma_addr_t dst,
> +               struct scatterlist *dst_sg, unsigned int dst_nents,
> +               unsigned long flags);
> +struct dma_async_tx_descriptor *
>  ioat_prep_interrupt_lock(struct dma_chan *c, unsigned long flags);
>  struct dma_async_tx_descriptor *
>  ioat_prep_xor(struct dma_chan *chan, dma_addr_t dest, dma_addr_t *src,
> diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c
> index e437112..1e03f3e 100644
> --- a/drivers/dma/ioat/init.c
> +++ b/drivers/dma/ioat/init.c
> @@ -1091,6 +1091,9 @@ static int ioat3_dma_probe(struct ioatdma_device *ioat_dma, int dca)
>
>         dma = &ioat_dma->dma_dev;
>         dma->device_prep_dma_memcpy = ioat_dma_prep_memcpy_lock;
> +       dma->device_prep_dma_memcpy_to_sg = ioat_dma_prep_memcpy_to_sg_lock;
> +       dma->device_prep_dma_memcpy_from_sg =
> +               ioat_dma_prep_memcpy_from_sg_lock;
>         dma->device_issue_pending = ioat_issue_pending;
>         dma->device_alloc_chan_resources = ioat_alloc_chan_resources;
>         dma->device_free_chan_resources = ioat_free_chan_resources;
> diff --git a/drivers/dma/ioat/prep.c b/drivers/dma/ioat/prep.c
> index 243421a..49e220c 100644
> --- a/drivers/dma/ioat/prep.c
> +++ b/drivers/dma/ioat/prep.c
> @@ -159,6 +159,111 @@ ioat_dma_prep_memcpy_lock(struct dma_chan *c, dma_addr_t dma_dest,
>         return &desc->txd;
>  }
>
> +struct dma_async_tx_descriptor *
> +ioat_dma_prep_memcpy_to_sg_lock(struct dma_chan *c,
> +               struct scatterlist *dst_sg, unsigned int dst_nents,
> +               dma_addr_t dma_src, unsigned long flags)
> +{
> +       struct ioatdma_chan *ioat_chan = to_ioat_chan(c);
> +       struct ioat_dma_descriptor *hw = NULL;
> +       struct ioat_ring_ent *desc = NULL;
> +       dma_addr_t src = dma_src;
> +       int num_descs, idx, i;
> +       struct scatterlist *s;
> +       size_t total_len = 0, len;
> +
> +
> +       if (test_bit(IOAT_CHAN_DOWN, &ioat_chan->state))
> +               return NULL;
> +
> +       /*
> +        * The upper layer will garantee that each entry does not exceed
> +        * xfercap.
> +        */
> +       num_descs = dst_nents;
> +
> +       if (likely(num_descs) &&
> +           ioat_check_space_lock(ioat_chan, num_descs) == 0)
> +               idx = ioat_chan->head;
> +       else
> +               return NULL;
> +
> +       for_each_sg(dst_sg, s, dst_nents, i) {
> +               desc = ioat_get_ring_ent(ioat_chan, idx + i);
> +               hw = desc->hw;
> +               len = sg_dma_len(s);
> +               hw->size = len;
> +               hw->ctl = 0;
> +               hw->src_addr = src;
> +               hw->dst_addr = sg_dma_address(s);
> +               src += len;
> +               total_len += len;
> +               dump_desc_dbg(ioat_chan, desc);
> +       }
> +
> +       desc->txd.flags = flags;
> +       desc->len = total_len;
> +       hw->ctl_f.int_en = !!(flags & DMA_PREP_INTERRUPT);
> +       hw->ctl_f.fence = !!(flags & DMA_PREP_FENCE);
> +       hw->ctl_f.compl_write = 1;
> +       dump_desc_dbg(ioat_chan, desc);
> +       /* we leave the channel locked to ensure in order submission */
> +
> +       return &desc->txd;
> +}
> +
> +struct dma_async_tx_descriptor *
> +ioat_dma_prep_memcpy_from_sg_lock(struct dma_chan *c, dma_addr_t dma_dst,
> +               struct scatterlist *src_sg, unsigned int src_nents,
> +               unsigned long flags)
> +{
> +       struct ioatdma_chan *ioat_chan = to_ioat_chan(c);
> +       struct ioat_dma_descriptor *hw = NULL;
> +       struct ioat_ring_ent *desc = NULL;
> +       dma_addr_t dst = dma_dst;
> +       int num_descs, idx, i;
> +       struct scatterlist *s;
> +       size_t total_len = 0, len;
> +
> +
> +       if (test_bit(IOAT_CHAN_DOWN, &ioat_chan->state))
> +               return NULL;
> +
> +       /*
> +        * The upper layer will garantee that each entry does not exceed
> +        * xfercap.
> +        */
> +       num_descs = src_nents;
> +
> +       if (likely(num_descs) &&
> +           ioat_check_space_lock(ioat_chan, num_descs) == 0)
> +               idx = ioat_chan->head;
> +       else
> +               return NULL;
> +
> +       for_each_sg(src_sg, s, src_nents, i) {
> +               desc = ioat_get_ring_ent(ioat_chan, idx + i);
> +               hw = desc->hw;
> +               len = sg_dma_len(s);
> +               hw->size = len;
> +               hw->ctl = 0;
> +               hw->src_addr = sg_dma_address(s);
> +               hw->dst_addr = dst;
> +               dst += len;
> +               total_len += len;
> +               dump_desc_dbg(ioat_chan, desc);
> +       }
> +
> +       desc->txd.flags = flags;
> +       desc->len = total_len;
> +       hw->ctl_f.int_en = !!(flags & DMA_PREP_INTERRUPT);
> +       hw->ctl_f.fence = !!(flags & DMA_PREP_FENCE);
> +       hw->ctl_f.compl_write = 1;
> +       dump_desc_dbg(ioat_chan, desc);
> +       /* we leave the channel locked to ensure in order submission */
> +
> +       return &desc->txd;
> +}
>
>  static struct dma_async_tx_descriptor *
>  __ioat_prep_xor_lock(struct dma_chan *c, enum sum_check_flags *result,
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index 5336808..6202e7c 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -694,6 +694,8 @@ struct dma_filter {
>   * @device_prep_dma_memset: prepares a memset operation
>   * @device_prep_dma_memset_sg: prepares a memset operation over a scatter list
>   * @device_prep_dma_interrupt: prepares an end of chain interrupt operation
> + * @device_prep_dma_memcpy_to_sg: prepares memcpy from buffer to scatterlist
> + * @device_prep_dma_memcpy_from_sg: prepares memcpy from scatterlist to buffer
>   * @device_prep_slave_sg: prepares a slave dma operation
>   * @device_prep_dma_cyclic: prepare a cyclic dma operation suitable for audio.
>   *     The function takes a buffer of size buf_len. The callback function will
> @@ -776,6 +778,14 @@ struct dma_device {
>                 struct scatterlist *dst_sg, unsigned int dst_nents,
>                 struct scatterlist *src_sg, unsigned int src_nents,
>                 unsigned long flags);
> +       struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_to_sg)(
> +               struct dma_chan *chan,
> +               struct scatterlist *dst_sg, unsigned int dst_nents,
> +               dma_addr_t src, unsigned long flags);
> +       struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_from_sg)(
> +               struct dma_chan *chan, dma_addr_t dst,
> +               struct scatterlist *dst_sg, unsigned int dst_nents,
> +               unsigned long flags);
>
>         struct dma_async_tx_descriptor *(*device_prep_slave_sg)(
>                 struct dma_chan *chan, struct scatterlist *sgl,
>

Can we get away with just adding one new operation with a flag to
indicate whether it is the 'to' or' 'from' sg case?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-07-31 22:24 ` [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq Dave Jiang
@ 2017-08-01  7:34   ` Johannes Thumshirn
  2017-08-01 16:40     ` Dave Jiang
  2017-08-01 17:43     ` Dan Williams
  2017-08-01 20:42   ` Ross Zwisler
  1 sibling, 2 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2017-08-01  7:34 UTC (permalink / raw)
  To: Dave Jiang; +Cc: vinod.koul, dmaengine, linux-nvdimm

Dave Jiang <dave.jiang@intel.com> writes:

> Adding DMA support for pmem blk reads. This provides signficant CPU
> reduction with large memory reads with good performance. DMAs are triggered
> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
> are still performed by the CPU in order to reduce latency. By default
> the pmem driver will be using blk-mq with DMA.
>
> Numbers below are measured against pmem simulated via DRAM using
> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
> platform.  Keep in mind the performance for actual persistent memory
> will differ.
> Fio 2.21 was used.
>
> 64k: 1 task queuedepth=1
> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>
> 64k: 16 tasks queuedepth=16
> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>
> 2M: 1 task queuedepth=1
> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>
> 2M: 16 tasks queuedepth=16
> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---

Hi Dave,

The above table shows that there's a performance benefit for 2M
transfers but a regression for 64k transfers, if we forget about the CPU
utilization for a second. Would it be beneficial to have heuristics on
the transfer size that decide when to use dma and when not? You
introduced this hunk:

-   rc = pmem_handle_cmd(cmd);
+   if (cmd->chan && bio_multiple_segments(req->bio))
+      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
+   else
+       rc = pmem_handle_cmd(cmd);

Which utilizes dma for bios with multiple segments and for single
segment bios you use the old path, maybe the single/multi segment logic
can be amended to have something like:

    if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
       rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
    else
       rc = pmem_handle_cmd(cmd);

Just something woth considering IMHO.

> +	len = blk_rq_payload_bytes(req);
> +	page = virt_to_page(pmem_addr);
> +	off = (u64)pmem_addr & ~PAGE_MASK;

off = offset_in_page(pmem_addr); ?

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support
  2017-08-01  2:14   ` Dan Williams
@ 2017-08-01 16:39     ` Dave Jiang
  2017-08-02  4:57     ` Vinod Koul
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Jiang @ 2017-08-01 16:39 UTC (permalink / raw)
  To: Dan Williams; +Cc: Koul, Vinod, dmaengine, linux-nvdimm



On 07/31/2017 07:14 PM, Dan Williams wrote:
> On Mon, Jul 31, 2017 at 3:24 PM, Dave Jiang <dave.jiang@intel.com> wrote:
>> Adding ioatdma support to copy from a physically contiguos buffer to a
>> provided scatterlist and vice versa. This is used to support
>> reading/writing persistent memory in the pmem driver.
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
>>  drivers/dma/ioat/dma.h    |    8 +++
>>  drivers/dma/ioat/init.c   |    3 +
>>  drivers/dma/ioat/prep.c   |  105 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/dmaengine.h |   10 ++++
>>  4 files changed, 126 insertions(+)
>>
>> diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
>> index 56200ee..b9a8d82 100644
>> --- a/drivers/dma/ioat/dma.h
>> +++ b/drivers/dma/ioat/dma.h
>> @@ -370,6 +370,14 @@ struct dma_async_tx_descriptor *
>>  ioat_dma_prep_memcpy_lock(struct dma_chan *c, dma_addr_t dma_dest,
>>                            dma_addr_t dma_src, size_t len, unsigned long flags);
>>  struct dma_async_tx_descriptor *
>> +ioat_dma_prep_memcpy_to_sg_lock(struct dma_chan *c,
>> +               struct scatterlist *dst_sg, unsigned int dst_nents,
>> +               dma_addr_t src, unsigned long flags);
>> +struct dma_async_tx_descriptor *
>> +ioat_dma_prep_memcpy_from_sg_lock(struct dma_chan *c, dma_addr_t dst,
>> +               struct scatterlist *dst_sg, unsigned int dst_nents,
>> +               unsigned long flags);
>> +struct dma_async_tx_descriptor *
>>  ioat_prep_interrupt_lock(struct dma_chan *c, unsigned long flags);
>>  struct dma_async_tx_descriptor *
>>  ioat_prep_xor(struct dma_chan *chan, dma_addr_t dest, dma_addr_t *src,
>> diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c
>> index e437112..1e03f3e 100644
>> --- a/drivers/dma/ioat/init.c
>> +++ b/drivers/dma/ioat/init.c
>> @@ -1091,6 +1091,9 @@ static int ioat3_dma_probe(struct ioatdma_device *ioat_dma, int dca)
>>
>>         dma = &ioat_dma->dma_dev;
>>         dma->device_prep_dma_memcpy = ioat_dma_prep_memcpy_lock;
>> +       dma->device_prep_dma_memcpy_to_sg = ioat_dma_prep_memcpy_to_sg_lock;
>> +       dma->device_prep_dma_memcpy_from_sg =
>> +               ioat_dma_prep_memcpy_from_sg_lock;
>>         dma->device_issue_pending = ioat_issue_pending;
>>         dma->device_alloc_chan_resources = ioat_alloc_chan_resources;
>>         dma->device_free_chan_resources = ioat_free_chan_resources;
>> diff --git a/drivers/dma/ioat/prep.c b/drivers/dma/ioat/prep.c
>> index 243421a..49e220c 100644
>> --- a/drivers/dma/ioat/prep.c
>> +++ b/drivers/dma/ioat/prep.c
>> @@ -159,6 +159,111 @@ ioat_dma_prep_memcpy_lock(struct dma_chan *c, dma_addr_t dma_dest,
>>         return &desc->txd;
>>  }
>>
>> +struct dma_async_tx_descriptor *
>> +ioat_dma_prep_memcpy_to_sg_lock(struct dma_chan *c,
>> +               struct scatterlist *dst_sg, unsigned int dst_nents,
>> +               dma_addr_t dma_src, unsigned long flags)
>> +{
>> +       struct ioatdma_chan *ioat_chan = to_ioat_chan(c);
>> +       struct ioat_dma_descriptor *hw = NULL;
>> +       struct ioat_ring_ent *desc = NULL;
>> +       dma_addr_t src = dma_src;
>> +       int num_descs, idx, i;
>> +       struct scatterlist *s;
>> +       size_t total_len = 0, len;
>> +
>> +
>> +       if (test_bit(IOAT_CHAN_DOWN, &ioat_chan->state))
>> +               return NULL;
>> +
>> +       /*
>> +        * The upper layer will garantee that each entry does not exceed
>> +        * xfercap.
>> +        */
>> +       num_descs = dst_nents;
>> +
>> +       if (likely(num_descs) &&
>> +           ioat_check_space_lock(ioat_chan, num_descs) == 0)
>> +               idx = ioat_chan->head;
>> +       else
>> +               return NULL;
>> +
>> +       for_each_sg(dst_sg, s, dst_nents, i) {
>> +               desc = ioat_get_ring_ent(ioat_chan, idx + i);
>> +               hw = desc->hw;
>> +               len = sg_dma_len(s);
>> +               hw->size = len;
>> +               hw->ctl = 0;
>> +               hw->src_addr = src;
>> +               hw->dst_addr = sg_dma_address(s);
>> +               src += len;
>> +               total_len += len;
>> +               dump_desc_dbg(ioat_chan, desc);
>> +       }
>> +
>> +       desc->txd.flags = flags;
>> +       desc->len = total_len;
>> +       hw->ctl_f.int_en = !!(flags & DMA_PREP_INTERRUPT);
>> +       hw->ctl_f.fence = !!(flags & DMA_PREP_FENCE);
>> +       hw->ctl_f.compl_write = 1;
>> +       dump_desc_dbg(ioat_chan, desc);
>> +       /* we leave the channel locked to ensure in order submission */
>> +
>> +       return &desc->txd;
>> +}
>> +
>> +struct dma_async_tx_descriptor *
>> +ioat_dma_prep_memcpy_from_sg_lock(struct dma_chan *c, dma_addr_t dma_dst,
>> +               struct scatterlist *src_sg, unsigned int src_nents,
>> +               unsigned long flags)
>> +{
>> +       struct ioatdma_chan *ioat_chan = to_ioat_chan(c);
>> +       struct ioat_dma_descriptor *hw = NULL;
>> +       struct ioat_ring_ent *desc = NULL;
>> +       dma_addr_t dst = dma_dst;
>> +       int num_descs, idx, i;
>> +       struct scatterlist *s;
>> +       size_t total_len = 0, len;
>> +
>> +
>> +       if (test_bit(IOAT_CHAN_DOWN, &ioat_chan->state))
>> +               return NULL;
>> +
>> +       /*
>> +        * The upper layer will garantee that each entry does not exceed
>> +        * xfercap.
>> +        */
>> +       num_descs = src_nents;
>> +
>> +       if (likely(num_descs) &&
>> +           ioat_check_space_lock(ioat_chan, num_descs) == 0)
>> +               idx = ioat_chan->head;
>> +       else
>> +               return NULL;
>> +
>> +       for_each_sg(src_sg, s, src_nents, i) {
>> +               desc = ioat_get_ring_ent(ioat_chan, idx + i);
>> +               hw = desc->hw;
>> +               len = sg_dma_len(s);
>> +               hw->size = len;
>> +               hw->ctl = 0;
>> +               hw->src_addr = sg_dma_address(s);
>> +               hw->dst_addr = dst;
>> +               dst += len;
>> +               total_len += len;
>> +               dump_desc_dbg(ioat_chan, desc);
>> +       }
>> +
>> +       desc->txd.flags = flags;
>> +       desc->len = total_len;
>> +       hw->ctl_f.int_en = !!(flags & DMA_PREP_INTERRUPT);
>> +       hw->ctl_f.fence = !!(flags & DMA_PREP_FENCE);
>> +       hw->ctl_f.compl_write = 1;
>> +       dump_desc_dbg(ioat_chan, desc);
>> +       /* we leave the channel locked to ensure in order submission */
>> +
>> +       return &desc->txd;
>> +}
>>
>>  static struct dma_async_tx_descriptor *
>>  __ioat_prep_xor_lock(struct dma_chan *c, enum sum_check_flags *result,
>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>> index 5336808..6202e7c 100644
>> --- a/include/linux/dmaengine.h
>> +++ b/include/linux/dmaengine.h
>> @@ -694,6 +694,8 @@ struct dma_filter {
>>   * @device_prep_dma_memset: prepares a memset operation
>>   * @device_prep_dma_memset_sg: prepares a memset operation over a scatter list
>>   * @device_prep_dma_interrupt: prepares an end of chain interrupt operation
>> + * @device_prep_dma_memcpy_to_sg: prepares memcpy from buffer to scatterlist
>> + * @device_prep_dma_memcpy_from_sg: prepares memcpy from scatterlist to buffer
>>   * @device_prep_slave_sg: prepares a slave dma operation
>>   * @device_prep_dma_cyclic: prepare a cyclic dma operation suitable for audio.
>>   *     The function takes a buffer of size buf_len. The callback function will
>> @@ -776,6 +778,14 @@ struct dma_device {
>>                 struct scatterlist *dst_sg, unsigned int dst_nents,
>>                 struct scatterlist *src_sg, unsigned int src_nents,
>>                 unsigned long flags);
>> +       struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_to_sg)(
>> +               struct dma_chan *chan,
>> +               struct scatterlist *dst_sg, unsigned int dst_nents,
>> +               dma_addr_t src, unsigned long flags);
>> +       struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_from_sg)(
>> +               struct dma_chan *chan, dma_addr_t dst,
>> +               struct scatterlist *dst_sg, unsigned int dst_nents,
>> +               unsigned long flags);
>>
>>         struct dma_async_tx_descriptor *(*device_prep_slave_sg)(
>>                 struct dma_chan *chan, struct scatterlist *sgl,
>>
> 
> Can we get away with just adding one new operation with a flag to
> indicate whether it is the 'to' or' 'from' sg case?
> 

Yes I can try to make that happen.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-01  7:34   ` Johannes Thumshirn
@ 2017-08-01 16:40     ` Dave Jiang
  2017-08-01 17:43     ` Dan Williams
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Jiang @ 2017-08-01 16:40 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Koul, Vinod, dmaengine, linux-nvdimm



On 08/01/2017 12:34 AM, Johannes Thumshirn wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
> 
>> Adding DMA support for pmem blk reads. This provides signficant CPU
>> reduction with large memory reads with good performance. DMAs are triggered
>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>> are still performed by the CPU in order to reduce latency. By default
>> the pmem driver will be using blk-mq with DMA.
>>
>> Numbers below are measured against pmem simulated via DRAM using
>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> platform.  Keep in mind the performance for actual persistent memory
>> will differ.
>> Fio 2.21 was used.
>>
>> 64k: 1 task queuedepth=1
>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>
>> 64k: 16 tasks queuedepth=16
>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>
>> 2M: 1 task queuedepth=1
>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>
>> 2M: 16 tasks queuedepth=16
>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
> 
> Hi Dave,
> 
> The above table shows that there's a performance benefit for 2M
> transfers but a regression for 64k transfers, if we forget about the CPU
> utilization for a second. Would it be beneficial to have heuristics on
> the transfer size that decide when to use dma and when not? You
> introduced this hunk:
> 
> -   rc = pmem_handle_cmd(cmd);
> +   if (cmd->chan && bio_multiple_segments(req->bio))
> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
> +   else
> +       rc = pmem_handle_cmd(cmd);
> 
> Which utilizes dma for bios with multiple segments and for single
> segment bios you use the old path, maybe the single/multi segment logic
> can be amended to have something like:
> 
>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>     else
>        rc = pmem_handle_cmd(cmd);

Yes good idea. I will introduce a thresh parameter that's tune-able.

> 
> Just something woth considering IMHO.
> 
>> +	len = blk_rq_payload_bytes(req);
>> +	page = virt_to_page(pmem_addr);
>> +	off = (u64)pmem_addr & ~PAGE_MASK;
> 
> off = offset_in_page(pmem_addr); ?

Thanks. Will fix.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-01  7:34   ` Johannes Thumshirn
  2017-08-01 16:40     ` Dave Jiang
@ 2017-08-01 17:43     ` Dan Williams
  2017-08-03  8:06       ` Johannes Thumshirn
  1 sibling, 1 reply; 20+ messages in thread
From: Dan Williams @ 2017-08-01 17:43 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Vinod Koul, dmaengine, linux-nvdimm

On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>
>> Adding DMA support for pmem blk reads. This provides signficant CPU
>> reduction with large memory reads with good performance. DMAs are triggered
>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>> are still performed by the CPU in order to reduce latency. By default
>> the pmem driver will be using blk-mq with DMA.
>>
>> Numbers below are measured against pmem simulated via DRAM using
>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> platform.  Keep in mind the performance for actual persistent memory
>> will differ.
>> Fio 2.21 was used.
>>
>> 64k: 1 task queuedepth=1
>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>
>> 64k: 16 tasks queuedepth=16
>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>
>> 2M: 1 task queuedepth=1
>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>
>> 2M: 16 tasks queuedepth=16
>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
>
> Hi Dave,
>
> The above table shows that there's a performance benefit for 2M
> transfers but a regression for 64k transfers, if we forget about the CPU
> utilization for a second.

I don't think we can forget about cpu utilization, and I would expect
most users would value cpu over small bandwidth increases. Especially
when those numbers show efficiency like this

+160% cpu +26% read bandwidth
+171% cpu -10% write bandwidth

> Would it be beneficial to have heuristics on
> the transfer size that decide when to use dma and when not? You
> introduced this hunk:
>
> -   rc = pmem_handle_cmd(cmd);
> +   if (cmd->chan && bio_multiple_segments(req->bio))
> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
> +   else
> +       rc = pmem_handle_cmd(cmd);
>
> Which utilizes dma for bios with multiple segments and for single
> segment bios you use the old path, maybe the single/multi segment logic
> can be amended to have something like:
>
>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>     else
>        rc = pmem_handle_cmd(cmd);
>
> Just something woth considering IMHO.

The thing we want to avoid most is having the cpu stall doing nothing
waiting for memory access, so I think once the efficiency of adding
more cpu goes non-linear it's better to let the dma engine handle
that. The current heuristic seems to achieve that, but the worry is
does it over-select the cpu for cases where requests could be merged
into a bulkier i/o that is more suitable for dma.

I'm not sure we want another tunable vs predictable / efficient
behavior by default.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] libnvdimm: Adding blk-mq support to the pmem driver
  2017-07-31 22:24 ` [PATCH 4/5] libnvdimm: Adding blk-mq support to the pmem driver Dave Jiang
@ 2017-08-01 19:02   ` Ross Zwisler
  0 siblings, 0 replies; 20+ messages in thread
From: Ross Zwisler @ 2017-08-01 19:02 UTC (permalink / raw)
  To: Dave Jiang; +Cc: vinod.koul, dmaengine, linux-nvdimm

On Mon, Jul 31, 2017 at 03:24:40PM -0700, Dave Jiang wrote:
> Adding blk-mq support to the pmem driver in addition to the direct bio
> support. This allows for hardware offloading via DMA engines. By default
> the bio method will be enabled. The blk-mq support can be turned on via
> module parameter queue_mode=1.
> 
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/nvdimm/pmem.c |  131 +++++++++++++++++++++++++++++++++++++++++++++----
>  drivers/nvdimm/pmem.h |    3 +
>  2 files changed, 122 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index c544d46..347835b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -31,10 +31,24 @@
>  #include <linux/pmem.h>
>  #include <linux/dax.h>
>  #include <linux/nd.h>
> +#include <linux/blk-mq.h>
>  #include "pmem.h"
>  #include "pfn.h"
>  #include "nd.h"
>  
> +enum {
> +	PMEM_Q_BIO = 0,
> +	PMEM_Q_MQ = 1,
> +};
> +
> +static int queue_mode = PMEM_Q_BIO;
> +module_param(queue_mode, int, 0444);
> +MODULE_PARM_DESC(queue_mode, "Pmem Queue Mode");

Maybe in the parameter description we can give the user a hint as to what the
settings are?  i.e. "PMEM queue mode (0=BIO, 1=MQ+DMA)" or something?

> +
> +struct pmem_cmd {
> +	struct request *rq;
> +};
> +
>  static struct device *to_dev(struct pmem_device *pmem)
>  {
>  	/*
> @@ -239,9 +253,12 @@ static const struct dax_operations pmem_dax_ops = {
>  	.direct_access = pmem_dax_direct_access,
>  };
>  
> -static void pmem_release_queue(void *q)
> +static void pmem_release_queue(void *pmem)
>  {
> -	blk_cleanup_queue(q);
> +

Unneeded whitespace.  Also, a nit, can we maybe add a local pmem with type
"struct pmem_device *" so we don't have to cast twice?

> +	blk_cleanup_queue(((struct pmem_device *)pmem)->q);
> +	if (queue_mode == PMEM_Q_MQ)
> +		blk_mq_free_tag_set(&((struct pmem_device *)pmem)->tag_set);
>  }
>  
>  static void pmem_freeze_queue(void *q)
> @@ -259,6 +276,65 @@ static void pmem_release_disk(void *__pmem)
>  	put_disk(pmem->disk);
>  }
>  
> +static int pmem_handle_cmd(struct pmem_cmd *cmd)
> +{
> +	struct request *req = cmd->rq;
> +	struct request_queue *q = req->q;
> +	struct pmem_device *pmem = q->queuedata;
> +	struct nd_region *nd_region = to_region(pmem);
> +	struct bio_vec bvec;
> +	struct req_iterator iter;
> +	int rc = 0;
> +
> +	if (req->cmd_flags & REQ_FLUSH)
> +		nvdimm_flush(nd_region);
> +
> +	rq_for_each_segment(bvec, req, iter) {
> +		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> +				bvec.bv_offset, op_is_write(req_op(req)),
> +				iter.iter.bi_sector);
> +		if (rc < 0)
> +			break;
> +	}
> +
> +	if (req->cmd_flags & REQ_FUA)
> +		nvdimm_flush(nd_region);
> +
> +	blk_mq_end_request(cmd->rq, rc);
> +
> +	return 0;

	return rc;
?

> +}
> +
> +static int pmem_queue_rq(struct blk_mq_hw_ctx *hctx,
> +		const struct blk_mq_queue_data *bd)
> +{
> +	struct pmem_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> +	int rc;
> +
> +	cmd->rq = bd->rq;
> +
> +	blk_mq_start_request(bd->rq);
> +
> +	rc = pmem_handle_cmd(cmd);
> +	if (rc < 0)
> +		rc = BLK_MQ_RQ_QUEUE_ERROR;
> +	else
> +		rc = BLK_MQ_RQ_QUEUE_OK;
> +
> +	return rc;
> +}

You can get rid of 'rc' altogether and simplify this to:

static int pmem_queue_rq(struct blk_mq_hw_ctx *hctx,
                const struct blk_mq_queue_data *bd)
{
        struct pmem_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);

        cmd->rq = bd->rq;
        blk_mq_start_request(bd->rq);

        if (pmem_handle_cmd(cmd) < 0)
                return BLK_MQ_RQ_QUEUE_ERROR;
        else
                return BLK_MQ_RQ_QUEUE_OK;
}

> +static int pmem_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
> +		unsigned int index)
> +{
> +	return 0;
> +}

I'm pretty sure you don't need a dummy implementation of .init_hctx.  All
callers check to make sure it's defined before calling, so if you just leave
the OP NULL it'll do the same thing.  scsi_mq_ops for example does this.

> @@ -303,17 +380,47 @@ static int pmem_attach_disk(struct device *dev,
>  		return -EBUSY;
>  	}
>  
> -	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
> -	if (!q)
> -		return -ENOMEM;
> +	if (queue_mode == PMEM_Q_MQ) {
> +		pmem->tag_set.ops = &pmem_mq_ops;
> +		pmem->tag_set.nr_hw_queues = nr_online_nodes;
> +		pmem->tag_set.queue_depth = 64;
> +		pmem->tag_set.numa_node = dev_to_node(dev);
> +		pmem->tag_set.cmd_size = sizeof(struct pmem_cmd);
> +		pmem->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> +		pmem->tag_set.driver_data = pmem;
> +
> +		rc = blk_mq_alloc_tag_set(&pmem->tag_set);
> +		if (rc < 0)
> +			return rc;
> +
> +		pmem->q = blk_mq_init_queue(&pmem->tag_set);
> +		if (IS_ERR(pmem->q)) {
> +			blk_mq_free_tag_set(&pmem->tag_set);
> +			return -ENOMEM;
> +		}
>  
> -	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> -		return -ENOMEM;
> +		if (devm_add_action_or_reset(dev, pmem_release_queue, pmem)) {
> +			blk_mq_free_tag_set(&pmem->tag_set);
> +			return -ENOMEM;
> +		}
> +	} else if (queue_mode == PMEM_Q_BIO) {
> +		pmem->q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
> +		if (!pmem->q)
> +			return -ENOMEM;
> +
> +		if (devm_add_action_or_reset(dev, pmem_release_queue, pmem))
> +			return -ENOMEM;
> +
> +		blk_queue_make_request(pmem->q, pmem_make_request);

Later on in this function there is a big block that is still manipulating 'q'
instead of pmem->q:

	blk_queue_write_cache(q, true, true);
	blk_queue_make_request(q, pmem_make_request);
	blk_queue_physical_block_size(q, PAGE_SIZE);
	blk_queue_max_hw_sectors(q, UINT_MAX);
	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
	q->queuedata = pmem;

I'm pretty sure this all needs to be looking at pmem->q instead, and we may
just end up killing the local 'q'.

Also, this block duplicates the blk_queue_make_request() call, which we
manually make happen in the PMEM_Q_BIO case above.  Not sure if this needs to
be common for both that and the PMEM_Q_MQ case, but we probably don't want it
in both places.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-07-31 22:24 ` [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq Dave Jiang
  2017-08-01  7:34   ` Johannes Thumshirn
@ 2017-08-01 20:42   ` Ross Zwisler
  1 sibling, 0 replies; 20+ messages in thread
From: Ross Zwisler @ 2017-08-01 20:42 UTC (permalink / raw)
  To: Dave Jiang; +Cc: vinod.koul, dmaengine, linux-nvdimm

On Mon, Jul 31, 2017 at 03:24:46PM -0700, Dave Jiang wrote:
> Adding DMA support for pmem blk reads. This provides signficant CPU
> reduction with large memory reads with good performance. DMAs are triggered
> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
> are still performed by the CPU in order to reduce latency. By default
> the pmem driver will be using blk-mq with DMA.
> 
> Numbers below are measured against pmem simulated via DRAM using
> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
> platform.  Keep in mind the performance for actual persistent memory
> will differ.
> Fio 2.21 was used.
> 
> 64k: 1 task queuedepth=1
> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
> 
> 64k: 16 tasks queuedepth=16
> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
> 
> 2M: 1 task queuedepth=1
> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
> 
> 2M: 16 tasks queuedepth=16
> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
> 
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---
<>
> +static void nd_pmem_dma_callback(void *data,
> +		const struct dmaengine_result *res)
> +{
> +	struct pmem_cmd *cmd = data;
> +	struct request *req = cmd->rq;
> +	struct request_queue *q = req->q;
> +	struct pmem_device *pmem = q->queuedata;
> +	struct nd_region *nd_region = to_region(pmem);
> +	struct device *dev = to_dev(pmem);
> +	int rc = 0;
> +
> +	dev_dbg(dev, "%s()\n", __func__);

Is this left-over debug, or did you mean to leave it in?

> +
> +	if (res) {
> +		enum dmaengine_tx_result dma_err = res->result;
> +
> +		switch (dma_err) {
> +		case DMA_TRANS_READ_FAILED:
> +		case DMA_TRANS_WRITE_FAILED:
> +		case DMA_TRANS_ABORTED:
> +			dev_dbg(dev, "bio failed\n");
> +			rc = -ENXIO;
> +			break;
> +		case DMA_TRANS_NOERROR:
> +		default:
> +			break;
> +		}
> +	}
> +
> +	if (req->cmd_flags & REQ_FUA)
> +		nvdimm_flush(nd_region);
> +
> +	dev_dbg(dev, "ending request\n");
> +	blk_mq_end_request(cmd->rq, rc);
> +}
> +
> +static int pmem_handle_cmd_dma(struct pmem_cmd *cmd, bool is_write)
> +{
> +	struct request *req = cmd->rq;
> +	struct request_queue *q = req->q;
> +	struct pmem_device *pmem = q->queuedata;
> +	struct device *dev = to_dev(pmem);
> +	phys_addr_t pmem_off = blk_rq_pos(req) * 512 + pmem->data_offset;
> +	void *pmem_addr = pmem->virt_addr + pmem_off;
> +	struct nd_region *nd_region = to_region(pmem);
> +	size_t len;
> +	struct dma_device *dma = cmd->chan->device;
> +	struct dmaengine_unmap_data *unmap;
> +	dma_cookie_t cookie;
> +	struct dma_async_tx_descriptor *txd;
> +	struct page *page;
> +	unsigned int off;
> +	int rc;
> +	enum dma_data_direction dir;
> +	dma_addr_t dma_addr;
> +
> +	if (req->cmd_flags & REQ_FLUSH)
> +		nvdimm_flush(nd_region);
> +
> +	unmap = dmaengine_get_unmap_data(dma->dev, 2, GFP_NOWAIT);
> +	if (!unmap) {
> +		dev_dbg(dev, "failed to get dma unmap data\n");
> +		rc = -ENOMEM;

The value of 'rc' isn't used at all in the error paths at the end of this
function.  Instead it ends the mq request with -ENXIO and returns -ENXIO
unconditionally.  That code should probably use 'rc' instead ... (continued in
next block)


> +		goto err;
> +	}
> +
> +	/*
> +	 * If reading from pmem, writing to scatterlist,
> +	 * and if writing to pmem, reading from scatterlist.
> +	 */
> +	dir = is_write ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
> +	cmd->sg_nents = blk_rq_map_sg(req->q, req, cmd->sg);
> +	if (cmd->sg_nents < 1) {
> +		rc = -EINVAL;
> +		goto err;
> +	}
> +
> +	if (cmd->sg_nents > 128) {
> +		rc = -ENOMEM;
> +		dev_warn(dev, "Number of sg greater than allocated\n");
> +		goto err;
> +	}
> +
> +	rc = dma_map_sg(dma->dev, cmd->sg, cmd->sg_nents, dir);
> +	if (rc < 1) {
> +		rc = -ENXIO;
> +		goto err;
> +	}
> +
> +	len = blk_rq_payload_bytes(req);
> +	page = virt_to_page(pmem_addr);
> +	off = (u64)pmem_addr & ~PAGE_MASK;
> +	dir = is_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> +	dma_addr = dma_map_page(dma->dev, page, off, len, dir);
> +	if (dma_mapping_error(dma->dev, unmap->addr[0])) {
> +		dev_dbg(dma->dev, "src DMA mapping error\n");
> +		goto err_unmap_sg;

... which means that these later gotos need to set 'rc'.  This applies to the
rest of the gotos in this function.

> +	}
> +
> +	unmap->len = len;
> +
> +	if (is_write) {
> +		unmap->addr[0] = dma_addr;
> +		unmap->addr[1] = (dma_addr_t)cmd->sg;
> +		unmap->to_cnt = 1;
> +		unmap->from_cnt = 0;
> +		dma_unmap_data_sg_from_nents(unmap, 2) = cmd->sg_nents;
> +		txd = dma->device_prep_dma_memcpy_from_sg(cmd->chan, dma_addr,
> +				cmd->sg, cmd->sg_nents, DMA_PREP_INTERRUPT);
> +	} else {
> +		unmap->addr[0] = (dma_addr_t)cmd->sg;
> +		unmap->addr[1] = dma_addr;
> +		unmap->from_cnt = 1;
> +		unmap->to_cnt = 0;
> +		dma_unmap_data_sg_to_nents(unmap, 2) = cmd->sg_nents;
> +		txd = dma->device_prep_dma_memcpy_to_sg(cmd->chan, cmd->sg,
> +			cmd->sg_nents, dma_addr, DMA_PREP_INTERRUPT);
> +	}
> +
> +	if (!txd) {
> +		dev_dbg(dma->dev, "dma prep failed\n");
> +		goto err_unmap_buffer;
> +	}
> +
> +	txd->callback_result = nd_pmem_dma_callback;
> +	txd->callback_param = cmd;
> +	dma_set_unmap(txd, unmap);
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie)) {
> +		dev_dbg(dma->dev, "dma submit error\n");
> +		goto err_set_unmap;
> +	}
> +
> +	dmaengine_unmap_put(unmap);
> +	dma_async_issue_pending(cmd->chan);
> +
> +	return 0;
> +
> +err_set_unmap:
> +	dmaengine_unmap_put(unmap);
> +err_unmap_buffer:
> +	dma_unmap_page(dev, dma_addr, len, dir);
> +err_unmap_sg:
> +	if (dir == DMA_TO_DEVICE)
> +		dir = DMA_FROM_DEVICE;
> +	else
> +		dir = DMA_TO_DEVICE;
> +	dma_unmap_sg(dev, cmd->sg, cmd->sg_nents, dir);
> +	dmaengine_unmap_put(unmap);
> +err:
> +	blk_mq_end_request(cmd->rq, -ENXIO);
> +	return -ENXIO;
> +}
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support
  2017-08-01  2:14   ` Dan Williams
  2017-08-01 16:39     ` Dave Jiang
@ 2017-08-02  4:57     ` Vinod Koul
  1 sibling, 0 replies; 20+ messages in thread
From: Vinod Koul @ 2017-08-02  4:57 UTC (permalink / raw)
  To: Dan Williams; +Cc: dmaengine, linux-nvdimm

On Mon, Jul 31, 2017 at 07:14:10PM -0700, Dan Williams wrote:
> >  static struct dma_async_tx_descriptor *
> >  __ioat_prep_xor_lock(struct dma_chan *c, enum sum_check_flags *result,
> > diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> > index 5336808..6202e7c 100644
> > --- a/include/linux/dmaengine.h
> > +++ b/include/linux/dmaengine.h
> > @@ -694,6 +694,8 @@ struct dma_filter {
> >   * @device_prep_dma_memset: prepares a memset operation
> >   * @device_prep_dma_memset_sg: prepares a memset operation over a scatter list
> >   * @device_prep_dma_interrupt: prepares an end of chain interrupt operation
> > + * @device_prep_dma_memcpy_to_sg: prepares memcpy from buffer to scatterlist
> > + * @device_prep_dma_memcpy_from_sg: prepares memcpy from scatterlist to buffer
> >   * @device_prep_slave_sg: prepares a slave dma operation
> >   * @device_prep_dma_cyclic: prepare a cyclic dma operation suitable for audio.
> >   *     The function takes a buffer of size buf_len. The callback function will
> > @@ -776,6 +778,14 @@ struct dma_device {
> >                 struct scatterlist *dst_sg, unsigned int dst_nents,
> >                 struct scatterlist *src_sg, unsigned int src_nents,
> >                 unsigned long flags);
> > +       struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_to_sg)(
> > +               struct dma_chan *chan,
> > +               struct scatterlist *dst_sg, unsigned int dst_nents,
> > +               dma_addr_t src, unsigned long flags);
> > +       struct dma_async_tx_descriptor *(*device_prep_dma_memcpy_from_sg)(
> > +               struct dma_chan *chan, dma_addr_t dst,
> > +               struct scatterlist *dst_sg, unsigned int dst_nents,
> > +               unsigned long flags);
> >
> >         struct dma_async_tx_descriptor *(*device_prep_slave_sg)(
> >                 struct dma_chan *chan, struct scatterlist *sgl,
> >
> 
> Can we get away with just adding one new operation with a flag to
> indicate whether it is the 'to' or' 'from' sg case?

Yeah that would be better..

-- 
~Vinod
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-01 17:43     ` Dan Williams
@ 2017-08-03  8:06       ` Johannes Thumshirn
  2017-08-03 15:41         ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2017-08-03  8:06 UTC (permalink / raw)
  To: Dan Williams; +Cc: Vinod Koul, dmaengine, linux-nvdimm

On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > Dave Jiang <dave.jiang@intel.com> writes:
> >
> >> Adding DMA support for pmem blk reads. This provides signficant CPU
> >> reduction with large memory reads with good performance. DMAs are triggered
> >> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
> >> are still performed by the CPU in order to reduce latency. By default
> >> the pmem driver will be using blk-mq with DMA.
> >>
> >> Numbers below are measured against pmem simulated via DRAM using
> >> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
> >> platform.  Keep in mind the performance for actual persistent memory
> >> will differ.
> >> Fio 2.21 was used.
> >>
> >> 64k: 1 task queuedepth=1
> >> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
> >> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
> >>
> >> 64k: 16 tasks queuedepth=16
> >> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
> >> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
> >>
> >> 2M: 1 task queuedepth=1
> >> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
> >> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
> >>
> >> 2M: 16 tasks queuedepth=16
> >> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
> >> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
> >>
> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> >> ---
> >
> > Hi Dave,
> >
> > The above table shows that there's a performance benefit for 2M
> > transfers but a regression for 64k transfers, if we forget about the CPU
> > utilization for a second.
> 
> I don't think we can forget about cpu utilization, and I would expect
> most users would value cpu over small bandwidth increases. Especially
> when those numbers show efficiency like this
> 
> +160% cpu +26% read bandwidth
> +171% cpu -10% write bandwidth
> 
> > Would it be beneficial to have heuristics on
> > the transfer size that decide when to use dma and when not? You
> > introduced this hunk:
> >
> > -   rc = pmem_handle_cmd(cmd);
> > +   if (cmd->chan && bio_multiple_segments(req->bio))
> > +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
> > +   else
> > +       rc = pmem_handle_cmd(cmd);
> >
> > Which utilizes dma for bios with multiple segments and for single
> > segment bios you use the old path, maybe the single/multi segment logic
> > can be amended to have something like:
> >
> >     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
> >        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
> >     else
> >        rc = pmem_handle_cmd(cmd);
> >
> > Just something woth considering IMHO.
> 
> The thing we want to avoid most is having the cpu stall doing nothing
> waiting for memory access, so I think once the efficiency of adding
> more cpu goes non-linear it's better to let the dma engine handle
> that. The current heuristic seems to achieve that, but the worry is
> does it over-select the cpu for cases where requests could be merged
> into a bulkier i/o that is more suitable for dma.
> 
> I'm not sure we want another tunable vs predictable / efficient
> behavior by default.

Sorry for my late reply, but lots of automated performance regression CI tools
_will_ report errors because of the performance drop. Yes these may be false
positives, but it'll cost hours to resolve them. If we have a tunable it's way
easier to set it correctly for the use cases. And please keep in mind some
people don't really case about CPU utilization but maxiumum throughput.

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-03  8:06       ` Johannes Thumshirn
@ 2017-08-03 15:41         ` Dan Williams
  2017-08-03 16:12           ` Dave Jiang
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2017-08-03 15:41 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Vinod Koul, dmaengine, linux-nvdimm

On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>> > Dave Jiang <dave.jiang@intel.com> writes:
>> >
>> >> Adding DMA support for pmem blk reads. This provides signficant CPU
>> >> reduction with large memory reads with good performance. DMAs are triggered
>> >> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>> >> are still performed by the CPU in order to reduce latency. By default
>> >> the pmem driver will be using blk-mq with DMA.
>> >>
>> >> Numbers below are measured against pmem simulated via DRAM using
>> >> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> >> platform.  Keep in mind the performance for actual persistent memory
>> >> will differ.
>> >> Fio 2.21 was used.
>> >>
>> >> 64k: 1 task queuedepth=1
>> >> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>> >> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>> >>
>> >> 64k: 16 tasks queuedepth=16
>> >> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>> >> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>> >>
>> >> 2M: 1 task queuedepth=1
>> >> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>> >> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>> >>
>> >> 2M: 16 tasks queuedepth=16
>> >> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>> >> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>> >>
>> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> >> ---
>> >
>> > Hi Dave,
>> >
>> > The above table shows that there's a performance benefit for 2M
>> > transfers but a regression for 64k transfers, if we forget about the CPU
>> > utilization for a second.
>>
>> I don't think we can forget about cpu utilization, and I would expect
>> most users would value cpu over small bandwidth increases. Especially
>> when those numbers show efficiency like this
>>
>> +160% cpu +26% read bandwidth
>> +171% cpu -10% write bandwidth
>>
>> > Would it be beneficial to have heuristics on
>> > the transfer size that decide when to use dma and when not? You
>> > introduced this hunk:
>> >
>> > -   rc = pmem_handle_cmd(cmd);
>> > +   if (cmd->chan && bio_multiple_segments(req->bio))
>> > +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
>> > +   else
>> > +       rc = pmem_handle_cmd(cmd);
>> >
>> > Which utilizes dma for bios with multiple segments and for single
>> > segment bios you use the old path, maybe the single/multi segment logic
>> > can be amended to have something like:
>> >
>> >     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>> >        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>> >     else
>> >        rc = pmem_handle_cmd(cmd);
>> >
>> > Just something woth considering IMHO.
>>
>> The thing we want to avoid most is having the cpu stall doing nothing
>> waiting for memory access, so I think once the efficiency of adding
>> more cpu goes non-linear it's better to let the dma engine handle
>> that. The current heuristic seems to achieve that, but the worry is
>> does it over-select the cpu for cases where requests could be merged
>> into a bulkier i/o that is more suitable for dma.
>>
>> I'm not sure we want another tunable vs predictable / efficient
>> behavior by default.
>
> Sorry for my late reply, but lots of automated performance regression CI tools
> _will_ report errors because of the performance drop. Yes these may be false
> positives, but it'll cost hours to resolve them. If we have a tunable it's way
> easier to set it correctly for the use cases. And please keep in mind some
> people don't really case about CPU utilization but maxiumum throughput.

Returning more cpu to the application may also increase bandwidth
because we stop spinning on completion and spend more time issuing
I/O. I'm also worried that this threshold cross-over point is workload
and pmem media specific. The tunable should be there as an override,
but I think we need simple/sane defaults out of the gate. A "no
regressions whatsoever on any metric" policy effectively kills this
effort as far as I can see. We otherwise need quite a bit more
qualification of different workloads, these 3 fio runs is not enough.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-03 15:41         ` Dan Williams
@ 2017-08-03 16:12           ` Dave Jiang
  2017-08-03 16:15             ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Jiang @ 2017-08-03 16:12 UTC (permalink / raw)
  To: Dan Williams, Johannes Thumshirn; +Cc: Koul, Vinod, dmaengine, linux-nvdimm



On 08/03/2017 08:41 AM, Dan Williams wrote:
> On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
>>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>>>> Dave Jiang <dave.jiang@intel.com> writes:
>>>>
>>>>> Adding DMA support for pmem blk reads. This provides signficant CPU
>>>>> reduction with large memory reads with good performance. DMAs are triggered
>>>>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>>>>> are still performed by the CPU in order to reduce latency. By default
>>>>> the pmem driver will be using blk-mq with DMA.
>>>>>
>>>>> Numbers below are measured against pmem simulated via DRAM using
>>>>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>>>>> platform.  Keep in mind the performance for actual persistent memory
>>>>> will differ.
>>>>> Fio 2.21 was used.
>>>>>
>>>>> 64k: 1 task queuedepth=1
>>>>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>>>>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>>>>
>>>>> 64k: 16 tasks queuedepth=16
>>>>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>>>>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>>>>
>>>>> 2M: 1 task queuedepth=1
>>>>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>>>>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>>>>
>>>>> 2M: 16 tasks queuedepth=16
>>>>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>>>>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>>>>
>>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>>>> ---
>>>>
>>>> Hi Dave,
>>>>
>>>> The above table shows that there's a performance benefit for 2M
>>>> transfers but a regression for 64k transfers, if we forget about the CPU
>>>> utilization for a second.
>>>
>>> I don't think we can forget about cpu utilization, and I would expect
>>> most users would value cpu over small bandwidth increases. Especially
>>> when those numbers show efficiency like this
>>>
>>> +160% cpu +26% read bandwidth
>>> +171% cpu -10% write bandwidth
>>>
>>>> Would it be beneficial to have heuristics on
>>>> the transfer size that decide when to use dma and when not? You
>>>> introduced this hunk:
>>>>
>>>> -   rc = pmem_handle_cmd(cmd);
>>>> +   if (cmd->chan && bio_multiple_segments(req->bio))
>>>> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
>>>> +   else
>>>> +       rc = pmem_handle_cmd(cmd);
>>>>
>>>> Which utilizes dma for bios with multiple segments and for single
>>>> segment bios you use the old path, maybe the single/multi segment logic
>>>> can be amended to have something like:
>>>>
>>>>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>>>>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>>>>     else
>>>>        rc = pmem_handle_cmd(cmd);
>>>>
>>>> Just something woth considering IMHO.
>>>
>>> The thing we want to avoid most is having the cpu stall doing nothing
>>> waiting for memory access, so I think once the efficiency of adding
>>> more cpu goes non-linear it's better to let the dma engine handle
>>> that. The current heuristic seems to achieve that, but the worry is
>>> does it over-select the cpu for cases where requests could be merged
>>> into a bulkier i/o that is more suitable for dma.
>>>
>>> I'm not sure we want another tunable vs predictable / efficient
>>> behavior by default.
>>
>> Sorry for my late reply, but lots of automated performance regression CI tools
>> _will_ report errors because of the performance drop. Yes these may be false
>> positives, but it'll cost hours to resolve them. If we have a tunable it's way
>> easier to set it correctly for the use cases. And please keep in mind some
>> people don't really case about CPU utilization but maxiumum throughput.
> 
> Returning more cpu to the application may also increase bandwidth
> because we stop spinning on completion and spend more time issuing
> I/O. I'm also worried that this threshold cross-over point is workload
> and pmem media specific. The tunable should be there as an override,
> but I think we need simple/sane defaults out of the gate. A "no
> regressions whatsoever on any metric" policy effectively kills this
> effort as far as I can see. We otherwise need quite a bit more
> qualification of different workloads, these 3 fio runs is not enough.
> 

I can put in a tunable knob like Johannes suggested and leave the
default for that to what it is currently. And we can adjust as necessary
as we do more testing.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-03 16:12           ` Dave Jiang
@ 2017-08-03 16:15             ` Dan Williams
  2017-08-04  6:07               ` Johannes Thumshirn
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2017-08-03 16:15 UTC (permalink / raw)
  To: Dave Jiang; +Cc: Koul, Vinod, dmaengine, linux-nvdimm

On Thu, Aug 3, 2017 at 9:12 AM, Dave Jiang <dave.jiang@intel.com> wrote:
>
>
> On 08/03/2017 08:41 AM, Dan Williams wrote:
>> On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>>> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
>>>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>>>>> Dave Jiang <dave.jiang@intel.com> writes:
>>>>>
>>>>>> Adding DMA support for pmem blk reads. This provides signficant CPU
>>>>>> reduction with large memory reads with good performance. DMAs are triggered
>>>>>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>>>>>> are still performed by the CPU in order to reduce latency. By default
>>>>>> the pmem driver will be using blk-mq with DMA.
>>>>>>
>>>>>> Numbers below are measured against pmem simulated via DRAM using
>>>>>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>>>>>> platform.  Keep in mind the performance for actual persistent memory
>>>>>> will differ.
>>>>>> Fio 2.21 was used.
>>>>>>
>>>>>> 64k: 1 task queuedepth=1
>>>>>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>>>>>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>>>>>
>>>>>> 64k: 16 tasks queuedepth=16
>>>>>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>>>>>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>>>>>
>>>>>> 2M: 1 task queuedepth=1
>>>>>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>>>>>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>>>>>
>>>>>> 2M: 16 tasks queuedepth=16
>>>>>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>>>>>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>>>>>
>>>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>>>>> ---
>>>>>
>>>>> Hi Dave,
>>>>>
>>>>> The above table shows that there's a performance benefit for 2M
>>>>> transfers but a regression for 64k transfers, if we forget about the CPU
>>>>> utilization for a second.
>>>>
>>>> I don't think we can forget about cpu utilization, and I would expect
>>>> most users would value cpu over small bandwidth increases. Especially
>>>> when those numbers show efficiency like this
>>>>
>>>> +160% cpu +26% read bandwidth
>>>> +171% cpu -10% write bandwidth
>>>>
>>>>> Would it be beneficial to have heuristics on
>>>>> the transfer size that decide when to use dma and when not? You
>>>>> introduced this hunk:
>>>>>
>>>>> -   rc = pmem_handle_cmd(cmd);
>>>>> +   if (cmd->chan && bio_multiple_segments(req->bio))
>>>>> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
>>>>> +   else
>>>>> +       rc = pmem_handle_cmd(cmd);
>>>>>
>>>>> Which utilizes dma for bios with multiple segments and for single
>>>>> segment bios you use the old path, maybe the single/multi segment logic
>>>>> can be amended to have something like:
>>>>>
>>>>>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>>>>>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>>>>>     else
>>>>>        rc = pmem_handle_cmd(cmd);
>>>>>
>>>>> Just something woth considering IMHO.
>>>>
>>>> The thing we want to avoid most is having the cpu stall doing nothing
>>>> waiting for memory access, so I think once the efficiency of adding
>>>> more cpu goes non-linear it's better to let the dma engine handle
>>>> that. The current heuristic seems to achieve that, but the worry is
>>>> does it over-select the cpu for cases where requests could be merged
>>>> into a bulkier i/o that is more suitable for dma.
>>>>
>>>> I'm not sure we want another tunable vs predictable / efficient
>>>> behavior by default.
>>>
>>> Sorry for my late reply, but lots of automated performance regression CI tools
>>> _will_ report errors because of the performance drop. Yes these may be false
>>> positives, but it'll cost hours to resolve them. If we have a tunable it's way
>>> easier to set it correctly for the use cases. And please keep in mind some
>>> people don't really case about CPU utilization but maxiumum throughput.
>>
>> Returning more cpu to the application may also increase bandwidth
>> because we stop spinning on completion and spend more time issuing
>> I/O. I'm also worried that this threshold cross-over point is workload
>> and pmem media specific. The tunable should be there as an override,
>> but I think we need simple/sane defaults out of the gate. A "no
>> regressions whatsoever on any metric" policy effectively kills this
>> effort as far as I can see. We otherwise need quite a bit more
>> qualification of different workloads, these 3 fio runs is not enough.
>>
>
> I can put in a tunable knob like Johannes suggested and leave the
> default for that to what it is currently. And we can adjust as necessary
> as we do more testing.

That's the problem, we need a multi-dimensional knob. It's not just
transfer-size that effects total delivered bandwidth.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-03 16:15             ` Dan Williams
@ 2017-08-04  6:07               ` Johannes Thumshirn
  2017-08-04 15:47                 ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2017-08-04  6:07 UTC (permalink / raw)
  To: Dan Williams; +Cc: Koul, Vinod, dmaengine, linux-nvdimm

On Thu, Aug 03, 2017 at 09:15:12AM -0700, Dan Williams wrote:
> > I can put in a tunable knob like Johannes suggested and leave the
> > default for that to what it is currently. And we can adjust as necessary
> > as we do more testing.
> 
> That's the problem, we need a multi-dimensional knob. It's not just
> transfer-size that effects total delivered bandwidth.

If Dan opposes it so much, I'm OK with no tunable. I'm just expressing my
concern that we get reports like this [1] one here as a result.

Thanks,
	Johannes

[1] http://marc.info/?l=linux-kernel&m=150175029403913&w=2

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq
  2017-08-04  6:07               ` Johannes Thumshirn
@ 2017-08-04 15:47                 ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2017-08-04 15:47 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Koul, Vinod, dmaengine, linux-nvdimm

On Thu, Aug 3, 2017 at 11:07 PM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> On Thu, Aug 03, 2017 at 09:15:12AM -0700, Dan Williams wrote:
>> > I can put in a tunable knob like Johannes suggested and leave the
>> > default for that to what it is currently. And we can adjust as necessary
>> > as we do more testing.
>>
>> That's the problem, we need a multi-dimensional knob. It's not just
>> transfer-size that effects total delivered bandwidth.
>
> If Dan opposes it so much, I'm OK with no tunable. I'm just expressing my
> concern that we get reports like this [1] one here as a result.

Sorry if I came off harsh. In that report from Mel he's showing a slow
down in a representative workload not just a canned bandwidth
benchmark. I'm worried that tuning for a single dimension on an fio
run doesn't actually improve throughput delivered to an application.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-08-04 15:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-31 22:24 [PATCH 0/5] Adding blk-mq and DMA support to pmem block driver Dave Jiang
2017-07-31 22:24 ` [PATCH 1/5] dmaengine: ioatdma: revert 7618d035 to allow sharing of DMA channels Dave Jiang
2017-07-31 22:24 ` [PATCH 2/5] dmaengine: ioatdma: dma_prep_memcpy_to/from_sg support Dave Jiang
2017-08-01  2:14   ` Dan Williams
2017-08-01 16:39     ` Dave Jiang
2017-08-02  4:57     ` Vinod Koul
2017-07-31 22:24 ` [PATCH 3/5] dmaengine: add SG support to dmaengine_unmap Dave Jiang
2017-07-31 22:24 ` [PATCH 4/5] libnvdimm: Adding blk-mq support to the pmem driver Dave Jiang
2017-08-01 19:02   ` Ross Zwisler
2017-07-31 22:24 ` [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq Dave Jiang
2017-08-01  7:34   ` Johannes Thumshirn
2017-08-01 16:40     ` Dave Jiang
2017-08-01 17:43     ` Dan Williams
2017-08-03  8:06       ` Johannes Thumshirn
2017-08-03 15:41         ` Dan Williams
2017-08-03 16:12           ` Dave Jiang
2017-08-03 16:15             ` Dan Williams
2017-08-04  6:07               ` Johannes Thumshirn
2017-08-04 15:47                 ` Dan Williams
2017-08-01 20:42   ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.