* [PATCH 1/3] nvme-rdma: Avoid preallocating big SGL for data
2019-11-24 16:38 [PATCH 0/3] nvme: Avoid preallocating big SGL for data Israel Rukshin
@ 2019-11-24 16:38 ` Israel Rukshin
2019-11-26 16:53 ` Christoph Hellwig
2019-11-24 16:38 ` [PATCH 2/3] nvme-fc: " Israel Rukshin
2019-11-24 16:38 ` [PATCH 3/3] nvmet-loop: " Israel Rukshin
2 siblings, 1 reply; 9+ messages in thread
From: Israel Rukshin @ 2019-11-24 16:38 UTC (permalink / raw)
To: Linux-nvme, Sagi Grimberg, Christoph Hellwig, James Smart, Keith Busch
Cc: Israel Rukshin, Max Gurtovoy
nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.
Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.
If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.
Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
support SG_CHAIN, use only runtime allocation for the SGL.
We didn't notice of a performance degradation, since for small IOs we'll
use the inline SG and for the bigger IOs the allocation of a bigger SGL
from slab is fast enough.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
drivers/nvme/host/nvme.h | 6 ++++++
drivers/nvme/host/rdma.c | 10 +++++-----
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 34ac79c..3615145 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -27,6 +27,12 @@
#define NVME_DEFAULT_KATO 5
#define NVME_KATO_GRACE 10
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+#define NVME_INLINE_SG_CNT 0
+#else
+#define NVME_INLINE_SG_CNT 2
+#endif
+
extern struct workqueue_struct *nvme_wq;
extern struct workqueue_struct *nvme_reset_wq;
extern struct workqueue_struct *nvme_delete_wq;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 05f2dfa..9a02fde 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -731,7 +731,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
set->reserved_tags = 2; /* connect + keep-alive */
set->numa_node = nctrl->numa_node;
set->cmd_size = sizeof(struct nvme_rdma_request) +
- SG_CHUNK_SIZE * sizeof(struct scatterlist);
+ NVME_INLINE_SG_CNT * sizeof(struct scatterlist);
set->driver_data = ctrl;
set->nr_hw_queues = 1;
set->timeout = ADMIN_TIMEOUT;
@@ -745,7 +745,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
set->numa_node = nctrl->numa_node;
set->flags = BLK_MQ_F_SHOULD_MERGE;
set->cmd_size = sizeof(struct nvme_rdma_request) +
- SG_CHUNK_SIZE * sizeof(struct scatterlist);
+ NVME_INLINE_SG_CNT * sizeof(struct scatterlist);
set->driver_data = ctrl;
set->nr_hw_queues = nctrl->queue_count - 1;
set->timeout = NVME_IO_TIMEOUT;
@@ -1160,7 +1160,7 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
}
ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq));
- sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
+ sg_free_table_chained(&req->sg_table, NVME_INLINE_SG_CNT);
}
static int nvme_rdma_set_sg_null(struct nvme_command *c)
@@ -1276,7 +1276,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
req->sg_table.sgl = req->first_sgl;
ret = sg_alloc_table_chained(&req->sg_table,
blk_rq_nr_phys_segments(rq), req->sg_table.sgl,
- SG_CHUNK_SIZE);
+ NVME_INLINE_SG_CNT);
if (ret)
return -ENOMEM;
@@ -1314,7 +1314,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
out_unmap_sg:
ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq));
out_free_table:
- sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
+ sg_free_table_chained(&req->sg_table, NVME_INLINE_SG_CNT);
return ret;
}
--
1.8.3.1
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/3] nvme-fc: Avoid preallocating big SGL for data
2019-11-24 16:38 [PATCH 0/3] nvme: Avoid preallocating big SGL for data Israel Rukshin
2019-11-24 16:38 ` [PATCH 1/3] nvme-rdma: " Israel Rukshin
@ 2019-11-24 16:38 ` Israel Rukshin
2019-11-25 17:04 ` James Smart
2019-11-24 16:38 ` [PATCH 3/3] nvmet-loop: " Israel Rukshin
2 siblings, 1 reply; 9+ messages in thread
From: Israel Rukshin @ 2019-11-24 16:38 UTC (permalink / raw)
To: Linux-nvme, Sagi Grimberg, Christoph Hellwig, James Smart, Keith Busch
Cc: Israel Rukshin, Max Gurtovoy
nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.
Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.
If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.
Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
drivers/nvme/host/fc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 679a721..13cb00e 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -95,7 +95,7 @@ struct nvme_fc_fcp_op {
struct nvme_fcp_op_w_sgl {
struct nvme_fc_fcp_op op;
- struct scatterlist sgl[SG_CHUNK_SIZE];
+ struct scatterlist sgl[NVME_INLINE_SG_CNT];
uint8_t priv[0];
};
@@ -2141,7 +2141,7 @@ enum {
freq->sg_table.sgl = freq->first_sgl;
ret = sg_alloc_table_chained(&freq->sg_table,
blk_rq_nr_phys_segments(rq), freq->sg_table.sgl,
- SG_CHUNK_SIZE);
+ NVME_INLINE_SG_CNT);
if (ret)
return -ENOMEM;
@@ -2150,7 +2150,7 @@ enum {
freq->sg_cnt = fc_dma_map_sg(ctrl->lport->dev, freq->sg_table.sgl,
op->nents, rq_dma_dir(rq));
if (unlikely(freq->sg_cnt <= 0)) {
- sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE);
+ sg_free_table_chained(&freq->sg_table, NVME_INLINE_SG_CNT);
freq->sg_cnt = 0;
return -EFAULT;
}
@@ -2173,7 +2173,7 @@ enum {
fc_dma_unmap_sg(ctrl->lport->dev, freq->sg_table.sgl, op->nents,
rq_dma_dir(rq));
- sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE);
+ sg_free_table_chained(&freq->sg_table, NVME_INLINE_SG_CNT);
freq->sg_cnt = 0;
}
--
1.8.3.1
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3] nvme-fc: Avoid preallocating big SGL for data
2019-11-24 16:38 ` [PATCH 2/3] nvme-fc: " Israel Rukshin
@ 2019-11-25 17:04 ` James Smart
0 siblings, 0 replies; 9+ messages in thread
From: James Smart @ 2019-11-25 17:04 UTC (permalink / raw)
To: Israel Rukshin, Linux-nvme, Sagi Grimberg, Christoph Hellwig,
James Smart, Keith Busch
Cc: Max Gurtovoy
On 11/24/2019 8:38 AM, Israel Rukshin wrote:
> nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
> on SG_CHUNK_SIZE.
>
> Modern DMA engines are often capable of dealing with very big segments so
> the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
> SGL allocation per command.
>
> If a controller has lots of deep queues, preallocation for the sg list can
> consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
> 128 and each queue's depth 128. This means the resulting preallocation
> for the data SGL is 128*128*4K = 64MB per controller.
>
> Switch to runtime allocation for SGL for lists longer than 2 entries. This
> is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
> well. Runtime SGL allocation has always been the case for the legacy I/O
> path so this is nothing new.
>
> Signed-off-by: Israel Rukshin <israelr@mellanox.com>
> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> ---
> drivers/nvme/host/fc.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
>
Look ok to me.
Reviewed-by: James Smart <james.smart@broadcom.com>
Note: would have preferred to see this be 4 patches, with patch 1 be the
header file addition only, but a minor nit.
-- james
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 3/3] nvmet-loop: Avoid preallocating big SGL for data
2019-11-24 16:38 [PATCH 0/3] nvme: Avoid preallocating big SGL for data Israel Rukshin
2019-11-24 16:38 ` [PATCH 1/3] nvme-rdma: " Israel Rukshin
2019-11-24 16:38 ` [PATCH 2/3] nvme-fc: " Israel Rukshin
@ 2019-11-24 16:38 ` Israel Rukshin
2019-11-25 2:24 ` Chaitanya Kulkarni
` (2 more replies)
2 siblings, 3 replies; 9+ messages in thread
From: Israel Rukshin @ 2019-11-24 16:38 UTC (permalink / raw)
To: Linux-nvme, Sagi Grimberg, Christoph Hellwig, James Smart, Keith Busch
Cc: Israel Rukshin, Max Gurtovoy
nvme_loop_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.
Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.
If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvmet-loop, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.
Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
drivers/nvme/target/loop.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 856eb06..dae31bf 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -76,7 +76,7 @@ static void nvme_loop_complete_rq(struct request *req)
{
struct nvme_loop_iod *iod = blk_mq_rq_to_pdu(req);
- sg_free_table_chained(&iod->sg_table, SG_CHUNK_SIZE);
+ sg_free_table_chained(&iod->sg_table, NVME_INLINE_SG_CNT);
nvme_complete_rq(req);
}
@@ -156,7 +156,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
iod->sg_table.sgl = iod->first_sgl;
if (sg_alloc_table_chained(&iod->sg_table,
blk_rq_nr_phys_segments(req),
- iod->sg_table.sgl, SG_CHUNK_SIZE))
+ iod->sg_table.sgl, NVME_INLINE_SG_CNT))
return BLK_STS_RESOURCE;
iod->req.sg = iod->sg_table.sgl;
@@ -340,7 +340,7 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
ctrl->admin_tag_set.reserved_tags = 2; /* connect + keep-alive */
ctrl->admin_tag_set.numa_node = NUMA_NO_NODE;
ctrl->admin_tag_set.cmd_size = sizeof(struct nvme_loop_iod) +
- SG_CHUNK_SIZE * sizeof(struct scatterlist);
+ NVME_INLINE_SG_CNT * sizeof(struct scatterlist);
ctrl->admin_tag_set.driver_data = ctrl;
ctrl->admin_tag_set.nr_hw_queues = 1;
ctrl->admin_tag_set.timeout = ADMIN_TIMEOUT;
@@ -514,7 +514,7 @@ static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl)
ctrl->tag_set.numa_node = NUMA_NO_NODE;
ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
ctrl->tag_set.cmd_size = sizeof(struct nvme_loop_iod) +
- SG_CHUNK_SIZE * sizeof(struct scatterlist);
+ NVME_INLINE_SG_CNT * sizeof(struct scatterlist);
ctrl->tag_set.driver_data = ctrl;
ctrl->tag_set.nr_hw_queues = ctrl->ctrl.queue_count - 1;
ctrl->tag_set.timeout = NVME_IO_TIMEOUT;
--
1.8.3.1
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] nvmet-loop: Avoid preallocating big SGL for data
2019-11-24 16:38 ` [PATCH 3/3] nvmet-loop: " Israel Rukshin
@ 2019-11-25 2:24 ` Chaitanya Kulkarni
2019-11-26 16:53 ` Christoph Hellwig
2019-11-26 17:40 ` Keith Busch
2 siblings, 0 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2019-11-25 2:24 UTC (permalink / raw)
To: Israel Rukshin, Linux-nvme, Sagi Grimberg, Christoph Hellwig,
James Smart, Keith Busch
Cc: Max Gurtovoy
Looks good, tested with simple loop setup running ranread
4K I/Os with fio.
Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
On 11/24/2019 08:39 AM, Israel Rukshin wrote:
> nvme_loop_create_io_queues() preallocates a big buffer for the IO SGL based
> on SG_CHUNK_SIZE.
>
> Modern DMA engines are often capable of dealing with very big segments so
> the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
> SGL allocation per command.
>
> If a controller has lots of deep queues, preallocation for the sg list can
> consume substantial amounts of memory. For nvmet-loop, nr_hw_queues can be
> 128 and each queue's depth 128. This means the resulting preallocation
> for the data SGL is 128*128*4K = 64MB per controller.
>
> Switch to runtime allocation for SGL for lists longer than 2 entries. This
> is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
> well. Runtime SGL allocation has always been the case for the legacy I/O
> path so this is nothing new.
>
> Signed-off-by: Israel Rukshin<israelr@mellanox.com>
> Reviewed-by: Max Gurtovoy<maxg@mellanox.com>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] nvmet-loop: Avoid preallocating big SGL for data
2019-11-24 16:38 ` [PATCH 3/3] nvmet-loop: " Israel Rukshin
2019-11-25 2:24 ` Chaitanya Kulkarni
@ 2019-11-26 16:53 ` Christoph Hellwig
2019-11-26 17:40 ` Keith Busch
2 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2019-11-26 16:53 UTC (permalink / raw)
To: Israel Rukshin
Cc: James Smart, Sagi Grimberg, Linux-nvme, Keith Busch,
Max Gurtovoy, Christoph Hellwig
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] nvmet-loop: Avoid preallocating big SGL for data
2019-11-24 16:38 ` [PATCH 3/3] nvmet-loop: " Israel Rukshin
2019-11-25 2:24 ` Chaitanya Kulkarni
2019-11-26 16:53 ` Christoph Hellwig
@ 2019-11-26 17:40 ` Keith Busch
2 siblings, 0 replies; 9+ messages in thread
From: Keith Busch @ 2019-11-26 17:40 UTC (permalink / raw)
To: Israel Rukshin
Cc: Max Gurtovoy, James Smart, Sagi Grimberg, Linux-nvme, Christoph Hellwig
On Sun, Nov 24, 2019 at 06:38:32PM +0200, Israel Rukshin wrote:
> @@ -156,7 +156,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> iod->sg_table.sgl = iod->first_sgl;
> if (sg_alloc_table_chained(&iod->sg_table,
> blk_rq_nr_phys_segments(req),
> - iod->sg_table.sgl, SG_CHUNK_SIZE))
> + iod->sg_table.sgl, NVME_INLINE_SG_CNT))
> return BLK_STS_RESOURCE;
Minor merge conflict here from a resource leak fix from Max, but I fixed
it up.
Series applied to nvme/for-5.5
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 9+ messages in thread