* [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver
@ 2020-06-19 0:30 Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
Hey, some datapath micro-optimizations for nvme-tcp host driver.
Optimizations are:
1. make requests list for send operations a lockless list, and have the
I/O thread pull from it to a local send_list in batches
2. Implement request plugging information
3. optimize msg flags to signal the stack more is coming if
some more requests are expected to be sent soon.
A simple benchmark reveals a nice win in iops in high qd (32) with
batching applied (8 requests per batch). Still QD=1 latency is not
impacted.
Tested on ConnectX-4 device (LRO disabled).
CPU: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
Baseline:
========
QDepth/Batch | IOPs [k]
--------------------------
1/1 | 43.2
32/8 | 106
with patchset:
=============
QDepth/Batch | IOPs [k]
--------------------------
1/1 | 43.3
32/8 | 165
Do note that in an alternative setup (different NIC, faster CPU) the
tests did not show a noticable difference, but overall the optimizations
do make the datapath somewhat more efficient.
Sagi Grimberg (3):
nvme-tcp: have queue prod/cons send list become a llist
nvme-tcp: leverage request plugging
nvme-tcp: optimize network stack with setting msg flags according to
batch size
drivers/nvme/host/tcp.c | 78 +++++++++++++++++++++++++++++++----------
1 file changed, 60 insertions(+), 18 deletions(-)
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
@ 2020-06-19 0:30 ` Sagi Grimberg
2020-06-24 16:48 ` Christoph Hellwig
2020-06-19 0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
` (2 subsequent siblings)
3 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
The queue processing will splice to a queue local list, this should
alleviate some contention on the send_list lock, but also prepares
us to the next patch where we look on these lists for network stack
flag optimization.
Remove queue lock as its unused anymore.
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 79ef2b8e2b3c..93fc6d9f97df 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -46,6 +46,7 @@ struct nvme_tcp_request {
u32 pdu_sent;
u16 ttag;
struct list_head entry;
+ struct llist_node lentry;
__le32 ddgst;
struct bio *curr_bio;
@@ -75,8 +76,8 @@ struct nvme_tcp_queue {
struct work_struct io_work;
int io_cpu;
- spinlock_t lock;
struct mutex send_mutex;
+ struct llist_head req_list;
struct list_head send_list;
/* recv state */
@@ -266,10 +267,8 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
struct nvme_tcp_queue *queue = req->queue;
bool empty;
- spin_lock(&queue->lock);
- empty = list_empty(&queue->send_list) && !queue->request;
- list_add_tail(&req->entry, &queue->send_list);
- spin_unlock(&queue->lock);
+ empty = llist_add(&req->lentry, &queue->req_list) &&
+ list_empty(&queue->send_list) && !queue->request;
/*
* if we're the first on the send_list and we can try to send
@@ -285,18 +284,38 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
}
}
+static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
+{
+ struct nvme_tcp_request *req;
+ struct llist_node *node;
+
+ node = llist_del_all(&queue->req_list);
+ if (!node)
+ return;
+
+ while (node) {
+ req = llist_entry(node, struct nvme_tcp_request, lentry);
+ list_add(&req->entry, &queue->send_list);
+ node = node->next;
+ }
+}
+
static inline struct nvme_tcp_request *
nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
{
struct nvme_tcp_request *req;
- spin_lock(&queue->lock);
req = list_first_entry_or_null(&queue->send_list,
struct nvme_tcp_request, entry);
- if (req)
- list_del(&req->entry);
- spin_unlock(&queue->lock);
+ if (!req) {
+ nvme_tcp_process_req_list(queue);
+ req = list_first_entry_or_null(&queue->send_list,
+ struct nvme_tcp_request, entry);
+ if (unlikely(!req))
+ return NULL;
+ }
+ list_del(&req->entry);
return req;
}
@@ -1342,8 +1361,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
int ret, rcv_pdu_size;
queue->ctrl = ctrl;
+ init_llist_head(&queue->req_list);
INIT_LIST_HEAD(&queue->send_list);
- spin_lock_init(&queue->lock);
mutex_init(&queue->send_mutex);
INIT_WORK(&queue->io_work, nvme_tcp_io_work);
queue->queue_size = queue_size;
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] nvme-tcp: leverage request plugging
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
@ 2020-06-19 0:30 ` Sagi Grimberg
2020-06-19 0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
3 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
blk-mq request plugging can improve the execution of our pipeline.
When we queue a request we actually trigger our I/O worker thread
yielding a context switch by definition. However if we know that
there are more requests in the pipe that are coming, we are better
off not trigger our I/O worker and only do that for the last request
in the batch (bd->last). By having it, we improve efficiency by
amortizing context switches over a batch of requests.
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/tcp.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 93fc6d9f97df..76fd587acc2e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -262,7 +262,7 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
}
static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
- bool sync)
+ bool sync, bool last)
{
struct nvme_tcp_queue *queue = req->queue;
bool empty;
@@ -279,7 +279,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
sync && empty && mutex_trylock(&queue->send_mutex)) {
nvme_tcp_try_send(queue);
mutex_unlock(&queue->send_mutex);
- } else {
+ } else if (last) {
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
}
}
@@ -614,7 +614,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
req->state = NVME_TCP_SEND_H2C_PDU;
req->offset = 0;
- nvme_tcp_queue_request(req, false);
+ nvme_tcp_queue_request(req, false, true);
return 0;
}
@@ -2123,7 +2123,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
ctrl->async_req.curr_bio = NULL;
ctrl->async_req.data_len = 0;
- nvme_tcp_queue_request(&ctrl->async_req, true);
+ nvme_tcp_queue_request(&ctrl->async_req, true, true);
}
static enum blk_eh_timer_return
@@ -2235,6 +2235,14 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
return 0;
}
+static void nvme_tcp_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+ struct nvme_tcp_queue *queue = hctx->driver_data;
+
+ if (!llist_empty(&queue->req_list))
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@@ -2254,7 +2262,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(rq);
- nvme_tcp_queue_request(req, true);
+ nvme_tcp_queue_request(req, true, bd->last);
return BLK_STS_OK;
}
@@ -2322,6 +2330,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx)
static const struct blk_mq_ops nvme_tcp_mq_ops = {
.queue_rq = nvme_tcp_queue_rq,
+ .commit_rqs = nvme_tcp_commit_rqs,
.complete = nvme_complete_rq,
.init_request = nvme_tcp_init_request,
.exit_request = nvme_tcp_exit_request,
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
2020-06-19 0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
@ 2020-06-19 0:30 ` Sagi Grimberg
2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
3 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
If we have a long list of request to send, signal the network stack
that more is coming (MSG_MORE). If we have nothing else, signal MSG_EOR.
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/tcp.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 76fd587acc2e..0a13a0812dbc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -79,6 +79,7 @@ struct nvme_tcp_queue {
struct mutex send_mutex;
struct llist_head req_list;
struct list_head send_list;
+ bool more_requests;
/* recv state */
void *pdu;
@@ -277,7 +278,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
*/
if (queue->io_cpu == smp_processor_id() &&
sync && empty && mutex_trylock(&queue->send_mutex)) {
+ queue->more_requests = !last;
nvme_tcp_try_send(queue);
+ queue->more_requests = false;
mutex_unlock(&queue->send_mutex);
} else if (last) {
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
@@ -880,6 +883,12 @@ static void nvme_tcp_state_change(struct sock *sk)
read_unlock(&sk->sk_callback_lock);
}
+static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
+{
+ return !list_empty(&queue->send_list) ||
+ !llist_empty(&queue->req_list) || queue->more_requests;
+}
+
static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
{
queue->request = NULL;
@@ -901,7 +910,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
bool last = nvme_tcp_pdu_last_send(req, len);
int ret, flags = MSG_DONTWAIT;
- if (last && !queue->data_digest)
+ if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
flags |= MSG_EOR;
else
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
@@ -948,7 +957,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
int flags = MSG_DONTWAIT;
int ret;
- if (inline_data)
+ if (inline_data || nvme_tcp_queue_more(queue))
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
else
flags |= MSG_EOR;
@@ -1013,12 +1022,17 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
int ret;
- struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
struct kvec iov = {
.iov_base = &req->ddgst + req->offset,
.iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
};
+ if (nvme_tcp_queue_more(queue))
+ msg.msg_flags |= MSG_MORE;
+ else
+ msg.msg_flags |= MSG_EOR;
+
ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
if (unlikely(ret <= 0))
return ret;
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
@ 2020-06-24 16:48 ` Christoph Hellwig
0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2020-06-24 16:48 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Keith Busch, Anil Vasudevan, Mark Wunderlich, Christoph Hellwig,
linux-nvme
On Thu, Jun 18, 2020 at 05:30:22PM -0700, Sagi Grimberg wrote:
> +static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
> +{
> + struct nvme_tcp_request *req;
> + struct llist_node *node;
> +
> + node = llist_del_all(&queue->req_list);
> + if (!node)
> + return;
> +
> + while (node) {
> + req = llist_entry(node, struct nvme_tcp_request, lentry);
> + list_add(&req->entry, &queue->send_list);
> + node = node->next;
> + }
The if is not needed as the while covers this.
Otherwise this looks good, I'ĺl fix it up when applying.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
` (2 preceding siblings ...)
2020-06-19 0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
@ 2020-06-24 16:51 ` Christoph Hellwig
3 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2020-06-24 16:51 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Keith Busch, Anil Vasudevan, Mark Wunderlich, Christoph Hellwig,
linux-nvme
Applied to nvme-5.9.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-06-24 16:51 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
2020-06-24 16:48 ` Christoph Hellwig
2020-06-19 0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
2020-06-19 0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).