[PATCH 0/3] Some datapath optimizations for nvme-tcp host driver

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver
@ 2020-06-19  0:30 Sagi Grimberg
  2020-06-19  0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19  0:30 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Anil Vasudevan, Mark Wunderlich

Hey, some datapath micro-optimizations for nvme-tcp host driver.

Optimizations are:
1. make requests list for send operations a lockless list, and have the
   I/O thread pull from it to a local send_list in batches
2. Implement request plugging information
3. optimize msg flags to signal the stack more is coming if
   some more requests are expected to be sent soon.

A simple benchmark reveals a nice win in iops in high qd (32) with
batching applied (8 requests per batch). Still QD=1 latency is not
impacted.

Tested on ConnectX-4 device (LRO disabled).
CPU: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

Baseline:
========
QDepth/Batch    | IOPs [k] 
--------------------------
1/1             | 43.2
32/8            | 106

with patchset:
=============
QDepth/Batch    | IOPs [k]  
--------------------------
1/1             | 43.3
32/8            | 165

Do note that in an alternative setup (different NIC, faster CPU) the
tests did not show a noticable difference, but overall the optimizations
do make the datapath somewhat more efficient.

Sagi Grimberg (3):
  nvme-tcp: have queue prod/cons send list become a llist
  nvme-tcp: leverage request plugging
  nvme-tcp: optimize network stack with setting msg flags according to
    batch size

 drivers/nvme/host/tcp.c | 78 +++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 18 deletions(-)

-- 
2.25.1

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist
  2020-06-19  0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
@ 2020-06-19  0:30 ` Sagi Grimberg
  2020-06-24 16:48   ` Christoph Hellwig
  2020-06-19  0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19  0:30 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Anil Vasudevan, Mark Wunderlich

The queue processing will splice to a queue local list, this should
alleviate some contention on the send_list lock, but also prepares
us to the next patch where we look on these lists for network stack
flag optimization.

Remove queue lock as its unused anymore.

Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 79ef2b8e2b3c..93fc6d9f97df 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -46,6 +46,7 @@ struct nvme_tcp_request {
 	u32			pdu_sent;
 	u16			ttag;
 	struct list_head	entry;
+	struct llist_node	lentry;
 	__le32			ddgst;
 
 	struct bio		*curr_bio;
@@ -75,8 +76,8 @@ struct nvme_tcp_queue {
 	struct work_struct	io_work;
 	int			io_cpu;
 
-	spinlock_t		lock;
 	struct mutex		send_mutex;
+	struct llist_head	req_list;
 	struct list_head	send_list;
 
 	/* recv state */
@@ -266,10 +267,8 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	struct nvme_tcp_queue *queue = req->queue;
 	bool empty;
 
-	spin_lock(&queue->lock);
-	empty = list_empty(&queue->send_list) && !queue->request;
-	list_add_tail(&req->entry, &queue->send_list);
-	spin_unlock(&queue->lock);
+	empty = llist_add(&req->lentry, &queue->req_list) &&
+		list_empty(&queue->send_list) && !queue->request;
 
 	/*
 	 * if we're the first on the send_list and we can try to send
@@ -285,18 +284,38 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	}
 }
 
+static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_request *req;
+	struct llist_node *node;
+
+	node = llist_del_all(&queue->req_list);
+	if (!node)
+		return;
+
+	while (node) {
+		req = llist_entry(node, struct nvme_tcp_request, lentry);
+		list_add(&req->entry, &queue->send_list);
+		node = node->next;
+	}
+}
+
 static inline struct nvme_tcp_request *
 nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
 {
 	struct nvme_tcp_request *req;
 
-	spin_lock(&queue->lock);
 	req = list_first_entry_or_null(&queue->send_list,
 			struct nvme_tcp_request, entry);
-	if (req)
-		list_del(&req->entry);
-	spin_unlock(&queue->lock);
+	if (!req) {
+		nvme_tcp_process_req_list(queue);
+		req = list_first_entry_or_null(&queue->send_list,
+				struct nvme_tcp_request, entry);
+		if (unlikely(!req))
+			return NULL;
+	}
 
+	list_del(&req->entry);
 	return req;
 }
 
@@ -1342,8 +1361,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 	int ret, rcv_pdu_size;
 
 	queue->ctrl = ctrl;
+	init_llist_head(&queue->req_list);
 	INIT_LIST_HEAD(&queue->send_list);
-	spin_lock_init(&queue->lock);
 	mutex_init(&queue->send_mutex);
 	INIT_WORK(&queue->io_work, nvme_tcp_io_work);
 	queue->queue_size = queue_size;
-- 
2.25.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] nvme-tcp: leverage request plugging
  2020-06-19  0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
  2020-06-19  0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
@ 2020-06-19  0:30 ` Sagi Grimberg
  2020-06-19  0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
  2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
  3 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19  0:30 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Anil Vasudevan, Mark Wunderlich

blk-mq request plugging can improve the execution of our pipeline.
When we queue a request we actually trigger our I/O worker thread
yielding a context switch by definition. However if we know that
there are more requests in the pipe that are coming, we are better
off not trigger our I/O worker and only do that for the last request
in the batch (bd->last). By having it, we improve efficiency by
amortizing context switches over a batch of requests.

Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 93fc6d9f97df..76fd587acc2e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -262,7 +262,7 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
 }
 
 static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
-		bool sync)
+		bool sync, bool last)
 {
 	struct nvme_tcp_queue *queue = req->queue;
 	bool empty;
@@ -279,7 +279,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	    sync && empty && mutex_trylock(&queue->send_mutex)) {
 		nvme_tcp_try_send(queue);
 		mutex_unlock(&queue->send_mutex);
-	} else {
+	} else if (last) {
 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 	}
 }
@@ -614,7 +614,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	req->state = NVME_TCP_SEND_H2C_PDU;
 	req->offset = 0;
 
-	nvme_tcp_queue_request(req, false);
+	nvme_tcp_queue_request(req, false, true);
 
 	return 0;
 }
@@ -2123,7 +2123,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
 
-	nvme_tcp_queue_request(&ctrl->async_req, true);
+	nvme_tcp_queue_request(&ctrl->async_req, true, true);
 }
 
 static enum blk_eh_timer_return
@@ -2235,6 +2235,14 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	return 0;
 }
 
+static void nvme_tcp_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct nvme_tcp_queue *queue = hctx->driver_data;
+
+	if (!llist_empty(&queue->req_list))
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
 static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 		const struct blk_mq_queue_data *bd)
 {
@@ -2254,7 +2262,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(rq);
 
-	nvme_tcp_queue_request(req, true);
+	nvme_tcp_queue_request(req, true, bd->last);
 
 	return BLK_STS_OK;
 }
@@ -2322,6 +2330,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx)
 
 static const struct blk_mq_ops nvme_tcp_mq_ops = {
 	.queue_rq	= nvme_tcp_queue_rq,
+	.commit_rqs	= nvme_tcp_commit_rqs,
 	.complete	= nvme_complete_rq,
 	.init_request	= nvme_tcp_init_request,
 	.exit_request	= nvme_tcp_exit_request,
-- 
2.25.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size
  2020-06-19  0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
  2020-06-19  0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
  2020-06-19  0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
@ 2020-06-19  0:30 ` Sagi Grimberg
  2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
  3 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19  0:30 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Anil Vasudevan, Mark Wunderlich

If we have a long list of request to send, signal the network stack
that more is coming (MSG_MORE). If we have nothing else, signal MSG_EOR.

Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 76fd587acc2e..0a13a0812dbc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -79,6 +79,7 @@ struct nvme_tcp_queue {
 	struct mutex		send_mutex;
 	struct llist_head	req_list;
 	struct list_head	send_list;
+	bool			more_requests;
 
 	/* recv state */
 	void			*pdu;
@@ -277,7 +278,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	 */
 	if (queue->io_cpu == smp_processor_id() &&
 	    sync && empty && mutex_trylock(&queue->send_mutex)) {
+		queue->more_requests = !last;
 		nvme_tcp_try_send(queue);
+		queue->more_requests = false;
 		mutex_unlock(&queue->send_mutex);
 	} else if (last) {
 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
@@ -880,6 +883,12 @@ static void nvme_tcp_state_change(struct sock *sk)
 	read_unlock(&sk->sk_callback_lock);
 }
 
+static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
+{
+	return !list_empty(&queue->send_list) ||
+		!llist_empty(&queue->req_list) || queue->more_requests;
+}
+
 static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
 {
 	queue->request = NULL;
@@ -901,7 +910,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 		bool last = nvme_tcp_pdu_last_send(req, len);
 		int ret, flags = MSG_DONTWAIT;
 
-		if (last && !queue->data_digest)
+		if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
 			flags |= MSG_EOR;
 		else
 			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
@@ -948,7 +957,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	int flags = MSG_DONTWAIT;
 	int ret;
 
-	if (inline_data)
+	if (inline_data || nvme_tcp_queue_more(queue))
 		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
 	else
 		flags |= MSG_EOR;
@@ -1013,12 +1022,17 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
 {
 	struct nvme_tcp_queue *queue = req->queue;
 	int ret;
-	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
 	struct kvec iov = {
 		.iov_base = &req->ddgst + req->offset,
 		.iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
 	};
 
+	if (nvme_tcp_queue_more(queue))
+		msg.msg_flags |= MSG_MORE;
+	else
+		msg.msg_flags |= MSG_EOR;
+
 	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
 	if (unlikely(ret <= 0))
 		return ret;
-- 
2.25.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist
  2020-06-19  0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
@ 2020-06-24 16:48   ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2020-06-24 16:48 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Anil Vasudevan, Mark Wunderlich, Christoph Hellwig,
	linux-nvme

On Thu, Jun 18, 2020 at 05:30:22PM -0700, Sagi Grimberg wrote:
> +static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
> +{
> +	struct nvme_tcp_request *req;
> +	struct llist_node *node;
> +
> +	node = llist_del_all(&queue->req_list);
> +	if (!node)
> +		return;
> +
> +	while (node) {
> +		req = llist_entry(node, struct nvme_tcp_request, lentry);
> +		list_add(&req->entry, &queue->send_list);
> +		node = node->next;
> +	}

The if is not needed as the while covers this.

Otherwise this looks good, I'ĺl fix it up when applying.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver
  2020-06-19  0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
                   ` (2 preceding siblings ...)
  2020-06-19  0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
@ 2020-06-24 16:51 ` Christoph Hellwig
  3 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2020-06-24 16:51 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Anil Vasudevan, Mark Wunderlich, Christoph Hellwig,
	linux-nvme

Applied to nvme-5.9.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-06-24 16:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19  0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
2020-06-19  0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
2020-06-24 16:48   ` Christoph Hellwig
2020-06-19  0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
2020-06-19  0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).