linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2 ] nvme-tcp I/O path optimizations
@ 2020-05-01 21:25 Sagi Grimberg
  2020-05-01 21:25 ` [PATCH 1/2] nvme-tcp: avoid scheduling io_work if we are already polling Sagi Grimberg
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Sagi Grimberg @ 2020-05-01 21:25 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme
  Cc: Anil Vasudevan, Mark Wunderlich

Hey All,

Here are two data-path optimizations that result in a measurable reduction
in latency and context switches.

First optimization is a heuristic relevant for polling oriented workloads
avoiding scheduling io_work when the application is polling. The second
optimization is an opportunistic attempt to send the request in queue_rq if
it is the first in line (statistic sampling reveals that this is often the
case for read-intensive workloads), otherwise we assume io_work will handle
it and we don't want to contend with it. The benefit is that we don't absorb
the extra context switch if we don't have to. Do note that given that network
send operations, despite setting MSG_DONTWAIT, may sleep, so we need to set
blocking dispatch (BLK_MQ_F_BLOCKING).

There are more data path optimizations being evaluated, both for the host and
the target. The ideas came from Mark and Anil from Intel who also benchmarked
and instrumented the system (thanks!). Testing was done using a NIC device from
Intel, but should not be specific to any device.

Representative fio micro-benchmark testing:
- ram device (nvmet-tcp)
- single CPU core (pinned)
- 100% 4k reads
- engine io_uring
- hipri flag set (polling)

Baseline:
========
QDepth/Batch    | IOPs [k]  | Avg. Lat [us]  | 99.99% Lat [us]  | ctx-switches
------------------------------------------------------------------------------
1/1             | 35.1      | 27.42          | 47.87		| ~119914
32/8            | 234       | 122.98         | 239		| ~143450

With patches applied:
====================
QDepth/Batch    | IOPs [k]  | Avg. Lat [us]  | 99.99% Lat [us]  | ctx-switches
------------------------------------------------------------------------------
1/1             | 39.6      | 24.25          | 36.6             | ~357
32/8            | 247       | 113.95         | 249              | ~37298

Sagi Grimberg (2):
  nvme-tcp: avoid scheduling io_work if we are already polling
  nvme-tcp: try to send request in queue_rq context

 drivers/nvme/host/tcp.c | 49 +++++++++++++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 12 deletions(-)

-- 

2.20.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] nvme-tcp: avoid scheduling io_work if we are already polling
  2020-05-01 21:25 [PATCH 0/2 ] nvme-tcp I/O path optimizations Sagi Grimberg
@ 2020-05-01 21:25 ` Sagi Grimberg
  2020-05-01 21:25 ` [PATCH 2/2] nvme-tcp: try to send request in queue_rq context Sagi Grimberg
  2020-05-06  6:57 ` [PATCH 0/2 ] nvme-tcp I/O path optimizations Christoph Hellwig
  2 siblings, 0 replies; 4+ messages in thread
From: Sagi Grimberg @ 2020-05-01 21:25 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme
  Cc: Anil Vasudevan, Mark Wunderlich

When the user runs polled I/O, we shouldn't have to trigger
the workqueue to generate the receive work upon the .data_ready
upcall. This prevents a redundant context switch when the
application is already polling for completions.

Proposed-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4862fa962011..b28f91d0f083 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -60,6 +60,7 @@ struct nvme_tcp_request {
 enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_ALLOCATED	= 0,
 	NVME_TCP_Q_LIVE		= 1,
+	NVME_TCP_Q_POLLING	= 2,
 };
 
 enum nvme_tcp_recv_state {
@@ -796,7 +797,8 @@ static void nvme_tcp_data_ready(struct sock *sk)
 
 	read_lock_bh(&sk->sk_callback_lock);
 	queue = sk->sk_user_data;
-	if (likely(queue && queue->rd_enabled))
+	if (likely(queue && queue->rd_enabled) &&
+	    !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 	read_unlock_bh(&sk->sk_callback_lock);
 }
@@ -2302,9 +2304,11 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx)
 	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
 		return 0;
 
+	set_bit(NVME_TCP_Q_POLLING, &queue->flags);
 	if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue))
 		sk_busy_loop(sk, true);
 	nvme_tcp_try_recv(queue);
+	clear_bit(NVME_TCP_Q_POLLING, &queue->flags);
 	return queue->nr_cqe;
 }
 
-- 
2.20.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] nvme-tcp: try to send request in queue_rq context
  2020-05-01 21:25 [PATCH 0/2 ] nvme-tcp I/O path optimizations Sagi Grimberg
  2020-05-01 21:25 ` [PATCH 1/2] nvme-tcp: avoid scheduling io_work if we are already polling Sagi Grimberg
@ 2020-05-01 21:25 ` Sagi Grimberg
  2020-05-06  6:57 ` [PATCH 0/2 ] nvme-tcp I/O path optimizations Christoph Hellwig
  2 siblings, 0 replies; 4+ messages in thread
From: Sagi Grimberg @ 2020-05-01 21:25 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme
  Cc: Anil Vasudevan, Mark Wunderlich

Today, nvme-tcp automatically schedules a send request
to a workqueue context, which is 1 more than we'd need
in case the socket buffer is wide open.

However, because we have async send activity (as a result
of r2t, or write_space callbacks), we need to synchronize
sends from possibly multiple contexts (ideally all running
on the same cpu though).

Thus, we only try to send directly from queue_rq in cases:
1. the send_list is empty
2. we can send it synchronously (i.e. not from the RX path)
3. we run on the same cpu as the queue->io_cpu to avoid
   contention on the send operation.

Proposed-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 43 ++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b28f91d0f083..c79e248b9f43 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -76,6 +76,7 @@ struct nvme_tcp_queue {
 	int			io_cpu;
 
 	spinlock_t		lock;
+	struct mutex		send_mutex;
 	struct list_head	send_list;
 
 	/* recv state */
@@ -132,6 +133,7 @@ static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
 static struct workqueue_struct *nvme_tcp_wq;
 static struct blk_mq_ops nvme_tcp_mq_ops;
 static struct blk_mq_ops nvme_tcp_admin_mq_ops;
+static int nvme_tcp_try_send(struct nvme_tcp_queue *queue);
 
 static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
 {
@@ -258,15 +260,29 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
 	}
 }
 
-static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
+static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
+		bool sync)
 {
 	struct nvme_tcp_queue *queue = req->queue;
+	bool empty;
 
 	spin_lock(&queue->lock);
+	empty = list_empty(&queue->send_list) && !queue->request;
 	list_add_tail(&req->entry, &queue->send_list);
 	spin_unlock(&queue->lock);
 
-	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	/*
+	 * if we're the first on the send_list and we can try to send
+	 * directly, otherwise queue io_work. Also, only do that if we
+	 * are on the same cpu, so we don't introduce contention.
+	 */
+	if (queue->io_cpu == smp_processor_id() &&
+	    sync && empty && mutex_trylock(&queue->send_mutex)) {
+		nvme_tcp_try_send(queue);
+		mutex_unlock(&queue->send_mutex);
+	} else {
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	}
 }
 
 static inline struct nvme_tcp_request *
@@ -579,7 +595,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	req->state = NVME_TCP_SEND_H2C_PDU;
 	req->offset = 0;
 
-	nvme_tcp_queue_request(req);
+	nvme_tcp_queue_request(req, false);
 
 	return 0;
 }
@@ -1065,11 +1081,14 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		bool pending = false;
 		int result;
 
-		result = nvme_tcp_try_send(queue);
-		if (result > 0)
-			pending = true;
-		else if (unlikely(result < 0))
-			break;
+		if (mutex_trylock(&queue->send_mutex)) {
+			result = nvme_tcp_try_send(queue);
+			mutex_unlock(&queue->send_mutex);
+			if (result > 0)
+				pending = true;
+			else if (unlikely(result < 0))
+				break;
+		}
 
 		result = nvme_tcp_try_recv(queue);
 		if (result > 0)
@@ -1321,6 +1340,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 	queue->ctrl = ctrl;
 	INIT_LIST_HEAD(&queue->send_list);
 	spin_lock_init(&queue->lock);
+	mutex_init(&queue->send_mutex);
 	INIT_WORK(&queue->io_work, nvme_tcp_io_work);
 	queue->queue_size = queue_size;
 
@@ -1545,6 +1565,7 @@ static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
 		set->reserved_tags = 2; /* connect + keep-alive */
 		set->numa_node = NUMA_NO_NODE;
+		set->flags = BLK_MQ_F_BLOCKING;
 		set->cmd_size = sizeof(struct nvme_tcp_request);
 		set->driver_data = ctrl;
 		set->nr_hw_queues = 1;
@@ -1556,7 +1577,7 @@ static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = 1; /* fabric connect */
 		set->numa_node = NUMA_NO_NODE;
-		set->flags = BLK_MQ_F_SHOULD_MERGE;
+		set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
 		set->cmd_size = sizeof(struct nvme_tcp_request);
 		set->driver_data = ctrl;
 		set->nr_hw_queues = nctrl->queue_count - 1;
@@ -2115,7 +2136,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
 
-	nvme_tcp_queue_request(&ctrl->async_req);
+	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
 
 static enum blk_eh_timer_return
@@ -2246,7 +2267,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(rq);
 
-	nvme_tcp_queue_request(req);
+	nvme_tcp_queue_request(req, true);
 
 	return BLK_STS_OK;
 }
-- 
2.20.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/2 ] nvme-tcp I/O path optimizations
  2020-05-01 21:25 [PATCH 0/2 ] nvme-tcp I/O path optimizations Sagi Grimberg
  2020-05-01 21:25 ` [PATCH 1/2] nvme-tcp: avoid scheduling io_work if we are already polling Sagi Grimberg
  2020-05-01 21:25 ` [PATCH 2/2] nvme-tcp: try to send request in queue_rq context Sagi Grimberg
@ 2020-05-06  6:57 ` Christoph Hellwig
  2 siblings, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2020-05-06  6:57 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Anil Vasudevan, Mark Wunderlich, Christoph Hellwig,
	linux-nvme

Thanks, applied to nvme-5.8.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-06  6:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01 21:25 [PATCH 0/2 ] nvme-tcp I/O path optimizations Sagi Grimberg
2020-05-01 21:25 ` [PATCH 1/2] nvme-tcp: avoid scheduling io_work if we are already polling Sagi Grimberg
2020-05-01 21:25 ` [PATCH 2/2] nvme-tcp: try to send request in queue_rq context Sagi Grimberg
2020-05-06  6:57 ` [PATCH 0/2 ] nvme-tcp I/O path optimizations Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).