Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/2] nvmet: add polling support
@ 2019-12-10  6:25 Chaitanya Kulkarni
  2019-12-10  6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-10  6:25 UTC (permalink / raw)
  To: sagi, hch; +Cc: Chaitanya Kulkarni, linux-nvme

Hi Christoph/Sagi/Keith,

This patch series implements the polling support for the NVMeOF target
with maintaining separate kthread for polling.

By implementing the polling for the file and bdev I was able to create
stubs for the target polling which will be useful for passthru code.

1. General approach:-

After submitting the request in (nvmet_file|bdev_execute_rw()) code path
we issue request on the available polling thread's list (poll_list)
which is shared between submission thread and the polling thread. We
create per CPU two polling threads to improve the scalability for I/O
submission.

When polling thread is woken up by the submission thread it moves the
request to its private list (done list) for processing if it finds
completed request (by the backend bdev/file). This allows outstanding
submission requests to make forward progress under pressure.

When there are no completed requests on the poll_list then polling
thread removes the request from the poll_list and calls backend
specific poll function on this request until it gets the completion.
On completion, it finishes the request in the same context over
trandport.

2. Testing and verification:-

I did fio verification tests for a couple of days and didn't find
any issues with data verification. Also, for bdev and file I could see
the io_poll stats under debugfs with considered/invoked/success numbers
being increased.

3. Performance improvements:-

There are still some performance improvements which can be done.

4. User interface:-

Users can optionally enable/disable polling with newly added configfs
attribute use_poll. This only enables polling in case backend
(bdev/file) supports it otherwise we just ignore this value.

It will be great if you can provide some feedback.

I'm aware of the io-wq code for io_using posted on the mailing list.
I was just wondering should we explore that to implement the polling ?

Patchset info :-

Repo:-          git://git.infradead.org/nvme.git
Bramch :-       nvme/for-5.5
HEAD:-

commit 7e4c6b9a5d22485acf009b3c3510a370f096dd54 (origin/nvme/for-5.5)
Author: Keith Busch <kbusch@kernel.org>
Date:   Fri Dec 6 08:11:17 2019 +0900

    nvme/pci: Fix read queue count

Regards,
Chaitanya

Chaitanya Kulkarni (2):
  nvmet: add bdev-ns polling support
  nvmet: add file-ns polling support

 drivers/nvme/target/Makefile      |   3 +-
 drivers/nvme/target/configfs.c    |  29 ++++++
 drivers/nvme/target/core.c        |  17 +++
 drivers/nvme/target/io-cmd-bdev.c |  61 +++++++++--
 drivers/nvme/target/io-cmd-file.c |  60 +++++++++--
 drivers/nvme/target/io-poll.c     | 165 ++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h       |  40 +++++++-
 7 files changed, 354 insertions(+), 21 deletions(-)
 create mode 100644 drivers/nvme/target/io-poll.c

-- 
2.22.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/2] nvmet: add bdev-ns polling support
  2019-12-10  6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni
@ 2019-12-10  6:25 ` " Chaitanya Kulkarni
  2020-01-20 12:52   ` Max Gurtovoy
  2019-12-10  6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni
  2019-12-12  1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg
  2 siblings, 1 reply; 12+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-10  6:25 UTC (permalink / raw)
  To: sagi, hch; +Cc: Chaitanya Kulkarni, linux-nvme

This patch adds support for the bdev-ns polling. We also add a new
file to keep polling code separate (io-poll.c). The newly added
configfs attribute allows user to enable/disable polling.

We only enable polling for the READ/WRITE operation based on support
from bdev and and use_poll configfs attr.

We configure polling per namespace. For each namespace we create
polling threads. For general approach please have a look at the
cover-letter of this patch-series.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 drivers/nvme/target/Makefile      |   3 +-
 drivers/nvme/target/configfs.c    |  29 ++++++
 drivers/nvme/target/core.c        |  13 +++
 drivers/nvme/target/io-cmd-bdev.c |  61 +++++++++--
 drivers/nvme/target/io-poll.c     | 165 ++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h       |  31 +++++-
 6 files changed, 291 insertions(+), 11 deletions(-)
 create mode 100644 drivers/nvme/target/io-poll.c

diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
index 2b33836f3d3e..8877ba16305c 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
 obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
 
 nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
-			discovery.o io-cmd-file.o io-cmd-bdev.o
+			discovery.o io-cmd-file.o io-cmd-bdev.o \
+			io-poll.o
 nvme-loop-y	+= loop.o
 nvmet-rdma-y	+= rdma.o
 nvmet-fc-y	+= fc.o
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 98613a45bd3b..0e6e8b0dbf79 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
 
 CONFIGFS_ATTR(nvmet_ns_, buffered_io);
 
+static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
+}
+
+static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_ns *ns = to_nvmet_ns(item);
+	bool val;
+
+	if (strtobool(page, &val))
+		return -EINVAL;
+
+	mutex_lock(&ns->subsys->lock);
+	if (ns->enabled) {
+		pr_err("disable ns before setting use_poll value.\n");
+		mutex_unlock(&ns->subsys->lock);
+		return -EINVAL;
+	}
+
+	ns->use_poll = val;
+	mutex_unlock(&ns->subsys->lock);
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_ns_, use_poll);
+
 static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_device_path,
 	&nvmet_ns_attr_device_nguid,
@@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_ana_grpid,
 	&nvmet_ns_attr_enable,
 	&nvmet_ns_attr_buffered_io,
+	&nvmet_ns_attr_use_poll,
 #ifdef CONFIG_PCI_P2PDMA
 	&nvmet_ns_attr_p2pmem,
 #endif
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 28438b833c1b..d8f9130d1cd1 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl,
 		ns->nsid);
 }
 
+inline void nvmet_req_done(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_bdev_req_complete(req);
+}
+
+inline void nvmet_req_poll_complete(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_bdev_poll_complete(req);
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
 
 	uuid_gen(&ns->uuid);
 	ns->buffered_io = false;
+	ns->use_poll = false;
 
 	return ns;
 }
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index b6fca0e421ef..13507e0cbc1d 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 
 int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 {
+	int fmode = FMODE_READ | FMODE_WRITE;
+	struct request_queue *q;
 	int ret;
 
-	ns->bdev = blkdev_get_by_path(ns->device_path,
-			FMODE_READ | FMODE_WRITE, NULL);
+	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
 	if (IS_ERR(ns->bdev)) {
 		ret = PTR_ERR(ns->bdev);
 		if (ret != -ENOTBLK) {
@@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 					ns->device_path, PTR_ERR(ns->bdev));
 		}
 		ns->bdev = NULL;
-		return ret;
+		goto out;
 	}
 	ns->size = i_size_read(ns->bdev->bd_inode);
 	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
-	return 0;
+	q = bdev_get_queue(ns->bdev);
+	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
+out:
+	return ret;
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
 {
 	if (ns->bdev) {
+		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
 		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
 		ns->bdev = NULL;
 	}
@@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
 	return status;
 }
 
-static void nvmet_bio_done(struct bio *bio)
+void nvmet_bdev_req_complete(struct nvmet_req *req)
 {
-	struct nvmet_req *req = bio->bi_private;
+	struct bio *bio = req->b.last_bio;
 
 	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
 	if (bio != &req->b.inline_bio)
 		bio_put(bio);
 }
 
+static void nvmet_bio_done(struct bio *bio)
+{
+	struct nvmet_req *req = bio->bi_private;
+
+	req->b.last_bio = bio;
+
+	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
+}
+
+void nvmet_bdev_poll_complete(struct nvmet_req *req)
+{
+	struct request_queue *q = bdev_get_queue(req->ns->bdev);
+
+	while (!completion_done(&req->wait)) {
+		int ret = blk_poll(q, req->b.cookie, true);
+
+		if (ret < 0) {
+			pr_err("tid %d poll error %d", req->t->id, ret);
+			break;
+		}
+	}
+	nvmet_bdev_req_complete(req);
+}
+
 static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 {
 	int sg_cnt = req->sg_cnt;
@@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 	bio->bi_end_io = nvmet_bio_done;
 	bio->bi_opf = op;
 
-	blk_start_plug(&plug);
+	if (!req->poll)
+		blk_start_plug(&plug);
 	for_each_sg(req->sg, sg, req->sg_cnt, i) {
 		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
 				!= sg->length) {
@@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
 		sg_cnt--;
 	}
 
-	submit_bio(bio);
-	blk_finish_plug(&plug);
+	req->b.last_bio = bio;
+	if (req->poll)
+		req->b.last_bio->bi_opf |= REQ_HIPRI;
+
+	req->b.cookie = submit_bio(bio);
+
+	nvmet_req_prep_poll(req);
+	nvmet_req_issue_poll(req);
+	if (!req->poll)
+		blk_finish_plug(&plug);
 }
 
 static void nvmet_bdev_execute_flush(struct nvmet_req *req)
@@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	switch (cmd->common.opcode) {
 	case nvme_cmd_read:
 	case nvme_cmd_write:
+		req->poll = req->ns->poll;
 		req->execute = nvmet_bdev_execute_rw;
 		return 0;
 	case nvme_cmd_flush:
+		req->poll = false;
 		req->execute = nvmet_bdev_execute_flush;
 		return 0;
 	case nvme_cmd_dsm:
+		req->poll = false;
 		req->execute = nvmet_bdev_execute_dsm;
 		return 0;
 	case nvme_cmd_write_zeroes:
+		req->poll = false;
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
 	default:
diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c
new file mode 100644
index 000000000000..175c939c22ff
--- /dev/null
+++ b/drivers/nvme/target/io-poll.c
@@ -0,0 +1,165 @@
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/sched/signal.h>
+#include <linux/kthread.h>
+#include <uapi/linux/sched/types.h>
+
+#include "nvmet.h"
+#define THREAD_PER_CPU	(num_online_cpus() * 2)
+
+static int nvmet_poll_thread(void *data);
+
+int nvmet_ns_start_poll(struct nvmet_ns *ns)
+{
+	struct nvmet_poll_data *t;
+	int ret = 0;
+	int i;
+
+	t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL);
+	if (!t) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < THREAD_PER_CPU; i++) {
+		init_completion(&t[i].thread_complete);
+		init_waitqueue_head(&t[i].poll_waitq);
+		INIT_LIST_HEAD(&t[i].list);
+		INIT_LIST_HEAD(&t[i].done);
+		mutex_init(&t[i].list_lock);
+		t[i].id = i;
+
+		t[i].thread = kthread_create(nvmet_poll_thread, &t[i],
+					     "nvmet_poll_thread%s/%d",
+					     ns->device_path, i);
+
+		if (IS_ERR(t[i].thread)) {
+			ret = PTR_ERR(t[i].thread);
+			goto err;
+		}
+
+		kthread_bind(t[i].thread, i % num_online_cpus());
+		wake_up_process(t[i].thread);
+		wait_for_completion(&t[i].thread_complete);
+	}
+
+	ns->t = t;
+out:
+	return ret;
+err:
+	nvmet_ns_stop_poll(ns);
+	goto out;
+}
+
+void nvmet_ns_stop_poll(struct nvmet_ns *ns)
+{
+	struct nvmet_poll_data *t = ns->t;
+	int i;
+
+	for (i = 0; i < THREAD_PER_CPU; i++) {
+		if (!t[i].thread)
+			continue;
+
+		if (wq_has_sleeper(&t[i].poll_waitq))
+			wake_up(&t[i].poll_waitq);
+		kthread_park(t[i].thread);
+		kthread_stop(t[i].thread);
+	}
+}
+
+static void nvmet_poll(struct nvmet_poll_data *t)
+{
+	struct nvmet_req *req, *tmp;
+
+	lockdep_assert_held(&t->list_lock);
+
+	list_for_each_entry_safe(req, tmp, &t->list, poll_entry) {
+
+		if (completion_done(&req->wait)) {
+			list_move_tail(&req->poll_entry, &t->done);
+			continue;
+		}
+
+		if (!list_empty(&t->done))
+			break;
+
+		list_del(&req->poll_entry);
+		mutex_unlock(&t->list_lock);
+		nvmet_req_poll_complete(req);
+		mutex_lock(&t->list_lock);
+	}
+	mutex_unlock(&t->list_lock);
+	/*
+	 * In future we can also add batch timeout or nr request to complete.
+	 */
+	while (!list_empty(&t->done)) {
+		/*
+		 * We lock and unlock for t->list which gurantee progress of
+		 * nvmet_xxx_rw_execute() when under pressure while we complete
+		 * the request.
+		 */
+		req = list_first_entry(&t->done, struct nvmet_req, poll_entry);
+		list_del(&req->poll_entry);
+		nvmet_req_done(req);
+	}
+
+	mutex_lock(&t->list_lock);
+}
+
+static int nvmet_poll_thread(void *data)
+{
+	struct nvmet_poll_data *t = (struct nvmet_poll_data *) data;
+	DEFINE_WAIT(wait);
+
+	complete(&t->thread_complete);
+
+	while (!kthread_should_park()) {
+
+		mutex_lock(&t->list_lock);
+		while (!list_empty(&t->list) && !need_resched())
+			nvmet_poll(t);
+		mutex_unlock(&t->list_lock);
+
+		prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE);
+		if (signal_pending(current))
+			flush_signals(current);
+		smp_mb();
+		schedule();
+
+		finish_wait(&t->poll_waitq, &wait);
+	}
+
+	kthread_parkme();
+	return 0;
+}
+
+inline void nvmet_req_prep_poll(struct nvmet_req *req)
+{
+	if (!req->poll)
+		return;
+
+	init_completion(&req->wait);
+	req->t = &req->ns->t[smp_processor_id()];
+}
+
+inline void nvmet_req_issue_poll(struct nvmet_req *req)
+{
+	if (!req->poll)
+		return;
+
+	while (!mutex_trylock(&req->t->list_lock)) {
+		if (req->t->id < num_online_cpus())
+			req->t = &req->ns->t[req->t->id + num_online_cpus()];
+		else
+			req->t = &req->ns->t[req->t->id - num_online_cpus()];
+	}
+
+	if (completion_done(&req->wait))
+		list_add(&req->poll_entry, &req->t->list);
+	else
+		list_add_tail(&req->poll_entry, &req->t->list);
+	mutex_unlock(&req->t->list_lock);
+
+	if (wq_has_sleeper(&req->t->poll_waitq))
+		wake_up(&req->t->poll_waitq);
+}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 46df45e837c9..ef2919e23e0b 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -49,11 +49,22 @@
 #define IPO_IATTR_CONNECT_SQE(x)	\
 	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
 
+struct nvmet_poll_data {
+	struct completion		thread_complete;
+	wait_queue_head_t		poll_waitq;
+	struct mutex			list_lock;
+	struct task_struct		*thread;
+	struct list_head		list;
+	struct list_head		done;
+	unsigned int			id;
+};
+
 struct nvmet_ns {
 	struct list_head	dev_link;
 	struct percpu_ref	ref;
 	struct block_device	*bdev;
 	struct file		*file;
+	struct nvmet_poll_data	*t;
 	bool			readonly;
 	u32			nsid;
 	u32			blksize_shift;
@@ -63,6 +74,8 @@ struct nvmet_ns {
 	u32			anagrpid;
 
 	bool			buffered_io;
+	bool			use_poll;
+	bool			poll;
 	bool			enabled;
 	struct nvmet_subsys	*subsys;
 	const char		*device_path;
@@ -292,9 +305,15 @@ struct nvmet_req {
 	struct nvmet_ns		*ns;
 	struct scatterlist	*sg;
 	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
+	struct completion	wait;
+	bool			poll;
+	struct nvmet_poll_data  *t;
+	struct list_head	poll_entry;
 	union {
 		struct {
-			struct bio      inline_bio;
+			struct bio		inline_bio;
+			blk_qc_t		cookie;
+			struct bio		*last_bio;
 		} b;
 		struct {
 			bool			mpool_alloc;
@@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys,
 void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
 		u8 event_info, u8 log_page);
 
+int nvmet_ns_start_poll(struct nvmet_ns *ns);
+void nvmet_ns_stop_poll(struct nvmet_ns *ns);
+void nvmet_req_prep_poll(struct nvmet_req *req);
+void nvmet_req_issue_poll(struct nvmet_req *req);
+
+void nvmet_req_poll_complete(struct nvmet_req *req);
+void nvmet_bdev_poll_complete(struct nvmet_req *req);
+void nvmet_bdev_req_complete(struct nvmet_req *req);
+void nvmet_req_done(struct nvmet_req *req);
+
 #define NVMET_QUEUE_SIZE	1024
 #define NVMET_NR_QUEUES		128
 #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE
-- 
2.22.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/2] nvmet: add file-ns polling support
  2019-12-10  6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni
  2019-12-10  6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni
@ 2019-12-10  6:25 ` " Chaitanya Kulkarni
  2019-12-12  1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg
  2 siblings, 0 replies; 12+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-10  6:25 UTC (permalink / raw)
  To: sagi, hch; +Cc: Chaitanya Kulkarni, linux-nvme

This patch adds polling support for file-ns. We only enable polling
based on underlying fs support and use_poll attr. By default we don't
poll on any operation we poll on the request with Reqd/write when we
receive -EIOCBQUEUED i.e. for direct I/O.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 drivers/nvme/target/core.c        |  4 +++
 drivers/nvme/target/io-cmd-file.c | 60 +++++++++++++++++++++++++++----
 drivers/nvme/target/nvmet.h       |  9 +++--
 3 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index d8f9130d1cd1..cd2f5c6f896e 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -514,12 +514,16 @@ inline void nvmet_req_done(struct nvmet_req *req)
 {
 	if (req->ns->bdev)
 		nvmet_bdev_req_complete(req);
+	if (req->ns->file)
+		nvmet_file_req_complete(req);
 }
 
 inline void nvmet_req_poll_complete(struct nvmet_req *req)
 {
 	if (req->ns->bdev)
 		nvmet_bdev_poll_complete(req);
+	if (req->ns->file)
+		nvmet_file_poll_complete(req);
 }
 
 int nvmet_ns_enable(struct nvmet_ns *ns)
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index caebfce06605..1f49a02fd437 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -16,6 +16,7 @@
 void nvmet_file_ns_disable(struct nvmet_ns *ns)
 {
 	if (ns->file) {
+		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
 		if (ns->buffered_io)
 			flush_workqueue(buffered_io_wq);
 		mempool_destroy(ns->bvec_pool);
@@ -72,6 +73,11 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
 		goto err;
 	}
 
+	ns->poll = ns->use_poll && ns->file->f_op->iopoll;
+	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
+	if (ret)
+		goto err;
+
 	return ret;
 err:
 	ns->size = 0;
@@ -114,9 +120,8 @@ static ssize_t nvmet_file_submit_bvec(struct nvmet_req *req, loff_t pos,
 	return call_iter(iocb, &iter);
 }
 
-static void nvmet_file_io_done(struct kiocb *iocb, long ret, long ret2)
+void nvmet_file_req_complete(struct nvmet_req *req)
 {
-	struct nvmet_req *req = container_of(iocb, struct nvmet_req, f.iocb);
 	u16 status = NVME_SC_SUCCESS;
 
 	if (req->f.bvec != req->inline_bvec) {
@@ -126,13 +131,39 @@ static void nvmet_file_io_done(struct kiocb *iocb, long ret, long ret2)
 			mempool_free(req->f.bvec, req->ns->bvec_pool);
 	}
 
-	if (unlikely(ret != req->transfer_len))
-		status = errno_to_nvme_status(req, ret);
+	if (unlikely(req->f.iosize != req->transfer_len))
+		status = errno_to_nvme_status(req, req->f.iosize);
 	nvmet_req_complete(req, status);
 }
 
+static void nvmet_file_io_done(struct kiocb *iocb, long ret, long ret2)
+{
+	struct nvmet_req *req = container_of(iocb, struct nvmet_req, f.iocb);
+
+	req->f.iosize = ret;
+	req->poll ? complete(&req->wait) : nvmet_file_req_complete(req);
+}
+
+void nvmet_file_poll_complete(struct nvmet_req *req)
+{
+	while (!completion_done(&req->wait)) {
+		int ret = req->f.iocb.ki_filp->f_op->iopoll(&req->f.iocb, true);
+
+		if (ret < 0)
+			pr_err("tid %d poll error %d", req->t->id, ret);
+	}
+	/*
+	 * We are out of the lock anyway, by completing the polled request here
+	 * we reduce lock contention and decrease the size of done list which
+	 * reduces the size of list_lock. This allows nvmet_file_execute_rw()
+	 * to make progress as and when we scheduled out.
+	 */
+	nvmet_file_req_complete(req);
+}
+
 static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags)
 {
+	struct kiocb *iocb = &req->f.iocb;
 	ssize_t nr_bvec = req->sg_cnt;
 	unsigned long bv_cnt = 0;
 	bool is_sync = false;
@@ -151,7 +182,7 @@ static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags)
 		return true;
 	}
 
-	memset(&req->f.iocb, 0, sizeof(struct kiocb));
+	memset(iocb, 0, sizeof(struct kiocb));
 	for_each_sg(req->sg, sg, req->sg_cnt, i) {
 		nvmet_file_init_bvec(&req->f.bvec[bv_cnt], sg);
 		len += req->f.bvec[bv_cnt].bv_len;
@@ -187,13 +218,20 @@ static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags)
 	 * A NULL ki_complete ask for synchronous execution, which we want
 	 * for the IOCB_NOWAIT case.
 	 */
-	if (!(ki_flags & IOCB_NOWAIT))
-		req->f.iocb.ki_complete = nvmet_file_io_done;
+	if (!(ki_flags & IOCB_NOWAIT)) {
+		iocb->ki_complete = nvmet_file_io_done;
+		ki_flags |= req->ns->poll ? IOCB_HIPRI : 0;
+	}
 
 	ret = nvmet_file_submit_bvec(req, pos, bv_cnt, total_len, ki_flags);
 
 	switch (ret) {
 	case -EIOCBQUEUED:
+		if (req->ns->poll) {
+			req->poll = true;
+			nvmet_req_prep_poll(req);
+			nvmet_req_issue_poll(req);
+		}
 		return true;
 	case -EAGAIN:
 		if (WARN_ON_ONCE(!(ki_flags & IOCB_NOWAIT)))
@@ -211,6 +249,10 @@ static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags)
 	}
 
 complete:
+	/*
+	 * If we are here, then I/O is synchronously completed and ret holds
+	 * number of bytes transferred.
+	 */
 	nvmet_file_io_done(&req->f.iocb, ret, 0);
 	return true;
 }
@@ -379,15 +421,19 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	switch (cmd->common.opcode) {
 	case nvme_cmd_read:
 	case nvme_cmd_write:
+		req->poll = false;
 		req->execute = nvmet_file_execute_rw;
 		return 0;
 	case nvme_cmd_flush:
+		req->poll = false;
 		req->execute = nvmet_file_execute_flush;
 		return 0;
 	case nvme_cmd_dsm:
+		req->poll = false;
 		req->execute = nvmet_file_execute_dsm;
 		return 0;
 	case nvme_cmd_write_zeroes:
+		req->poll = false;
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
 	default:
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ef2919e23e0b..e7e0e0de705e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -317,9 +317,10 @@ struct nvmet_req {
 		} b;
 		struct {
 			bool			mpool_alloc;
-			struct kiocb            iocb;
-			struct bio_vec          *bvec;
-			struct work_struct      work;
+			struct kiocb		iocb;
+			struct bio_vec		*bvec;
+			struct work_struct	work;
+			long			iosize;
 		} f;
 	};
 	int			sg_cnt;
@@ -469,6 +470,8 @@ void nvmet_req_issue_poll(struct nvmet_req *req);
 void nvmet_req_poll_complete(struct nvmet_req *req);
 void nvmet_bdev_poll_complete(struct nvmet_req *req);
 void nvmet_bdev_req_complete(struct nvmet_req *req);
+void nvmet_file_poll_complete(struct nvmet_req *req);
+void nvmet_file_req_complete(struct nvmet_req *req);
 void nvmet_req_done(struct nvmet_req *req);
 
 #define NVMET_QUEUE_SIZE	1024
-- 
2.22.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] nvmet: add polling support
  2019-12-10  6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni
  2019-12-10  6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni
  2019-12-10  6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni
@ 2019-12-12  1:01 ` " Sagi Grimberg
  2019-12-12  5:44   ` Chaitanya Kulkarni
  2020-01-20  4:48   ` Chaitanya Kulkarni
  2 siblings, 2 replies; 12+ messages in thread
From: Sagi Grimberg @ 2019-12-12  1:01 UTC (permalink / raw)
  To: Chaitanya Kulkarni, hch; +Cc: linux-nvme

percpu threads per namespace? Sounds like the wrong approach. These
threads will compete for cpu time with the main nvmet contexts.

Have you considered having the main nvmet contexts incorporate polling
activity between I/Os? Don't have a great dea on how to do it from first
thought...

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] nvmet: add polling support
  2019-12-12  1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg
@ 2019-12-12  5:44   ` Chaitanya Kulkarni
  2019-12-12 20:32     ` Sagi Grimberg
  2020-01-20  4:48   ` Chaitanya Kulkarni
  1 sibling, 1 reply; 12+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-12  5:44 UTC (permalink / raw)
  To: Sagi Grimberg, hch; +Cc: linux-nvme

On 12/11/2019 05:01 PM, Sagi Grimberg wrote:
> percpu threads per namespace? Sounds like the wrong approach. These
> threads will compete for cpu time with the main nvmet contexts.
>
That make sense, how about a global threadpool for target which can be 
shared between all the subsystem and their name-spaces just like we
have buffered_io_wq?
> Have you considered having the main nvmet contexts incorporate polling
> activity between I/Os? Don't have a great dea on how to do it from first
> thought...
>

I am not able to understand nvmet context, can you please elaborate ?
Are you referring to the pattern we one have in the
nvme_execute_rq_polled() ?


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] nvmet: add polling support
  2019-12-12  5:44   ` Chaitanya Kulkarni
@ 2019-12-12 20:32     ` Sagi Grimberg
  2020-01-20  5:13       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2019-12-12 20:32 UTC (permalink / raw)
  To: Chaitanya Kulkarni, hch; +Cc: linux-nvme


>> percpu threads per namespace? Sounds like the wrong approach. These
>> threads will compete for cpu time with the main nvmet contexts.
>>
> That make sense, how about a global threadpool for target which can be
> shared between all the subsystem and their name-spaces just like we
> have buffered_io_wq?
>> Have you considered having the main nvmet contexts incorporate polling
>> activity between I/Os? Don't have a great dea on how to do it from first
>> thought...
>>
> 
> I am not able to understand nvmet context, can you please elaborate ?
> Are you referring to the pattern we one have in the
> nvme_execute_rq_polled() ?

No, we would want non-selective polling. Right now we have nvmet context
starting from the transport going to submit I/O to the backend, or
starting from the backend going to submit to the transport.

Ideally, we'd have these contexts to do the polling instead of a 
different thread that will poll for as much as it can taking away
cpu time?

One way to do it is to place a intermediate thread that will sit between
the transport and the backend but that would yield an additional context
switch in the I/O path (not ideal).

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] nvmet: add polling support
  2019-12-12  1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg
  2019-12-12  5:44   ` Chaitanya Kulkarni
@ 2020-01-20  4:48   ` Chaitanya Kulkarni
  1 sibling, 0 replies; 12+ messages in thread
From: Chaitanya Kulkarni @ 2020-01-20  4:48 UTC (permalink / raw)
  To: Sagi Grimberg, hch; +Cc: linux-nvme

On 12/11/2019 05:01 PM, Sagi Grimberg wrote:
> percpu threads per namespace? Sounds like the wrong approach. These
> threads will compete for cpu time with the main nvmet contexts.

Isn't having need_resched() placed in the polling thread should
help with the one thread hogging the CPU ?

Please have a look my reply and performance numbers for new patch.


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] nvmet: add polling support
  2019-12-12 20:32     ` Sagi Grimberg
@ 2020-01-20  5:13       ` Chaitanya Kulkarni
  0 siblings, 0 replies; 12+ messages in thread
From: Chaitanya Kulkarni @ 2020-01-20  5:13 UTC (permalink / raw)
  To: Sagi Grimberg, hch; +Cc: linux-nvme

On 12/12/2019 12:32 PM, Sagi Grimberg wrote:
>>> >>percpu threads per namespace? Sounds like the wrong approach. These
>>> >>threads will compete for cpu time with the main nvmet contexts.
>>> >>
>> >That make sense, how about a global threadpool for target which can be
>> >shared between all the subsystem and their name-spaces just like we
>> >have buffered_io_wq?
>>> >>Have you considered having the main nvmet contexts incorporate polling
>>> >>activity between I/Os? Don't have a great dea on how to do it from first
>>> >>thought...
>>> >>
>> >
>> >I am not able to understand nvmet context, can you please elaborate ?
>> >Are you referring to the pattern we one have in the
>> >nvme_execute_rq_polled() ?
> No, we would want non-selective polling. Right now we have nvmet context
> starting from the transport going to submit I/O to the backend, or
> starting from the backend going to submit to the transport.
>
> Ideally, we'd have these contexts to do the polling instead of a
> different thread that will poll for as much as it can taking away
> cpu time?
>
> One way to do it is to place a intermediate thread that will sit between
> the transport and the backend but that would yield an additional context
> switch in the I/O path (not ideal).
>
Yes not ideal.

I took numbers on the following patch and I can see
performance improved when enabled polling in QEMU with
nvme-loop.

#nvme list | tr -s ' ' ' '
/dev/nvme0n1 foo QEMU NVMe Ctrl 1 2.10 MB / 2.10 MB 1 B + 9 B 1.0
/dev/nvme1n1 6e79351962f851ad Linux 1 2.10 MB / 2.10 MB 1 B + 9 B 5.5.0-rc

We can see that CPU percentage has gone up to the saturation point
and increase in BW and IOPS.

IOPS/BW:-

Default:-
read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
Poll:-
read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)

mpstat:-
Default:-
CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
---------------------------------------------------------
all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
---------------------------------------------------------
0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73

Poll:-
CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
---------------------------------------------------------
all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
---------------------------------------------------------
0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00

Any thoughts ?


Patch :-
diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
index 2b33836f3d3e..8877ba16305c 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
  obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o

  nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
-			discovery.o io-cmd-file.o io-cmd-bdev.o
+			discovery.o io-cmd-file.o io-cmd-bdev.o \
+			io-poll.o
  nvme-loop-y	+= loop.o
  nvmet-rdma-y	+= rdma.o
  nvmet-fc-y	+= fc.o
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 98613a45bd3b..0e6e8b0dbf79 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct 
config_item *item,

  CONFIGFS_ATTR(nvmet_ns_, buffered_io);

+static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
+}
+
+static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_ns *ns = to_nvmet_ns(item);
+	bool val;
+
+	if (strtobool(page, &val))
+		return -EINVAL;
+
+	mutex_lock(&ns->subsys->lock);
+	if (ns->enabled) {
+		pr_err("disable ns before setting use_poll value.\n");
+		mutex_unlock(&ns->subsys->lock);
+		return -EINVAL;
+	}
+
+	ns->use_poll = val;
+	mutex_unlock(&ns->subsys->lock);
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_ns_, use_poll);
+
  static struct configfs_attribute *nvmet_ns_attrs[] = {
  	&nvmet_ns_attr_device_path,
  	&nvmet_ns_attr_device_nguid,
@@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
  	&nvmet_ns_attr_ana_grpid,
  	&nvmet_ns_attr_enable,
  	&nvmet_ns_attr_buffered_io,
+	&nvmet_ns_attr_use_poll,
  #ifdef CONFIG_PCI_P2PDMA
  	&nvmet_ns_attr_p2pmem,
  #endif
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 28438b833c1b..af91240b1a1f 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct 
nvmet_ctrl *ctrl,
  		ns->nsid);
  }

+inline void nvmet_req_poll(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_req_bdev_poll(req);
+}
+
+inline void nvmet_req_poll_complete(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_bdev_req_complete(req);
+}
+
  int nvmet_ns_enable(struct nvmet_ns *ns)
  {
  	struct nvmet_subsys *subsys = ns->subsys;
@@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys 
*subsys, u32 nsid)

  	uuid_gen(&ns->uuid);
  	ns->buffered_io = false;
+	ns->use_poll = false;

  	return ns;diff --git a/drivers/nvme/target/Makefile 
b/drivers/nvme/target/Makefile
index 2b33836f3d3e..8877ba16305c 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
  obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o

  nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
-			discovery.o io-cmd-file.o io-cmd-bdev.o
+			discovery.o io-cmd-file.o io-cmd-bdev.o \
+			io-poll.o
  nvme-loop-y	+= loop.o
  nvmet-rdma-y	+= rdma.o
  nvmet-fc-y	+= fc.o
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 98613a45bd3b..0e6e8b0dbf79 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct 
config_item *item,

  CONFIGFS_ATTR(nvmet_ns_, buffered_io);

+static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
+}
+
+static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_ns *ns = to_nvmet_ns(item);
+	bool val;
+
+	if (strtobool(page, &val))
+		return -EINVAL;
+
+	mutex_lock(&ns->subsys->lock);
+	if (ns->enabled) {
+		pr_err("disable ns before setting use_poll value.\n");
+		mutex_unlock(&ns->subsys->lock);
+		return -EINVAL;
+	}
+
+	ns->use_poll = val;
+	mutex_unlock(&ns->subsys->lock);
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_ns_, use_poll);
+
  static struct configfs_attribute *nvmet_ns_attrs[] = {
  	&nvmet_ns_attr_device_path,
  	&nvmet_ns_attr_device_nguid,
@@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
  	&nvmet_ns_attr_ana_grpid,
  	&nvmet_ns_attr_enable,
  	&nvmet_ns_attr_buffered_io,
+	&nvmet_ns_attr_use_poll,
  #ifdef CONFIG_PCI_P2PDMA
  	&nvmet_ns_attr_p2pmem,
  #endif
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 28438b833c1b..af91240b1a1f 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct 
nvmet_ctrl *ctrl,
  		ns->nsid);
  }

+inline void nvmet_req_poll(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_req_bdev_poll(req);
+}
+
+inline void nvmet_req_poll_complete(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_bdev_req_complete(req);
+}
+
  int nvmet_ns_enable(struct nvmet_ns *ns)
  {
  	struct nvmet_subsys *subsys = ns->subsys;
@@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys 
*subsys, u32 nsid)

  	uuid_gen(&ns->uuid);
  	ns->buffered_io = false;
+	ns->use_poll = false;

  	return ns;
  }
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index b6fca0e421ef..317c9b427b71 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device 
*bdev, struct nvme_id_ns *id)

  int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
  {
+	int fmode = FMODE_READ | FMODE_WRITE;
+	struct request_queue *q;
  	int ret;

-	ns->bdev = blkdev_get_by_path(ns->device_path,
-			FMODE_READ | FMODE_WRITE, NULL);
+	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
  	if (IS_ERR(ns->bdev)) {
  		ret = PTR_ERR(ns->bdev);
  		if (ret != -ENOTBLK) {
@@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
  					ns->device_path, PTR_ERR(ns->bdev));
  		}
  		ns->bdev = NULL;
-		return ret;
+		goto out;
  	}
  	ns->size = i_size_read(ns->bdev->bd_inode);
  	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
-	return 0;
+	q = bdev_get_queue(ns->bdev);
+	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
+out:
+	return ret;
  }

  void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
  {
  	if (ns->bdev) {
+		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
  		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
  		ns->bdev = NULL;
  	}
@@ -133,15 +139,34 @@ static u16 blk_to_nvme_status(struct nvmet_req 
*req, blk_status_t blk_sts)
  	return status;
  }

-static void nvmet_bio_done(struct bio *bio)
+void nvmet_bdev_req_complete(struct nvmet_req *req)
  {
-	struct nvmet_req *req = bio->bi_private;
+	struct bio *bio = req->b.last_bio;

  	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
  	if (bio != &req->b.inline_bio)
  		bio_put(bio);
  }

+static void nvmet_bio_done(struct bio *bio)
+{
+	struct nvmet_req *req = bio->bi_private;
+
+	req->b.last_bio = bio;
+
+	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
+}
+
+void nvmet_req_bdev_poll(struct nvmet_req *req)
+{
+	struct request_queue *q = bdev_get_queue(req->ns->bdev);
+	int ret = blk_poll(q, req->b.cookie, true);
+
+	if (ret < 0) {
+		pr_err("tid %d poll error %d", req->t->id, ret);
+	}
+}
+
  static void nvmet_bdev_execute_rw(struct nvmet_req *req)
  {
  	int sg_cnt = req->sg_cnt;
@@ -185,7 +210,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
  	bio->bi_end_io = nvmet_bio_done;
  	bio->bi_opf = op;

-	blk_start_plug(&plug);
+	if (!req->poll)
+		blk_start_plug(&plug);
  	for_each_sg(req->sg, sg, req->sg_cnt, i) {
  		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
  				!= sg->length) {
@@ -204,8 +230,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req 
*req)
  		sg_cnt--;
  	}

-	submit_bio(bio);
-	blk_finish_plug(&plug);
+	req->b.last_bio = bio;
+	if (req->poll)
+		req->b.last_bio->bi_opf |= REQ_HIPRI;
+
+	req->b.cookie = submit_bio(bio);
+
+	nvmet_req_prep_poll(req);
+	nvmet_req_issue_poll(req);
+	if (!req->poll)
+		blk_finish_plug(&plug);
  }

  static void nvmet_bdev_execute_flush(struct nvmet_req *req)
@@ -330,15 +364,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
  	switch (cmd->common.opcode) {
  	case nvme_cmd_read:
  	case nvme_cmd_write:
+		req->poll = req->ns->poll;
  		req->execute = nvmet_bdev_execute_rw;
  		return 0;
  	case nvme_cmd_flush:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_flush;
  		return 0;
  	case nvme_cmd_dsm:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_dsm;
  		return 0;
  	case nvme_cmd_write_zeroes:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_write_zeroes;
  		return 0;
  	default:
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 46df45e837c9..e5991e36ed6f 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -49,11 +49,22 @@
  #define IPO_IATTR_CONNECT_SQE(x)	\
  	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))

+struct nvmet_poll_data {
+	struct completion		thread_complete;
+	wait_queue_head_t		poll_waitq;
+	struct mutex			list_lock;
+	struct task_struct		*thread;
+	struct list_head		list;
+	struct list_head		done;
+	unsigned int			id;
+};
+diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
index 2b33836f3d3e..8877ba16305c 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
  obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o

  nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
-			discovery.o io-cmd-file.o io-cmd-bdev.o
+			discovery.o io-cmd-file.o io-cmd-bdev.o \
+			io-poll.o
  nvme-loop-y	+= loop.o
  nvmet-rdma-y	+= rdma.o
  nvmet-fc-y	+= fc.o
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 98613a45bd3b..0e6e8b0dbf79 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct 
config_item *item,

  CONFIGFS_ATTR(nvmet_ns_, buffered_io);

+static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
+}
+
+static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_ns *ns = to_nvmet_ns(item);
+	bool val;
+
+	if (strtobool(page, &val))
+		return -EINVAL;
+
+	mutex_lock(&ns->subsys->lock);
+	if (ns->enabled) {
+		pr_err("disable ns before setting use_poll value.\n");
+		mutex_unlock(&ns->subsys->lock);
+		return -EINVAL;
+	}
+
+	ns->use_poll = val;
+	mutex_unlock(&ns->subsys->lock);
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_ns_, use_poll);
+
  static struct configfs_attribute *nvmet_ns_attrs[] = {
  	&nvmet_ns_attr_device_path,
  	&nvmet_ns_attr_device_nguid,
@@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
  	&nvmet_ns_attr_ana_grpid,
  	&nvmet_ns_attr_enable,
  	&nvmet_ns_attr_buffered_io,
+	&nvmet_ns_attr_use_poll,
  #ifdef CONFIG_PCI_P2PDMA
  	&nvmet_ns_attr_p2pmem,
  #endif
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 28438b833c1b..af91240b1a1f 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct 
nvmet_ctrl *ctrl,
  		ns->nsid);
  }

+inline void nvmet_req_poll(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_req_bdev_poll(req);
+}
+
+inline void nvmet_req_poll_complete(struct nvmet_req *req)
+{
+	if (req->ns->bdev)
+		nvmet_bdev_req_complete(req);
+}
+
  int nvmet_ns_enable(struct nvmet_ns *ns)
  {
  	struct nvmet_subsys *subsys = ns->subsys;
@@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys 
*subsys, u32 nsid)

  	uuid_gen(&ns->uuid);
  	ns->buffered_io = false;
+	ns->use_poll = false;

  	return ns;
  }
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index b6fca0e421ef..317c9b427b71 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device 
*bdev, struct nvme_id_ns *id)

  int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
  {
+	int fmode = FMODE_READ | FMODE_WRITE;
+	struct request_queue *q;
  	int ret;

-	ns->bdev = blkdev_get_by_path(ns->device_path,
-			FMODE_READ | FMODE_WRITE, NULL);
+	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
  	if (IS_ERR(ns->bdev)) {
  		ret = PTR_ERR(ns->bdev);
  		if (ret != -ENOTBLK) {
@@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
  					ns->device_path, PTR_ERR(ns->bdev));
  		}
  		ns->bdev = NULL;
-		return ret;
+		goto out;
  	}
  	ns->size = i_size_read(ns->bdev->bd_inode);
  	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
-	return 0;
+	q = bdev_get_queue(ns->bdev);
+	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
+out:
+	return ret;
  }

  void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
  {
  	if (ns->bdev) {
+		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
  		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
  		ns->bdev = NULL;
  	}
@@ -133,15 +139,34 @@ static u16 blk_to_nvme_status(struct nvmet_req 
*req, blk_status_t blk_sts)
  	return status;
  }

-static void nvmet_bio_done(struct bio *bio)
+void nvmet_bdev_req_complete(struct nvmet_req *req)
  {
-	struct nvmet_req *req = bio->bi_private;
+	struct bio *bio = req->b.last_bio;

  	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
  	if (bio != &req->b.inline_bio)
  		bio_put(bio);
  }

+static void nvmet_bio_done(struct bio *bio)
+{
+	struct nvmet_req *req = bio->bi_private;
+
+	req->b.last_bio = bio;
+
+	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
+}
+
+void nvmet_req_bdev_poll(struct nvmet_req *req)
+{
+	struct request_queue *q = bdev_get_queue(req->ns->bdev);
+	int ret = blk_poll(q, req->b.cookie, true);
+
+	if (ret < 0) {
+		pr_err("tid %d poll error %d", req->t->id, ret);
+	}
+}
+
  static void nvmet_bdev_execute_rw(struct nvmet_req *req)
  {
  	int sg_cnt = req->sg_cnt;
@@ -185,7 +210,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
  	bio->bi_end_io = nvmet_bio_done;
  	bio->bi_opf = op;

-	blk_start_plug(&plug);
+	if (!req->poll)
+		blk_start_plug(&plug);
  	for_each_sg(req->sg, sg, req->sg_cnt, i) {
  		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
  				!= sg->length) {
@@ -204,8 +230,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req 
*req)
  		sg_cnt--;
  	}

-	submit_bio(bio);
-	blk_finish_plug(&plug);
+	req->b.last_bio = bio;
+	if (req->poll)
+		req->b.last_bio->bi_opf |= REQ_HIPRI;
+
+	req->b.cookie = submit_bio(bio);
+
+	nvmet_req_prep_poll(req);
+	nvmet_req_issue_poll(req);
+	if (!req->poll)
+		blk_finish_plug(&plug);
  }

  static void nvmet_bdev_execute_flush(struct nvmet_req *req)
@@ -330,15 +364,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
  	switch (cmd->common.opcode) {
  	case nvme_cmd_read:
  	case nvme_cmd_write:
+		req->poll = req->ns->poll;
  		req->execute = nvmet_bdev_execute_rw;
  		return 0;
  	case nvme_cmd_flush:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_flush;
  		return 0;
  	case nvme_cmd_dsm:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_dsm;
  		return 0;
  	case nvme_cmd_write_zeroes:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_write_zeroes;
  		return 0;
  	default:
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 46df45e837c9..e5991e36ed6f 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -49,11 +49,22 @@
  #define IPO_IATTR_CONNECT_SQE(x)	\
  	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))

+struct nvmet_poll_data {
+	struct completion		thread_complete;
+	wait_queue_head_t		poll_waitq;
+	struct mutex			list_lock;
+	struct task_struct		*thread;
+	struct list_head		list;
+	struct list_head		done;
+	unsigned int			id;
+};
+
  struct nvmet_ns {
  	struct list_head	dev_link;
  	struct percpu_ref	ref;
  	struct block_device	*bdev;
  	struct file		*file;
+	struct nvmet_poll_data	*t;
  	bool			readonly;
  	u32			nsid;
  	u32			blksize_shift;
@@ -63,6 +74,8 @@ struct nvmet_ns {
  	u32			anagrpid;

  	bool			buffered_io;
+	bool			use_poll;
+	bool			poll;
  	bool			enabled;
  	struct nvmet_subsys	*subsys;
  	const char		*device_path;
@@ -292,9 +305,15 @@ struct nvmet_req {
  	struct nvmet_ns		*ns;
  	struct scatterlist	*sg;
  	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
+	struct completion	wait;
+	bool			poll;
+	struct nvmet_poll_data  *t;
+	struct list_head	poll_entry;
  	union {
  		struct {
-			struct bio      inline_bio;
+			struct bio		inline_bio;
+			blk_qc_t		cookie;
+			struct bio		*last_bio;
  		} b;
  		struct {
  			bool			mpool_alloc;
@@ -442,6 +461,17 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys 
*subsys,
  void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
  		u8 event_info, u8 log_page);

+int nvmet_ns_start_poll(struct nvmet_ns *ns);
+void nvmet_ns_stop_poll(struct nvmet_ns *ns);
+void nvmet_req_prep_poll(struct nvmet_req *req);
+void nvmet_req_issue_poll(struct nvmet_req *req);
+
+void nvmet_req_bdev_poll(struct nvmet_req *req);
+void nvmet_req_poll(struct nvmet_req *req);
+void nvmet_req_poll_complete(struct nvmet_req *req);
+void nvmet_bdev_req_complete(struct nvmet_req *req);
+void nvmet_req_done(struct nvmet_req *req);
+
  #define NVMET_QUEUE_SIZE	1024
  #define NVMET_NR_QUEUES		128
  #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE
-- 
2.22.1


  struct nvmet_ns {
  	struct list_head	dev_link;
  	struct percpu_ref	ref;
  	struct block_device	*bdev;
  	struct file		*file;
+	struct nvmet_poll_data	*t;
  	bool			readonly;
  	u32			nsid;
  	u32			blksize_shift;
@@ -63,6 +74,8 @@ struct nvmet_ns {
  	u32			anagrpid;

  	bool			buffered_io;
+	bool			use_poll;
+	bool			poll;
  	bool			enabled;
  	struct nvmet_subsys	*subsys;
  	const char		*device_path;
@@ -292,9 +305,15 @@ struct nvmet_req {
  	struct nvmet_ns		*ns;
  	struct scatterlist	*sg;
  	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
+	struct completion	wait;
+	bool			poll;
+	struct nvmet_poll_data  *t;
+	struct list_head	poll_entry;
  	union {
  		struct {
-			struct bio      inline_bio;
+			struct bio		inline_bio;
+			blk_qc_t		cookie;
+			struct bio		*last_bio;
  		} b;
  		struct {
  			bool			mpool_alloc;
@@ -442,6 +461,17 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys 
*subsys,
  void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
  		u8 event_info, u8 log_page);

+int nvmet_ns_start_poll(struct nvmet_ns *ns);
+void nvmet_ns_stop_poll(struct nvmet_ns *ns);
+void nvmet_req_prep_poll(struct nvmet_req *req);
+void nvmet_req_issue_poll(struct nvmet_req *req);
+
+void nvmet_req_bdev_poll(struct nvmet_req *req);
+void nvmet_req_poll(struct nvmet_req *req);
+void nvmet_req_poll_complete(struct nvmet_req *req);
+void nvmet_bdev_req_complete(struct nvmet_req *req);
+void nvmet_req_done(struct nvmet_req *req);
+
  #define NVMET_QUEUE_SIZE	1024
  #define NVMET_NR_QUEUES		128
  #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE
-- 
2.22.1


  }
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index b6fca0e421ef..317c9b427b71 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device 
*bdev, struct nvme_id_ns *id)

  int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
  {
+	int fmode = FMODE_READ | FMODE_WRITE;
+	struct request_queue *q;
  	int ret;

-	ns->bdev = blkdev_get_by_path(ns->device_path,
-			FMODE_READ | FMODE_WRITE, NULL);
+	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
  	if (IS_ERR(ns->bdev)) {
  		ret = PTR_ERR(ns->bdev);
  		if (ret != -ENOTBLK) {
@@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
  					ns->device_path, PTR_ERR(ns->bdev));
  		}
  		ns->bdev = NULL;
-		return ret;
+		goto out;
  	}
  	ns->size = i_size_read(ns->bdev->bd_inode);
  	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
-	return 0;
+	q = bdev_get_queue(ns->bdev);
+	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
+out:
+	return ret;
  }

  void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
  {
  	if (ns->bdev) {
+		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
  		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
  		ns->bdev = NULL;
  	}
@@ -133,15 +139,34 @@ static u16 blk_to_nvme_status(struct nvmet_req 
*req, blk_status_t blk_sts)
  	return status;
  }

-static void nvmet_bio_done(struct bio *bio)
+void nvmet_bdev_req_complete(struct nvmet_req *req)
  {
-	struct nvmet_req *req = bio->bi_private;
+	struct bio *bio = req->b.last_bio;

  	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
  	if (bio != &req->b.inline_bio)
  		bio_put(bio);
  }

+static void nvmet_bio_done(struct bio *bio)
+{
+	struct nvmet_req *req = bio->bi_private;
+
+	req->b.last_bio = bio;
+
+	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
+}
+
+void nvmet_req_bdev_poll(struct nvmet_req *req)
+{
+	struct request_queue *q = bdev_get_queue(req->ns->bdev);
+	int ret = blk_poll(q, req->b.cookie, true);
+
+	if (ret < 0) {
+		pr_err("tid %d poll error %d", req->t->id, ret);
+	}
+}
+
  static void nvmet_bdev_execute_rw(struct nvmet_req *req)
  {
  	int sg_cnt = req->sg_cnt;
@@ -185,7 +210,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
  	bio->bi_end_io = nvmet_bio_done;
  	bio->bi_opf = op;

-	blk_start_plug(&plug);
+	if (!req->poll)
+		blk_start_plug(&plug);
  	for_each_sg(req->sg, sg, req->sg_cnt, i) {
  		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
  				!= sg->length) {
@@ -204,8 +230,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req 
*req)
  		sg_cnt--;
  	}

-	submit_bio(bio);
-	blk_finish_plug(&plug);
+	req->b.last_bio = bio;
+	if (req->poll)
+		req->b.last_bio->bi_opf |= REQ_HIPRI;
+
+	req->b.cookie = submit_bio(bio);
+
+	nvmet_req_prep_poll(req);
+	nvmet_req_issue_poll(req);
+	if (!req->poll)
+		blk_finish_plug(&plug);
  }

  static void nvmet_bdev_execute_flush(struct nvmet_req *req)
@@ -330,15 +364,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
  	switch (cmd->common.opcode) {
  	case nvme_cmd_read:
  	case nvme_cmd_write:
+		req->poll = req->ns->poll;
  		req->execute = nvmet_bdev_execute_rw;
  		return 0;
  	case nvme_cmd_flush:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_flush;
  		return 0;
  	case nvme_cmd_dsm:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_dsm;
  		return 0;
  	case nvme_cmd_write_zeroes:
+		req->poll = false;
  		req->execute = nvmet_bdev_execute_write_zeroes;
  		return 0;
  	default:
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 46df45e837c9..e5991e36ed6f 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -49,11 +49,22 @@
  #define IPO_IATTR_CONNECT_SQE(x)	\
  	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))

+struct nvmet_poll_data {
+	struct completion		thread_complete;
+	wait_queue_head_t		poll_waitq;
+	struct mutex			list_lock;
+	struct task_struct		*thread;
+	struct list_head		list;
+	struct list_head		done;
+	unsigned int			id;
+};
+
  struct nvmet_ns {
  	struct list_head	dev_link;
  	struct percpu_ref	ref;
  	struct block_device	*bdev;
  	struct file		*file;
+	struct nvmet_poll_data	*t;
  	bool			readonly;
  	u32			nsid;
  	u32			blksize_shift;
@@ -63,6 +74,8 @@ struct nvmet_ns {
  	u32			anagrpid;

  	bool			buffered_io;
+	bool			use_poll;
+	bool			poll;
  	bool			enabled;
  	struct nvmet_subsys	*subsys;
  	const char		*device_path;
@@ -292,9 +305,15 @@ struct nvmet_req {
  	struct nvmet_ns		*ns;
  	struct scatterlist	*sg;
  	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
+	struct completion	wait;
+	bool			poll;
+	struct nvmet_poll_data  *t;
+	struct list_head	poll_entry;
  	union {
  		struct {
-			struct bio      inline_bio;
+			struct bio		inline_bio;
+			blk_qc_t		cookie;
+			struct bio		*last_bio;
  		} b;
  		struct {
  			bool			mpool_alloc;
@@ -442,6 +461,17 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys 
*subsys,
  void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
  		u8 event_info, u8 log_page);

+int nvmet_ns_start_poll(struct nvmet_ns *ns);
+void nvmet_ns_stop_poll(struct nvmet_ns *ns);
+void nvmet_req_prep_poll(struct nvmet_req *req);
+void nvmet_req_issue_poll(struct nvmet_req *req);
+
+void nvmet_req_bdev_poll(struct nvmet_req *req);
+void nvmet_req_poll(struct nvmet_req *req);
+void nvmet_req_poll_complete(struct nvmet_req *req);
+void nvmet_bdev_req_complete(struct nvmet_req *req);
+void nvmet_req_done(struct nvmet_req *req);
+
  #define NVMET_QUEUE_SIZE	1024
  #define NVMET_NR_QUEUES		128
  #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE
-- 
2.22.1










_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support
  2019-12-10  6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni
@ 2020-01-20 12:52   ` Max Gurtovoy
  2020-01-21 19:22     ` Chaitanya Kulkarni
  0 siblings, 1 reply; 12+ messages in thread
From: Max Gurtovoy @ 2020-01-20 12:52 UTC (permalink / raw)
  To: Chaitanya Kulkarni, sagi, hch; +Cc: linux-nvme


On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote:
> This patch adds support for the bdev-ns polling. We also add a new
> file to keep polling code separate (io-poll.c). The newly added
> configfs attribute allows user to enable/disable polling.
>
> We only enable polling for the READ/WRITE operation based on support
> from bdev and and use_poll configfs attr.
>
> We configure polling per namespace. For each namespace we create
> polling threads. For general approach please have a look at the
> cover-letter of this patch-series.

It would be great to get some numbers here to learn about the trade-off 
for this approach.


>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> ---
>   drivers/nvme/target/Makefile      |   3 +-
>   drivers/nvme/target/configfs.c    |  29 ++++++
>   drivers/nvme/target/core.c        |  13 +++
>   drivers/nvme/target/io-cmd-bdev.c |  61 +++++++++--
>   drivers/nvme/target/io-poll.c     | 165 ++++++++++++++++++++++++++++++
>   drivers/nvme/target/nvmet.h       |  31 +++++-
>   6 files changed, 291 insertions(+), 11 deletions(-)
>   create mode 100644 drivers/nvme/target/io-poll.c
>
> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
> index 2b33836f3d3e..8877ba16305c 100644
> --- a/drivers/nvme/target/Makefile
> +++ b/drivers/nvme/target/Makefile
> @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
>   obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
>   
>   nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
> -			discovery.o io-cmd-file.o io-cmd-bdev.o
> +			discovery.o io-cmd-file.o io-cmd-bdev.o \
> +			io-poll.o
>   nvme-loop-y	+= loop.o
>   nvmet-rdma-y	+= rdma.o
>   nvmet-fc-y	+= fc.o
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index 98613a45bd3b..0e6e8b0dbf79 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
>   
>   CONFIGFS_ATTR(nvmet_ns_, buffered_io);
>   
> +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
> +{
> +	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
> +}
> +
> +static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
> +		const char *page, size_t count)
> +{
> +	struct nvmet_ns *ns = to_nvmet_ns(item);
> +	bool val;
> +
> +	if (strtobool(page, &val))
> +		return -EINVAL;
> +
> +	mutex_lock(&ns->subsys->lock);
> +	if (ns->enabled) {
> +		pr_err("disable ns before setting use_poll value.\n");
> +		mutex_unlock(&ns->subsys->lock);
> +		return -EINVAL;
> +	}
> +
> +	ns->use_poll = val;
> +	mutex_unlock(&ns->subsys->lock);
> +	return count;
> +}
> +
> +CONFIGFS_ATTR(nvmet_ns_, use_poll);
> +
>   static struct configfs_attribute *nvmet_ns_attrs[] = {
>   	&nvmet_ns_attr_device_path,
>   	&nvmet_ns_attr_device_nguid,
> @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>   	&nvmet_ns_attr_ana_grpid,
>   	&nvmet_ns_attr_enable,
>   	&nvmet_ns_attr_buffered_io,
> +	&nvmet_ns_attr_use_poll,
>   #ifdef CONFIG_PCI_P2PDMA
>   	&nvmet_ns_attr_p2pmem,
>   #endif
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index 28438b833c1b..d8f9130d1cd1 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl,
>   		ns->nsid);
>   }
>   
> +inline void nvmet_req_done(struct nvmet_req *req)
> +{
> +	if (req->ns->bdev)
> +		nvmet_bdev_req_complete(req);
> +}
> +
> +inline void nvmet_req_poll_complete(struct nvmet_req *req)
> +{
> +	if (req->ns->bdev)
> +		nvmet_bdev_poll_complete(req);
> +}
> +
>   int nvmet_ns_enable(struct nvmet_ns *ns)
>   {
>   	struct nvmet_subsys *subsys = ns->subsys;
> @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
>   
>   	uuid_gen(&ns->uuid);
>   	ns->buffered_io = false;
> +	ns->use_poll = false;
>   
>   	return ns;
>   }
> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> index b6fca0e421ef..13507e0cbc1d 100644
> --- a/drivers/nvme/target/io-cmd-bdev.c
> +++ b/drivers/nvme/target/io-cmd-bdev.c
> @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
>   
>   int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>   {
> +	int fmode = FMODE_READ | FMODE_WRITE;
> +	struct request_queue *q;
>   	int ret;
>   
> -	ns->bdev = blkdev_get_by_path(ns->device_path,
> -			FMODE_READ | FMODE_WRITE, NULL);
> +	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
>   	if (IS_ERR(ns->bdev)) {
>   		ret = PTR_ERR(ns->bdev);
>   		if (ret != -ENOTBLK) {
> @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>   					ns->device_path, PTR_ERR(ns->bdev));
>   		}
>   		ns->bdev = NULL;
> -		return ret;
> +		goto out;
>   	}
>   	ns->size = i_size_read(ns->bdev->bd_inode);
>   	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> -	return 0;
> +	q = bdev_get_queue(ns->bdev);
> +	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
> +	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
> +out:
> +	return ret;
>   }
>   
>   void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
>   {
>   	if (ns->bdev) {
> +		ns->poll ? nvmet_ns_stop_poll(ns) : 0;

can you please use an "if" statement instead of the above convention ?


>   		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
>   		ns->bdev = NULL;
>   	}
> @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
>   	return status;
>   }
>   
> -static void nvmet_bio_done(struct bio *bio)
> +void nvmet_bdev_req_complete(struct nvmet_req *req)
>   {
> -	struct nvmet_req *req = bio->bi_private;
> +	struct bio *bio = req->b.last_bio;
>   
>   	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
>   	if (bio != &req->b.inline_bio)
>   		bio_put(bio);
>   }
>   
> +static void nvmet_bio_done(struct bio *bio)
> +{
> +	struct nvmet_req *req = bio->bi_private;
> +
> +	req->b.last_bio = bio;
> +
> +	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);

Same here for the "if". Lets keep the code as readable as we can


> +}
> +
> +void nvmet_bdev_poll_complete(struct nvmet_req *req)
> +{
> +	struct request_queue *q = bdev_get_queue(req->ns->bdev);
> +
> +	while (!completion_done(&req->wait)) {
> +		int ret = blk_poll(q, req->b.cookie, true);
> +
> +		if (ret < 0) {
> +			pr_err("tid %d poll error %d", req->t->id, ret);
> +			break;
> +		}
> +	}
> +	nvmet_bdev_req_complete(req);
> +}
> +
>   static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>   {
>   	int sg_cnt = req->sg_cnt;
> @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>   	bio->bi_end_io = nvmet_bio_done;
>   	bio->bi_opf = op;
>   
> -	blk_start_plug(&plug);
> +	if (!req->poll)
> +		blk_start_plug(&plug);
>   	for_each_sg(req->sg, sg, req->sg_cnt, i) {
>   		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
>   				!= sg->length) {
> @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>   		sg_cnt--;
>   	}
>   
> -	submit_bio(bio);
> -	blk_finish_plug(&plug);
> +	req->b.last_bio = bio;
> +	if (req->poll)
> +		req->b.last_bio->bi_opf |= REQ_HIPRI;
> +
> +	req->b.cookie = submit_bio(bio);
> +
> +	nvmet_req_prep_poll(req);
> +	nvmet_req_issue_poll(req);
> +	if (!req->poll)
> +		blk_finish_plug(&plug);
>   }
>   
>   static void nvmet_bdev_execute_flush(struct nvmet_req *req)
> @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
>   	switch (cmd->common.opcode) {
>   	case nvme_cmd_read:
>   	case nvme_cmd_write:
> +		req->poll = req->ns->poll;
>   		req->execute = nvmet_bdev_execute_rw;
>   		return 0;
>   	case nvme_cmd_flush:
> +		req->poll = false;

should be done in some general place for req initialization.


>   		req->execute = nvmet_bdev_execute_flush;
>   		return 0;
>   	case nvme_cmd_dsm:
> +		req->poll = false;
>   		req->execute = nvmet_bdev_execute_dsm;
>   		return 0;
>   	case nvme_cmd_write_zeroes:
> +		req->poll = false;
>   		req->execute = nvmet_bdev_execute_write_zeroes;
>   		return 0;
>   	default:
> diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c
> new file mode 100644
> index 000000000000..175c939c22ff
> --- /dev/null
> +++ b/drivers/nvme/target/io-poll.c
> @@ -0,0 +1,165 @@
> +#include <linux/blkdev.h>
> +#include <linux/module.h>
> +#include <linux/sched/signal.h>
> +#include <linux/kthread.h>
> +#include <uapi/linux/sched/types.h>
> +
> +#include "nvmet.h"
> +#define THREAD_PER_CPU	(num_online_cpus() * 2)
> +
> +static int nvmet_poll_thread(void *data);
> +
> +int nvmet_ns_start_poll(struct nvmet_ns *ns)
> +{
> +	struct nvmet_poll_data *t;
> +	int ret = 0;
> +	int i;
> +
> +	t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL);
> +	if (!t) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < THREAD_PER_CPU; i++) {
> +		init_completion(&t[i].thread_complete);
> +		init_waitqueue_head(&t[i].poll_waitq);
> +		INIT_LIST_HEAD(&t[i].list);
> +		INIT_LIST_HEAD(&t[i].done);
> +		mutex_init(&t[i].list_lock);
> +		t[i].id = i;
> +
> +		t[i].thread = kthread_create(nvmet_poll_thread, &t[i],
> +					     "nvmet_poll_thread%s/%d",
> +					     ns->device_path, i);
> +
> +		if (IS_ERR(t[i].thread)) {
> +			ret = PTR_ERR(t[i].thread);
> +			goto err;
> +		}
> +
> +		kthread_bind(t[i].thread, i % num_online_cpus());
> +		wake_up_process(t[i].thread);
> +		wait_for_completion(&t[i].thread_complete);
> +	}
> +
> +	ns->t = t;
> +out:
> +	return ret;
> +err:
> +	nvmet_ns_stop_poll(ns);
> +	goto out;
> +}
> +
> +void nvmet_ns_stop_poll(struct nvmet_ns *ns)
> +{
> +	struct nvmet_poll_data *t = ns->t;
> +	int i;
> +
> +	for (i = 0; i < THREAD_PER_CPU; i++) {
> +		if (!t[i].thread)
> +			continue;
> +
> +		if (wq_has_sleeper(&t[i].poll_waitq))
> +			wake_up(&t[i].poll_waitq);
> +		kthread_park(t[i].thread);
> +		kthread_stop(t[i].thread);
> +	}
> +}
> +
> +static void nvmet_poll(struct nvmet_poll_data *t)
> +{
> +	struct nvmet_req *req, *tmp;
> +
> +	lockdep_assert_held(&t->list_lock);
> +
> +	list_for_each_entry_safe(req, tmp, &t->list, poll_entry) {
> +
> +		if (completion_done(&req->wait)) {
> +			list_move_tail(&req->poll_entry, &t->done);
> +			continue;
> +		}
> +
> +		if (!list_empty(&t->done))
> +			break;
> +
> +		list_del(&req->poll_entry);
> +		mutex_unlock(&t->list_lock);
> +		nvmet_req_poll_complete(req);
> +		mutex_lock(&t->list_lock);
> +	}
> +	mutex_unlock(&t->list_lock);
> +	/*
> +	 * In future we can also add batch timeout or nr request to complete.
> +	 */
> +	while (!list_empty(&t->done)) {
> +		/*
> +		 * We lock and unlock for t->list which gurantee progress of
> +		 * nvmet_xxx_rw_execute() when under pressure while we complete
> +		 * the request.
> +		 */
> +		req = list_first_entry(&t->done, struct nvmet_req, poll_entry);
> +		list_del(&req->poll_entry);
> +		nvmet_req_done(req);
> +	}
> +
> +	mutex_lock(&t->list_lock);
> +}
> +
> +static int nvmet_poll_thread(void *data)
> +{
> +	struct nvmet_poll_data *t = (struct nvmet_poll_data *) data;
> +	DEFINE_WAIT(wait);
> +
> +	complete(&t->thread_complete);
> +
> +	while (!kthread_should_park()) {
> +
> +		mutex_lock(&t->list_lock);
> +		while (!list_empty(&t->list) && !need_resched())
> +			nvmet_poll(t);
> +		mutex_unlock(&t->list_lock);
> +
> +		prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE);
> +		if (signal_pending(current))
> +			flush_signals(current);
> +		smp_mb();
> +		schedule();
> +
> +		finish_wait(&t->poll_waitq, &wait);
> +	}
> +
> +	kthread_parkme();
> +	return 0;
> +}
> +
> +inline void nvmet_req_prep_poll(struct nvmet_req *req)
> +{
> +	if (!req->poll)
> +		return;
> +
> +	init_completion(&req->wait);
> +	req->t = &req->ns->t[smp_processor_id()];
> +}
> +
> +inline void nvmet_req_issue_poll(struct nvmet_req *req)
> +{
> +	if (!req->poll)
> +		return;
> +
> +	while (!mutex_trylock(&req->t->list_lock)) {
> +		if (req->t->id < num_online_cpus())
> +			req->t = &req->ns->t[req->t->id + num_online_cpus()];
> +		else
> +			req->t = &req->ns->t[req->t->id - num_online_cpus()];
> +	}
> +
> +	if (completion_done(&req->wait))
> +		list_add(&req->poll_entry, &req->t->list);
> +	else
> +		list_add_tail(&req->poll_entry, &req->t->list);
> +	mutex_unlock(&req->t->list_lock);
> +
> +	if (wq_has_sleeper(&req->t->poll_waitq))
> +		wake_up(&req->t->poll_waitq);
> +}
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 46df45e837c9..ef2919e23e0b 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -49,11 +49,22 @@
>   #define IPO_IATTR_CONNECT_SQE(x)	\
>   	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
>   
> +struct nvmet_poll_data {
> +	struct completion		thread_complete;
> +	wait_queue_head_t		poll_waitq;
> +	struct mutex			list_lock;
> +	struct task_struct		*thread;
> +	struct list_head		list;
> +	struct list_head		done;
> +	unsigned int			id;
> +};
> +
>   struct nvmet_ns {
>   	struct list_head	dev_link;
>   	struct percpu_ref	ref;
>   	struct block_device	*bdev;
>   	struct file		*file;
> +	struct nvmet_poll_data	*t;
>   	bool			readonly;
>   	u32			nsid;
>   	u32			blksize_shift;
> @@ -63,6 +74,8 @@ struct nvmet_ns {
>   	u32			anagrpid;
>   
>   	bool			buffered_io;
> +	bool			use_poll;
> +	bool			poll;
>   	bool			enabled;
>   	struct nvmet_subsys	*subsys;
>   	const char		*device_path;
> @@ -292,9 +305,15 @@ struct nvmet_req {
>   	struct nvmet_ns		*ns;
>   	struct scatterlist	*sg;
>   	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
> +	struct completion	wait;
> +	bool			poll;
> +	struct nvmet_poll_data  *t;
> +	struct list_head	poll_entry;
>   	union {
>   		struct {
> -			struct bio      inline_bio;
> +			struct bio		inline_bio;
> +			blk_qc_t		cookie;
> +			struct bio		*last_bio;
>   		} b;
>   		struct {
>   			bool			mpool_alloc;
> @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys,
>   void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>   		u8 event_info, u8 log_page);
>   
> +int nvmet_ns_start_poll(struct nvmet_ns *ns);
> +void nvmet_ns_stop_poll(struct nvmet_ns *ns);
> +void nvmet_req_prep_poll(struct nvmet_req *req);
> +void nvmet_req_issue_poll(struct nvmet_req *req);
> +
> +void nvmet_req_poll_complete(struct nvmet_req *req);
> +void nvmet_bdev_poll_complete(struct nvmet_req *req);
> +void nvmet_bdev_req_complete(struct nvmet_req *req);
> +void nvmet_req_done(struct nvmet_req *req);
> +
>   #define NVMET_QUEUE_SIZE	1024
>   #define NVMET_NR_QUEUES		128
>   #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support
  2020-01-20 12:52   ` Max Gurtovoy
@ 2020-01-21 19:22     ` Chaitanya Kulkarni
  2020-01-23 14:23       ` Max Gurtovoy
  0 siblings, 1 reply; 12+ messages in thread
From: Chaitanya Kulkarni @ 2020-01-21 19:22 UTC (permalink / raw)
  To: Max Gurtovoy, sagi, hch; +Cc: linux-nvme

Please see my in-line comments.
On 01/20/2020 04:52 AM, Max Gurtovoy wrote:
>
> On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote:
>> This patch adds support for the bdev-ns polling. We also add a new
>> file to keep polling code separate (io-poll.c). The newly added
>> configfs attribute allows user to enable/disable polling.
>>
>> We only enable polling for the READ/WRITE operation based on support
>> from bdev and and use_poll configfs attr.
>>
>> We configure polling per namespace. For each namespace we create
>> polling threads. For general approach please have a look at the
>> cover-letter of this patch-series.
>
> It would be great to get some numbers here to learn about the trade-off
> for this approach.
>
>

Already posted these numbers with QEMU nvme-loop here :-
http://lists.infradead.org/pipermail/linux-nvme/2020-January/028692.html

IOPS/BW:-

Default:-
read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
Poll:-
read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)

mpstat:-
Default:-
CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
---------------------------------------------------------
all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
---------------------------------------------------------
0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73

Poll:-
CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
---------------------------------------------------------
all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
---------------------------------------------------------
0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00

>>
>> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
>> ---
>>    drivers/nvme/target/Makefile      |   3 +-
>>    drivers/nvme/target/configfs.c    |  29 ++++++
>>    drivers/nvme/target/core.c        |  13 +++
>>    drivers/nvme/target/io-cmd-bdev.c |  61 +++++++++--
>>    drivers/nvme/target/io-poll.c     | 165 ++++++++++++++++++++++++++++++
>>    drivers/nvme/target/nvmet.h       |  31 +++++-
>>    6 files changed, 291 insertions(+), 11 deletions(-)
>>    create mode 100644 drivers/nvme/target/io-poll.c
>>
>> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
>> index 2b33836f3d3e..8877ba16305c 100644
>> --- a/drivers/nvme/target/Makefile
>> +++ b/drivers/nvme/target/Makefile
>> @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
>>    obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
>>
>>    nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
>> -			discovery.o io-cmd-file.o io-cmd-bdev.o
>> +			discovery.o io-cmd-file.o io-cmd-bdev.o \
>> +			io-poll.o
>>    nvme-loop-y	+= loop.o
>>    nvmet-rdma-y	+= rdma.o
>>    nvmet-fc-y	+= fc.o
>> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
>> index 98613a45bd3b..0e6e8b0dbf79 100644
>> --- a/drivers/nvme/target/configfs.c
>> +++ b/drivers/nvme/target/configfs.c
>> @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
>>
>>    CONFIGFS_ATTR(nvmet_ns_, buffered_io);
>>
>> +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
>> +{
>> +	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
>> +}
>> +
>> +static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
>> +		const char *page, size_t count)
>> +{
>> +	struct nvmet_ns *ns = to_nvmet_ns(item);
>> +	bool val;
>> +
>> +	if (strtobool(page, &val))
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&ns->subsys->lock);
>> +	if (ns->enabled) {
>> +		pr_err("disable ns before setting use_poll value.\n");
>> +		mutex_unlock(&ns->subsys->lock);
>> +		return -EINVAL;
>> +	}
>> +
>> +	ns->use_poll = val;
>> +	mutex_unlock(&ns->subsys->lock);
>> +	return count;
>> +}
>> +
>> +CONFIGFS_ATTR(nvmet_ns_, use_poll);
>> +
>>    static struct configfs_attribute *nvmet_ns_attrs[] = {
>>    	&nvmet_ns_attr_device_path,
>>    	&nvmet_ns_attr_device_nguid,
>> @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>>    	&nvmet_ns_attr_ana_grpid,
>>    	&nvmet_ns_attr_enable,
>>    	&nvmet_ns_attr_buffered_io,
>> +	&nvmet_ns_attr_use_poll,
>>    #ifdef CONFIG_PCI_P2PDMA
>>    	&nvmet_ns_attr_p2pmem,
>>    #endif
>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>> index 28438b833c1b..d8f9130d1cd1 100644
>> --- a/drivers/nvme/target/core.c
>> +++ b/drivers/nvme/target/core.c
>> @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl,
>>    		ns->nsid);
>>    }
>>
>> +inline void nvmet_req_done(struct nvmet_req *req)
>> +{
>> +	if (req->ns->bdev)
>> +		nvmet_bdev_req_complete(req);
>> +}
>> +
>> +inline void nvmet_req_poll_complete(struct nvmet_req *req)
>> +{
>> +	if (req->ns->bdev)
>> +		nvmet_bdev_poll_complete(req);
>> +}
>> +
>>    int nvmet_ns_enable(struct nvmet_ns *ns)
>>    {
>>    	struct nvmet_subsys *subsys = ns->subsys;
>> @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
>>
>>    	uuid_gen(&ns->uuid);
>>    	ns->buffered_io = false;
>> +	ns->use_poll = false;
>>
>>    	return ns;
>>    }
>> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
>> index b6fca0e421ef..13507e0cbc1d 100644
>> --- a/drivers/nvme/target/io-cmd-bdev.c
>> +++ b/drivers/nvme/target/io-cmd-bdev.c
>> @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
>>
>>    int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>    {
>> +	int fmode = FMODE_READ | FMODE_WRITE;
>> +	struct request_queue *q;
>>    	int ret;
>>
>> -	ns->bdev = blkdev_get_by_path(ns->device_path,
>> -			FMODE_READ | FMODE_WRITE, NULL);
>> +	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
>>    	if (IS_ERR(ns->bdev)) {
>>    		ret = PTR_ERR(ns->bdev);
>>    		if (ret != -ENOTBLK) {
>> @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>    					ns->device_path, PTR_ERR(ns->bdev));
>>    		}
>>    		ns->bdev = NULL;
>> -		return ret;
>> +		goto out;
>>    	}
>>    	ns->size = i_size_read(ns->bdev->bd_inode);
>>    	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
>> -	return 0;
>> +	q = bdev_get_queue(ns->bdev);
>> +	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
>> +	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
>> +out:
>> +	return ret;
>>    }
>>
>>    void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
>>    {
>>    	if (ns->bdev) {
>> +		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
>
> can you please use an "if" statement instead of the above convention ?
>

This is just as RFC, will send out formal patch series with all the
coding style fixes.

>
>>    		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
>>    		ns->bdev = NULL;
>>    	}
>> @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
>>    	return status;
>>    }
>>
>> -static void nvmet_bio_done(struct bio *bio)
>> +void nvmet_bdev_req_complete(struct nvmet_req *req)
>>    {
>> -	struct nvmet_req *req = bio->bi_private;
>> +	struct bio *bio = req->b.last_bio;
>>
>>    	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
>>    	if (bio != &req->b.inline_bio)
>>    		bio_put(bio);
>>    }
>>
>> +static void nvmet_bio_done(struct bio *bio)
>> +{
>> +	struct nvmet_req *req = bio->bi_private;
>> +
>> +	req->b.last_bio = bio;
>> +
>> +	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
>
> Same here for the "if". Lets keep the code as readable as we can
>
Same as above.
>
>> +}
>> +
>> +void nvmet_bdev_poll_complete(struct nvmet_req *req)
>> +{
>> +	struct request_queue *q = bdev_get_queue(req->ns->bdev);
>> +
>> +	while (!completion_done(&req->wait)) {
>> +		int ret = blk_poll(q, req->b.cookie, true);
>> +
>> +		if (ret < 0) {
>> +			pr_err("tid %d poll error %d", req->t->id, ret);
>> +			break;
>> +		}
>> +	}
>> +	nvmet_bdev_req_complete(req);
>> +}
>> +IOPS/BW:-

Default:-
read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
Poll:-
read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)

mpstat:-
Default:-
CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
---------------------------------------------------------
all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
---------------------------------------------------------
0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73

Poll:-
CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
---------------------------------------------------------
all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
---------------------------------------------------------
0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00

>>    static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>    {
>>    	int sg_cnt = req->sg_cnt;
>> @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>    	bio->bi_end_io = nvmet_bio_done;
>>    	bio->bi_opf = op;
>>
>> -	blk_start_plug(&plug);
>> +	if (!req->poll)
>> +		blk_start_plug(&plug);
>>    	for_each_sg(req->sg, sg, req->sg_cnt, i) {
>>    		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
>>    				!= sg->length) {
>> @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>    		sg_cnt--;
>>    	}
>>
>> -	submit_bio(bio);
>> -	blk_finish_plug(&plug);
>> +	req->b.last_bio = bio;
>> +	if (req->poll)
>> +		req->b.last_bio->bi_opf |= REQ_HIPRI;
>> +
>> +	req->b.cookie = submit_bio(bio);
>> +
>> +	nvmet_req_prep_poll(req);
>> +	nvmet_req_issue_poll(req);
>> +	if (!req->poll)
>> +		blk_finish_plug(&plug);
>>    }
>>
>>    static void nvmet_bdev_execute_flush(struct nvmet_req *req)
>> @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
>>    	switch (cmd->common.opcode) {
>>    	case nvme_cmd_read:
>>    	case nvme_cmd_write:
>> +		req->poll = req->ns->poll;
>>    		req->execute = nvmet_bdev_execute_rw;
>>    		return 0;
>>    	case nvme_cmd_flush:
>> +		req->poll = false;
>
> should be done in some general place for req initialization.
>
>

This needs to be done here for better redability, since it
depends on the 2 factors with backend support and user setting.

>>    		req->execute = nvmet_bdev_execute_flush;
>>    		return 0;
>>    	case nvme_cmd_dsm:
>> +		req->poll = false;
>>    		req->execute = nvmet_bdev_execute_dsm;
>>    		return 0;
>>    	case nvme_cmd_write_zeroes:
>> +		req->poll = false;
>>    		req->execute = nvmet_bdev_execute_write_zeroes;
>>    		return 0;
>>    	default:
>> diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c
>> new file mode 100644
>> index 000000000000..175c939c22ff
>> --- /dev/null
>> +++ b/drivers/nvme/target/io-poll.c
>> @@ -0,0 +1,165 @@
>> +#include <linux/blkdev.h>
>> +#include <linux/module.h>
>> +#include <linux/sched/signal.h>
>> +#include <linux/kthread.h>
>> +#include <uapi/linux/sched/types.h>
>> +
>> +#include "nvmet.h"
>> +#define THREAD_PER_CPU	(num_online_cpus() * 2)
>> +
>> +static int nvmet_poll_thread(void *data);
>> +
>> +int nvmet_ns_start_poll(struct nvmet_ns *ns)
>> +{
>> +	struct nvmet_poll_data *t;
>> +	int ret = 0;
>> +	int i;
>> +
>> +	t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL);
>> +	if (!t) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	for (i = 0; i < THREAD_PER_CPU; i++) {
>> +		init_completion(&t[i].thread_complete);
>> +		init_waitqueue_head(&t[i].poll_waitq);
>> +		INIT_LIST_HEAD(&t[i].list);
>> +		INIT_LIST_HEAD(&t[i].done);
>> +		mutex_init(&t[i].list_lock);
>> +		t[i].id = i;
>> +
>> +		t[i].thread = kthread_create(nvmet_poll_thread, &t[i],
>> +					     "nvmet_poll_thread%s/%d",
>> +					     ns->device_path, i);
>> +
>> +		if (IS_ERR(t[i].thread)) {
>> +			ret = PTR_ERR(t[i].thread);
>> +			goto err;
>> +		}
>> +
>> +		kthread_bind(t[i].thread, i % num_online_cpus());
>> +		wake_up_process(t[i].thread);
>> +		wait_for_completion(&t[i].thread_complete);
>> +	}
>> +
>> +	ns->t = t;
>> +out:
>> +	return ret;
>> +err:
>> +	nvmet_ns_stop_poll(ns);
>> +	goto out;
>> +}
>> +
>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns)
>> +{
>> +	struct nvmet_poll_data *t = ns->t;
>> +	int i;
>> +
>> +	for (i = 0; i < THREAD_PER_CPU; i++) {
>> +		if (!t[i].thread)
>> +			continue;
>> +
>> +		if (wq_has_sleeper(&t[i].poll_waitq))
>> +			wake_up(&t[i].poll_waitq);
>> +		kthread_park(t[i].thread);
>> +		kthread_stop(t[i].thread);
>> +	}
>> +}
>> +
>> +static void nvmet_poll(struct nvmet_poll_data *t)
>> +{
>> +	struct nvmet_req *req, *tmp;
>> +
>> +	lockdep_assert_held(&t->list_lock);
>> +
>> +	list_for_each_entry_safe(req, tmp, &t->list, poll_entry) {
>> +
>> +		if (completion_done(&req->wait)) {
>> +			list_move_tail(&req->poll_entry, &t->done);
>> +			continue;
>> +		}
>> +
>> +		if (!list_empty(&t->done))
>> +			break;
>> +
>> +		list_del(&req->poll_entry);
>> +		mutex_unlock(&t->list_lock);
>> +		nvmet_req_poll_complete(req);
>> +		mutex_lock(&t->list_lock);
>> +	}
>> +	mutex_unlock(&t->list_lock);
>> +	/*
>> +	 * In future we can also add batch timeout or nr request to complete.
>> +	 */
>> +	while (!list_empty(&t->done)) {
>> +		/*
>> +		 * We lock and unlock for t->list which gurantee progress of
>> +		 * nvmet_xxx_rw_execute() when under pressure while we complete
>> +		 * the request.
>> +		 */
>> +		req = list_first_entry(&t->done, struct nvmet_req, poll_entry);
>> +		list_del(&req->poll_entry);
>> +		nvmet_req_done(req);
>> +	}
>> +
>> +	mutex_lock(&t->list_lock);
>> +}
>> +
>> +static int nvmet_poll_thread(void *data)
>> +{
>> +	struct nvmet_poll_data *t = (struct nvmet_poll_data *) data;
>> +	DEFINE_WAIT(wait);
>> +
>> +	complete(&t->thread_complete);
>> +
>> +	while (!kthread_should_park()) {
>> +
>> +		mutex_lock(&t->list_lock);
>> +		while (!list_empty(&t->list) && !need_resched())
>> +			nvmet_poll(t);
>> +		mutex_unlock(&t->list_lock);
>> +
>> +		prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE);
>> +		if (signal_pending(current))
>> +			flush_signals(current);
>> +		smp_mb();
>> +		schedule();
>> +
>> +		finish_wait(&t->poll_waitq, &wait);
>> +	}
>> +
>> +	kthread_parkme();
>> +	return 0;
>> +}
>> +
>> +inline void nvmet_req_prep_poll(struct nvmet_req *req)
>> +{
>> +	if (!req->poll)
>> +		return;
>> +
>> +	init_completion(&req->wait);
>> +	req->t = &req->ns->t[smp_processor_id()];
>> +}
>> +
>> +inline void nvmet_req_issue_poll(struct nvmet_req *req)
>> +{
>> +	if (!req->poll)
>> +		return;
>> +
>> +	while (!mutex_trylock(&req->t->list_lock)) {
>> +		if (req->t->id < num_online_cpus())
>> +			req->t = &req->ns->t[req->t->id + num_online_cpus()];
>> +		else
>> +			req->t = &req->ns->t[req->t->id - num_online_cpus()];
>> +	}
>> +
>> +	if (completion_done(&req->wait))
>> +		list_add(&req->poll_entry, &req->t->list);
>> +	else
>> +		list_add_tail(&req->poll_entry, &req->t->list);
>> +	mutex_unlock(&req->t->list_lock);
>> +
>> +	if (wq_has_sleeper(&req->t->poll_waitq))
>> +		wake_up(&req->t->poll_waitq);
>> +}
>> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
>> index 46df45e837c9..ef2919e23e0b 100644
>> --- a/drivers/nvme/target/nvmet.h
>> +++ b/drivers/nvme/target/nvmet.h
>> @@ -49,11 +49,22 @@
>>    #define IPO_IATTR_CONNECT_SQE(x)	\
>>    	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
>>
>> +struct nvmet_poll_data {
>> +	struct completion		thread_complete;
>> +	wait_queue_head_t		poll_waitq;
>> +	struct mutex			list_lock;
>> +	struct task_struct		*thread;
>> +	struct list_head		list;
>> +	struct list_head		done;
>> +	unsigned int			id;
>> +};
>> +
>>    struct nvmet_ns {
>>    	struct list_head	dev_link;
>>    	struct percpu_ref	ref;
>>    	struct block_device	*bdev;
>>    	struct file		*file;
>> +	struct nvmet_poll_data	*t;
>>    	bool			readonly;
>>    	u32			nsid;
>>    	u32			blksize_shift;
>> @@ -63,6 +74,8 @@ struct nvmet_ns {
>>    	u32			anagrpid;
>>
>>    	bool			buffered_io;
>> +	bool			use_poll;
>> +	bool			poll;
>>    	bool			enabled;
>>    	struct nvmet_subsys	*subsys;
>>    	const char		*device_path;
>> @@ -292,9 +305,15 @@ struct nvmet_req {
>>    	struct nvmet_ns		*ns;
>>    	struct scatterlist	*sg;
>>    	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
>> +	struct completion	wait;
>> +	bool			poll;
>> +	struct nvmet_poll_data  *t;
>> +	struct list_head	poll_entry;
>>    	union {
>>    		struct {
>> -			struct bio      inline_bio;
>> +			struct bio		inline_bio;
>> +			blk_qc_t		cookie;
>> +			struct bio		*last_bio;
>>    		} b;
>>    		struct {
>>    			bool			mpool_alloc;
>> @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys,
>>    void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>    		u8 event_info, u8 log_page);
>>
>> +int nvmet_ns_start_poll(struct nvmet_ns *ns);
>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns);
>> +void nvmet_req_prep_poll(struct nvmet_req *req);
>> +void nvmet_req_issue_poll(struct nvmet_req *req);
>> +
>> +void nvmet_req_poll_complete(struct nvmet_req *req);
>> +void nvmet_bdev_poll_complete(struct nvmet_req *req);
>> +void nvmet_bdev_req_complete(struct nvmet_req *req);
>> +void nvmet_req_done(struct nvmet_req *req);
>> +
>>    #define NVMET_QUEUE_SIZE	1024
>>    #define NVMET_NR_QUEUES		128
>>    #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE
>


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support
  2020-01-21 19:22     ` Chaitanya Kulkarni
@ 2020-01-23 14:23       ` Max Gurtovoy
  2020-01-30 18:19         ` Chaitanya Kulkarni
  0 siblings, 1 reply; 12+ messages in thread
From: Max Gurtovoy @ 2020-01-23 14:23 UTC (permalink / raw)
  To: Chaitanya Kulkarni, sagi, hch; +Cc: linux-nvme


On 1/21/2020 9:22 PM, Chaitanya Kulkarni wrote:
> Please see my in-line comments.
> On 01/20/2020 04:52 AM, Max Gurtovoy wrote:
>> On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote:
>>> This patch adds support for the bdev-ns polling. We also add a new
>>> file to keep polling code separate (io-poll.c). The newly added
>>> configfs attribute allows user to enable/disable polling.
>>>
>>> We only enable polling for the READ/WRITE operation based on support
>>> from bdev and and use_poll configfs attr.
>>>
>>> We configure polling per namespace. For each namespace we create
>>> polling threads. For general approach please have a look at the
>>> cover-letter of this patch-series.
>> It would be great to get some numbers here to learn about the trade-off
>> for this approach.
>>
>>
> Already posted these numbers with QEMU nvme-loop here :-
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fpipermail%2Flinux-nvme%2F2020-January%2F028692.html&amp;data=02%7C01%7Cmaxg%40mellanox.com%7C9308e8badf9f41ec4faf08d79ea733ae%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637152313279111655&amp;sdata=RM3%2FcVeRcp8hgyAUFIM%2B7wbZ0cw2xvhdsfcVp9FuPm4%3D&amp;reserved=0
>
> IOPS/BW:-
>
> Default:-
> read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
> read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
> read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
> Poll:-
> read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
> read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
> read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)
>
> mpstat:-
> Default:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
> ---------------------------------------------------------
> 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
> 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
> 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
> 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
> 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
> 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
> 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
> 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
> 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>
> Poll:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
> ---------------------------------------------------------
> 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
> 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
> 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
> 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
> 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00

with all the respect to nvmet-loop, it isn't a real use case, right ?

Can you re-run it using RDMA and TCP ?

what is the backing device for these tests ? NVMe/NULL_BLK ?


>>> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
>>> ---
>>>     drivers/nvme/target/Makefile      |   3 +-
>>>     drivers/nvme/target/configfs.c    |  29 ++++++
>>>     drivers/nvme/target/core.c        |  13 +++
>>>     drivers/nvme/target/io-cmd-bdev.c |  61 +++++++++--
>>>     drivers/nvme/target/io-poll.c     | 165 ++++++++++++++++++++++++++++++
>>>     drivers/nvme/target/nvmet.h       |  31 +++++-
>>>     6 files changed, 291 insertions(+), 11 deletions(-)
>>>     create mode 100644 drivers/nvme/target/io-poll.c
>>>
>>> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
>>> index 2b33836f3d3e..8877ba16305c 100644
>>> --- a/drivers/nvme/target/Makefile
>>> +++ b/drivers/nvme/target/Makefile
>>> @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
>>>     obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
>>>
>>>     nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
>>> -			discovery.o io-cmd-file.o io-cmd-bdev.o
>>> +			discovery.o io-cmd-file.o io-cmd-bdev.o \
>>> +			io-poll.o
>>>     nvme-loop-y	+= loop.o
>>>     nvmet-rdma-y	+= rdma.o
>>>     nvmet-fc-y	+= fc.o
>>> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
>>> index 98613a45bd3b..0e6e8b0dbf79 100644
>>> --- a/drivers/nvme/target/configfs.c
>>> +++ b/drivers/nvme/target/configfs.c
>>> @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
>>>
>>>     CONFIGFS_ATTR(nvmet_ns_, buffered_io);
>>>
>>> +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
>>> +{
>>> +	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
>>> +}
>>> +
>>> +static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
>>> +		const char *page, size_t count)
>>> +{
>>> +	struct nvmet_ns *ns = to_nvmet_ns(item);
>>> +	bool val;
>>> +
>>> +	if (strtobool(page, &val))
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&ns->subsys->lock);
>>> +	if (ns->enabled) {
>>> +		pr_err("disable ns before setting use_poll value.\n");
>>> +		mutex_unlock(&ns->subsys->lock);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	ns->use_poll = val;
>>> +	mutex_unlock(&ns->subsys->lock);
>>> +	return count;
>>> +}
>>> +
>>> +CONFIGFS_ATTR(nvmet_ns_, use_poll);
>>> +
>>>     static struct configfs_attribute *nvmet_ns_attrs[] = {
>>>     	&nvmet_ns_attr_device_path,
>>>     	&nvmet_ns_attr_device_nguid,
>>> @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>>>     	&nvmet_ns_attr_ana_grpid,
>>>     	&nvmet_ns_attr_enable,
>>>     	&nvmet_ns_attr_buffered_io,
>>> +	&nvmet_ns_attr_use_poll,
>>>     #ifdef CONFIG_PCI_P2PDMA
>>>     	&nvmet_ns_attr_p2pmem,
>>>     #endif
>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>>> index 28438b833c1b..d8f9130d1cd1 100644
>>> --- a/drivers/nvme/target/core.c
>>> +++ b/drivers/nvme/target/core.c
>>> @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl,
>>>     		ns->nsid);
>>>     }
>>>
>>> +inline void nvmet_req_done(struct nvmet_req *req)
>>> +{
>>> +	if (req->ns->bdev)
>>> +		nvmet_bdev_req_complete(req);
>>> +}
>>> +
>>> +inline void nvmet_req_poll_complete(struct nvmet_req *req)
>>> +{
>>> +	if (req->ns->bdev)
>>> +		nvmet_bdev_poll_complete(req);
>>> +}
>>> +
>>>     int nvmet_ns_enable(struct nvmet_ns *ns)
>>>     {
>>>     	struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
>>>
>>>     	uuid_gen(&ns->uuid);
>>>     	ns->buffered_io = false;
>>> +	ns->use_poll = false;
>>>
>>>     	return ns;
>>>     }
>>> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
>>> index b6fca0e421ef..13507e0cbc1d 100644
>>> --- a/drivers/nvme/target/io-cmd-bdev.c
>>> +++ b/drivers/nvme/target/io-cmd-bdev.c
>>> @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
>>>
>>>     int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>>     {
>>> +	int fmode = FMODE_READ | FMODE_WRITE;
>>> +	struct request_queue *q;
>>>     	int ret;
>>>
>>> -	ns->bdev = blkdev_get_by_path(ns->device_path,
>>> -			FMODE_READ | FMODE_WRITE, NULL);
>>> +	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
>>>     	if (IS_ERR(ns->bdev)) {
>>>     		ret = PTR_ERR(ns->bdev);
>>>     		if (ret != -ENOTBLK) {
>>> @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>>     					ns->device_path, PTR_ERR(ns->bdev));
>>>     		}
>>>     		ns->bdev = NULL;
>>> -		return ret;
>>> +		goto out;
>>>     	}
>>>     	ns->size = i_size_read(ns->bdev->bd_inode);
>>>     	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
>>> -	return 0;
>>> +	q = bdev_get_queue(ns->bdev);
>>> +	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
>>> +	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
>>> +out:
>>> +	return ret;
>>>     }
>>>
>>>     void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
>>>     {
>>>     	if (ns->bdev) {
>>> +		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
>> can you please use an "if" statement instead of the above convention ?
>>
> This is just as RFC, will send out formal patch series with all the
> coding style fixes.
>
>>>     		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
>>>     		ns->bdev = NULL;
>>>     	}
>>> @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
>>>     	return status;
>>>     }
>>>
>>> -static void nvmet_bio_done(struct bio *bio)
>>> +void nvmet_bdev_req_complete(struct nvmet_req *req)
>>>     {
>>> -	struct nvmet_req *req = bio->bi_private;
>>> +	struct bio *bio = req->b.last_bio;
>>>
>>>     	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
>>>     	if (bio != &req->b.inline_bio)
>>>     		bio_put(bio);
>>>     }
>>>
>>> +static void nvmet_bio_done(struct bio *bio)
>>> +{
>>> +	struct nvmet_req *req = bio->bi_private;
>>> +
>>> +	req->b.last_bio = bio;
>>> +
>>> +	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
>> Same here for the "if". Lets keep the code as readable as we can
>>
> Same as above.
>>> +}
>>> +
>>> +void nvmet_bdev_poll_complete(struct nvmet_req *req)
>>> +{
>>> +	struct request_queue *q = bdev_get_queue(req->ns->bdev);
>>> +
>>> +	while (!completion_done(&req->wait)) {
>>> +		int ret = blk_poll(q, req->b.cookie, true);
>>> +
>>> +		if (ret < 0) {
>>> +			pr_err("tid %d poll error %d", req->t->id, ret);
>>> +			break;
>>> +		}
>>> +	}
>>> +	nvmet_bdev_req_complete(req);
>>> +}
>>> +IOPS/BW:-
> Default:-
> read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
> read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
> read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
> Poll:-
> read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
> read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
> read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)
>
> mpstat:-
> Default:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
> ---------------------------------------------------------
> 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
> 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
> 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
> 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
> 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
> 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
> 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
> 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
> 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>
> Poll:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
> ---------------------------------------------------------
> 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
> 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
> 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
> 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
> 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
>>>     static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>>     {
>>>     	int sg_cnt = req->sg_cnt;
>>> @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>>     	bio->bi_end_io = nvmet_bio_done;
>>>     	bio->bi_opf = op;
>>>
>>> -	blk_start_plug(&plug);
>>> +	if (!req->poll)
>>> +		blk_start_plug(&plug);
>>>     	for_each_sg(req->sg, sg, req->sg_cnt, i) {
>>>     		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
>>>     				!= sg->length) {
>>> @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>>     		sg_cnt--;
>>>     	}
>>>
>>> -	submit_bio(bio);
>>> -	blk_finish_plug(&plug);
>>> +	req->b.last_bio = bio;
>>> +	if (req->poll)
>>> +		req->b.last_bio->bi_opf |= REQ_HIPRI;
>>> +
>>> +	req->b.cookie = submit_bio(bio);
>>> +
>>> +	nvmet_req_prep_poll(req);
>>> +	nvmet_req_issue_poll(req);
>>> +	if (!req->poll)
>>> +		blk_finish_plug(&plug);
>>>     }
>>>
>>>     static void nvmet_bdev_execute_flush(struct nvmet_req *req)
>>> @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
>>>     	switch (cmd->common.opcode) {
>>>     	case nvme_cmd_read:
>>>     	case nvme_cmd_write:
>>> +		req->poll = req->ns->poll;
>>>     		req->execute = nvmet_bdev_execute_rw;
>>>     		return 0;
>>>     	case nvme_cmd_flush:
>>> +		req->poll = false;
>> should be done in some general place for req initialization.
>>
>>
> This needs to be done here for better redability, since it
> depends on the 2 factors with backend support and user setting.
>
>>>     		req->execute = nvmet_bdev_execute_flush;
>>>     		return 0;
>>>     	case nvme_cmd_dsm:
>>> +		req->poll = false;
>>>     		req->execute = nvmet_bdev_execute_dsm;
>>>     		return 0;
>>>     	case nvme_cmd_write_zeroes:
>>> +		req->poll = false;
>>>     		req->execute = nvmet_bdev_execute_write_zeroes;
>>>     		return 0;
>>>     	default:
>>> diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c
>>> new file mode 100644
>>> index 000000000000..175c939c22ff
>>> --- /dev/null
>>> +++ b/drivers/nvme/target/io-poll.c
>>> @@ -0,0 +1,165 @@
>>> +#include <linux/blkdev.h>
>>> +#include <linux/module.h>
>>> +#include <linux/sched/signal.h>
>>> +#include <linux/kthread.h>
>>> +#include <uapi/linux/sched/types.h>
>>> +
>>> +#include "nvmet.h"
>>> +#define THREAD_PER_CPU	(num_online_cpus() * 2)
>>> +
>>> +static int nvmet_poll_thread(void *data);
>>> +
>>> +int nvmet_ns_start_poll(struct nvmet_ns *ns)
>>> +{
>>> +	struct nvmet_poll_data *t;
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL);
>>> +	if (!t) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	for (i = 0; i < THREAD_PER_CPU; i++) {
>>> +		init_completion(&t[i].thread_complete);
>>> +		init_waitqueue_head(&t[i].poll_waitq);
>>> +		INIT_LIST_HEAD(&t[i].list);
>>> +		INIT_LIST_HEAD(&t[i].done);
>>> +		mutex_init(&t[i].list_lock);
>>> +		t[i].id = i;
>>> +
>>> +		t[i].thread = kthread_create(nvmet_poll_thread, &t[i],
>>> +					     "nvmet_poll_thread%s/%d",
>>> +					     ns->device_path, i);
>>> +
>>> +		if (IS_ERR(t[i].thread)) {
>>> +			ret = PTR_ERR(t[i].thread);
>>> +			goto err;
>>> +		}
>>> +
>>> +		kthread_bind(t[i].thread, i % num_online_cpus());
>>> +		wake_up_process(t[i].thread);
>>> +		wait_for_completion(&t[i].thread_complete);
>>> +	}
>>> +
>>> +	ns->t = t;
>>> +out:
>>> +	return ret;
>>> +err:
>>> +	nvmet_ns_stop_poll(ns);
>>> +	goto out;
>>> +}
>>> +
>>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns)
>>> +{
>>> +	struct nvmet_poll_data *t = ns->t;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < THREAD_PER_CPU; i++) {
>>> +		if (!t[i].thread)
>>> +			continue;
>>> +
>>> +		if (wq_has_sleeper(&t[i].poll_waitq))
>>> +			wake_up(&t[i].poll_waitq);
>>> +		kthread_park(t[i].thread);
>>> +		kthread_stop(t[i].thread);
>>> +	}
>>> +}
>>> +
>>> +static void nvmet_poll(struct nvmet_poll_data *t)
>>> +{
>>> +	struct nvmet_req *req, *tmp;
>>> +
>>> +	lockdep_assert_held(&t->list_lock);
>>> +
>>> +	list_for_each_entry_safe(req, tmp, &t->list, poll_entry) {
>>> +
>>> +		if (completion_done(&req->wait)) {
>>> +			list_move_tail(&req->poll_entry, &t->done);
>>> +			continue;
>>> +		}
>>> +
>>> +		if (!list_empty(&t->done))
>>> +			break;
>>> +
>>> +		list_del(&req->poll_entry);
>>> +		mutex_unlock(&t->list_lock);
>>> +		nvmet_req_poll_complete(req);
>>> +		mutex_lock(&t->list_lock);
>>> +	}
>>> +	mutex_unlock(&t->list_lock);
>>> +	/*
>>> +	 * In future we can also add batch timeout or nr request to complete.
>>> +	 */
>>> +	while (!list_empty(&t->done)) {
>>> +		/*
>>> +		 * We lock and unlock for t->list which gurantee progress of
>>> +		 * nvmet_xxx_rw_execute() when under pressure while we complete
>>> +		 * the request.
>>> +		 */
>>> +		req = list_first_entry(&t->done, struct nvmet_req, poll_entry);
>>> +		list_del(&req->poll_entry);
>>> +		nvmet_req_done(req);
>>> +	}
>>> +
>>> +	mutex_lock(&t->list_lock);
>>> +}
>>> +
>>> +static int nvmet_poll_thread(void *data)
>>> +{
>>> +	struct nvmet_poll_data *t = (struct nvmet_poll_data *) data;
>>> +	DEFINE_WAIT(wait);
>>> +
>>> +	complete(&t->thread_complete);
>>> +
>>> +	while (!kthread_should_park()) {
>>> +
>>> +		mutex_lock(&t->list_lock);
>>> +		while (!list_empty(&t->list) && !need_resched())
>>> +			nvmet_poll(t);
>>> +		mutex_unlock(&t->list_lock);
>>> +
>>> +		prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE);
>>> +		if (signal_pending(current))
>>> +			flush_signals(current);
>>> +		smp_mb();
>>> +		schedule();
>>> +
>>> +		finish_wait(&t->poll_waitq, &wait);
>>> +	}
>>> +
>>> +	kthread_parkme();
>>> +	return 0;
>>> +}
>>> +
>>> +inline void nvmet_req_prep_poll(struct nvmet_req *req)
>>> +{
>>> +	if (!req->poll)
>>> +		return;
>>> +
>>> +	init_completion(&req->wait);
>>> +	req->t = &req->ns->t[smp_processor_id()];
>>> +}
>>> +
>>> +inline void nvmet_req_issue_poll(struct nvmet_req *req)
>>> +{
>>> +	if (!req->poll)
>>> +		return;
>>> +
>>> +	while (!mutex_trylock(&req->t->list_lock)) {
>>> +		if (req->t->id < num_online_cpus())
>>> +			req->t = &req->ns->t[req->t->id + num_online_cpus()];
>>> +		else
>>> +			req->t = &req->ns->t[req->t->id - num_online_cpus()];
>>> +	}
>>> +
>>> +	if (completion_done(&req->wait))
>>> +		list_add(&req->poll_entry, &req->t->list);
>>> +	else
>>> +		list_add_tail(&req->poll_entry, &req->t->list);
>>> +	mutex_unlock(&req->t->list_lock);
>>> +
>>> +	if (wq_has_sleeper(&req->t->poll_waitq))
>>> +		wake_up(&req->t->poll_waitq);
>>> +}
>>> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
>>> index 46df45e837c9..ef2919e23e0b 100644
>>> --- a/drivers/nvme/target/nvmet.h
>>> +++ b/drivers/nvme/target/nvmet.h
>>> @@ -49,11 +49,22 @@
>>>     #define IPO_IATTR_CONNECT_SQE(x)	\
>>>     	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
>>>
>>> +struct nvmet_poll_data {
>>> +	struct completion		thread_complete;
>>> +	wait_queue_head_t		poll_waitq;
>>> +	struct mutex			list_lock;
>>> +	struct task_struct		*thread;
>>> +	struct list_head		list;
>>> +	struct list_head		done;
>>> +	unsigned int			id;
>>> +};
>>> +
>>>     struct nvmet_ns {
>>>     	struct list_head	dev_link;
>>>     	struct percpu_ref	ref;
>>>     	struct block_device	*bdev;
>>>     	struct file		*file;
>>> +	struct nvmet_poll_data	*t;
>>>     	bool			readonly;
>>>     	u32			nsid;
>>>     	u32			blksize_shift;
>>> @@ -63,6 +74,8 @@ struct nvmet_ns {
>>>     	u32			anagrpid;
>>>
>>>     	bool			buffered_io;
>>> +	bool			use_poll;
>>> +	bool			poll;
>>>     	bool			enabled;
>>>     	struct nvmet_subsys	*subsys;
>>>     	const char		*device_path;
>>> @@ -292,9 +305,15 @@ struct nvmet_req {
>>>     	struct nvmet_ns		*ns;
>>>     	struct scatterlist	*sg;
>>>     	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
>>> +	struct completion	wait;
>>> +	bool			poll;
>>> +	struct nvmet_poll_data  *t;
>>> +	struct list_head	poll_entry;
>>>     	union {
>>>     		struct {
>>> -			struct bio      inline_bio;
>>> +			struct bio		inline_bio;
>>> +			blk_qc_t		cookie;
>>> +			struct bio		*last_bio;
>>>     		} b;
>>>     		struct {
>>>     			bool			mpool_alloc;
>>> @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys,
>>>     void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>>     		u8 event_info, u8 log_page);
>>>
>>> +int nvmet_ns_start_poll(struct nvmet_ns *ns);
>>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns);
>>> +void nvmet_req_prep_poll(struct nvmet_req *req);
>>> +void nvmet_req_issue_poll(struct nvmet_req *req);
>>> +
>>> +void nvmet_req_poll_complete(struct nvmet_req *req);
>>> +void nvmet_bdev_poll_complete(struct nvmet_req *req);
>>> +void nvmet_bdev_req_complete(struct nvmet_req *req);
>>> +void nvmet_req_done(struct nvmet_req *req);
>>> +
>>>     #define NVMET_QUEUE_SIZE	1024
>>>     #define NVMET_NR_QUEUES		128
>>>     #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support
  2020-01-23 14:23       ` Max Gurtovoy
@ 2020-01-30 18:19         ` Chaitanya Kulkarni
  0 siblings, 0 replies; 12+ messages in thread
From: Chaitanya Kulkarni @ 2020-01-30 18:19 UTC (permalink / raw)
  To: Max Gurtovoy, sagi, hch; +Cc: linux-nvme

On 01/23/2020 06:23 AM, Max Gurtovoy wrote:
> On 1/21/2020 9:22 PM, Chaitanya Kulkarni wrote:
>> >Please see my in-line comments.
>> >On 01/20/2020 04:52 AM, Max Gurtovoy wrote:
>>> >>On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote:
>>>> >>>This patch adds support for the bdev-ns polling. We also add a new
>>>> >>>file to keep polling code separate (io-poll.c). The newly added
>>>> >>>configfs attribute allows user to enable/disable polling.
>>>> >>>
>>>> >>>We only enable polling for the READ/WRITE operation based on support
>>>> >>>from bdev and and use_poll configfs attr.
>>>> >>>
>>>> >>>We configure polling per namespace. For each namespace we create
>>>> >>>polling threads. For general approach please have a look at the
>>>> >>>cover-letter of this patch-series.
>>> >>It would be great to get some numbers here to learn about the trade-off
>>> >>for this approach.
>>> >>
>>> >>
>> >Already posted these numbers with QEMU nvme-loop here :-
>> >https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fpipermail%2Flinux-nvme%2F2020-January%2F028692.html&amp;data=02%7C01%7Cmaxg%40mellanox.com%7C9308e8badf9f41ec4faf08d79ea733ae%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637152313279111655&amp;sdata=RM3%2FcVeRcp8hgyAUFIM%2B7wbZ0cw2xvhdsfcVp9FuPm4%3D&amp;reserved=0
>> >
>> >IOPS/BW:-
>> >
>> >Default:-
>> >read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
>> >read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
>> >read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
>> >Poll:-
>> >read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
>> >read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
>> >read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)
>> >
>> >mpstat:-
>> >Default:-
>> >CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
>> >---------------------------------------------------------
>> >all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
>> >---------------------------------------------------------
>> >0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
>> >1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
>> >2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
>> >3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
>> >4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
>> >5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
>> >6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
>> >7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
>> >8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
>> >9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
>> >10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>> >11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>> >
>> >Poll:-
>> >CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
>> >---------------------------------------------------------
>> >all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
>> >---------------------------------------------------------
>> >0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
>> >1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
>> >2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> >3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> >4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> >5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> >6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> >7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
>> >8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> >9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> >10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
>> >11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> with all the respect to nvmet-loop, it isn't a real use case, right ?
>
> Can you re-run it using RDMA and TCP ?
>
> what is the backing device for these tests ? NVMe/NULL_BLK ?
>
>
I don't think null_blk supports polling. This is NVMe QEMU Device.

Don't have the RDMA setup, I'll create one and report numbers.






_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, back to index

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-10  6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni
2019-12-10  6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni
2020-01-20 12:52   ` Max Gurtovoy
2020-01-21 19:22     ` Chaitanya Kulkarni
2020-01-23 14:23       ` Max Gurtovoy
2020-01-30 18:19         ` Chaitanya Kulkarni
2019-12-10  6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni
2019-12-12  1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg
2019-12-12  5:44   ` Chaitanya Kulkarni
2019-12-12 20:32     ` Sagi Grimberg
2020-01-20  5:13       ` Chaitanya Kulkarni
2020-01-20  4:48   ` Chaitanya Kulkarni

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git