All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Gurtovoy <maxg@mellanox.com>
To: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>,
	"sagi@grimberg.me" <sagi@grimberg.me>, "hch@lst.de" <hch@lst.de>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support
Date: Thu, 23 Jan 2020 16:23:30 +0200	[thread overview]
Message-ID: <03494a32-b31a-432b-870d-7347f5cc9596@mellanox.com> (raw)
In-Reply-To: <DM6PR04MB575431D7764F7D4BB98AFA5B860D0@DM6PR04MB5754.namprd04.prod.outlook.com>


On 1/21/2020 9:22 PM, Chaitanya Kulkarni wrote:
> Please see my in-line comments.
> On 01/20/2020 04:52 AM, Max Gurtovoy wrote:
>> On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote:
>>> This patch adds support for the bdev-ns polling. We also add a new
>>> file to keep polling code separate (io-poll.c). The newly added
>>> configfs attribute allows user to enable/disable polling.
>>>
>>> We only enable polling for the READ/WRITE operation based on support
>>> from bdev and and use_poll configfs attr.
>>>
>>> We configure polling per namespace. For each namespace we create
>>> polling threads. For general approach please have a look at the
>>> cover-letter of this patch-series.
>> It would be great to get some numbers here to learn about the trade-off
>> for this approach.
>>
>>
> Already posted these numbers with QEMU nvme-loop here :-
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fpipermail%2Flinux-nvme%2F2020-January%2F028692.html&amp;data=02%7C01%7Cmaxg%40mellanox.com%7C9308e8badf9f41ec4faf08d79ea733ae%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637152313279111655&amp;sdata=RM3%2FcVeRcp8hgyAUFIM%2B7wbZ0cw2xvhdsfcVp9FuPm4%3D&amp;reserved=0
>
> IOPS/BW:-
>
> Default:-
> read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
> read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
> read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
> Poll:-
> read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
> read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
> read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)
>
> mpstat:-
> Default:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
> ---------------------------------------------------------
> 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
> 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
> 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
> 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
> 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
> 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
> 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
> 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
> 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>
> Poll:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
> ---------------------------------------------------------
> 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
> 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
> 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
> 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
> 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00

with all the respect to nvmet-loop, it isn't a real use case, right ?

Can you re-run it using RDMA and TCP ?

what is the backing device for these tests ? NVMe/NULL_BLK ?


>>> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
>>> ---
>>>     drivers/nvme/target/Makefile      |   3 +-
>>>     drivers/nvme/target/configfs.c    |  29 ++++++
>>>     drivers/nvme/target/core.c        |  13 +++
>>>     drivers/nvme/target/io-cmd-bdev.c |  61 +++++++++--
>>>     drivers/nvme/target/io-poll.c     | 165 ++++++++++++++++++++++++++++++
>>>     drivers/nvme/target/nvmet.h       |  31 +++++-
>>>     6 files changed, 291 insertions(+), 11 deletions(-)
>>>     create mode 100644 drivers/nvme/target/io-poll.c
>>>
>>> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
>>> index 2b33836f3d3e..8877ba16305c 100644
>>> --- a/drivers/nvme/target/Makefile
>>> +++ b/drivers/nvme/target/Makefile
>>> @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
>>>     obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
>>>
>>>     nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
>>> -			discovery.o io-cmd-file.o io-cmd-bdev.o
>>> +			discovery.o io-cmd-file.o io-cmd-bdev.o \
>>> +			io-poll.o
>>>     nvme-loop-y	+= loop.o
>>>     nvmet-rdma-y	+= rdma.o
>>>     nvmet-fc-y	+= fc.o
>>> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
>>> index 98613a45bd3b..0e6e8b0dbf79 100644
>>> --- a/drivers/nvme/target/configfs.c
>>> +++ b/drivers/nvme/target/configfs.c
>>> @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
>>>
>>>     CONFIGFS_ATTR(nvmet_ns_, buffered_io);
>>>
>>> +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
>>> +{
>>> +	return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
>>> +}
>>> +
>>> +static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
>>> +		const char *page, size_t count)
>>> +{
>>> +	struct nvmet_ns *ns = to_nvmet_ns(item);
>>> +	bool val;
>>> +
>>> +	if (strtobool(page, &val))
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&ns->subsys->lock);
>>> +	if (ns->enabled) {
>>> +		pr_err("disable ns before setting use_poll value.\n");
>>> +		mutex_unlock(&ns->subsys->lock);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	ns->use_poll = val;
>>> +	mutex_unlock(&ns->subsys->lock);
>>> +	return count;
>>> +}
>>> +
>>> +CONFIGFS_ATTR(nvmet_ns_, use_poll);
>>> +
>>>     static struct configfs_attribute *nvmet_ns_attrs[] = {
>>>     	&nvmet_ns_attr_device_path,
>>>     	&nvmet_ns_attr_device_nguid,
>>> @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>>>     	&nvmet_ns_attr_ana_grpid,
>>>     	&nvmet_ns_attr_enable,
>>>     	&nvmet_ns_attr_buffered_io,
>>> +	&nvmet_ns_attr_use_poll,
>>>     #ifdef CONFIG_PCI_P2PDMA
>>>     	&nvmet_ns_attr_p2pmem,
>>>     #endif
>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>>> index 28438b833c1b..d8f9130d1cd1 100644
>>> --- a/drivers/nvme/target/core.c
>>> +++ b/drivers/nvme/target/core.c
>>> @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl,
>>>     		ns->nsid);
>>>     }
>>>
>>> +inline void nvmet_req_done(struct nvmet_req *req)
>>> +{
>>> +	if (req->ns->bdev)
>>> +		nvmet_bdev_req_complete(req);
>>> +}
>>> +
>>> +inline void nvmet_req_poll_complete(struct nvmet_req *req)
>>> +{
>>> +	if (req->ns->bdev)
>>> +		nvmet_bdev_poll_complete(req);
>>> +}
>>> +
>>>     int nvmet_ns_enable(struct nvmet_ns *ns)
>>>     {
>>>     	struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
>>>
>>>     	uuid_gen(&ns->uuid);
>>>     	ns->buffered_io = false;
>>> +	ns->use_poll = false;
>>>
>>>     	return ns;
>>>     }
>>> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
>>> index b6fca0e421ef..13507e0cbc1d 100644
>>> --- a/drivers/nvme/target/io-cmd-bdev.c
>>> +++ b/drivers/nvme/target/io-cmd-bdev.c
>>> @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
>>>
>>>     int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>>     {
>>> +	int fmode = FMODE_READ | FMODE_WRITE;
>>> +	struct request_queue *q;
>>>     	int ret;
>>>
>>> -	ns->bdev = blkdev_get_by_path(ns->device_path,
>>> -			FMODE_READ | FMODE_WRITE, NULL);
>>> +	ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
>>>     	if (IS_ERR(ns->bdev)) {
>>>     		ret = PTR_ERR(ns->bdev);
>>>     		if (ret != -ENOTBLK) {
>>> @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>>     					ns->device_path, PTR_ERR(ns->bdev));
>>>     		}
>>>     		ns->bdev = NULL;
>>> -		return ret;
>>> +		goto out;
>>>     	}
>>>     	ns->size = i_size_read(ns->bdev->bd_inode);
>>>     	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
>>> -	return 0;
>>> +	q = bdev_get_queue(ns->bdev);
>>> +	ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
>>> +	ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
>>> +out:
>>> +	return ret;
>>>     }
>>>
>>>     void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
>>>     {
>>>     	if (ns->bdev) {
>>> +		ns->poll ? nvmet_ns_stop_poll(ns) : 0;
>> can you please use an "if" statement instead of the above convention ?
>>
> This is just as RFC, will send out formal patch series with all the
> coding style fixes.
>
>>>     		blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
>>>     		ns->bdev = NULL;
>>>     	}
>>> @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
>>>     	return status;
>>>     }
>>>
>>> -static void nvmet_bio_done(struct bio *bio)
>>> +void nvmet_bdev_req_complete(struct nvmet_req *req)
>>>     {
>>> -	struct nvmet_req *req = bio->bi_private;
>>> +	struct bio *bio = req->b.last_bio;
>>>
>>>     	nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
>>>     	if (bio != &req->b.inline_bio)
>>>     		bio_put(bio);
>>>     }
>>>
>>> +static void nvmet_bio_done(struct bio *bio)
>>> +{
>>> +	struct nvmet_req *req = bio->bi_private;
>>> +
>>> +	req->b.last_bio = bio;
>>> +
>>> +	req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
>> Same here for the "if". Lets keep the code as readable as we can
>>
> Same as above.
>>> +}
>>> +
>>> +void nvmet_bdev_poll_complete(struct nvmet_req *req)
>>> +{
>>> +	struct request_queue *q = bdev_get_queue(req->ns->bdev);
>>> +
>>> +	while (!completion_done(&req->wait)) {
>>> +		int ret = blk_poll(q, req->b.cookie, true);
>>> +
>>> +		if (ret < 0) {
>>> +			pr_err("tid %d poll error %d", req->t->id, ret);
>>> +			break;
>>> +		}
>>> +	}
>>> +	nvmet_bdev_req_complete(req);
>>> +}
>>> +IOPS/BW:-
> Default:-
> read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
> read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
> read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
> Poll:-
> read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
> read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
> read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)
>
> mpstat:-
> Default:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
> ---------------------------------------------------------
> 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
> 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
> 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
> 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
> 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
> 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
> 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
> 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
> 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>
> Poll:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
> ---------------------------------------------------------
> 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
> 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
> 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
> 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
> 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
>>>     static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>>     {
>>>     	int sg_cnt = req->sg_cnt;
>>> @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>>     	bio->bi_end_io = nvmet_bio_done;
>>>     	bio->bi_opf = op;
>>>
>>> -	blk_start_plug(&plug);
>>> +	if (!req->poll)
>>> +		blk_start_plug(&plug);
>>>     	for_each_sg(req->sg, sg, req->sg_cnt, i) {
>>>     		while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
>>>     				!= sg->length) {
>>> @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>>     		sg_cnt--;
>>>     	}
>>>
>>> -	submit_bio(bio);
>>> -	blk_finish_plug(&plug);
>>> +	req->b.last_bio = bio;
>>> +	if (req->poll)
>>> +		req->b.last_bio->bi_opf |= REQ_HIPRI;
>>> +
>>> +	req->b.cookie = submit_bio(bio);
>>> +
>>> +	nvmet_req_prep_poll(req);
>>> +	nvmet_req_issue_poll(req);
>>> +	if (!req->poll)
>>> +		blk_finish_plug(&plug);
>>>     }
>>>
>>>     static void nvmet_bdev_execute_flush(struct nvmet_req *req)
>>> @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
>>>     	switch (cmd->common.opcode) {
>>>     	case nvme_cmd_read:
>>>     	case nvme_cmd_write:
>>> +		req->poll = req->ns->poll;
>>>     		req->execute = nvmet_bdev_execute_rw;
>>>     		return 0;
>>>     	case nvme_cmd_flush:
>>> +		req->poll = false;
>> should be done in some general place for req initialization.
>>
>>
> This needs to be done here for better redability, since it
> depends on the 2 factors with backend support and user setting.
>
>>>     		req->execute = nvmet_bdev_execute_flush;
>>>     		return 0;
>>>     	case nvme_cmd_dsm:
>>> +		req->poll = false;
>>>     		req->execute = nvmet_bdev_execute_dsm;
>>>     		return 0;
>>>     	case nvme_cmd_write_zeroes:
>>> +		req->poll = false;
>>>     		req->execute = nvmet_bdev_execute_write_zeroes;
>>>     		return 0;
>>>     	default:
>>> diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c
>>> new file mode 100644
>>> index 000000000000..175c939c22ff
>>> --- /dev/null
>>> +++ b/drivers/nvme/target/io-poll.c
>>> @@ -0,0 +1,165 @@
>>> +#include <linux/blkdev.h>
>>> +#include <linux/module.h>
>>> +#include <linux/sched/signal.h>
>>> +#include <linux/kthread.h>
>>> +#include <uapi/linux/sched/types.h>
>>> +
>>> +#include "nvmet.h"
>>> +#define THREAD_PER_CPU	(num_online_cpus() * 2)
>>> +
>>> +static int nvmet_poll_thread(void *data);
>>> +
>>> +int nvmet_ns_start_poll(struct nvmet_ns *ns)
>>> +{
>>> +	struct nvmet_poll_data *t;
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL);
>>> +	if (!t) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	for (i = 0; i < THREAD_PER_CPU; i++) {
>>> +		init_completion(&t[i].thread_complete);
>>> +		init_waitqueue_head(&t[i].poll_waitq);
>>> +		INIT_LIST_HEAD(&t[i].list);
>>> +		INIT_LIST_HEAD(&t[i].done);
>>> +		mutex_init(&t[i].list_lock);
>>> +		t[i].id = i;
>>> +
>>> +		t[i].thread = kthread_create(nvmet_poll_thread, &t[i],
>>> +					     "nvmet_poll_thread%s/%d",
>>> +					     ns->device_path, i);
>>> +
>>> +		if (IS_ERR(t[i].thread)) {
>>> +			ret = PTR_ERR(t[i].thread);
>>> +			goto err;
>>> +		}
>>> +
>>> +		kthread_bind(t[i].thread, i % num_online_cpus());
>>> +		wake_up_process(t[i].thread);
>>> +		wait_for_completion(&t[i].thread_complete);
>>> +	}
>>> +
>>> +	ns->t = t;
>>> +out:
>>> +	return ret;
>>> +err:
>>> +	nvmet_ns_stop_poll(ns);
>>> +	goto out;
>>> +}
>>> +
>>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns)
>>> +{
>>> +	struct nvmet_poll_data *t = ns->t;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < THREAD_PER_CPU; i++) {
>>> +		if (!t[i].thread)
>>> +			continue;
>>> +
>>> +		if (wq_has_sleeper(&t[i].poll_waitq))
>>> +			wake_up(&t[i].poll_waitq);
>>> +		kthread_park(t[i].thread);
>>> +		kthread_stop(t[i].thread);
>>> +	}
>>> +}
>>> +
>>> +static void nvmet_poll(struct nvmet_poll_data *t)
>>> +{
>>> +	struct nvmet_req *req, *tmp;
>>> +
>>> +	lockdep_assert_held(&t->list_lock);
>>> +
>>> +	list_for_each_entry_safe(req, tmp, &t->list, poll_entry) {
>>> +
>>> +		if (completion_done(&req->wait)) {
>>> +			list_move_tail(&req->poll_entry, &t->done);
>>> +			continue;
>>> +		}
>>> +
>>> +		if (!list_empty(&t->done))
>>> +			break;
>>> +
>>> +		list_del(&req->poll_entry);
>>> +		mutex_unlock(&t->list_lock);
>>> +		nvmet_req_poll_complete(req);
>>> +		mutex_lock(&t->list_lock);
>>> +	}
>>> +	mutex_unlock(&t->list_lock);
>>> +	/*
>>> +	 * In future we can also add batch timeout or nr request to complete.
>>> +	 */
>>> +	while (!list_empty(&t->done)) {
>>> +		/*
>>> +		 * We lock and unlock for t->list which gurantee progress of
>>> +		 * nvmet_xxx_rw_execute() when under pressure while we complete
>>> +		 * the request.
>>> +		 */
>>> +		req = list_first_entry(&t->done, struct nvmet_req, poll_entry);
>>> +		list_del(&req->poll_entry);
>>> +		nvmet_req_done(req);
>>> +	}
>>> +
>>> +	mutex_lock(&t->list_lock);
>>> +}
>>> +
>>> +static int nvmet_poll_thread(void *data)
>>> +{
>>> +	struct nvmet_poll_data *t = (struct nvmet_poll_data *) data;
>>> +	DEFINE_WAIT(wait);
>>> +
>>> +	complete(&t->thread_complete);
>>> +
>>> +	while (!kthread_should_park()) {
>>> +
>>> +		mutex_lock(&t->list_lock);
>>> +		while (!list_empty(&t->list) && !need_resched())
>>> +			nvmet_poll(t);
>>> +		mutex_unlock(&t->list_lock);
>>> +
>>> +		prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE);
>>> +		if (signal_pending(current))
>>> +			flush_signals(current);
>>> +		smp_mb();
>>> +		schedule();
>>> +
>>> +		finish_wait(&t->poll_waitq, &wait);
>>> +	}
>>> +
>>> +	kthread_parkme();
>>> +	return 0;
>>> +}
>>> +
>>> +inline void nvmet_req_prep_poll(struct nvmet_req *req)
>>> +{
>>> +	if (!req->poll)
>>> +		return;
>>> +
>>> +	init_completion(&req->wait);
>>> +	req->t = &req->ns->t[smp_processor_id()];
>>> +}
>>> +
>>> +inline void nvmet_req_issue_poll(struct nvmet_req *req)
>>> +{
>>> +	if (!req->poll)
>>> +		return;
>>> +
>>> +	while (!mutex_trylock(&req->t->list_lock)) {
>>> +		if (req->t->id < num_online_cpus())
>>> +			req->t = &req->ns->t[req->t->id + num_online_cpus()];
>>> +		else
>>> +			req->t = &req->ns->t[req->t->id - num_online_cpus()];
>>> +	}
>>> +
>>> +	if (completion_done(&req->wait))
>>> +		list_add(&req->poll_entry, &req->t->list);
>>> +	else
>>> +		list_add_tail(&req->poll_entry, &req->t->list);
>>> +	mutex_unlock(&req->t->list_lock);
>>> +
>>> +	if (wq_has_sleeper(&req->t->poll_waitq))
>>> +		wake_up(&req->t->poll_waitq);
>>> +}
>>> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
>>> index 46df45e837c9..ef2919e23e0b 100644
>>> --- a/drivers/nvme/target/nvmet.h
>>> +++ b/drivers/nvme/target/nvmet.h
>>> @@ -49,11 +49,22 @@
>>>     #define IPO_IATTR_CONNECT_SQE(x)	\
>>>     	(cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
>>>
>>> +struct nvmet_poll_data {
>>> +	struct completion		thread_complete;
>>> +	wait_queue_head_t		poll_waitq;
>>> +	struct mutex			list_lock;
>>> +	struct task_struct		*thread;
>>> +	struct list_head		list;
>>> +	struct list_head		done;
>>> +	unsigned int			id;
>>> +};
>>> +
>>>     struct nvmet_ns {
>>>     	struct list_head	dev_link;
>>>     	struct percpu_ref	ref;
>>>     	struct block_device	*bdev;
>>>     	struct file		*file;
>>> +	struct nvmet_poll_data	*t;
>>>     	bool			readonly;
>>>     	u32			nsid;
>>>     	u32			blksize_shift;
>>> @@ -63,6 +74,8 @@ struct nvmet_ns {
>>>     	u32			anagrpid;
>>>
>>>     	bool			buffered_io;
>>> +	bool			use_poll;
>>> +	bool			poll;
>>>     	bool			enabled;
>>>     	struct nvmet_subsys	*subsys;
>>>     	const char		*device_path;
>>> @@ -292,9 +305,15 @@ struct nvmet_req {
>>>     	struct nvmet_ns		*ns;
>>>     	struct scatterlist	*sg;
>>>     	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
>>> +	struct completion	wait;
>>> +	bool			poll;
>>> +	struct nvmet_poll_data  *t;
>>> +	struct list_head	poll_entry;
>>>     	union {
>>>     		struct {
>>> -			struct bio      inline_bio;
>>> +			struct bio		inline_bio;
>>> +			blk_qc_t		cookie;
>>> +			struct bio		*last_bio;
>>>     		} b;
>>>     		struct {
>>>     			bool			mpool_alloc;
>>> @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys,
>>>     void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>>     		u8 event_info, u8 log_page);
>>>
>>> +int nvmet_ns_start_poll(struct nvmet_ns *ns);
>>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns);
>>> +void nvmet_req_prep_poll(struct nvmet_req *req);
>>> +void nvmet_req_issue_poll(struct nvmet_req *req);
>>> +
>>> +void nvmet_req_poll_complete(struct nvmet_req *req);
>>> +void nvmet_bdev_poll_complete(struct nvmet_req *req);
>>> +void nvmet_bdev_req_complete(struct nvmet_req *req);
>>> +void nvmet_req_done(struct nvmet_req *req);
>>> +
>>>     #define NVMET_QUEUE_SIZE	1024
>>>     #define NVMET_NR_QUEUES		128
>>>     #define NVMET_MAX_CMD		NVMET_QUEUE_SIZE

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2020-01-23 14:23 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-10  6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni
2019-12-10  6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni
2020-01-20 12:52   ` Max Gurtovoy
2020-01-21 19:22     ` Chaitanya Kulkarni
2020-01-23 14:23       ` Max Gurtovoy [this message]
2020-01-30 18:19         ` Chaitanya Kulkarni
2019-12-10  6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni
2019-12-12  1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg
2019-12-12  5:44   ` Chaitanya Kulkarni
2019-12-12 20:32     ` Sagi Grimberg
2020-01-20  5:13       ` Chaitanya Kulkarni
2020-01-20  4:48   ` Chaitanya Kulkarni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=03494a32-b31a-432b-870d-7347f5cc9596@mellanox.com \
    --to=maxg@mellanox.com \
    --cc=Chaitanya.Kulkarni@wdc.com \
    --cc=hch@lst.de \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.