From: Max Gurtovoy <maxg@mellanox.com>
To: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>,
"sagi@grimberg.me" <sagi@grimberg.me>, "hch@lst.de" <hch@lst.de>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support
Date: Thu, 23 Jan 2020 16:23:30 +0200 [thread overview]
Message-ID: <03494a32-b31a-432b-870d-7347f5cc9596@mellanox.com> (raw)
In-Reply-To: <DM6PR04MB575431D7764F7D4BB98AFA5B860D0@DM6PR04MB5754.namprd04.prod.outlook.com>
On 1/21/2020 9:22 PM, Chaitanya Kulkarni wrote:
> Please see my in-line comments.
> On 01/20/2020 04:52 AM, Max Gurtovoy wrote:
>> On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote:
>>> This patch adds support for the bdev-ns polling. We also add a new
>>> file to keep polling code separate (io-poll.c). The newly added
>>> configfs attribute allows user to enable/disable polling.
>>>
>>> We only enable polling for the READ/WRITE operation based on support
>>> from bdev and and use_poll configfs attr.
>>>
>>> We configure polling per namespace. For each namespace we create
>>> polling threads. For general approach please have a look at the
>>> cover-letter of this patch-series.
>> It would be great to get some numbers here to learn about the trade-off
>> for this approach.
>>
>>
> Already posted these numbers with QEMU nvme-loop here :-
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fpipermail%2Flinux-nvme%2F2020-January%2F028692.html&data=02%7C01%7Cmaxg%40mellanox.com%7C9308e8badf9f41ec4faf08d79ea733ae%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637152313279111655&sdata=RM3%2FcVeRcp8hgyAUFIM%2B7wbZ0cw2xvhdsfcVp9FuPm4%3D&reserved=0
>
> IOPS/BW:-
>
> Default:-
> read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
> read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
> read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
> Poll:-
> read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
> read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
> read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)
>
> mpstat:-
> Default:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
> ---------------------------------------------------------
> 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
> 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
> 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
> 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
> 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
> 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
> 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
> 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
> 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>
> Poll:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
> ---------------------------------------------------------
> 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
> 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
> 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
> 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
> 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
with all the respect to nvmet-loop, it isn't a real use case, right ?
Can you re-run it using RDMA and TCP ?
what is the backing device for these tests ? NVMe/NULL_BLK ?
>>> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
>>> ---
>>> drivers/nvme/target/Makefile | 3 +-
>>> drivers/nvme/target/configfs.c | 29 ++++++
>>> drivers/nvme/target/core.c | 13 +++
>>> drivers/nvme/target/io-cmd-bdev.c | 61 +++++++++--
>>> drivers/nvme/target/io-poll.c | 165 ++++++++++++++++++++++++++++++
>>> drivers/nvme/target/nvmet.h | 31 +++++-
>>> 6 files changed, 291 insertions(+), 11 deletions(-)
>>> create mode 100644 drivers/nvme/target/io-poll.c
>>>
>>> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
>>> index 2b33836f3d3e..8877ba16305c 100644
>>> --- a/drivers/nvme/target/Makefile
>>> +++ b/drivers/nvme/target/Makefile
>>> @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o
>>> obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o
>>>
>>> nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \
>>> - discovery.o io-cmd-file.o io-cmd-bdev.o
>>> + discovery.o io-cmd-file.o io-cmd-bdev.o \
>>> + io-poll.o
>>> nvme-loop-y += loop.o
>>> nvmet-rdma-y += rdma.o
>>> nvmet-fc-y += fc.o
>>> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
>>> index 98613a45bd3b..0e6e8b0dbf79 100644
>>> --- a/drivers/nvme/target/configfs.c
>>> +++ b/drivers/nvme/target/configfs.c
>>> @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item,
>>>
>>> CONFIGFS_ATTR(nvmet_ns_, buffered_io);
>>>
>>> +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page)
>>> +{
>>> + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll);
>>> +}
>>> +
>>> +static ssize_t nvmet_ns_use_poll_store(struct config_item *item,
>>> + const char *page, size_t count)
>>> +{
>>> + struct nvmet_ns *ns = to_nvmet_ns(item);
>>> + bool val;
>>> +
>>> + if (strtobool(page, &val))
>>> + return -EINVAL;
>>> +
>>> + mutex_lock(&ns->subsys->lock);
>>> + if (ns->enabled) {
>>> + pr_err("disable ns before setting use_poll value.\n");
>>> + mutex_unlock(&ns->subsys->lock);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ns->use_poll = val;
>>> + mutex_unlock(&ns->subsys->lock);
>>> + return count;
>>> +}
>>> +
>>> +CONFIGFS_ATTR(nvmet_ns_, use_poll);
>>> +
>>> static struct configfs_attribute *nvmet_ns_attrs[] = {
>>> &nvmet_ns_attr_device_path,
>>> &nvmet_ns_attr_device_nguid,
>>> @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>>> &nvmet_ns_attr_ana_grpid,
>>> &nvmet_ns_attr_enable,
>>> &nvmet_ns_attr_buffered_io,
>>> + &nvmet_ns_attr_use_poll,
>>> #ifdef CONFIG_PCI_P2PDMA
>>> &nvmet_ns_attr_p2pmem,
>>> #endif
>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>>> index 28438b833c1b..d8f9130d1cd1 100644
>>> --- a/drivers/nvme/target/core.c
>>> +++ b/drivers/nvme/target/core.c
>>> @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl,
>>> ns->nsid);
>>> }
>>>
>>> +inline void nvmet_req_done(struct nvmet_req *req)
>>> +{
>>> + if (req->ns->bdev)
>>> + nvmet_bdev_req_complete(req);
>>> +}
>>> +
>>> +inline void nvmet_req_poll_complete(struct nvmet_req *req)
>>> +{
>>> + if (req->ns->bdev)
>>> + nvmet_bdev_poll_complete(req);
>>> +}
>>> +
>>> int nvmet_ns_enable(struct nvmet_ns *ns)
>>> {
>>> struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid)
>>>
>>> uuid_gen(&ns->uuid);
>>> ns->buffered_io = false;
>>> + ns->use_poll = false;
>>>
>>> return ns;
>>> }
>>> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
>>> index b6fca0e421ef..13507e0cbc1d 100644
>>> --- a/drivers/nvme/target/io-cmd-bdev.c
>>> +++ b/drivers/nvme/target/io-cmd-bdev.c
>>> @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
>>>
>>> int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>> {
>>> + int fmode = FMODE_READ | FMODE_WRITE;
>>> + struct request_queue *q;
>>> int ret;
>>>
>>> - ns->bdev = blkdev_get_by_path(ns->device_path,
>>> - FMODE_READ | FMODE_WRITE, NULL);
>>> + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL);
>>> if (IS_ERR(ns->bdev)) {
>>> ret = PTR_ERR(ns->bdev);
>>> if (ret != -ENOTBLK) {
>>> @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>>> ns->device_path, PTR_ERR(ns->bdev));
>>> }
>>> ns->bdev = NULL;
>>> - return ret;
>>> + goto out;
>>> }
>>> ns->size = i_size_read(ns->bdev->bd_inode);
>>> ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
>>> - return 0;
>>> + q = bdev_get_queue(ns->bdev);
>>> + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
>>> + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0;
>>> +out:
>>> + return ret;
>>> }
>>>
>>> void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
>>> {
>>> if (ns->bdev) {
>>> + ns->poll ? nvmet_ns_stop_poll(ns) : 0;
>> can you please use an "if" statement instead of the above convention ?
>>
> This is just as RFC, will send out formal patch series with all the
> coding style fixes.
>
>>> blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
>>> ns->bdev = NULL;
>>> }
>>> @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
>>> return status;
>>> }
>>>
>>> -static void nvmet_bio_done(struct bio *bio)
>>> +void nvmet_bdev_req_complete(struct nvmet_req *req)
>>> {
>>> - struct nvmet_req *req = bio->bi_private;
>>> + struct bio *bio = req->b.last_bio;
>>>
>>> nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status));
>>> if (bio != &req->b.inline_bio)
>>> bio_put(bio);
>>> }
>>>
>>> +static void nvmet_bio_done(struct bio *bio)
>>> +{
>>> + struct nvmet_req *req = bio->bi_private;
>>> +
>>> + req->b.last_bio = bio;
>>> +
>>> + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req);
>> Same here for the "if". Lets keep the code as readable as we can
>>
> Same as above.
>>> +}
>>> +
>>> +void nvmet_bdev_poll_complete(struct nvmet_req *req)
>>> +{
>>> + struct request_queue *q = bdev_get_queue(req->ns->bdev);
>>> +
>>> + while (!completion_done(&req->wait)) {
>>> + int ret = blk_poll(q, req->b.cookie, true);
>>> +
>>> + if (ret < 0) {
>>> + pr_err("tid %d poll error %d", req->t->id, ret);
>>> + break;
>>> + }
>>> + }
>>> + nvmet_bdev_req_complete(req);
>>> +}
>>> +IOPS/BW:-
> Default:-
> read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec)
> read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec)
> read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec)
> Poll:-
> read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec)
> read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec)
> read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec)
>
> mpstat:-
> Default:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60
> ---------------------------------------------------------
> 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00
> 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57
> 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41
> 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18
> 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22
> 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84
> 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99
> 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39
> 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
> 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73
>
> Poll:-
> CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle
> ---------------------------------------------------------
> all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75
> ---------------------------------------------------------
> 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00
> 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00
> 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00
> 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96
> 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
>>> static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>> {
>>> int sg_cnt = req->sg_cnt;
>>> @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>> bio->bi_end_io = nvmet_bio_done;
>>> bio->bi_opf = op;
>>>
>>> - blk_start_plug(&plug);
>>> + if (!req->poll)
>>> + blk_start_plug(&plug);
>>> for_each_sg(req->sg, sg, req->sg_cnt, i) {
>>> while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset)
>>> != sg->length) {
>>> @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>> sg_cnt--;
>>> }
>>>
>>> - submit_bio(bio);
>>> - blk_finish_plug(&plug);
>>> + req->b.last_bio = bio;
>>> + if (req->poll)
>>> + req->b.last_bio->bi_opf |= REQ_HIPRI;
>>> +
>>> + req->b.cookie = submit_bio(bio);
>>> +
>>> + nvmet_req_prep_poll(req);
>>> + nvmet_req_issue_poll(req);
>>> + if (!req->poll)
>>> + blk_finish_plug(&plug);
>>> }
>>>
>>> static void nvmet_bdev_execute_flush(struct nvmet_req *req)
>>> @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
>>> switch (cmd->common.opcode) {
>>> case nvme_cmd_read:
>>> case nvme_cmd_write:
>>> + req->poll = req->ns->poll;
>>> req->execute = nvmet_bdev_execute_rw;
>>> return 0;
>>> case nvme_cmd_flush:
>>> + req->poll = false;
>> should be done in some general place for req initialization.
>>
>>
> This needs to be done here for better redability, since it
> depends on the 2 factors with backend support and user setting.
>
>>> req->execute = nvmet_bdev_execute_flush;
>>> return 0;
>>> case nvme_cmd_dsm:
>>> + req->poll = false;
>>> req->execute = nvmet_bdev_execute_dsm;
>>> return 0;
>>> case nvme_cmd_write_zeroes:
>>> + req->poll = false;
>>> req->execute = nvmet_bdev_execute_write_zeroes;
>>> return 0;
>>> default:
>>> diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c
>>> new file mode 100644
>>> index 000000000000..175c939c22ff
>>> --- /dev/null
>>> +++ b/drivers/nvme/target/io-poll.c
>>> @@ -0,0 +1,165 @@
>>> +#include <linux/blkdev.h>
>>> +#include <linux/module.h>
>>> +#include <linux/sched/signal.h>
>>> +#include <linux/kthread.h>
>>> +#include <uapi/linux/sched/types.h>
>>> +
>>> +#include "nvmet.h"
>>> +#define THREAD_PER_CPU (num_online_cpus() * 2)
>>> +
>>> +static int nvmet_poll_thread(void *data);
>>> +
>>> +int nvmet_ns_start_poll(struct nvmet_ns *ns)
>>> +{
>>> + struct nvmet_poll_data *t;
>>> + int ret = 0;
>>> + int i;
>>> +
>>> + t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL);
>>> + if (!t) {
>>> + ret = -ENOMEM;
>>> + goto out;
>>> + }
>>> +
>>> + for (i = 0; i < THREAD_PER_CPU; i++) {
>>> + init_completion(&t[i].thread_complete);
>>> + init_waitqueue_head(&t[i].poll_waitq);
>>> + INIT_LIST_HEAD(&t[i].list);
>>> + INIT_LIST_HEAD(&t[i].done);
>>> + mutex_init(&t[i].list_lock);
>>> + t[i].id = i;
>>> +
>>> + t[i].thread = kthread_create(nvmet_poll_thread, &t[i],
>>> + "nvmet_poll_thread%s/%d",
>>> + ns->device_path, i);
>>> +
>>> + if (IS_ERR(t[i].thread)) {
>>> + ret = PTR_ERR(t[i].thread);
>>> + goto err;
>>> + }
>>> +
>>> + kthread_bind(t[i].thread, i % num_online_cpus());
>>> + wake_up_process(t[i].thread);
>>> + wait_for_completion(&t[i].thread_complete);
>>> + }
>>> +
>>> + ns->t = t;
>>> +out:
>>> + return ret;
>>> +err:
>>> + nvmet_ns_stop_poll(ns);
>>> + goto out;
>>> +}
>>> +
>>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns)
>>> +{
>>> + struct nvmet_poll_data *t = ns->t;
>>> + int i;
>>> +
>>> + for (i = 0; i < THREAD_PER_CPU; i++) {
>>> + if (!t[i].thread)
>>> + continue;
>>> +
>>> + if (wq_has_sleeper(&t[i].poll_waitq))
>>> + wake_up(&t[i].poll_waitq);
>>> + kthread_park(t[i].thread);
>>> + kthread_stop(t[i].thread);
>>> + }
>>> +}
>>> +
>>> +static void nvmet_poll(struct nvmet_poll_data *t)
>>> +{
>>> + struct nvmet_req *req, *tmp;
>>> +
>>> + lockdep_assert_held(&t->list_lock);
>>> +
>>> + list_for_each_entry_safe(req, tmp, &t->list, poll_entry) {
>>> +
>>> + if (completion_done(&req->wait)) {
>>> + list_move_tail(&req->poll_entry, &t->done);
>>> + continue;
>>> + }
>>> +
>>> + if (!list_empty(&t->done))
>>> + break;
>>> +
>>> + list_del(&req->poll_entry);
>>> + mutex_unlock(&t->list_lock);
>>> + nvmet_req_poll_complete(req);
>>> + mutex_lock(&t->list_lock);
>>> + }
>>> + mutex_unlock(&t->list_lock);
>>> + /*
>>> + * In future we can also add batch timeout or nr request to complete.
>>> + */
>>> + while (!list_empty(&t->done)) {
>>> + /*
>>> + * We lock and unlock for t->list which gurantee progress of
>>> + * nvmet_xxx_rw_execute() when under pressure while we complete
>>> + * the request.
>>> + */
>>> + req = list_first_entry(&t->done, struct nvmet_req, poll_entry);
>>> + list_del(&req->poll_entry);
>>> + nvmet_req_done(req);
>>> + }
>>> +
>>> + mutex_lock(&t->list_lock);
>>> +}
>>> +
>>> +static int nvmet_poll_thread(void *data)
>>> +{
>>> + struct nvmet_poll_data *t = (struct nvmet_poll_data *) data;
>>> + DEFINE_WAIT(wait);
>>> +
>>> + complete(&t->thread_complete);
>>> +
>>> + while (!kthread_should_park()) {
>>> +
>>> + mutex_lock(&t->list_lock);
>>> + while (!list_empty(&t->list) && !need_resched())
>>> + nvmet_poll(t);
>>> + mutex_unlock(&t->list_lock);
>>> +
>>> + prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE);
>>> + if (signal_pending(current))
>>> + flush_signals(current);
>>> + smp_mb();
>>> + schedule();
>>> +
>>> + finish_wait(&t->poll_waitq, &wait);
>>> + }
>>> +
>>> + kthread_parkme();
>>> + return 0;
>>> +}
>>> +
>>> +inline void nvmet_req_prep_poll(struct nvmet_req *req)
>>> +{
>>> + if (!req->poll)
>>> + return;
>>> +
>>> + init_completion(&req->wait);
>>> + req->t = &req->ns->t[smp_processor_id()];
>>> +}
>>> +
>>> +inline void nvmet_req_issue_poll(struct nvmet_req *req)
>>> +{
>>> + if (!req->poll)
>>> + return;
>>> +
>>> + while (!mutex_trylock(&req->t->list_lock)) {
>>> + if (req->t->id < num_online_cpus())
>>> + req->t = &req->ns->t[req->t->id + num_online_cpus()];
>>> + else
>>> + req->t = &req->ns->t[req->t->id - num_online_cpus()];
>>> + }
>>> +
>>> + if (completion_done(&req->wait))
>>> + list_add(&req->poll_entry, &req->t->list);
>>> + else
>>> + list_add_tail(&req->poll_entry, &req->t->list);
>>> + mutex_unlock(&req->t->list_lock);
>>> +
>>> + if (wq_has_sleeper(&req->t->poll_waitq))
>>> + wake_up(&req->t->poll_waitq);
>>> +}
>>> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
>>> index 46df45e837c9..ef2919e23e0b 100644
>>> --- a/drivers/nvme/target/nvmet.h
>>> +++ b/drivers/nvme/target/nvmet.h
>>> @@ -49,11 +49,22 @@
>>> #define IPO_IATTR_CONNECT_SQE(x) \
>>> (cpu_to_le32(offsetof(struct nvmf_connect_command, x)))
>>>
>>> +struct nvmet_poll_data {
>>> + struct completion thread_complete;
>>> + wait_queue_head_t poll_waitq;
>>> + struct mutex list_lock;
>>> + struct task_struct *thread;
>>> + struct list_head list;
>>> + struct list_head done;
>>> + unsigned int id;
>>> +};
>>> +
>>> struct nvmet_ns {
>>> struct list_head dev_link;
>>> struct percpu_ref ref;
>>> struct block_device *bdev;
>>> struct file *file;
>>> + struct nvmet_poll_data *t;
>>> bool readonly;
>>> u32 nsid;
>>> u32 blksize_shift;
>>> @@ -63,6 +74,8 @@ struct nvmet_ns {
>>> u32 anagrpid;
>>>
>>> bool buffered_io;
>>> + bool use_poll;
>>> + bool poll;
>>> bool enabled;
>>> struct nvmet_subsys *subsys;
>>> const char *device_path;
>>> @@ -292,9 +305,15 @@ struct nvmet_req {
>>> struct nvmet_ns *ns;
>>> struct scatterlist *sg;
>>> struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC];
>>> + struct completion wait;
>>> + bool poll;
>>> + struct nvmet_poll_data *t;
>>> + struct list_head poll_entry;
>>> union {
>>> struct {
>>> - struct bio inline_bio;
>>> + struct bio inline_bio;
>>> + blk_qc_t cookie;
>>> + struct bio *last_bio;
>>> } b;
>>> struct {
>>> bool mpool_alloc;
>>> @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys,
>>> void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>> u8 event_info, u8 log_page);
>>>
>>> +int nvmet_ns_start_poll(struct nvmet_ns *ns);
>>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns);
>>> +void nvmet_req_prep_poll(struct nvmet_req *req);
>>> +void nvmet_req_issue_poll(struct nvmet_req *req);
>>> +
>>> +void nvmet_req_poll_complete(struct nvmet_req *req);
>>> +void nvmet_bdev_poll_complete(struct nvmet_req *req);
>>> +void nvmet_bdev_req_complete(struct nvmet_req *req);
>>> +void nvmet_req_done(struct nvmet_req *req);
>>> +
>>> #define NVMET_QUEUE_SIZE 1024
>>> #define NVMET_NR_QUEUES 128
>>> #define NVMET_MAX_CMD NVMET_QUEUE_SIZE
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2020-01-23 14:23 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-10 6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni
2019-12-10 6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni
2020-01-20 12:52 ` Max Gurtovoy
2020-01-21 19:22 ` Chaitanya Kulkarni
2020-01-23 14:23 ` Max Gurtovoy [this message]
2020-01-30 18:19 ` Chaitanya Kulkarni
2019-12-10 6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni
2019-12-12 1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg
2019-12-12 5:44 ` Chaitanya Kulkarni
2019-12-12 20:32 ` Sagi Grimberg
2020-01-20 5:13 ` Chaitanya Kulkarni
2020-01-20 4:48 ` Chaitanya Kulkarni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=03494a32-b31a-432b-870d-7347f5cc9596@mellanox.com \
--to=maxg@mellanox.com \
--cc=Chaitanya.Kulkarni@wdc.com \
--cc=hch@lst.de \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).