* [RFC PATCH 0/2] nvmet: add polling support @ 2019-12-10 6:25 Chaitanya Kulkarni 2019-12-10 6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Chaitanya Kulkarni @ 2019-12-10 6:25 UTC (permalink / raw) To: sagi, hch; +Cc: Chaitanya Kulkarni, linux-nvme Hi Christoph/Sagi/Keith, This patch series implements the polling support for the NVMeOF target with maintaining separate kthread for polling. By implementing the polling for the file and bdev I was able to create stubs for the target polling which will be useful for passthru code. 1. General approach:- After submitting the request in (nvmet_file|bdev_execute_rw()) code path we issue request on the available polling thread's list (poll_list) which is shared between submission thread and the polling thread. We create per CPU two polling threads to improve the scalability for I/O submission. When polling thread is woken up by the submission thread it moves the request to its private list (done list) for processing if it finds completed request (by the backend bdev/file). This allows outstanding submission requests to make forward progress under pressure. When there are no completed requests on the poll_list then polling thread removes the request from the poll_list and calls backend specific poll function on this request until it gets the completion. On completion, it finishes the request in the same context over trandport. 2. Testing and verification:- I did fio verification tests for a couple of days and didn't find any issues with data verification. Also, for bdev and file I could see the io_poll stats under debugfs with considered/invoked/success numbers being increased. 3. Performance improvements:- There are still some performance improvements which can be done. 4. User interface:- Users can optionally enable/disable polling with newly added configfs attribute use_poll. This only enables polling in case backend (bdev/file) supports it otherwise we just ignore this value. It will be great if you can provide some feedback. I'm aware of the io-wq code for io_using posted on the mailing list. I was just wondering should we explore that to implement the polling ? Patchset info :- Repo:- git://git.infradead.org/nvme.git Bramch :- nvme/for-5.5 HEAD:- commit 7e4c6b9a5d22485acf009b3c3510a370f096dd54 (origin/nvme/for-5.5) Author: Keith Busch <kbusch@kernel.org> Date: Fri Dec 6 08:11:17 2019 +0900 nvme/pci: Fix read queue count Regards, Chaitanya Chaitanya Kulkarni (2): nvmet: add bdev-ns polling support nvmet: add file-ns polling support drivers/nvme/target/Makefile | 3 +- drivers/nvme/target/configfs.c | 29 ++++++ drivers/nvme/target/core.c | 17 +++ drivers/nvme/target/io-cmd-bdev.c | 61 +++++++++-- drivers/nvme/target/io-cmd-file.c | 60 +++++++++-- drivers/nvme/target/io-poll.c | 165 ++++++++++++++++++++++++++++++ drivers/nvme/target/nvmet.h | 40 +++++++- 7 files changed, 354 insertions(+), 21 deletions(-) create mode 100644 drivers/nvme/target/io-poll.c -- 2.22.1 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC PATCH 1/2] nvmet: add bdev-ns polling support 2019-12-10 6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni @ 2019-12-10 6:25 ` Chaitanya Kulkarni 2020-01-20 12:52 ` Max Gurtovoy 2019-12-10 6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni 2019-12-12 1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg 2 siblings, 1 reply; 12+ messages in thread From: Chaitanya Kulkarni @ 2019-12-10 6:25 UTC (permalink / raw) To: sagi, hch; +Cc: Chaitanya Kulkarni, linux-nvme This patch adds support for the bdev-ns polling. We also add a new file to keep polling code separate (io-poll.c). The newly added configfs attribute allows user to enable/disable polling. We only enable polling for the READ/WRITE operation based on support from bdev and and use_poll configfs attr. We configure polling per namespace. For each namespace we create polling threads. For general approach please have a look at the cover-letter of this patch-series. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> --- drivers/nvme/target/Makefile | 3 +- drivers/nvme/target/configfs.c | 29 ++++++ drivers/nvme/target/core.c | 13 +++ drivers/nvme/target/io-cmd-bdev.c | 61 +++++++++-- drivers/nvme/target/io-poll.c | 165 ++++++++++++++++++++++++++++++ drivers/nvme/target/nvmet.h | 31 +++++- 6 files changed, 291 insertions(+), 11 deletions(-) create mode 100644 drivers/nvme/target/io-poll.c diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile index 2b33836f3d3e..8877ba16305c 100644 --- a/drivers/nvme/target/Makefile +++ b/drivers/nvme/target/Makefile @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ - discovery.o io-cmd-file.o io-cmd-bdev.o + discovery.o io-cmd-file.o io-cmd-bdev.o \ + io-poll.o nvme-loop-y += loop.o nvmet-rdma-y += rdma.o nvmet-fc-y += fc.o diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 98613a45bd3b..0e6e8b0dbf79 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item, CONFIGFS_ATTR(nvmet_ns_, buffered_io); +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page) +{ + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll); +} + +static ssize_t nvmet_ns_use_poll_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_ns *ns = to_nvmet_ns(item); + bool val; + + if (strtobool(page, &val)) + return -EINVAL; + + mutex_lock(&ns->subsys->lock); + if (ns->enabled) { + pr_err("disable ns before setting use_poll value.\n"); + mutex_unlock(&ns->subsys->lock); + return -EINVAL; + } + + ns->use_poll = val; + mutex_unlock(&ns->subsys->lock); + return count; +} + +CONFIGFS_ATTR(nvmet_ns_, use_poll); + static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_device_path, &nvmet_ns_attr_device_nguid, @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_ana_grpid, &nvmet_ns_attr_enable, &nvmet_ns_attr_buffered_io, + &nvmet_ns_attr_use_poll, #ifdef CONFIG_PCI_P2PDMA &nvmet_ns_attr_p2pmem, #endif diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 28438b833c1b..d8f9130d1cd1 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, ns->nsid); } +inline void nvmet_req_done(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_bdev_req_complete(req); +} + +inline void nvmet_req_poll_complete(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_bdev_poll_complete(req); +} + int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) uuid_gen(&ns->uuid); ns->buffered_io = false; + ns->use_poll = false; return ns; } diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index b6fca0e421ef..13507e0cbc1d 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) int nvmet_bdev_ns_enable(struct nvmet_ns *ns) { + int fmode = FMODE_READ | FMODE_WRITE; + struct request_queue *q; int ret; - ns->bdev = blkdev_get_by_path(ns->device_path, - FMODE_READ | FMODE_WRITE, NULL); + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL); if (IS_ERR(ns->bdev)) { ret = PTR_ERR(ns->bdev); if (ret != -ENOTBLK) { @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) ns->device_path, PTR_ERR(ns->bdev)); } ns->bdev = NULL; - return ret; + goto out; } ns->size = i_size_read(ns->bdev->bd_inode); ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); - return 0; + q = bdev_get_queue(ns->bdev); + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags); + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; +out: + return ret; } void nvmet_bdev_ns_disable(struct nvmet_ns *ns) { if (ns->bdev) { + ns->poll ? nvmet_ns_stop_poll(ns) : 0; blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); ns->bdev = NULL; } @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) return status; } -static void nvmet_bio_done(struct bio *bio) +void nvmet_bdev_req_complete(struct nvmet_req *req) { - struct nvmet_req *req = bio->bi_private; + struct bio *bio = req->b.last_bio; nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); if (bio != &req->b.inline_bio) bio_put(bio); } +static void nvmet_bio_done(struct bio *bio) +{ + struct nvmet_req *req = bio->bi_private; + + req->b.last_bio = bio; + + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req); +} + +void nvmet_bdev_poll_complete(struct nvmet_req *req) +{ + struct request_queue *q = bdev_get_queue(req->ns->bdev); + + while (!completion_done(&req->wait)) { + int ret = blk_poll(q, req->b.cookie, true); + + if (ret < 0) { + pr_err("tid %d poll error %d", req->t->id, ret); + break; + } + } + nvmet_bdev_req_complete(req); +} + static void nvmet_bdev_execute_rw(struct nvmet_req *req) { int sg_cnt = req->sg_cnt; @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) bio->bi_end_io = nvmet_bio_done; bio->bi_opf = op; - blk_start_plug(&plug); + if (!req->poll) + blk_start_plug(&plug); for_each_sg(req->sg, sg, req->sg_cnt, i) { while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) != sg->length) { @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) sg_cnt--; } - submit_bio(bio); - blk_finish_plug(&plug); + req->b.last_bio = bio; + if (req->poll) + req->b.last_bio->bi_opf |= REQ_HIPRI; + + req->b.cookie = submit_bio(bio); + + nvmet_req_prep_poll(req); + nvmet_req_issue_poll(req); + if (!req->poll) + blk_finish_plug(&plug); } static void nvmet_bdev_execute_flush(struct nvmet_req *req) @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) switch (cmd->common.opcode) { case nvme_cmd_read: case nvme_cmd_write: + req->poll = req->ns->poll; req->execute = nvmet_bdev_execute_rw; return 0; case nvme_cmd_flush: + req->poll = false; req->execute = nvmet_bdev_execute_flush; return 0; case nvme_cmd_dsm: + req->poll = false; req->execute = nvmet_bdev_execute_dsm; return 0; case nvme_cmd_write_zeroes: + req->poll = false; req->execute = nvmet_bdev_execute_write_zeroes; return 0; default: diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c new file mode 100644 index 000000000000..175c939c22ff --- /dev/null +++ b/drivers/nvme/target/io-poll.c @@ -0,0 +1,165 @@ +#include <linux/blkdev.h> +#include <linux/module.h> +#include <linux/sched/signal.h> +#include <linux/kthread.h> +#include <uapi/linux/sched/types.h> + +#include "nvmet.h" +#define THREAD_PER_CPU (num_online_cpus() * 2) + +static int nvmet_poll_thread(void *data); + +int nvmet_ns_start_poll(struct nvmet_ns *ns) +{ + struct nvmet_poll_data *t; + int ret = 0; + int i; + + t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL); + if (!t) { + ret = -ENOMEM; + goto out; + } + + for (i = 0; i < THREAD_PER_CPU; i++) { + init_completion(&t[i].thread_complete); + init_waitqueue_head(&t[i].poll_waitq); + INIT_LIST_HEAD(&t[i].list); + INIT_LIST_HEAD(&t[i].done); + mutex_init(&t[i].list_lock); + t[i].id = i; + + t[i].thread = kthread_create(nvmet_poll_thread, &t[i], + "nvmet_poll_thread%s/%d", + ns->device_path, i); + + if (IS_ERR(t[i].thread)) { + ret = PTR_ERR(t[i].thread); + goto err; + } + + kthread_bind(t[i].thread, i % num_online_cpus()); + wake_up_process(t[i].thread); + wait_for_completion(&t[i].thread_complete); + } + + ns->t = t; +out: + return ret; +err: + nvmet_ns_stop_poll(ns); + goto out; +} + +void nvmet_ns_stop_poll(struct nvmet_ns *ns) +{ + struct nvmet_poll_data *t = ns->t; + int i; + + for (i = 0; i < THREAD_PER_CPU; i++) { + if (!t[i].thread) + continue; + + if (wq_has_sleeper(&t[i].poll_waitq)) + wake_up(&t[i].poll_waitq); + kthread_park(t[i].thread); + kthread_stop(t[i].thread); + } +} + +static void nvmet_poll(struct nvmet_poll_data *t) +{ + struct nvmet_req *req, *tmp; + + lockdep_assert_held(&t->list_lock); + + list_for_each_entry_safe(req, tmp, &t->list, poll_entry) { + + if (completion_done(&req->wait)) { + list_move_tail(&req->poll_entry, &t->done); + continue; + } + + if (!list_empty(&t->done)) + break; + + list_del(&req->poll_entry); + mutex_unlock(&t->list_lock); + nvmet_req_poll_complete(req); + mutex_lock(&t->list_lock); + } + mutex_unlock(&t->list_lock); + /* + * In future we can also add batch timeout or nr request to complete. + */ + while (!list_empty(&t->done)) { + /* + * We lock and unlock for t->list which gurantee progress of + * nvmet_xxx_rw_execute() when under pressure while we complete + * the request. + */ + req = list_first_entry(&t->done, struct nvmet_req, poll_entry); + list_del(&req->poll_entry); + nvmet_req_done(req); + } + + mutex_lock(&t->list_lock); +} + +static int nvmet_poll_thread(void *data) +{ + struct nvmet_poll_data *t = (struct nvmet_poll_data *) data; + DEFINE_WAIT(wait); + + complete(&t->thread_complete); + + while (!kthread_should_park()) { + + mutex_lock(&t->list_lock); + while (!list_empty(&t->list) && !need_resched()) + nvmet_poll(t); + mutex_unlock(&t->list_lock); + + prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE); + if (signal_pending(current)) + flush_signals(current); + smp_mb(); + schedule(); + + finish_wait(&t->poll_waitq, &wait); + } + + kthread_parkme(); + return 0; +} + +inline void nvmet_req_prep_poll(struct nvmet_req *req) +{ + if (!req->poll) + return; + + init_completion(&req->wait); + req->t = &req->ns->t[smp_processor_id()]; +} + +inline void nvmet_req_issue_poll(struct nvmet_req *req) +{ + if (!req->poll) + return; + + while (!mutex_trylock(&req->t->list_lock)) { + if (req->t->id < num_online_cpus()) + req->t = &req->ns->t[req->t->id + num_online_cpus()]; + else + req->t = &req->ns->t[req->t->id - num_online_cpus()]; + } + + if (completion_done(&req->wait)) + list_add(&req->poll_entry, &req->t->list); + else + list_add_tail(&req->poll_entry, &req->t->list); + mutex_unlock(&req->t->list_lock); + + if (wq_has_sleeper(&req->t->poll_waitq)) + wake_up(&req->t->poll_waitq); +} diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 46df45e837c9..ef2919e23e0b 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -49,11 +49,22 @@ #define IPO_IATTR_CONNECT_SQE(x) \ (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) +struct nvmet_poll_data { + struct completion thread_complete; + wait_queue_head_t poll_waitq; + struct mutex list_lock; + struct task_struct *thread; + struct list_head list; + struct list_head done; + unsigned int id; +}; + struct nvmet_ns { struct list_head dev_link; struct percpu_ref ref; struct block_device *bdev; struct file *file; + struct nvmet_poll_data *t; bool readonly; u32 nsid; u32 blksize_shift; @@ -63,6 +74,8 @@ struct nvmet_ns { u32 anagrpid; bool buffered_io; + bool use_poll; + bool poll; bool enabled; struct nvmet_subsys *subsys; const char *device_path; @@ -292,9 +305,15 @@ struct nvmet_req { struct nvmet_ns *ns; struct scatterlist *sg; struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC]; + struct completion wait; + bool poll; + struct nvmet_poll_data *t; + struct list_head poll_entry; union { struct { - struct bio inline_bio; + struct bio inline_bio; + blk_qc_t cookie; + struct bio *last_bio; } b; struct { bool mpool_alloc; @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, u8 event_info, u8 log_page); +int nvmet_ns_start_poll(struct nvmet_ns *ns); +void nvmet_ns_stop_poll(struct nvmet_ns *ns); +void nvmet_req_prep_poll(struct nvmet_req *req); +void nvmet_req_issue_poll(struct nvmet_req *req); + +void nvmet_req_poll_complete(struct nvmet_req *req); +void nvmet_bdev_poll_complete(struct nvmet_req *req); +void nvmet_bdev_req_complete(struct nvmet_req *req); +void nvmet_req_done(struct nvmet_req *req); + #define NVMET_QUEUE_SIZE 1024 #define NVMET_NR_QUEUES 128 #define NVMET_MAX_CMD NVMET_QUEUE_SIZE -- 2.22.1 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support 2019-12-10 6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni @ 2020-01-20 12:52 ` Max Gurtovoy 2020-01-21 19:22 ` Chaitanya Kulkarni 0 siblings, 1 reply; 12+ messages in thread From: Max Gurtovoy @ 2020-01-20 12:52 UTC (permalink / raw) To: Chaitanya Kulkarni, sagi, hch; +Cc: linux-nvme On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote: > This patch adds support for the bdev-ns polling. We also add a new > file to keep polling code separate (io-poll.c). The newly added > configfs attribute allows user to enable/disable polling. > > We only enable polling for the READ/WRITE operation based on support > from bdev and and use_poll configfs attr. > > We configure polling per namespace. For each namespace we create > polling threads. For general approach please have a look at the > cover-letter of this patch-series. It would be great to get some numbers here to learn about the trade-off for this approach. > > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> > --- > drivers/nvme/target/Makefile | 3 +- > drivers/nvme/target/configfs.c | 29 ++++++ > drivers/nvme/target/core.c | 13 +++ > drivers/nvme/target/io-cmd-bdev.c | 61 +++++++++-- > drivers/nvme/target/io-poll.c | 165 ++++++++++++++++++++++++++++++ > drivers/nvme/target/nvmet.h | 31 +++++- > 6 files changed, 291 insertions(+), 11 deletions(-) > create mode 100644 drivers/nvme/target/io-poll.c > > diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile > index 2b33836f3d3e..8877ba16305c 100644 > --- a/drivers/nvme/target/Makefile > +++ b/drivers/nvme/target/Makefile > @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o > obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o > > nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ > - discovery.o io-cmd-file.o io-cmd-bdev.o > + discovery.o io-cmd-file.o io-cmd-bdev.o \ > + io-poll.o > nvme-loop-y += loop.o > nvmet-rdma-y += rdma.o > nvmet-fc-y += fc.o > diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c > index 98613a45bd3b..0e6e8b0dbf79 100644 > --- a/drivers/nvme/target/configfs.c > +++ b/drivers/nvme/target/configfs.c > @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item, > > CONFIGFS_ATTR(nvmet_ns_, buffered_io); > > +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page) > +{ > + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll); > +} > + > +static ssize_t nvmet_ns_use_poll_store(struct config_item *item, > + const char *page, size_t count) > +{ > + struct nvmet_ns *ns = to_nvmet_ns(item); > + bool val; > + > + if (strtobool(page, &val)) > + return -EINVAL; > + > + mutex_lock(&ns->subsys->lock); > + if (ns->enabled) { > + pr_err("disable ns before setting use_poll value.\n"); > + mutex_unlock(&ns->subsys->lock); > + return -EINVAL; > + } > + > + ns->use_poll = val; > + mutex_unlock(&ns->subsys->lock); > + return count; > +} > + > +CONFIGFS_ATTR(nvmet_ns_, use_poll); > + > static struct configfs_attribute *nvmet_ns_attrs[] = { > &nvmet_ns_attr_device_path, > &nvmet_ns_attr_device_nguid, > @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { > &nvmet_ns_attr_ana_grpid, > &nvmet_ns_attr_enable, > &nvmet_ns_attr_buffered_io, > + &nvmet_ns_attr_use_poll, > #ifdef CONFIG_PCI_P2PDMA > &nvmet_ns_attr_p2pmem, > #endif > diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c > index 28438b833c1b..d8f9130d1cd1 100644 > --- a/drivers/nvme/target/core.c > +++ b/drivers/nvme/target/core.c > @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, > ns->nsid); > } > > +inline void nvmet_req_done(struct nvmet_req *req) > +{ > + if (req->ns->bdev) > + nvmet_bdev_req_complete(req); > +} > + > +inline void nvmet_req_poll_complete(struct nvmet_req *req) > +{ > + if (req->ns->bdev) > + nvmet_bdev_poll_complete(req); > +} > + > int nvmet_ns_enable(struct nvmet_ns *ns) > { > struct nvmet_subsys *subsys = ns->subsys; > @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) > > uuid_gen(&ns->uuid); > ns->buffered_io = false; > + ns->use_poll = false; > > return ns; > } > diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c > index b6fca0e421ef..13507e0cbc1d 100644 > --- a/drivers/nvme/target/io-cmd-bdev.c > +++ b/drivers/nvme/target/io-cmd-bdev.c > @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) > > int nvmet_bdev_ns_enable(struct nvmet_ns *ns) > { > + int fmode = FMODE_READ | FMODE_WRITE; > + struct request_queue *q; > int ret; > > - ns->bdev = blkdev_get_by_path(ns->device_path, > - FMODE_READ | FMODE_WRITE, NULL); > + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL); > if (IS_ERR(ns->bdev)) { > ret = PTR_ERR(ns->bdev); > if (ret != -ENOTBLK) { > @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) > ns->device_path, PTR_ERR(ns->bdev)); > } > ns->bdev = NULL; > - return ret; > + goto out; > } > ns->size = i_size_read(ns->bdev->bd_inode); > ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); > - return 0; > + q = bdev_get_queue(ns->bdev); > + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags); > + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; > +out: > + return ret; > } > > void nvmet_bdev_ns_disable(struct nvmet_ns *ns) > { > if (ns->bdev) { > + ns->poll ? nvmet_ns_stop_poll(ns) : 0; can you please use an "if" statement instead of the above convention ? > blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); > ns->bdev = NULL; > } > @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) > return status; > } > > -static void nvmet_bio_done(struct bio *bio) > +void nvmet_bdev_req_complete(struct nvmet_req *req) > { > - struct nvmet_req *req = bio->bi_private; > + struct bio *bio = req->b.last_bio; > > nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); > if (bio != &req->b.inline_bio) > bio_put(bio); > } > > +static void nvmet_bio_done(struct bio *bio) > +{ > + struct nvmet_req *req = bio->bi_private; > + > + req->b.last_bio = bio; > + > + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req); Same here for the "if". Lets keep the code as readable as we can > +} > + > +void nvmet_bdev_poll_complete(struct nvmet_req *req) > +{ > + struct request_queue *q = bdev_get_queue(req->ns->bdev); > + > + while (!completion_done(&req->wait)) { > + int ret = blk_poll(q, req->b.cookie, true); > + > + if (ret < 0) { > + pr_err("tid %d poll error %d", req->t->id, ret); > + break; > + } > + } > + nvmet_bdev_req_complete(req); > +} > + > static void nvmet_bdev_execute_rw(struct nvmet_req *req) > { > int sg_cnt = req->sg_cnt; > @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) > bio->bi_end_io = nvmet_bio_done; > bio->bi_opf = op; > > - blk_start_plug(&plug); > + if (!req->poll) > + blk_start_plug(&plug); > for_each_sg(req->sg, sg, req->sg_cnt, i) { > while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) > != sg->length) { > @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) > sg_cnt--; > } > > - submit_bio(bio); > - blk_finish_plug(&plug); > + req->b.last_bio = bio; > + if (req->poll) > + req->b.last_bio->bi_opf |= REQ_HIPRI; > + > + req->b.cookie = submit_bio(bio); > + > + nvmet_req_prep_poll(req); > + nvmet_req_issue_poll(req); > + if (!req->poll) > + blk_finish_plug(&plug); > } > > static void nvmet_bdev_execute_flush(struct nvmet_req *req) > @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) > switch (cmd->common.opcode) { > case nvme_cmd_read: > case nvme_cmd_write: > + req->poll = req->ns->poll; > req->execute = nvmet_bdev_execute_rw; > return 0; > case nvme_cmd_flush: > + req->poll = false; should be done in some general place for req initialization. > req->execute = nvmet_bdev_execute_flush; > return 0; > case nvme_cmd_dsm: > + req->poll = false; > req->execute = nvmet_bdev_execute_dsm; > return 0; > case nvme_cmd_write_zeroes: > + req->poll = false; > req->execute = nvmet_bdev_execute_write_zeroes; > return 0; > default: > diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c > new file mode 100644 > index 000000000000..175c939c22ff > --- /dev/null > +++ b/drivers/nvme/target/io-poll.c > @@ -0,0 +1,165 @@ > +#include <linux/blkdev.h> > +#include <linux/module.h> > +#include <linux/sched/signal.h> > +#include <linux/kthread.h> > +#include <uapi/linux/sched/types.h> > + > +#include "nvmet.h" > +#define THREAD_PER_CPU (num_online_cpus() * 2) > + > +static int nvmet_poll_thread(void *data); > + > +int nvmet_ns_start_poll(struct nvmet_ns *ns) > +{ > + struct nvmet_poll_data *t; > + int ret = 0; > + int i; > + > + t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL); > + if (!t) { > + ret = -ENOMEM; > + goto out; > + } > + > + for (i = 0; i < THREAD_PER_CPU; i++) { > + init_completion(&t[i].thread_complete); > + init_waitqueue_head(&t[i].poll_waitq); > + INIT_LIST_HEAD(&t[i].list); > + INIT_LIST_HEAD(&t[i].done); > + mutex_init(&t[i].list_lock); > + t[i].id = i; > + > + t[i].thread = kthread_create(nvmet_poll_thread, &t[i], > + "nvmet_poll_thread%s/%d", > + ns->device_path, i); > + > + if (IS_ERR(t[i].thread)) { > + ret = PTR_ERR(t[i].thread); > + goto err; > + } > + > + kthread_bind(t[i].thread, i % num_online_cpus()); > + wake_up_process(t[i].thread); > + wait_for_completion(&t[i].thread_complete); > + } > + > + ns->t = t; > +out: > + return ret; > +err: > + nvmet_ns_stop_poll(ns); > + goto out; > +} > + > +void nvmet_ns_stop_poll(struct nvmet_ns *ns) > +{ > + struct nvmet_poll_data *t = ns->t; > + int i; > + > + for (i = 0; i < THREAD_PER_CPU; i++) { > + if (!t[i].thread) > + continue; > + > + if (wq_has_sleeper(&t[i].poll_waitq)) > + wake_up(&t[i].poll_waitq); > + kthread_park(t[i].thread); > + kthread_stop(t[i].thread); > + } > +} > + > +static void nvmet_poll(struct nvmet_poll_data *t) > +{ > + struct nvmet_req *req, *tmp; > + > + lockdep_assert_held(&t->list_lock); > + > + list_for_each_entry_safe(req, tmp, &t->list, poll_entry) { > + > + if (completion_done(&req->wait)) { > + list_move_tail(&req->poll_entry, &t->done); > + continue; > + } > + > + if (!list_empty(&t->done)) > + break; > + > + list_del(&req->poll_entry); > + mutex_unlock(&t->list_lock); > + nvmet_req_poll_complete(req); > + mutex_lock(&t->list_lock); > + } > + mutex_unlock(&t->list_lock); > + /* > + * In future we can also add batch timeout or nr request to complete. > + */ > + while (!list_empty(&t->done)) { > + /* > + * We lock and unlock for t->list which gurantee progress of > + * nvmet_xxx_rw_execute() when under pressure while we complete > + * the request. > + */ > + req = list_first_entry(&t->done, struct nvmet_req, poll_entry); > + list_del(&req->poll_entry); > + nvmet_req_done(req); > + } > + > + mutex_lock(&t->list_lock); > +} > + > +static int nvmet_poll_thread(void *data) > +{ > + struct nvmet_poll_data *t = (struct nvmet_poll_data *) data; > + DEFINE_WAIT(wait); > + > + complete(&t->thread_complete); > + > + while (!kthread_should_park()) { > + > + mutex_lock(&t->list_lock); > + while (!list_empty(&t->list) && !need_resched()) > + nvmet_poll(t); > + mutex_unlock(&t->list_lock); > + > + prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE); > + if (signal_pending(current)) > + flush_signals(current); > + smp_mb(); > + schedule(); > + > + finish_wait(&t->poll_waitq, &wait); > + } > + > + kthread_parkme(); > + return 0; > +} > + > +inline void nvmet_req_prep_poll(struct nvmet_req *req) > +{ > + if (!req->poll) > + return; > + > + init_completion(&req->wait); > + req->t = &req->ns->t[smp_processor_id()]; > +} > + > +inline void nvmet_req_issue_poll(struct nvmet_req *req) > +{ > + if (!req->poll) > + return; > + > + while (!mutex_trylock(&req->t->list_lock)) { > + if (req->t->id < num_online_cpus()) > + req->t = &req->ns->t[req->t->id + num_online_cpus()]; > + else > + req->t = &req->ns->t[req->t->id - num_online_cpus()]; > + } > + > + if (completion_done(&req->wait)) > + list_add(&req->poll_entry, &req->t->list); > + else > + list_add_tail(&req->poll_entry, &req->t->list); > + mutex_unlock(&req->t->list_lock); > + > + if (wq_has_sleeper(&req->t->poll_waitq)) > + wake_up(&req->t->poll_waitq); > +} > diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h > index 46df45e837c9..ef2919e23e0b 100644 > --- a/drivers/nvme/target/nvmet.h > +++ b/drivers/nvme/target/nvmet.h > @@ -49,11 +49,22 @@ > #define IPO_IATTR_CONNECT_SQE(x) \ > (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) > > +struct nvmet_poll_data { > + struct completion thread_complete; > + wait_queue_head_t poll_waitq; > + struct mutex list_lock; > + struct task_struct *thread; > + struct list_head list; > + struct list_head done; > + unsigned int id; > +}; > + > struct nvmet_ns { > struct list_head dev_link; > struct percpu_ref ref; > struct block_device *bdev; > struct file *file; > + struct nvmet_poll_data *t; > bool readonly; > u32 nsid; > u32 blksize_shift; > @@ -63,6 +74,8 @@ struct nvmet_ns { > u32 anagrpid; > > bool buffered_io; > + bool use_poll; > + bool poll; > bool enabled; > struct nvmet_subsys *subsys; > const char *device_path; > @@ -292,9 +305,15 @@ struct nvmet_req { > struct nvmet_ns *ns; > struct scatterlist *sg; > struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC]; > + struct completion wait; > + bool poll; > + struct nvmet_poll_data *t; > + struct list_head poll_entry; > union { > struct { > - struct bio inline_bio; > + struct bio inline_bio; > + blk_qc_t cookie; > + struct bio *last_bio; > } b; > struct { > bool mpool_alloc; > @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, > void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, > u8 event_info, u8 log_page); > > +int nvmet_ns_start_poll(struct nvmet_ns *ns); > +void nvmet_ns_stop_poll(struct nvmet_ns *ns); > +void nvmet_req_prep_poll(struct nvmet_req *req); > +void nvmet_req_issue_poll(struct nvmet_req *req); > + > +void nvmet_req_poll_complete(struct nvmet_req *req); > +void nvmet_bdev_poll_complete(struct nvmet_req *req); > +void nvmet_bdev_req_complete(struct nvmet_req *req); > +void nvmet_req_done(struct nvmet_req *req); > + > #define NVMET_QUEUE_SIZE 1024 > #define NVMET_NR_QUEUES 128 > #define NVMET_MAX_CMD NVMET_QUEUE_SIZE _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support 2020-01-20 12:52 ` Max Gurtovoy @ 2020-01-21 19:22 ` Chaitanya Kulkarni 2020-01-23 14:23 ` Max Gurtovoy 0 siblings, 1 reply; 12+ messages in thread From: Chaitanya Kulkarni @ 2020-01-21 19:22 UTC (permalink / raw) To: Max Gurtovoy, sagi, hch; +Cc: linux-nvme Please see my in-line comments. On 01/20/2020 04:52 AM, Max Gurtovoy wrote: > > On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote: >> This patch adds support for the bdev-ns polling. We also add a new >> file to keep polling code separate (io-poll.c). The newly added >> configfs attribute allows user to enable/disable polling. >> >> We only enable polling for the READ/WRITE operation based on support >> from bdev and and use_poll configfs attr. >> >> We configure polling per namespace. For each namespace we create >> polling threads. For general approach please have a look at the >> cover-letter of this patch-series. > > It would be great to get some numbers here to learn about the trade-off > for this approach. > > Already posted these numbers with QEMU nvme-loop here :- http://lists.infradead.org/pipermail/linux-nvme/2020-January/028692.html IOPS/BW:- Default:- read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec) read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec) read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec) Poll:- read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec) read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec) read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec) mpstat:- Default:- CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle --------------------------------------------------------- all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60 --------------------------------------------------------- 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 Poll:- CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle --------------------------------------------------------- all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75 --------------------------------------------------------- 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> >> --- >> drivers/nvme/target/Makefile | 3 +- >> drivers/nvme/target/configfs.c | 29 ++++++ >> drivers/nvme/target/core.c | 13 +++ >> drivers/nvme/target/io-cmd-bdev.c | 61 +++++++++-- >> drivers/nvme/target/io-poll.c | 165 ++++++++++++++++++++++++++++++ >> drivers/nvme/target/nvmet.h | 31 +++++- >> 6 files changed, 291 insertions(+), 11 deletions(-) >> create mode 100644 drivers/nvme/target/io-poll.c >> >> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile >> index 2b33836f3d3e..8877ba16305c 100644 >> --- a/drivers/nvme/target/Makefile >> +++ b/drivers/nvme/target/Makefile >> @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o >> obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o >> >> nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ >> - discovery.o io-cmd-file.o io-cmd-bdev.o >> + discovery.o io-cmd-file.o io-cmd-bdev.o \ >> + io-poll.o >> nvme-loop-y += loop.o >> nvmet-rdma-y += rdma.o >> nvmet-fc-y += fc.o >> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c >> index 98613a45bd3b..0e6e8b0dbf79 100644 >> --- a/drivers/nvme/target/configfs.c >> +++ b/drivers/nvme/target/configfs.c >> @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item, >> >> CONFIGFS_ATTR(nvmet_ns_, buffered_io); >> >> +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page) >> +{ >> + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll); >> +} >> + >> +static ssize_t nvmet_ns_use_poll_store(struct config_item *item, >> + const char *page, size_t count) >> +{ >> + struct nvmet_ns *ns = to_nvmet_ns(item); >> + bool val; >> + >> + if (strtobool(page, &val)) >> + return -EINVAL; >> + >> + mutex_lock(&ns->subsys->lock); >> + if (ns->enabled) { >> + pr_err("disable ns before setting use_poll value.\n"); >> + mutex_unlock(&ns->subsys->lock); >> + return -EINVAL; >> + } >> + >> + ns->use_poll = val; >> + mutex_unlock(&ns->subsys->lock); >> + return count; >> +} >> + >> +CONFIGFS_ATTR(nvmet_ns_, use_poll); >> + >> static struct configfs_attribute *nvmet_ns_attrs[] = { >> &nvmet_ns_attr_device_path, >> &nvmet_ns_attr_device_nguid, >> @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { >> &nvmet_ns_attr_ana_grpid, >> &nvmet_ns_attr_enable, >> &nvmet_ns_attr_buffered_io, >> + &nvmet_ns_attr_use_poll, >> #ifdef CONFIG_PCI_P2PDMA >> &nvmet_ns_attr_p2pmem, >> #endif >> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c >> index 28438b833c1b..d8f9130d1cd1 100644 >> --- a/drivers/nvme/target/core.c >> +++ b/drivers/nvme/target/core.c >> @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, >> ns->nsid); >> } >> >> +inline void nvmet_req_done(struct nvmet_req *req) >> +{ >> + if (req->ns->bdev) >> + nvmet_bdev_req_complete(req); >> +} >> + >> +inline void nvmet_req_poll_complete(struct nvmet_req *req) >> +{ >> + if (req->ns->bdev) >> + nvmet_bdev_poll_complete(req); >> +} >> + >> int nvmet_ns_enable(struct nvmet_ns *ns) >> { >> struct nvmet_subsys *subsys = ns->subsys; >> @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) >> >> uuid_gen(&ns->uuid); >> ns->buffered_io = false; >> + ns->use_poll = false; >> >> return ns; >> } >> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c >> index b6fca0e421ef..13507e0cbc1d 100644 >> --- a/drivers/nvme/target/io-cmd-bdev.c >> +++ b/drivers/nvme/target/io-cmd-bdev.c >> @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) >> >> int nvmet_bdev_ns_enable(struct nvmet_ns *ns) >> { >> + int fmode = FMODE_READ | FMODE_WRITE; >> + struct request_queue *q; >> int ret; >> >> - ns->bdev = blkdev_get_by_path(ns->device_path, >> - FMODE_READ | FMODE_WRITE, NULL); >> + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL); >> if (IS_ERR(ns->bdev)) { >> ret = PTR_ERR(ns->bdev); >> if (ret != -ENOTBLK) { >> @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) >> ns->device_path, PTR_ERR(ns->bdev)); >> } >> ns->bdev = NULL; >> - return ret; >> + goto out; >> } >> ns->size = i_size_read(ns->bdev->bd_inode); >> ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); >> - return 0; >> + q = bdev_get_queue(ns->bdev); >> + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags); >> + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; >> +out: >> + return ret; >> } >> >> void nvmet_bdev_ns_disable(struct nvmet_ns *ns) >> { >> if (ns->bdev) { >> + ns->poll ? nvmet_ns_stop_poll(ns) : 0; > > can you please use an "if" statement instead of the above convention ? > This is just as RFC, will send out formal patch series with all the coding style fixes. > >> blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); >> ns->bdev = NULL; >> } >> @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) >> return status; >> } >> >> -static void nvmet_bio_done(struct bio *bio) >> +void nvmet_bdev_req_complete(struct nvmet_req *req) >> { >> - struct nvmet_req *req = bio->bi_private; >> + struct bio *bio = req->b.last_bio; >> >> nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); >> if (bio != &req->b.inline_bio) >> bio_put(bio); >> } >> >> +static void nvmet_bio_done(struct bio *bio) >> +{ >> + struct nvmet_req *req = bio->bi_private; >> + >> + req->b.last_bio = bio; >> + >> + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req); > > Same here for the "if". Lets keep the code as readable as we can > Same as above. > >> +} >> + >> +void nvmet_bdev_poll_complete(struct nvmet_req *req) >> +{ >> + struct request_queue *q = bdev_get_queue(req->ns->bdev); >> + >> + while (!completion_done(&req->wait)) { >> + int ret = blk_poll(q, req->b.cookie, true); >> + >> + if (ret < 0) { >> + pr_err("tid %d poll error %d", req->t->id, ret); >> + break; >> + } >> + } >> + nvmet_bdev_req_complete(req); >> +} >> +IOPS/BW:- Default:- read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec) read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec) read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec) Poll:- read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec) read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec) read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec) mpstat:- Default:- CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle --------------------------------------------------------- all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60 --------------------------------------------------------- 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 Poll:- CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle --------------------------------------------------------- all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75 --------------------------------------------------------- 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> static void nvmet_bdev_execute_rw(struct nvmet_req *req) >> { >> int sg_cnt = req->sg_cnt; >> @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) >> bio->bi_end_io = nvmet_bio_done; >> bio->bi_opf = op; >> >> - blk_start_plug(&plug); >> + if (!req->poll) >> + blk_start_plug(&plug); >> for_each_sg(req->sg, sg, req->sg_cnt, i) { >> while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) >> != sg->length) { >> @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) >> sg_cnt--; >> } >> >> - submit_bio(bio); >> - blk_finish_plug(&plug); >> + req->b.last_bio = bio; >> + if (req->poll) >> + req->b.last_bio->bi_opf |= REQ_HIPRI; >> + >> + req->b.cookie = submit_bio(bio); >> + >> + nvmet_req_prep_poll(req); >> + nvmet_req_issue_poll(req); >> + if (!req->poll) >> + blk_finish_plug(&plug); >> } >> >> static void nvmet_bdev_execute_flush(struct nvmet_req *req) >> @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) >> switch (cmd->common.opcode) { >> case nvme_cmd_read: >> case nvme_cmd_write: >> + req->poll = req->ns->poll; >> req->execute = nvmet_bdev_execute_rw; >> return 0; >> case nvme_cmd_flush: >> + req->poll = false; > > should be done in some general place for req initialization. > > This needs to be done here for better redability, since it depends on the 2 factors with backend support and user setting. >> req->execute = nvmet_bdev_execute_flush; >> return 0; >> case nvme_cmd_dsm: >> + req->poll = false; >> req->execute = nvmet_bdev_execute_dsm; >> return 0; >> case nvme_cmd_write_zeroes: >> + req->poll = false; >> req->execute = nvmet_bdev_execute_write_zeroes; >> return 0; >> default: >> diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c >> new file mode 100644 >> index 000000000000..175c939c22ff >> --- /dev/null >> +++ b/drivers/nvme/target/io-poll.c >> @@ -0,0 +1,165 @@ >> +#include <linux/blkdev.h> >> +#include <linux/module.h> >> +#include <linux/sched/signal.h> >> +#include <linux/kthread.h> >> +#include <uapi/linux/sched/types.h> >> + >> +#include "nvmet.h" >> +#define THREAD_PER_CPU (num_online_cpus() * 2) >> + >> +static int nvmet_poll_thread(void *data); >> + >> +int nvmet_ns_start_poll(struct nvmet_ns *ns) >> +{ >> + struct nvmet_poll_data *t; >> + int ret = 0; >> + int i; >> + >> + t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL); >> + if (!t) { >> + ret = -ENOMEM; >> + goto out; >> + } >> + >> + for (i = 0; i < THREAD_PER_CPU; i++) { >> + init_completion(&t[i].thread_complete); >> + init_waitqueue_head(&t[i].poll_waitq); >> + INIT_LIST_HEAD(&t[i].list); >> + INIT_LIST_HEAD(&t[i].done); >> + mutex_init(&t[i].list_lock); >> + t[i].id = i; >> + >> + t[i].thread = kthread_create(nvmet_poll_thread, &t[i], >> + "nvmet_poll_thread%s/%d", >> + ns->device_path, i); >> + >> + if (IS_ERR(t[i].thread)) { >> + ret = PTR_ERR(t[i].thread); >> + goto err; >> + } >> + >> + kthread_bind(t[i].thread, i % num_online_cpus()); >> + wake_up_process(t[i].thread); >> + wait_for_completion(&t[i].thread_complete); >> + } >> + >> + ns->t = t; >> +out: >> + return ret; >> +err: >> + nvmet_ns_stop_poll(ns); >> + goto out; >> +} >> + >> +void nvmet_ns_stop_poll(struct nvmet_ns *ns) >> +{ >> + struct nvmet_poll_data *t = ns->t; >> + int i; >> + >> + for (i = 0; i < THREAD_PER_CPU; i++) { >> + if (!t[i].thread) >> + continue; >> + >> + if (wq_has_sleeper(&t[i].poll_waitq)) >> + wake_up(&t[i].poll_waitq); >> + kthread_park(t[i].thread); >> + kthread_stop(t[i].thread); >> + } >> +} >> + >> +static void nvmet_poll(struct nvmet_poll_data *t) >> +{ >> + struct nvmet_req *req, *tmp; >> + >> + lockdep_assert_held(&t->list_lock); >> + >> + list_for_each_entry_safe(req, tmp, &t->list, poll_entry) { >> + >> + if (completion_done(&req->wait)) { >> + list_move_tail(&req->poll_entry, &t->done); >> + continue; >> + } >> + >> + if (!list_empty(&t->done)) >> + break; >> + >> + list_del(&req->poll_entry); >> + mutex_unlock(&t->list_lock); >> + nvmet_req_poll_complete(req); >> + mutex_lock(&t->list_lock); >> + } >> + mutex_unlock(&t->list_lock); >> + /* >> + * In future we can also add batch timeout or nr request to complete. >> + */ >> + while (!list_empty(&t->done)) { >> + /* >> + * We lock and unlock for t->list which gurantee progress of >> + * nvmet_xxx_rw_execute() when under pressure while we complete >> + * the request. >> + */ >> + req = list_first_entry(&t->done, struct nvmet_req, poll_entry); >> + list_del(&req->poll_entry); >> + nvmet_req_done(req); >> + } >> + >> + mutex_lock(&t->list_lock); >> +} >> + >> +static int nvmet_poll_thread(void *data) >> +{ >> + struct nvmet_poll_data *t = (struct nvmet_poll_data *) data; >> + DEFINE_WAIT(wait); >> + >> + complete(&t->thread_complete); >> + >> + while (!kthread_should_park()) { >> + >> + mutex_lock(&t->list_lock); >> + while (!list_empty(&t->list) && !need_resched()) >> + nvmet_poll(t); >> + mutex_unlock(&t->list_lock); >> + >> + prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE); >> + if (signal_pending(current)) >> + flush_signals(current); >> + smp_mb(); >> + schedule(); >> + >> + finish_wait(&t->poll_waitq, &wait); >> + } >> + >> + kthread_parkme(); >> + return 0; >> +} >> + >> +inline void nvmet_req_prep_poll(struct nvmet_req *req) >> +{ >> + if (!req->poll) >> + return; >> + >> + init_completion(&req->wait); >> + req->t = &req->ns->t[smp_processor_id()]; >> +} >> + >> +inline void nvmet_req_issue_poll(struct nvmet_req *req) >> +{ >> + if (!req->poll) >> + return; >> + >> + while (!mutex_trylock(&req->t->list_lock)) { >> + if (req->t->id < num_online_cpus()) >> + req->t = &req->ns->t[req->t->id + num_online_cpus()]; >> + else >> + req->t = &req->ns->t[req->t->id - num_online_cpus()]; >> + } >> + >> + if (completion_done(&req->wait)) >> + list_add(&req->poll_entry, &req->t->list); >> + else >> + list_add_tail(&req->poll_entry, &req->t->list); >> + mutex_unlock(&req->t->list_lock); >> + >> + if (wq_has_sleeper(&req->t->poll_waitq)) >> + wake_up(&req->t->poll_waitq); >> +} >> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h >> index 46df45e837c9..ef2919e23e0b 100644 >> --- a/drivers/nvme/target/nvmet.h >> +++ b/drivers/nvme/target/nvmet.h >> @@ -49,11 +49,22 @@ >> #define IPO_IATTR_CONNECT_SQE(x) \ >> (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) >> >> +struct nvmet_poll_data { >> + struct completion thread_complete; >> + wait_queue_head_t poll_waitq; >> + struct mutex list_lock; >> + struct task_struct *thread; >> + struct list_head list; >> + struct list_head done; >> + unsigned int id; >> +}; >> + >> struct nvmet_ns { >> struct list_head dev_link; >> struct percpu_ref ref; >> struct block_device *bdev; >> struct file *file; >> + struct nvmet_poll_data *t; >> bool readonly; >> u32 nsid; >> u32 blksize_shift; >> @@ -63,6 +74,8 @@ struct nvmet_ns { >> u32 anagrpid; >> >> bool buffered_io; >> + bool use_poll; >> + bool poll; >> bool enabled; >> struct nvmet_subsys *subsys; >> const char *device_path; >> @@ -292,9 +305,15 @@ struct nvmet_req { >> struct nvmet_ns *ns; >> struct scatterlist *sg; >> struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC]; >> + struct completion wait; >> + bool poll; >> + struct nvmet_poll_data *t; >> + struct list_head poll_entry; >> union { >> struct { >> - struct bio inline_bio; >> + struct bio inline_bio; >> + blk_qc_t cookie; >> + struct bio *last_bio; >> } b; >> struct { >> bool mpool_alloc; >> @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, >> void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, >> u8 event_info, u8 log_page); >> >> +int nvmet_ns_start_poll(struct nvmet_ns *ns); >> +void nvmet_ns_stop_poll(struct nvmet_ns *ns); >> +void nvmet_req_prep_poll(struct nvmet_req *req); >> +void nvmet_req_issue_poll(struct nvmet_req *req); >> + >> +void nvmet_req_poll_complete(struct nvmet_req *req); >> +void nvmet_bdev_poll_complete(struct nvmet_req *req); >> +void nvmet_bdev_req_complete(struct nvmet_req *req); >> +void nvmet_req_done(struct nvmet_req *req); >> + >> #define NVMET_QUEUE_SIZE 1024 >> #define NVMET_NR_QUEUES 128 >> #define NVMET_MAX_CMD NVMET_QUEUE_SIZE > _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support 2020-01-21 19:22 ` Chaitanya Kulkarni @ 2020-01-23 14:23 ` Max Gurtovoy 2020-01-30 18:19 ` Chaitanya Kulkarni 0 siblings, 1 reply; 12+ messages in thread From: Max Gurtovoy @ 2020-01-23 14:23 UTC (permalink / raw) To: Chaitanya Kulkarni, sagi, hch; +Cc: linux-nvme On 1/21/2020 9:22 PM, Chaitanya Kulkarni wrote: > Please see my in-line comments. > On 01/20/2020 04:52 AM, Max Gurtovoy wrote: >> On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote: >>> This patch adds support for the bdev-ns polling. We also add a new >>> file to keep polling code separate (io-poll.c). The newly added >>> configfs attribute allows user to enable/disable polling. >>> >>> We only enable polling for the READ/WRITE operation based on support >>> from bdev and and use_poll configfs attr. >>> >>> We configure polling per namespace. For each namespace we create >>> polling threads. For general approach please have a look at the >>> cover-letter of this patch-series. >> It would be great to get some numbers here to learn about the trade-off >> for this approach. >> >> > Already posted these numbers with QEMU nvme-loop here :- > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fpipermail%2Flinux-nvme%2F2020-January%2F028692.html&data=02%7C01%7Cmaxg%40mellanox.com%7C9308e8badf9f41ec4faf08d79ea733ae%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637152313279111655&sdata=RM3%2FcVeRcp8hgyAUFIM%2B7wbZ0cw2xvhdsfcVp9FuPm4%3D&reserved=0 > > IOPS/BW:- > > Default:- > read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec) > read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec) > read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec) > Poll:- > read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec) > read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec) > read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec) > > mpstat:- > Default:- > CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle > --------------------------------------------------------- > all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60 > --------------------------------------------------------- > 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00 > 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57 > 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41 > 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18 > 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35 > 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 > 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22 > 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84 > 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 > 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39 > 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 > 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 > > Poll:- > CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle > --------------------------------------------------------- > all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75 > --------------------------------------------------------- > 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00 > 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00 > 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00 > 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96 > 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 with all the respect to nvmet-loop, it isn't a real use case, right ? Can you re-run it using RDMA and TCP ? what is the backing device for these tests ? NVMe/NULL_BLK ? >>> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> >>> --- >>> drivers/nvme/target/Makefile | 3 +- >>> drivers/nvme/target/configfs.c | 29 ++++++ >>> drivers/nvme/target/core.c | 13 +++ >>> drivers/nvme/target/io-cmd-bdev.c | 61 +++++++++-- >>> drivers/nvme/target/io-poll.c | 165 ++++++++++++++++++++++++++++++ >>> drivers/nvme/target/nvmet.h | 31 +++++- >>> 6 files changed, 291 insertions(+), 11 deletions(-) >>> create mode 100644 drivers/nvme/target/io-poll.c >>> >>> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile >>> index 2b33836f3d3e..8877ba16305c 100644 >>> --- a/drivers/nvme/target/Makefile >>> +++ b/drivers/nvme/target/Makefile >>> @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o >>> obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o >>> >>> nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ >>> - discovery.o io-cmd-file.o io-cmd-bdev.o >>> + discovery.o io-cmd-file.o io-cmd-bdev.o \ >>> + io-poll.o >>> nvme-loop-y += loop.o >>> nvmet-rdma-y += rdma.o >>> nvmet-fc-y += fc.o >>> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c >>> index 98613a45bd3b..0e6e8b0dbf79 100644 >>> --- a/drivers/nvme/target/configfs.c >>> +++ b/drivers/nvme/target/configfs.c >>> @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item, >>> >>> CONFIGFS_ATTR(nvmet_ns_, buffered_io); >>> >>> +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page) >>> +{ >>> + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll); >>> +} >>> + >>> +static ssize_t nvmet_ns_use_poll_store(struct config_item *item, >>> + const char *page, size_t count) >>> +{ >>> + struct nvmet_ns *ns = to_nvmet_ns(item); >>> + bool val; >>> + >>> + if (strtobool(page, &val)) >>> + return -EINVAL; >>> + >>> + mutex_lock(&ns->subsys->lock); >>> + if (ns->enabled) { >>> + pr_err("disable ns before setting use_poll value.\n"); >>> + mutex_unlock(&ns->subsys->lock); >>> + return -EINVAL; >>> + } >>> + >>> + ns->use_poll = val; >>> + mutex_unlock(&ns->subsys->lock); >>> + return count; >>> +} >>> + >>> +CONFIGFS_ATTR(nvmet_ns_, use_poll); >>> + >>> static struct configfs_attribute *nvmet_ns_attrs[] = { >>> &nvmet_ns_attr_device_path, >>> &nvmet_ns_attr_device_nguid, >>> @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { >>> &nvmet_ns_attr_ana_grpid, >>> &nvmet_ns_attr_enable, >>> &nvmet_ns_attr_buffered_io, >>> + &nvmet_ns_attr_use_poll, >>> #ifdef CONFIG_PCI_P2PDMA >>> &nvmet_ns_attr_p2pmem, >>> #endif >>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c >>> index 28438b833c1b..d8f9130d1cd1 100644 >>> --- a/drivers/nvme/target/core.c >>> +++ b/drivers/nvme/target/core.c >>> @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, >>> ns->nsid); >>> } >>> >>> +inline void nvmet_req_done(struct nvmet_req *req) >>> +{ >>> + if (req->ns->bdev) >>> + nvmet_bdev_req_complete(req); >>> +} >>> + >>> +inline void nvmet_req_poll_complete(struct nvmet_req *req) >>> +{ >>> + if (req->ns->bdev) >>> + nvmet_bdev_poll_complete(req); >>> +} >>> + >>> int nvmet_ns_enable(struct nvmet_ns *ns) >>> { >>> struct nvmet_subsys *subsys = ns->subsys; >>> @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) >>> >>> uuid_gen(&ns->uuid); >>> ns->buffered_io = false; >>> + ns->use_poll = false; >>> >>> return ns; >>> } >>> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c >>> index b6fca0e421ef..13507e0cbc1d 100644 >>> --- a/drivers/nvme/target/io-cmd-bdev.c >>> +++ b/drivers/nvme/target/io-cmd-bdev.c >>> @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) >>> >>> int nvmet_bdev_ns_enable(struct nvmet_ns *ns) >>> { >>> + int fmode = FMODE_READ | FMODE_WRITE; >>> + struct request_queue *q; >>> int ret; >>> >>> - ns->bdev = blkdev_get_by_path(ns->device_path, >>> - FMODE_READ | FMODE_WRITE, NULL); >>> + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL); >>> if (IS_ERR(ns->bdev)) { >>> ret = PTR_ERR(ns->bdev); >>> if (ret != -ENOTBLK) { >>> @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) >>> ns->device_path, PTR_ERR(ns->bdev)); >>> } >>> ns->bdev = NULL; >>> - return ret; >>> + goto out; >>> } >>> ns->size = i_size_read(ns->bdev->bd_inode); >>> ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); >>> - return 0; >>> + q = bdev_get_queue(ns->bdev); >>> + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags); >>> + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; >>> +out: >>> + return ret; >>> } >>> >>> void nvmet_bdev_ns_disable(struct nvmet_ns *ns) >>> { >>> if (ns->bdev) { >>> + ns->poll ? nvmet_ns_stop_poll(ns) : 0; >> can you please use an "if" statement instead of the above convention ? >> > This is just as RFC, will send out formal patch series with all the > coding style fixes. > >>> blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); >>> ns->bdev = NULL; >>> } >>> @@ -133,15 +139,39 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) >>> return status; >>> } >>> >>> -static void nvmet_bio_done(struct bio *bio) >>> +void nvmet_bdev_req_complete(struct nvmet_req *req) >>> { >>> - struct nvmet_req *req = bio->bi_private; >>> + struct bio *bio = req->b.last_bio; >>> >>> nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); >>> if (bio != &req->b.inline_bio) >>> bio_put(bio); >>> } >>> >>> +static void nvmet_bio_done(struct bio *bio) >>> +{ >>> + struct nvmet_req *req = bio->bi_private; >>> + >>> + req->b.last_bio = bio; >>> + >>> + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req); >> Same here for the "if". Lets keep the code as readable as we can >> > Same as above. >>> +} >>> + >>> +void nvmet_bdev_poll_complete(struct nvmet_req *req) >>> +{ >>> + struct request_queue *q = bdev_get_queue(req->ns->bdev); >>> + >>> + while (!completion_done(&req->wait)) { >>> + int ret = blk_poll(q, req->b.cookie, true); >>> + >>> + if (ret < 0) { >>> + pr_err("tid %d poll error %d", req->t->id, ret); >>> + break; >>> + } >>> + } >>> + nvmet_bdev_req_complete(req); >>> +} >>> +IOPS/BW:- > Default:- > read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec) > read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec) > read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec) > Poll:- > read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec) > read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec) > read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec) > > mpstat:- > Default:- > CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle > --------------------------------------------------------- > all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60 > --------------------------------------------------------- > 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00 > 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57 > 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41 > 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18 > 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35 > 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 > 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22 > 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84 > 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 > 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39 > 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 > 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 > > Poll:- > CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle > --------------------------------------------------------- > all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75 > --------------------------------------------------------- > 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00 > 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00 > 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00 > 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96 > 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> static void nvmet_bdev_execute_rw(struct nvmet_req *req) >>> { >>> int sg_cnt = req->sg_cnt; >>> @@ -185,7 +215,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) >>> bio->bi_end_io = nvmet_bio_done; >>> bio->bi_opf = op; >>> >>> - blk_start_plug(&plug); >>> + if (!req->poll) >>> + blk_start_plug(&plug); >>> for_each_sg(req->sg, sg, req->sg_cnt, i) { >>> while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) >>> != sg->length) { >>> @@ -204,8 +235,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) >>> sg_cnt--; >>> } >>> >>> - submit_bio(bio); >>> - blk_finish_plug(&plug); >>> + req->b.last_bio = bio; >>> + if (req->poll) >>> + req->b.last_bio->bi_opf |= REQ_HIPRI; >>> + >>> + req->b.cookie = submit_bio(bio); >>> + >>> + nvmet_req_prep_poll(req); >>> + nvmet_req_issue_poll(req); >>> + if (!req->poll) >>> + blk_finish_plug(&plug); >>> } >>> >>> static void nvmet_bdev_execute_flush(struct nvmet_req *req) >>> @@ -330,15 +369,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) >>> switch (cmd->common.opcode) { >>> case nvme_cmd_read: >>> case nvme_cmd_write: >>> + req->poll = req->ns->poll; >>> req->execute = nvmet_bdev_execute_rw; >>> return 0; >>> case nvme_cmd_flush: >>> + req->poll = false; >> should be done in some general place for req initialization. >> >> > This needs to be done here for better redability, since it > depends on the 2 factors with backend support and user setting. > >>> req->execute = nvmet_bdev_execute_flush; >>> return 0; >>> case nvme_cmd_dsm: >>> + req->poll = false; >>> req->execute = nvmet_bdev_execute_dsm; >>> return 0; >>> case nvme_cmd_write_zeroes: >>> + req->poll = false; >>> req->execute = nvmet_bdev_execute_write_zeroes; >>> return 0; >>> default: >>> diff --git a/drivers/nvme/target/io-poll.c b/drivers/nvme/target/io-poll.c >>> new file mode 100644 >>> index 000000000000..175c939c22ff >>> --- /dev/null >>> +++ b/drivers/nvme/target/io-poll.c >>> @@ -0,0 +1,165 @@ >>> +#include <linux/blkdev.h> >>> +#include <linux/module.h> >>> +#include <linux/sched/signal.h> >>> +#include <linux/kthread.h> >>> +#include <uapi/linux/sched/types.h> >>> + >>> +#include "nvmet.h" >>> +#define THREAD_PER_CPU (num_online_cpus() * 2) >>> + >>> +static int nvmet_poll_thread(void *data); >>> + >>> +int nvmet_ns_start_poll(struct nvmet_ns *ns) >>> +{ >>> + struct nvmet_poll_data *t; >>> + int ret = 0; >>> + int i; >>> + >>> + t = kzalloc(sizeof(*t) * THREAD_PER_CPU, GFP_KERNEL); >>> + if (!t) { >>> + ret = -ENOMEM; >>> + goto out; >>> + } >>> + >>> + for (i = 0; i < THREAD_PER_CPU; i++) { >>> + init_completion(&t[i].thread_complete); >>> + init_waitqueue_head(&t[i].poll_waitq); >>> + INIT_LIST_HEAD(&t[i].list); >>> + INIT_LIST_HEAD(&t[i].done); >>> + mutex_init(&t[i].list_lock); >>> + t[i].id = i; >>> + >>> + t[i].thread = kthread_create(nvmet_poll_thread, &t[i], >>> + "nvmet_poll_thread%s/%d", >>> + ns->device_path, i); >>> + >>> + if (IS_ERR(t[i].thread)) { >>> + ret = PTR_ERR(t[i].thread); >>> + goto err; >>> + } >>> + >>> + kthread_bind(t[i].thread, i % num_online_cpus()); >>> + wake_up_process(t[i].thread); >>> + wait_for_completion(&t[i].thread_complete); >>> + } >>> + >>> + ns->t = t; >>> +out: >>> + return ret; >>> +err: >>> + nvmet_ns_stop_poll(ns); >>> + goto out; >>> +} >>> + >>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns) >>> +{ >>> + struct nvmet_poll_data *t = ns->t; >>> + int i; >>> + >>> + for (i = 0; i < THREAD_PER_CPU; i++) { >>> + if (!t[i].thread) >>> + continue; >>> + >>> + if (wq_has_sleeper(&t[i].poll_waitq)) >>> + wake_up(&t[i].poll_waitq); >>> + kthread_park(t[i].thread); >>> + kthread_stop(t[i].thread); >>> + } >>> +} >>> + >>> +static void nvmet_poll(struct nvmet_poll_data *t) >>> +{ >>> + struct nvmet_req *req, *tmp; >>> + >>> + lockdep_assert_held(&t->list_lock); >>> + >>> + list_for_each_entry_safe(req, tmp, &t->list, poll_entry) { >>> + >>> + if (completion_done(&req->wait)) { >>> + list_move_tail(&req->poll_entry, &t->done); >>> + continue; >>> + } >>> + >>> + if (!list_empty(&t->done)) >>> + break; >>> + >>> + list_del(&req->poll_entry); >>> + mutex_unlock(&t->list_lock); >>> + nvmet_req_poll_complete(req); >>> + mutex_lock(&t->list_lock); >>> + } >>> + mutex_unlock(&t->list_lock); >>> + /* >>> + * In future we can also add batch timeout or nr request to complete. >>> + */ >>> + while (!list_empty(&t->done)) { >>> + /* >>> + * We lock and unlock for t->list which gurantee progress of >>> + * nvmet_xxx_rw_execute() when under pressure while we complete >>> + * the request. >>> + */ >>> + req = list_first_entry(&t->done, struct nvmet_req, poll_entry); >>> + list_del(&req->poll_entry); >>> + nvmet_req_done(req); >>> + } >>> + >>> + mutex_lock(&t->list_lock); >>> +} >>> + >>> +static int nvmet_poll_thread(void *data) >>> +{ >>> + struct nvmet_poll_data *t = (struct nvmet_poll_data *) data; >>> + DEFINE_WAIT(wait); >>> + >>> + complete(&t->thread_complete); >>> + >>> + while (!kthread_should_park()) { >>> + >>> + mutex_lock(&t->list_lock); >>> + while (!list_empty(&t->list) && !need_resched()) >>> + nvmet_poll(t); >>> + mutex_unlock(&t->list_lock); >>> + >>> + prepare_to_wait(&t->poll_waitq, &wait, TASK_INTERRUPTIBLE); >>> + if (signal_pending(current)) >>> + flush_signals(current); >>> + smp_mb(); >>> + schedule(); >>> + >>> + finish_wait(&t->poll_waitq, &wait); >>> + } >>> + >>> + kthread_parkme(); >>> + return 0; >>> +} >>> + >>> +inline void nvmet_req_prep_poll(struct nvmet_req *req) >>> +{ >>> + if (!req->poll) >>> + return; >>> + >>> + init_completion(&req->wait); >>> + req->t = &req->ns->t[smp_processor_id()]; >>> +} >>> + >>> +inline void nvmet_req_issue_poll(struct nvmet_req *req) >>> +{ >>> + if (!req->poll) >>> + return; >>> + >>> + while (!mutex_trylock(&req->t->list_lock)) { >>> + if (req->t->id < num_online_cpus()) >>> + req->t = &req->ns->t[req->t->id + num_online_cpus()]; >>> + else >>> + req->t = &req->ns->t[req->t->id - num_online_cpus()]; >>> + } >>> + >>> + if (completion_done(&req->wait)) >>> + list_add(&req->poll_entry, &req->t->list); >>> + else >>> + list_add_tail(&req->poll_entry, &req->t->list); >>> + mutex_unlock(&req->t->list_lock); >>> + >>> + if (wq_has_sleeper(&req->t->poll_waitq)) >>> + wake_up(&req->t->poll_waitq); >>> +} >>> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h >>> index 46df45e837c9..ef2919e23e0b 100644 >>> --- a/drivers/nvme/target/nvmet.h >>> +++ b/drivers/nvme/target/nvmet.h >>> @@ -49,11 +49,22 @@ >>> #define IPO_IATTR_CONNECT_SQE(x) \ >>> (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) >>> >>> +struct nvmet_poll_data { >>> + struct completion thread_complete; >>> + wait_queue_head_t poll_waitq; >>> + struct mutex list_lock; >>> + struct task_struct *thread; >>> + struct list_head list; >>> + struct list_head done; >>> + unsigned int id; >>> +}; >>> + >>> struct nvmet_ns { >>> struct list_head dev_link; >>> struct percpu_ref ref; >>> struct block_device *bdev; >>> struct file *file; >>> + struct nvmet_poll_data *t; >>> bool readonly; >>> u32 nsid; >>> u32 blksize_shift; >>> @@ -63,6 +74,8 @@ struct nvmet_ns { >>> u32 anagrpid; >>> >>> bool buffered_io; >>> + bool use_poll; >>> + bool poll; >>> bool enabled; >>> struct nvmet_subsys *subsys; >>> const char *device_path; >>> @@ -292,9 +305,15 @@ struct nvmet_req { >>> struct nvmet_ns *ns; >>> struct scatterlist *sg; >>> struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC]; >>> + struct completion wait; >>> + bool poll; >>> + struct nvmet_poll_data *t; >>> + struct list_head poll_entry; >>> union { >>> struct { >>> - struct bio inline_bio; >>> + struct bio inline_bio; >>> + blk_qc_t cookie; >>> + struct bio *last_bio; >>> } b; >>> struct { >>> bool mpool_alloc; >>> @@ -442,6 +461,16 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, >>> void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, >>> u8 event_info, u8 log_page); >>> >>> +int nvmet_ns_start_poll(struct nvmet_ns *ns); >>> +void nvmet_ns_stop_poll(struct nvmet_ns *ns); >>> +void nvmet_req_prep_poll(struct nvmet_req *req); >>> +void nvmet_req_issue_poll(struct nvmet_req *req); >>> + >>> +void nvmet_req_poll_complete(struct nvmet_req *req); >>> +void nvmet_bdev_poll_complete(struct nvmet_req *req); >>> +void nvmet_bdev_req_complete(struct nvmet_req *req); >>> +void nvmet_req_done(struct nvmet_req *req); >>> + >>> #define NVMET_QUEUE_SIZE 1024 >>> #define NVMET_NR_QUEUES 128 >>> #define NVMET_MAX_CMD NVMET_QUEUE_SIZE _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 1/2] nvmet: add bdev-ns polling support 2020-01-23 14:23 ` Max Gurtovoy @ 2020-01-30 18:19 ` Chaitanya Kulkarni 0 siblings, 0 replies; 12+ messages in thread From: Chaitanya Kulkarni @ 2020-01-30 18:19 UTC (permalink / raw) To: Max Gurtovoy, sagi, hch; +Cc: linux-nvme On 01/23/2020 06:23 AM, Max Gurtovoy wrote: > On 1/21/2020 9:22 PM, Chaitanya Kulkarni wrote: >> >Please see my in-line comments. >> >On 01/20/2020 04:52 AM, Max Gurtovoy wrote: >>> >>On 12/10/2019 8:25 AM, Chaitanya Kulkarni wrote: >>>> >>>This patch adds support for the bdev-ns polling. We also add a new >>>> >>>file to keep polling code separate (io-poll.c). The newly added >>>> >>>configfs attribute allows user to enable/disable polling. >>>> >>> >>>> >>>We only enable polling for the READ/WRITE operation based on support >>>> >>>from bdev and and use_poll configfs attr. >>>> >>> >>>> >>>We configure polling per namespace. For each namespace we create >>>> >>>polling threads. For general approach please have a look at the >>>> >>>cover-letter of this patch-series. >>> >>It would be great to get some numbers here to learn about the trade-off >>> >>for this approach. >>> >> >>> >> >> >Already posted these numbers with QEMU nvme-loop here :- >> >https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fpipermail%2Flinux-nvme%2F2020-January%2F028692.html&data=02%7C01%7Cmaxg%40mellanox.com%7C9308e8badf9f41ec4faf08d79ea733ae%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637152313279111655&sdata=RM3%2FcVeRcp8hgyAUFIM%2B7wbZ0cw2xvhdsfcVp9FuPm4%3D&reserved=0 >> > >> >IOPS/BW:- >> > >> >Default:- >> >read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec) >> >read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec) >> >read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec) >> >Poll:- >> >read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec) >> >read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec) >> >read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec) >> > >> >mpstat:- >> >Default:- >> >CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle >> >--------------------------------------------------------- >> >all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60 >> >--------------------------------------------------------- >> >0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00 >> >1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57 >> >2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41 >> >3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18 >> >4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35 >> >5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 >> >6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22 >> >7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84 >> >8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 >> >9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39 >> >10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 >> >11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 >> > >> >Poll:- >> >CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle >> >--------------------------------------------------------- >> >all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75 >> >--------------------------------------------------------- >> >0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00 >> >1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00 >> >2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00 >> >8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> >10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96 >> >11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 > with all the respect to nvmet-loop, it isn't a real use case, right ? > > Can you re-run it using RDMA and TCP ? > > what is the backing device for these tests ? NVMe/NULL_BLK ? > > I don't think null_blk supports polling. This is NVMe QEMU Device. Don't have the RDMA setup, I'll create one and report numbers. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC PATCH 2/2] nvmet: add file-ns polling support 2019-12-10 6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni 2019-12-10 6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni @ 2019-12-10 6:25 ` Chaitanya Kulkarni 2019-12-12 1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg 2 siblings, 0 replies; 12+ messages in thread From: Chaitanya Kulkarni @ 2019-12-10 6:25 UTC (permalink / raw) To: sagi, hch; +Cc: Chaitanya Kulkarni, linux-nvme This patch adds polling support for file-ns. We only enable polling based on underlying fs support and use_poll attr. By default we don't poll on any operation we poll on the request with Reqd/write when we receive -EIOCBQUEUED i.e. for direct I/O. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> --- drivers/nvme/target/core.c | 4 +++ drivers/nvme/target/io-cmd-file.c | 60 +++++++++++++++++++++++++++---- drivers/nvme/target/nvmet.h | 9 +++-- 3 files changed, 63 insertions(+), 10 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index d8f9130d1cd1..cd2f5c6f896e 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -514,12 +514,16 @@ inline void nvmet_req_done(struct nvmet_req *req) { if (req->ns->bdev) nvmet_bdev_req_complete(req); + if (req->ns->file) + nvmet_file_req_complete(req); } inline void nvmet_req_poll_complete(struct nvmet_req *req) { if (req->ns->bdev) nvmet_bdev_poll_complete(req); + if (req->ns->file) + nvmet_file_poll_complete(req); } int nvmet_ns_enable(struct nvmet_ns *ns) diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c index caebfce06605..1f49a02fd437 100644 --- a/drivers/nvme/target/io-cmd-file.c +++ b/drivers/nvme/target/io-cmd-file.c @@ -16,6 +16,7 @@ void nvmet_file_ns_disable(struct nvmet_ns *ns) { if (ns->file) { + ns->poll ? nvmet_ns_stop_poll(ns) : 0; if (ns->buffered_io) flush_workqueue(buffered_io_wq); mempool_destroy(ns->bvec_pool); @@ -72,6 +73,11 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns) goto err; } + ns->poll = ns->use_poll && ns->file->f_op->iopoll; + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; + if (ret) + goto err; + return ret; err: ns->size = 0; @@ -114,9 +120,8 @@ static ssize_t nvmet_file_submit_bvec(struct nvmet_req *req, loff_t pos, return call_iter(iocb, &iter); } -static void nvmet_file_io_done(struct kiocb *iocb, long ret, long ret2) +void nvmet_file_req_complete(struct nvmet_req *req) { - struct nvmet_req *req = container_of(iocb, struct nvmet_req, f.iocb); u16 status = NVME_SC_SUCCESS; if (req->f.bvec != req->inline_bvec) { @@ -126,13 +131,39 @@ static void nvmet_file_io_done(struct kiocb *iocb, long ret, long ret2) mempool_free(req->f.bvec, req->ns->bvec_pool); } - if (unlikely(ret != req->transfer_len)) - status = errno_to_nvme_status(req, ret); + if (unlikely(req->f.iosize != req->transfer_len)) + status = errno_to_nvme_status(req, req->f.iosize); nvmet_req_complete(req, status); } +static void nvmet_file_io_done(struct kiocb *iocb, long ret, long ret2) +{ + struct nvmet_req *req = container_of(iocb, struct nvmet_req, f.iocb); + + req->f.iosize = ret; + req->poll ? complete(&req->wait) : nvmet_file_req_complete(req); +} + +void nvmet_file_poll_complete(struct nvmet_req *req) +{ + while (!completion_done(&req->wait)) { + int ret = req->f.iocb.ki_filp->f_op->iopoll(&req->f.iocb, true); + + if (ret < 0) + pr_err("tid %d poll error %d", req->t->id, ret); + } + /* + * We are out of the lock anyway, by completing the polled request here + * we reduce lock contention and decrease the size of done list which + * reduces the size of list_lock. This allows nvmet_file_execute_rw() + * to make progress as and when we scheduled out. + */ + nvmet_file_req_complete(req); +} + static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags) { + struct kiocb *iocb = &req->f.iocb; ssize_t nr_bvec = req->sg_cnt; unsigned long bv_cnt = 0; bool is_sync = false; @@ -151,7 +182,7 @@ static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags) return true; } - memset(&req->f.iocb, 0, sizeof(struct kiocb)); + memset(iocb, 0, sizeof(struct kiocb)); for_each_sg(req->sg, sg, req->sg_cnt, i) { nvmet_file_init_bvec(&req->f.bvec[bv_cnt], sg); len += req->f.bvec[bv_cnt].bv_len; @@ -187,13 +218,20 @@ static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags) * A NULL ki_complete ask for synchronous execution, which we want * for the IOCB_NOWAIT case. */ - if (!(ki_flags & IOCB_NOWAIT)) - req->f.iocb.ki_complete = nvmet_file_io_done; + if (!(ki_flags & IOCB_NOWAIT)) { + iocb->ki_complete = nvmet_file_io_done; + ki_flags |= req->ns->poll ? IOCB_HIPRI : 0; + } ret = nvmet_file_submit_bvec(req, pos, bv_cnt, total_len, ki_flags); switch (ret) { case -EIOCBQUEUED: + if (req->ns->poll) { + req->poll = true; + nvmet_req_prep_poll(req); + nvmet_req_issue_poll(req); + } return true; case -EAGAIN: if (WARN_ON_ONCE(!(ki_flags & IOCB_NOWAIT))) @@ -211,6 +249,10 @@ static bool nvmet_file_execute_io(struct nvmet_req *req, int ki_flags) } complete: + /* + * If we are here, then I/O is synchronously completed and ret holds + * number of bytes transferred. + */ nvmet_file_io_done(&req->f.iocb, ret, 0); return true; } @@ -379,15 +421,19 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req) switch (cmd->common.opcode) { case nvme_cmd_read: case nvme_cmd_write: + req->poll = false; req->execute = nvmet_file_execute_rw; return 0; case nvme_cmd_flush: + req->poll = false; req->execute = nvmet_file_execute_flush; return 0; case nvme_cmd_dsm: + req->poll = false; req->execute = nvmet_file_execute_dsm; return 0; case nvme_cmd_write_zeroes: + req->poll = false; req->execute = nvmet_file_execute_write_zeroes; return 0; default: diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index ef2919e23e0b..e7e0e0de705e 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -317,9 +317,10 @@ struct nvmet_req { } b; struct { bool mpool_alloc; - struct kiocb iocb; - struct bio_vec *bvec; - struct work_struct work; + struct kiocb iocb; + struct bio_vec *bvec; + struct work_struct work; + long iosize; } f; }; int sg_cnt; @@ -469,6 +470,8 @@ void nvmet_req_issue_poll(struct nvmet_req *req); void nvmet_req_poll_complete(struct nvmet_req *req); void nvmet_bdev_poll_complete(struct nvmet_req *req); void nvmet_bdev_req_complete(struct nvmet_req *req); +void nvmet_file_poll_complete(struct nvmet_req *req); +void nvmet_file_req_complete(struct nvmet_req *req); void nvmet_req_done(struct nvmet_req *req); #define NVMET_QUEUE_SIZE 1024 -- 2.22.1 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/2] nvmet: add polling support 2019-12-10 6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni 2019-12-10 6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni 2019-12-10 6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni @ 2019-12-12 1:01 ` Sagi Grimberg 2019-12-12 5:44 ` Chaitanya Kulkarni 2020-01-20 4:48 ` Chaitanya Kulkarni 2 siblings, 2 replies; 12+ messages in thread From: Sagi Grimberg @ 2019-12-12 1:01 UTC (permalink / raw) To: Chaitanya Kulkarni, hch; +Cc: linux-nvme percpu threads per namespace? Sounds like the wrong approach. These threads will compete for cpu time with the main nvmet contexts. Have you considered having the main nvmet contexts incorporate polling activity between I/Os? Don't have a great dea on how to do it from first thought... _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/2] nvmet: add polling support 2019-12-12 1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg @ 2019-12-12 5:44 ` Chaitanya Kulkarni 2019-12-12 20:32 ` Sagi Grimberg 2020-01-20 4:48 ` Chaitanya Kulkarni 1 sibling, 1 reply; 12+ messages in thread From: Chaitanya Kulkarni @ 2019-12-12 5:44 UTC (permalink / raw) To: Sagi Grimberg, hch; +Cc: linux-nvme On 12/11/2019 05:01 PM, Sagi Grimberg wrote: > percpu threads per namespace? Sounds like the wrong approach. These > threads will compete for cpu time with the main nvmet contexts. > That make sense, how about a global threadpool for target which can be shared between all the subsystem and their name-spaces just like we have buffered_io_wq? > Have you considered having the main nvmet contexts incorporate polling > activity between I/Os? Don't have a great dea on how to do it from first > thought... > I am not able to understand nvmet context, can you please elaborate ? Are you referring to the pattern we one have in the nvme_execute_rq_polled() ? _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/2] nvmet: add polling support 2019-12-12 5:44 ` Chaitanya Kulkarni @ 2019-12-12 20:32 ` Sagi Grimberg 2020-01-20 5:13 ` Chaitanya Kulkarni 0 siblings, 1 reply; 12+ messages in thread From: Sagi Grimberg @ 2019-12-12 20:32 UTC (permalink / raw) To: Chaitanya Kulkarni, hch; +Cc: linux-nvme >> percpu threads per namespace? Sounds like the wrong approach. These >> threads will compete for cpu time with the main nvmet contexts. >> > That make sense, how about a global threadpool for target which can be > shared between all the subsystem and their name-spaces just like we > have buffered_io_wq? >> Have you considered having the main nvmet contexts incorporate polling >> activity between I/Os? Don't have a great dea on how to do it from first >> thought... >> > > I am not able to understand nvmet context, can you please elaborate ? > Are you referring to the pattern we one have in the > nvme_execute_rq_polled() ? No, we would want non-selective polling. Right now we have nvmet context starting from the transport going to submit I/O to the backend, or starting from the backend going to submit to the transport. Ideally, we'd have these contexts to do the polling instead of a different thread that will poll for as much as it can taking away cpu time? One way to do it is to place a intermediate thread that will sit between the transport and the backend but that would yield an additional context switch in the I/O path (not ideal). _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/2] nvmet: add polling support 2019-12-12 20:32 ` Sagi Grimberg @ 2020-01-20 5:13 ` Chaitanya Kulkarni 0 siblings, 0 replies; 12+ messages in thread From: Chaitanya Kulkarni @ 2020-01-20 5:13 UTC (permalink / raw) To: Sagi Grimberg, hch; +Cc: linux-nvme On 12/12/2019 12:32 PM, Sagi Grimberg wrote: >>> >>percpu threads per namespace? Sounds like the wrong approach. These >>> >>threads will compete for cpu time with the main nvmet contexts. >>> >> >> >That make sense, how about a global threadpool for target which can be >> >shared between all the subsystem and their name-spaces just like we >> >have buffered_io_wq? >>> >>Have you considered having the main nvmet contexts incorporate polling >>> >>activity between I/Os? Don't have a great dea on how to do it from first >>> >>thought... >>> >> >> > >> >I am not able to understand nvmet context, can you please elaborate ? >> >Are you referring to the pattern we one have in the >> >nvme_execute_rq_polled() ? > No, we would want non-selective polling. Right now we have nvmet context > starting from the transport going to submit I/O to the backend, or > starting from the backend going to submit to the transport. > > Ideally, we'd have these contexts to do the polling instead of a > different thread that will poll for as much as it can taking away > cpu time? > > One way to do it is to place a intermediate thread that will sit between > the transport and the backend but that would yield an additional context > switch in the I/O path (not ideal). > Yes not ideal. I took numbers on the following patch and I can see performance improved when enabled polling in QEMU with nvme-loop. #nvme list | tr -s ' ' ' ' /dev/nvme0n1 foo QEMU NVMe Ctrl 1 2.10 MB / 2.10 MB 1 B + 9 B 1.0 /dev/nvme1n1 6e79351962f851ad Linux 1 2.10 MB / 2.10 MB 1 B + 9 B 5.5.0-rc We can see that CPU percentage has gone up to the saturation point and increase in BW and IOPS. IOPS/BW:- Default:- read: IOPS=52.8k, BW=206MiB/s (216MB/s)(6188MiB/30001msec) read: IOPS=54.3k, BW=212MiB/s (223MB/s)(6369MiB/30001msec) read: IOPS=53.5k, BW=209MiB/s (219MB/s)(6274MiB/30001msec) Poll:- read: IOPS=68.4k, BW=267MiB/s (280MB/s)(8011MiB/30001msec) read: IOPS=69.3k, BW=271MiB/s (284MB/s)(8124MiB/30001msec) read: IOPS=69.4k, BW=271MiB/s (284MB/s)(8132MiB/30001msec) mpstat:- Default:- CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle --------------------------------------------------------- all 1.18 0.00 |60.14| 0.00 0.00 0.00 0.08 0.00 0.00 38.60 --------------------------------------------------------- 0 0.00 0.00 | 32.00 |0.00 0.00 0.00 0.00 0.00 0.00 68.00 1 1.01 0.00 | 42.42 |0.00 0.00 0.00 0.00 0.00 0.00 56.57 2 1.01 0.00 | 57.58 |0.00 0.00 0.00 0.00 0.00 0.00 41.41 3 2.02 0.00 | 79.80 |0.00 0.00 0.00 0.00 0.00 0.00 18.18 4 1.01 0.00 | 63.64 |0.00 0.00 0.00 0.00 0.00 0.00 35.35 5 3.09 0.00 | 63.92 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 6 2.02 0.00 | 75.76 |0.00 0.00 0.00 0.00 0.00 0.00 22.22 7 1.02 0.00 | 57.14 |0.00 0.00 0.00 0.00 0.00 0.00 41.84 8 0.00 0.00 | 67.01 |0.00 0.00 0.00 0.00 0.00 0.00 32.99 9 1.01 0.00 | 59.60 |0.00 0.00 0.00 0.00 0.00 0.00 39.39 10 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 11 1.02 0.00| 62.24 |0.00 0.00 0.00 0.00 0.00 0.00 36.73 Poll:- CPU %usr %nice |%sys| %iowait %irq %soft %steal %guest %gnice %idle --------------------------------------------------------- all 1.08 0.00 98.08 0.00 0.00 0.00 0.08 0.00 0.00 0.75 --------------------------------------------------------- 0 2.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 1.00 1 0.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 3.00 2 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 3 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 4 1.01 0.00 | 98.99 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 5 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 6 0.99 0.00 | 99.01 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 7 1.00 0.00 | 97.00 |0.00 0.00 0.00 0.00 0.00 0.00 2.00 8 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 9 1.00 0.00 | 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 10 1.98 0.00| 94.06 |0.00 0.00 0.00 0.00 0.00 0.00 3.96 11 1.00 0.00| 99.00 |0.00 0.00 0.00 0.00 0.00 0.00 0.00 Any thoughts ? Patch :- diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile index 2b33836f3d3e..8877ba16305c 100644 --- a/drivers/nvme/target/Makefile +++ b/drivers/nvme/target/Makefile @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ - discovery.o io-cmd-file.o io-cmd-bdev.o + discovery.o io-cmd-file.o io-cmd-bdev.o \ + io-poll.o nvme-loop-y += loop.o nvmet-rdma-y += rdma.o nvmet-fc-y += fc.o diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 98613a45bd3b..0e6e8b0dbf79 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item, CONFIGFS_ATTR(nvmet_ns_, buffered_io); +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page) +{ + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll); +} + +static ssize_t nvmet_ns_use_poll_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_ns *ns = to_nvmet_ns(item); + bool val; + + if (strtobool(page, &val)) + return -EINVAL; + + mutex_lock(&ns->subsys->lock); + if (ns->enabled) { + pr_err("disable ns before setting use_poll value.\n"); + mutex_unlock(&ns->subsys->lock); + return -EINVAL; + } + + ns->use_poll = val; + mutex_unlock(&ns->subsys->lock); + return count; +} + +CONFIGFS_ATTR(nvmet_ns_, use_poll); + static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_device_path, &nvmet_ns_attr_device_nguid, @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_ana_grpid, &nvmet_ns_attr_enable, &nvmet_ns_attr_buffered_io, + &nvmet_ns_attr_use_poll, #ifdef CONFIG_PCI_P2PDMA &nvmet_ns_attr_p2pmem, #endif diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 28438b833c1b..af91240b1a1f 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, ns->nsid); } +inline void nvmet_req_poll(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_req_bdev_poll(req); +} + +inline void nvmet_req_poll_complete(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_bdev_req_complete(req); +} + int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) uuid_gen(&ns->uuid); ns->buffered_io = false; + ns->use_poll = false; return ns;diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile index 2b33836f3d3e..8877ba16305c 100644 --- a/drivers/nvme/target/Makefile +++ b/drivers/nvme/target/Makefile @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ - discovery.o io-cmd-file.o io-cmd-bdev.o + discovery.o io-cmd-file.o io-cmd-bdev.o \ + io-poll.o nvme-loop-y += loop.o nvmet-rdma-y += rdma.o nvmet-fc-y += fc.o diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 98613a45bd3b..0e6e8b0dbf79 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item, CONFIGFS_ATTR(nvmet_ns_, buffered_io); +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page) +{ + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll); +} + +static ssize_t nvmet_ns_use_poll_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_ns *ns = to_nvmet_ns(item); + bool val; + + if (strtobool(page, &val)) + return -EINVAL; + + mutex_lock(&ns->subsys->lock); + if (ns->enabled) { + pr_err("disable ns before setting use_poll value.\n"); + mutex_unlock(&ns->subsys->lock); + return -EINVAL; + } + + ns->use_poll = val; + mutex_unlock(&ns->subsys->lock); + return count; +} + +CONFIGFS_ATTR(nvmet_ns_, use_poll); + static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_device_path, &nvmet_ns_attr_device_nguid, @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_ana_grpid, &nvmet_ns_attr_enable, &nvmet_ns_attr_buffered_io, + &nvmet_ns_attr_use_poll, #ifdef CONFIG_PCI_P2PDMA &nvmet_ns_attr_p2pmem, #endif diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 28438b833c1b..af91240b1a1f 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, ns->nsid); } +inline void nvmet_req_poll(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_req_bdev_poll(req); +} + +inline void nvmet_req_poll_complete(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_bdev_req_complete(req); +} + int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) uuid_gen(&ns->uuid); ns->buffered_io = false; + ns->use_poll = false; return ns; } diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index b6fca0e421ef..317c9b427b71 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) int nvmet_bdev_ns_enable(struct nvmet_ns *ns) { + int fmode = FMODE_READ | FMODE_WRITE; + struct request_queue *q; int ret; - ns->bdev = blkdev_get_by_path(ns->device_path, - FMODE_READ | FMODE_WRITE, NULL); + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL); if (IS_ERR(ns->bdev)) { ret = PTR_ERR(ns->bdev); if (ret != -ENOTBLK) { @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) ns->device_path, PTR_ERR(ns->bdev)); } ns->bdev = NULL; - return ret; + goto out; } ns->size = i_size_read(ns->bdev->bd_inode); ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); - return 0; + q = bdev_get_queue(ns->bdev); + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags); + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; +out: + return ret; } void nvmet_bdev_ns_disable(struct nvmet_ns *ns) { if (ns->bdev) { + ns->poll ? nvmet_ns_stop_poll(ns) : 0; blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); ns->bdev = NULL; } @@ -133,15 +139,34 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) return status; } -static void nvmet_bio_done(struct bio *bio) +void nvmet_bdev_req_complete(struct nvmet_req *req) { - struct nvmet_req *req = bio->bi_private; + struct bio *bio = req->b.last_bio; nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); if (bio != &req->b.inline_bio) bio_put(bio); } +static void nvmet_bio_done(struct bio *bio) +{ + struct nvmet_req *req = bio->bi_private; + + req->b.last_bio = bio; + + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req); +} + +void nvmet_req_bdev_poll(struct nvmet_req *req) +{ + struct request_queue *q = bdev_get_queue(req->ns->bdev); + int ret = blk_poll(q, req->b.cookie, true); + + if (ret < 0) { + pr_err("tid %d poll error %d", req->t->id, ret); + } +} + static void nvmet_bdev_execute_rw(struct nvmet_req *req) { int sg_cnt = req->sg_cnt; @@ -185,7 +210,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) bio->bi_end_io = nvmet_bio_done; bio->bi_opf = op; - blk_start_plug(&plug); + if (!req->poll) + blk_start_plug(&plug); for_each_sg(req->sg, sg, req->sg_cnt, i) { while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) != sg->length) { @@ -204,8 +230,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) sg_cnt--; } - submit_bio(bio); - blk_finish_plug(&plug); + req->b.last_bio = bio; + if (req->poll) + req->b.last_bio->bi_opf |= REQ_HIPRI; + + req->b.cookie = submit_bio(bio); + + nvmet_req_prep_poll(req); + nvmet_req_issue_poll(req); + if (!req->poll) + blk_finish_plug(&plug); } static void nvmet_bdev_execute_flush(struct nvmet_req *req) @@ -330,15 +364,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) switch (cmd->common.opcode) { case nvme_cmd_read: case nvme_cmd_write: + req->poll = req->ns->poll; req->execute = nvmet_bdev_execute_rw; return 0; case nvme_cmd_flush: + req->poll = false; req->execute = nvmet_bdev_execute_flush; return 0; case nvme_cmd_dsm: + req->poll = false; req->execute = nvmet_bdev_execute_dsm; return 0; case nvme_cmd_write_zeroes: + req->poll = false; req->execute = nvmet_bdev_execute_write_zeroes; return 0; default: diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 46df45e837c9..e5991e36ed6f 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -49,11 +49,22 @@ #define IPO_IATTR_CONNECT_SQE(x) \ (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) +struct nvmet_poll_data { + struct completion thread_complete; + wait_queue_head_t poll_waitq; + struct mutex list_lock; + struct task_struct *thread; + struct list_head list; + struct list_head done; + unsigned int id; +}; +diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile index 2b33836f3d3e..8877ba16305c 100644 --- a/drivers/nvme/target/Makefile +++ b/drivers/nvme/target/Makefile @@ -10,7 +10,8 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ - discovery.o io-cmd-file.o io-cmd-bdev.o + discovery.o io-cmd-file.o io-cmd-bdev.o \ + io-poll.o nvme-loop-y += loop.o nvmet-rdma-y += rdma.o nvmet-fc-y += fc.o diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 98613a45bd3b..0e6e8b0dbf79 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -545,6 +545,34 @@ static ssize_t nvmet_ns_buffered_io_store(struct config_item *item, CONFIGFS_ATTR(nvmet_ns_, buffered_io); +static ssize_t nvmet_ns_use_poll_show(struct config_item *item, char *page) +{ + return sprintf(page, "%d\n", to_nvmet_ns(item)->use_poll); +} + +static ssize_t nvmet_ns_use_poll_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_ns *ns = to_nvmet_ns(item); + bool val; + + if (strtobool(page, &val)) + return -EINVAL; + + mutex_lock(&ns->subsys->lock); + if (ns->enabled) { + pr_err("disable ns before setting use_poll value.\n"); + mutex_unlock(&ns->subsys->lock); + return -EINVAL; + } + + ns->use_poll = val; + mutex_unlock(&ns->subsys->lock); + return count; +} + +CONFIGFS_ATTR(nvmet_ns_, use_poll); + static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_device_path, &nvmet_ns_attr_device_nguid, @@ -552,6 +580,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_ana_grpid, &nvmet_ns_attr_enable, &nvmet_ns_attr_buffered_io, + &nvmet_ns_attr_use_poll, #ifdef CONFIG_PCI_P2PDMA &nvmet_ns_attr_p2pmem, #endif diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 28438b833c1b..af91240b1a1f 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -510,6 +510,18 @@ static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, ns->nsid); } +inline void nvmet_req_poll(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_req_bdev_poll(req); +} + +inline void nvmet_req_poll_complete(struct nvmet_req *req) +{ + if (req->ns->bdev) + nvmet_bdev_req_complete(req); +} + int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; @@ -653,6 +665,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) uuid_gen(&ns->uuid); ns->buffered_io = false; + ns->use_poll = false; return ns; } diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index b6fca0e421ef..317c9b427b71 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) int nvmet_bdev_ns_enable(struct nvmet_ns *ns) { + int fmode = FMODE_READ | FMODE_WRITE; + struct request_queue *q; int ret; - ns->bdev = blkdev_get_by_path(ns->device_path, - FMODE_READ | FMODE_WRITE, NULL); + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL); if (IS_ERR(ns->bdev)) { ret = PTR_ERR(ns->bdev); if (ret != -ENOTBLK) { @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) ns->device_path, PTR_ERR(ns->bdev)); } ns->bdev = NULL; - return ret; + goto out; } ns->size = i_size_read(ns->bdev->bd_inode); ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); - return 0; + q = bdev_get_queue(ns->bdev); + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags); + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; +out: + return ret; } void nvmet_bdev_ns_disable(struct nvmet_ns *ns) { if (ns->bdev) { + ns->poll ? nvmet_ns_stop_poll(ns) : 0; blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); ns->bdev = NULL; } @@ -133,15 +139,34 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) return status; } -static void nvmet_bio_done(struct bio *bio) +void nvmet_bdev_req_complete(struct nvmet_req *req) { - struct nvmet_req *req = bio->bi_private; + struct bio *bio = req->b.last_bio; nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); if (bio != &req->b.inline_bio) bio_put(bio); } +static void nvmet_bio_done(struct bio *bio) +{ + struct nvmet_req *req = bio->bi_private; + + req->b.last_bio = bio; + + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req); +} + +void nvmet_req_bdev_poll(struct nvmet_req *req) +{ + struct request_queue *q = bdev_get_queue(req->ns->bdev); + int ret = blk_poll(q, req->b.cookie, true); + + if (ret < 0) { + pr_err("tid %d poll error %d", req->t->id, ret); + } +} + static void nvmet_bdev_execute_rw(struct nvmet_req *req) { int sg_cnt = req->sg_cnt; @@ -185,7 +210,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) bio->bi_end_io = nvmet_bio_done; bio->bi_opf = op; - blk_start_plug(&plug); + if (!req->poll) + blk_start_plug(&plug); for_each_sg(req->sg, sg, req->sg_cnt, i) { while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) != sg->length) { @@ -204,8 +230,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) sg_cnt--; } - submit_bio(bio); - blk_finish_plug(&plug); + req->b.last_bio = bio; + if (req->poll) + req->b.last_bio->bi_opf |= REQ_HIPRI; + + req->b.cookie = submit_bio(bio); + + nvmet_req_prep_poll(req); + nvmet_req_issue_poll(req); + if (!req->poll) + blk_finish_plug(&plug); } static void nvmet_bdev_execute_flush(struct nvmet_req *req) @@ -330,15 +364,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) switch (cmd->common.opcode) { case nvme_cmd_read: case nvme_cmd_write: + req->poll = req->ns->poll; req->execute = nvmet_bdev_execute_rw; return 0; case nvme_cmd_flush: + req->poll = false; req->execute = nvmet_bdev_execute_flush; return 0; case nvme_cmd_dsm: + req->poll = false; req->execute = nvmet_bdev_execute_dsm; return 0; case nvme_cmd_write_zeroes: + req->poll = false; req->execute = nvmet_bdev_execute_write_zeroes; return 0; default: diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 46df45e837c9..e5991e36ed6f 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -49,11 +49,22 @@ #define IPO_IATTR_CONNECT_SQE(x) \ (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) +struct nvmet_poll_data { + struct completion thread_complete; + wait_queue_head_t poll_waitq; + struct mutex list_lock; + struct task_struct *thread; + struct list_head list; + struct list_head done; + unsigned int id; +}; + struct nvmet_ns { struct list_head dev_link; struct percpu_ref ref; struct block_device *bdev; struct file *file; + struct nvmet_poll_data *t; bool readonly; u32 nsid; u32 blksize_shift; @@ -63,6 +74,8 @@ struct nvmet_ns { u32 anagrpid; bool buffered_io; + bool use_poll; + bool poll; bool enabled; struct nvmet_subsys *subsys; const char *device_path; @@ -292,9 +305,15 @@ struct nvmet_req { struct nvmet_ns *ns; struct scatterlist *sg; struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC]; + struct completion wait; + bool poll; + struct nvmet_poll_data *t; + struct list_head poll_entry; union { struct { - struct bio inline_bio; + struct bio inline_bio; + blk_qc_t cookie; + struct bio *last_bio; } b; struct { bool mpool_alloc; @@ -442,6 +461,17 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, u8 event_info, u8 log_page); +int nvmet_ns_start_poll(struct nvmet_ns *ns); +void nvmet_ns_stop_poll(struct nvmet_ns *ns); +void nvmet_req_prep_poll(struct nvmet_req *req); +void nvmet_req_issue_poll(struct nvmet_req *req); + +void nvmet_req_bdev_poll(struct nvmet_req *req); +void nvmet_req_poll(struct nvmet_req *req); +void nvmet_req_poll_complete(struct nvmet_req *req); +void nvmet_bdev_req_complete(struct nvmet_req *req); +void nvmet_req_done(struct nvmet_req *req); + #define NVMET_QUEUE_SIZE 1024 #define NVMET_NR_QUEUES 128 #define NVMET_MAX_CMD NVMET_QUEUE_SIZE -- 2.22.1 struct nvmet_ns { struct list_head dev_link; struct percpu_ref ref; struct block_device *bdev; struct file *file; + struct nvmet_poll_data *t; bool readonly; u32 nsid; u32 blksize_shift; @@ -63,6 +74,8 @@ struct nvmet_ns { u32 anagrpid; bool buffered_io; + bool use_poll; + bool poll; bool enabled; struct nvmet_subsys *subsys; const char *device_path; @@ -292,9 +305,15 @@ struct nvmet_req { struct nvmet_ns *ns; struct scatterlist *sg; struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC]; + struct completion wait; + bool poll; + struct nvmet_poll_data *t; + struct list_head poll_entry; union { struct { - struct bio inline_bio; + struct bio inline_bio; + blk_qc_t cookie; + struct bio *last_bio; } b; struct { bool mpool_alloc; @@ -442,6 +461,17 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, u8 event_info, u8 log_page); +int nvmet_ns_start_poll(struct nvmet_ns *ns); +void nvmet_ns_stop_poll(struct nvmet_ns *ns); +void nvmet_req_prep_poll(struct nvmet_req *req); +void nvmet_req_issue_poll(struct nvmet_req *req); + +void nvmet_req_bdev_poll(struct nvmet_req *req); +void nvmet_req_poll(struct nvmet_req *req); +void nvmet_req_poll_complete(struct nvmet_req *req); +void nvmet_bdev_req_complete(struct nvmet_req *req); +void nvmet_req_done(struct nvmet_req *req); + #define NVMET_QUEUE_SIZE 1024 #define NVMET_NR_QUEUES 128 #define NVMET_MAX_CMD NVMET_QUEUE_SIZE -- 2.22.1 } diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index b6fca0e421ef..317c9b427b71 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -49,10 +49,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) int nvmet_bdev_ns_enable(struct nvmet_ns *ns) { + int fmode = FMODE_READ | FMODE_WRITE; + struct request_queue *q; int ret; - ns->bdev = blkdev_get_by_path(ns->device_path, - FMODE_READ | FMODE_WRITE, NULL); + ns->bdev = blkdev_get_by_path(ns->device_path, fmode, NULL); if (IS_ERR(ns->bdev)) { ret = PTR_ERR(ns->bdev); if (ret != -ENOTBLK) { @@ -60,16 +61,21 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) ns->device_path, PTR_ERR(ns->bdev)); } ns->bdev = NULL; - return ret; + goto out; } ns->size = i_size_read(ns->bdev->bd_inode); ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); - return 0; + q = bdev_get_queue(ns->bdev); + ns->poll = ns->use_poll && test_bit(QUEUE_FLAG_POLL, &q->queue_flags); + ret = ns->poll ? nvmet_ns_start_poll(ns) : 0; +out: + return ret; } void nvmet_bdev_ns_disable(struct nvmet_ns *ns) { if (ns->bdev) { + ns->poll ? nvmet_ns_stop_poll(ns) : 0; blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); ns->bdev = NULL; } @@ -133,15 +139,34 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) return status; } -static void nvmet_bio_done(struct bio *bio) +void nvmet_bdev_req_complete(struct nvmet_req *req) { - struct nvmet_req *req = bio->bi_private; + struct bio *bio = req->b.last_bio; nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); if (bio != &req->b.inline_bio) bio_put(bio); } +static void nvmet_bio_done(struct bio *bio) +{ + struct nvmet_req *req = bio->bi_private; + + req->b.last_bio = bio; + + req->poll ? complete(&req->wait) : nvmet_bdev_req_complete(req); +} + +void nvmet_req_bdev_poll(struct nvmet_req *req) +{ + struct request_queue *q = bdev_get_queue(req->ns->bdev); + int ret = blk_poll(q, req->b.cookie, true); + + if (ret < 0) { + pr_err("tid %d poll error %d", req->t->id, ret); + } +} + static void nvmet_bdev_execute_rw(struct nvmet_req *req) { int sg_cnt = req->sg_cnt; @@ -185,7 +210,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) bio->bi_end_io = nvmet_bio_done; bio->bi_opf = op; - blk_start_plug(&plug); + if (!req->poll) + blk_start_plug(&plug); for_each_sg(req->sg, sg, req->sg_cnt, i) { while (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) != sg->length) { @@ -204,8 +230,16 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req) sg_cnt--; } - submit_bio(bio); - blk_finish_plug(&plug); + req->b.last_bio = bio; + if (req->poll) + req->b.last_bio->bi_opf |= REQ_HIPRI; + + req->b.cookie = submit_bio(bio); + + nvmet_req_prep_poll(req); + nvmet_req_issue_poll(req); + if (!req->poll) + blk_finish_plug(&plug); } static void nvmet_bdev_execute_flush(struct nvmet_req *req) @@ -330,15 +364,19 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) switch (cmd->common.opcode) { case nvme_cmd_read: case nvme_cmd_write: + req->poll = req->ns->poll; req->execute = nvmet_bdev_execute_rw; return 0; case nvme_cmd_flush: + req->poll = false; req->execute = nvmet_bdev_execute_flush; return 0; case nvme_cmd_dsm: + req->poll = false; req->execute = nvmet_bdev_execute_dsm; return 0; case nvme_cmd_write_zeroes: + req->poll = false; req->execute = nvmet_bdev_execute_write_zeroes; return 0; default: diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 46df45e837c9..e5991e36ed6f 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -49,11 +49,22 @@ #define IPO_IATTR_CONNECT_SQE(x) \ (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) +struct nvmet_poll_data { + struct completion thread_complete; + wait_queue_head_t poll_waitq; + struct mutex list_lock; + struct task_struct *thread; + struct list_head list; + struct list_head done; + unsigned int id; +}; + struct nvmet_ns { struct list_head dev_link; struct percpu_ref ref; struct block_device *bdev; struct file *file; + struct nvmet_poll_data *t; bool readonly; u32 nsid; u32 blksize_shift; @@ -63,6 +74,8 @@ struct nvmet_ns { u32 anagrpid; bool buffered_io; + bool use_poll; + bool poll; bool enabled; struct nvmet_subsys *subsys; const char *device_path; @@ -292,9 +305,15 @@ struct nvmet_req { struct nvmet_ns *ns; struct scatterlist *sg; struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC]; + struct completion wait; + bool poll; + struct nvmet_poll_data *t; + struct list_head poll_entry; union { struct { - struct bio inline_bio; + struct bio inline_bio; + blk_qc_t cookie; + struct bio *last_bio; } b; struct { bool mpool_alloc; @@ -442,6 +461,17 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, u8 event_info, u8 log_page); +int nvmet_ns_start_poll(struct nvmet_ns *ns); +void nvmet_ns_stop_poll(struct nvmet_ns *ns); +void nvmet_req_prep_poll(struct nvmet_req *req); +void nvmet_req_issue_poll(struct nvmet_req *req); + +void nvmet_req_bdev_poll(struct nvmet_req *req); +void nvmet_req_poll(struct nvmet_req *req); +void nvmet_req_poll_complete(struct nvmet_req *req); +void nvmet_bdev_req_complete(struct nvmet_req *req); +void nvmet_req_done(struct nvmet_req *req); + #define NVMET_QUEUE_SIZE 1024 #define NVMET_NR_QUEUES 128 #define NVMET_MAX_CMD NVMET_QUEUE_SIZE -- 2.22.1 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/2] nvmet: add polling support 2019-12-12 1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg 2019-12-12 5:44 ` Chaitanya Kulkarni @ 2020-01-20 4:48 ` Chaitanya Kulkarni 1 sibling, 0 replies; 12+ messages in thread From: Chaitanya Kulkarni @ 2020-01-20 4:48 UTC (permalink / raw) To: Sagi Grimberg, hch; +Cc: linux-nvme On 12/11/2019 05:01 PM, Sagi Grimberg wrote: > percpu threads per namespace? Sounds like the wrong approach. These > threads will compete for cpu time with the main nvmet contexts. Isn't having need_resched() placed in the polling thread should help with the one thread hogging the CPU ? Please have a look my reply and performance numbers for new patch. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-01-30 18:20 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-10 6:25 [RFC PATCH 0/2] nvmet: add polling support Chaitanya Kulkarni 2019-12-10 6:25 ` [RFC PATCH 1/2] nvmet: add bdev-ns " Chaitanya Kulkarni 2020-01-20 12:52 ` Max Gurtovoy 2020-01-21 19:22 ` Chaitanya Kulkarni 2020-01-23 14:23 ` Max Gurtovoy 2020-01-30 18:19 ` Chaitanya Kulkarni 2019-12-10 6:25 ` [RFC PATCH 2/2] nvmet: add file-ns " Chaitanya Kulkarni 2019-12-12 1:01 ` [RFC PATCH 0/2] nvmet: add " Sagi Grimberg 2019-12-12 5:44 ` Chaitanya Kulkarni 2019-12-12 20:32 ` Sagi Grimberg 2020-01-20 5:13 ` Chaitanya Kulkarni 2020-01-20 4:48 ` Chaitanya Kulkarni
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.