From: "Nadolski, Edmund" <edmund.nadolski@intel.com> To: Ming Lei <ming.lei@redhat.com>, linux-nvme@lists.infradead.org Cc: Jens Axboe <axboe@fb.com>, Keith Busch <kbusch@kernel.org>, Long Li <longli@microsoft.com>, Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de> Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Date: Tue, 12 Nov 2019 11:11:40 -0700 Message-ID: <c4a160fc-5440-c245-ec42-71fee027ba45@intel.com> (raw) In-Reply-To: <20191108035508.26395-3-ming.lei@redhat.com> On 11/7/2019 8:55 PM, Ming Lei wrote: > f9dde187fa92("nvme-pci: remove cq check after submission") removes > cq check after submission, this change actually causes performance > regression on some NVMe drive in which single nvmeq handles requests > originated from more than one blk-mq sw queues(call it multi-mapping > queue). > > Actually polling IO after submission can handle IO more efficiently, > especially for multi-mapping queue: > > 1) the poll itself is very cheap, and lockless check on cq is enough, > see nvme_cqe_pending(). Especially the check can be done after batch > submission is done. > > 2) when IO completion is observed via the poll in submission, the requst > may be completed without interrupt involved, or the interrupt handler > overload can be decreased. > > 3) when single sw queue is submiting IOs to this hw queue, if IOs completion > is observed via this poll, the IO can be completed locally, which is > cheaper than remote completion. > > Follows test result done on Azure L80sv2 guest with NVMe drive( > Microsoft Corporation Device b111). This guest has 80 CPUs and 10 > numa nodes, and each NVMe drive supports 8 hw queues. > > 1) test script: > fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \ > --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \ > --direct=1 --runtime=30 --numjobs=1 --rw=randread \ > --name=test --group_reporting --gtod_reduce=1 > > 2) test result: > | v5.3 | v5.3 with this patchset > ------------------------------------------- > IOPS | 130K | 424K > > Given IO is handled more efficiently in this way, the original report > of CPU lockup[1] on Hyper-V can't be observed any more after this patch > is applied. This issue is usually triggered when running IO from all > CPUs concurrently. > > I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa > nodes), small improvement is observed on running the above fio over two > NVMe drive with batch 1. > > [1] https://lore.kernel.org/lkml/1566281669-48212-1-git-send-email-longli@linuxonhyperv.com > > Cc: Keith Busch <kbusch@kernel.org> > Cc: Jens Axboe <axboe@fb.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Sagi Grimberg <sagi@grimberg.me> > Cc: Long Li <longli@microsoft.com> > Fixes: f9dde187fa92("nvme-pci: remove cq check after submission") > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/nvme/host/pci.c | 90 ++++++++++++++++++++++++++++++++++++----- > 1 file changed, 80 insertions(+), 10 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 5b20ab4d21d3..880376f43897 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -83,6 +83,7 @@ struct nvme_queue; > > static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); > static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode); > +static void nvme_poll_in_submission(struct nvme_queue *nvmeq); > > /* > * Represents an NVM Express device. Each nvme_dev is a PCI function. > @@ -165,7 +166,10 @@ struct nvme_queue { > spinlock_t sq_lock; > void *sq_cmds; > /* only used for poll queues: */ > - spinlock_t cq_poll_lock ____cacheline_aligned_in_smp; > + union { > + spinlock_t cq_poll_lock; > + spinlock_t cq_lock; > + }____cacheline_aligned_in_smp; Is the new lock intended to protect anything differently than the old lock? > volatile struct nvme_completion *cqes; > struct blk_mq_tags **tags; > dma_addr_t sq_dma_addr; > @@ -185,6 +189,7 @@ struct nvme_queue { > #define NVMEQ_SQ_CMB 1 > #define NVMEQ_DELETE_ERROR 2 > #define NVMEQ_POLLED 3 > +#define NVMEQ_MULTI_MAPPING 4 > u32 *dbbuf_sq_db; > u32 *dbbuf_cq_db; > u32 *dbbuf_sq_ei; > @@ -911,6 +916,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, > > blk_mq_start_request(req); > nvme_submit_cmd(nvmeq, &cmnd, bd->last); > + > + if (bd->last) > + nvme_poll_in_submission(nvmeq); > + > return BLK_STS_OK; > out_unmap_data: > nvme_unmap_data(dev, req); > @@ -1016,6 +1025,19 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start, > return found; > } > > +static inline irqreturn_t > +nvme_update_cq(struct nvme_queue *nvmeq, u16 *start, u16 *end) > +{ > + irqreturn_t ret = IRQ_NONE; > + > + if (nvmeq->cq_head != nvmeq->last_cq_head) > + ret = IRQ_HANDLED; > + nvme_process_cq(nvmeq, start, end, -1); > + nvmeq->last_cq_head = nvmeq->cq_head; > + > + return ret; > +} > + > static irqreturn_t nvme_irq(int irq, void *data) > { > struct nvme_queue *nvmeq = data; > @@ -1027,10 +1049,7 @@ static irqreturn_t nvme_irq(int irq, void *data) > * the irq handler, even if that was on another CPU. > */ > rmb(); > - if (nvmeq->cq_head != nvmeq->last_cq_head) > - ret = IRQ_HANDLED; > - nvme_process_cq(nvmeq, &start, &end, -1); > - nvmeq->last_cq_head = nvmeq->cq_head; > + ret = nvme_update_cq(nvmeq, &start, &end); > wmb(); > > if (start != end) { > @@ -1041,6 +1060,24 @@ static irqreturn_t nvme_irq(int irq, void *data) > return ret; > } > > +static irqreturn_t nvme_irq_multi_mapping(int irq, void *data) > +{ > + struct nvme_queue *nvmeq = data; > + irqreturn_t ret = IRQ_NONE; > + u16 start, end; > + > + spin_lock(&nvmeq->cq_lock); > + ret = nvme_update_cq(nvmeq, &start, &end); > + spin_unlock(&nvmeq->cq_lock); > + > + if (start != end) { > + nvme_complete_cqes(nvmeq, start, end); > + return IRQ_HANDLED; > + } > + > + return ret; > +} > + > static irqreturn_t nvme_irq_check(int irq, void *data) > { > struct nvme_queue *nvmeq = data; > @@ -1049,6 +1086,24 @@ static irqreturn_t nvme_irq_check(int irq, void *data) > return IRQ_NONE; > } > > +static void nvme_poll_in_submission(struct nvme_queue *nvmeq) > +{ > + if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) && > + nvme_cqe_pending(nvmeq)) { > + unsigned long flags; > + > + if (spin_trylock_irqsave(&nvmeq->cq_lock, flags)) { > + u16 start, end; > + > + nvme_update_cq(nvmeq, &start, &end); > + spin_unlock_irqrestore(&nvmeq->cq_lock, flags); > + > + if (start != end) > + nvme_complete_cqes(nvmeq, start, end); > + } > + } > +} Just a nit, to me it reads simpler to return right away when the first test is false, rather than enclose the true path in an additional nesting level. Thanks, Ed > /* > * Poll for completions any queue, including those not dedicated to polling. > * Can be called from any context. > @@ -1499,12 +1554,14 @@ static int queue_request_irq(struct nvme_queue *nvmeq) > { > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); > int nr = nvmeq->dev->ctrl.instance; > + irq_handler_t handler = test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) ? > + nvme_irq_multi_mapping : nvme_irq; > > if (use_threaded_interrupts) { > return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, > - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > + handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } else { > - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > + return pci_request_irq(pdev, nvmeq->cq_vector, handler, > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } > } > @@ -1514,7 +1571,13 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid) > struct nvme_dev *dev = nvmeq->dev; > > spin_lock_init(&nvmeq->sq_lock); > - spin_lock_init(&nvmeq->cq_poll_lock); > + > + if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags)) { > + WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags)); > + spin_lock_init(&nvmeq->cq_lock); > + } else { > + spin_lock_init(&nvmeq->cq_poll_lock); > + } > > nvmeq->sq_tail = 0; > nvmeq->last_sq_tail = 0; > @@ -1534,15 +1597,22 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled) > u16 vector = 0; > > clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags); > + clear_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags); > > /* > * A queue's vector matches the queue identifier unless the controller > * has only one vector available. > */ > - if (!polled) > + if (!polled) { > + struct pci_dev *pdev = to_pci_dev(dev->dev); > + > vector = dev->num_vecs == 1 ? 0 : qid; > - else > + if (vector && cpumask_weight(pci_irq_get_affinity(pdev, > + vector)) > 1) > + set_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags); > + } else { > set_bit(NVMEQ_POLLED, &nvmeq->flags); > + } > > result = adapter_alloc_cq(dev, qid, nvmeq, vector); > if (result) > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply index Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-11-08 3:55 [PATCH 0/2] nvme-pci: improve IO performance via poll after batch submission Ming Lei 2019-11-08 3:55 ` [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei 2019-11-08 4:12 ` Keith Busch 2019-11-08 7:09 ` Ming Lei 2019-11-08 3:55 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Ming Lei 2019-11-11 20:44 ` Christoph Hellwig 2019-11-12 0:33 ` Long Li 2019-11-12 1:35 ` Sagi Grimberg 2019-11-12 2:39 ` Ming Lei 2019-11-12 16:25 ` Hannes Reinecke 2019-11-12 16:49 ` Keith Busch 2019-11-12 17:29 ` Hannes Reinecke 2019-11-13 3:05 ` Ming Lei 2019-11-13 3:17 ` Keith Busch 2019-11-13 3:57 ` Ming Lei 2019-11-12 21:20 ` Long Li 2019-11-12 21:36 ` Keith Busch 2019-11-13 0:50 ` Long Li 2019-11-13 2:24 ` Ming Lei 2019-11-12 2:07 ` Ming Lei 2019-11-12 1:44 ` Sagi Grimberg 2019-11-12 9:56 ` Ming Lei 2019-11-12 17:35 ` Sagi Grimberg 2019-11-12 21:17 ` Long Li 2019-11-12 23:44 ` Jens Axboe 2019-11-13 2:47 ` Ming Lei 2019-11-12 18:11 ` Nadolski, Edmund [this message] 2019-11-13 13:46 ` Ming Lei
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=c4a160fc-5440-c245-ec42-71fee027ba45@intel.com \ --to=edmund.nadolski@intel.com \ --cc=axboe@fb.com \ --cc=hch@lst.de \ --cc=kbusch@kernel.org \ --cc=linux-nvme@lists.infradead.org \ --cc=longli@microsoft.com \ --cc=ming.lei@redhat.com \ --cc=sagi@grimberg.me \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-NVME Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \ linux-nvme@lists.infradead.org public-inbox-index linux-nvme Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme AGPL code for this site: git clone https://public-inbox.org/public-inbox.git