From: Ming Lei <ming.lei@redhat.com> To: linux-nvme@lists.infradead.org Cc: Sagi Grimberg <sagi@grimberg.me>, Long Li <longli@microsoft.com>, Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@fb.com>, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de> Subject: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Date: Fri, 8 Nov 2019 11:55:08 +0800 Message-ID: <20191108035508.26395-3-ming.lei@redhat.com> (raw) In-Reply-To: <20191108035508.26395-1-ming.lei@redhat.com> f9dde187fa92("nvme-pci: remove cq check after submission") removes cq check after submission, this change actually causes performance regression on some NVMe drive in which single nvmeq handles requests originated from more than one blk-mq sw queues(call it multi-mapping queue). Actually polling IO after submission can handle IO more efficiently, especially for multi-mapping queue: 1) the poll itself is very cheap, and lockless check on cq is enough, see nvme_cqe_pending(). Especially the check can be done after batch submission is done. 2) when IO completion is observed via the poll in submission, the requst may be completed without interrupt involved, or the interrupt handler overload can be decreased. 3) when single sw queue is submiting IOs to this hw queue, if IOs completion is observed via this poll, the IO can be completed locally, which is cheaper than remote completion. Follows test result done on Azure L80sv2 guest with NVMe drive( Microsoft Corporation Device b111). This guest has 80 CPUs and 10 numa nodes, and each NVMe drive supports 8 hw queues. 1) test script: fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \ --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \ --direct=1 --runtime=30 --numjobs=1 --rw=randread \ --name=test --group_reporting --gtod_reduce=1 2) test result: | v5.3 | v5.3 with this patchset ------------------------------------------- IOPS | 130K | 424K Given IO is handled more efficiently in this way, the original report of CPU lockup[1] on Hyper-V can't be observed any more after this patch is applied. This issue is usually triggered when running IO from all CPUs concurrently. I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa nodes), small improvement is observed on running the above fio over two NVMe drive with batch 1. [1] https://lore.kernel.org/lkml/1566281669-48212-1-git-send-email-longli@linuxonhyperv.com Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Long Li <longli@microsoft.com> Fixes: f9dde187fa92("nvme-pci: remove cq check after submission") Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/nvme/host/pci.c | 90 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 80 insertions(+), 10 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 5b20ab4d21d3..880376f43897 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -83,6 +83,7 @@ struct nvme_queue; static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode); +static void nvme_poll_in_submission(struct nvme_queue *nvmeq); /* * Represents an NVM Express device. Each nvme_dev is a PCI function. @@ -165,7 +166,10 @@ struct nvme_queue { spinlock_t sq_lock; void *sq_cmds; /* only used for poll queues: */ - spinlock_t cq_poll_lock ____cacheline_aligned_in_smp; + union { + spinlock_t cq_poll_lock; + spinlock_t cq_lock; + }____cacheline_aligned_in_smp; volatile struct nvme_completion *cqes; struct blk_mq_tags **tags; dma_addr_t sq_dma_addr; @@ -185,6 +189,7 @@ struct nvme_queue { #define NVMEQ_SQ_CMB 1 #define NVMEQ_DELETE_ERROR 2 #define NVMEQ_POLLED 3 +#define NVMEQ_MULTI_MAPPING 4 u32 *dbbuf_sq_db; u32 *dbbuf_cq_db; u32 *dbbuf_sq_ei; @@ -911,6 +916,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, blk_mq_start_request(req); nvme_submit_cmd(nvmeq, &cmnd, bd->last); + + if (bd->last) + nvme_poll_in_submission(nvmeq); + return BLK_STS_OK; out_unmap_data: nvme_unmap_data(dev, req); @@ -1016,6 +1025,19 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start, return found; } +static inline irqreturn_t +nvme_update_cq(struct nvme_queue *nvmeq, u16 *start, u16 *end) +{ + irqreturn_t ret = IRQ_NONE; + + if (nvmeq->cq_head != nvmeq->last_cq_head) + ret = IRQ_HANDLED; + nvme_process_cq(nvmeq, start, end, -1); + nvmeq->last_cq_head = nvmeq->cq_head; + + return ret; +} + static irqreturn_t nvme_irq(int irq, void *data) { struct nvme_queue *nvmeq = data; @@ -1027,10 +1049,7 @@ static irqreturn_t nvme_irq(int irq, void *data) * the irq handler, even if that was on another CPU. */ rmb(); - if (nvmeq->cq_head != nvmeq->last_cq_head) - ret = IRQ_HANDLED; - nvme_process_cq(nvmeq, &start, &end, -1); - nvmeq->last_cq_head = nvmeq->cq_head; + ret = nvme_update_cq(nvmeq, &start, &end); wmb(); if (start != end) { @@ -1041,6 +1060,24 @@ static irqreturn_t nvme_irq(int irq, void *data) return ret; } +static irqreturn_t nvme_irq_multi_mapping(int irq, void *data) +{ + struct nvme_queue *nvmeq = data; + irqreturn_t ret = IRQ_NONE; + u16 start, end; + + spin_lock(&nvmeq->cq_lock); + ret = nvme_update_cq(nvmeq, &start, &end); + spin_unlock(&nvmeq->cq_lock); + + if (start != end) { + nvme_complete_cqes(nvmeq, start, end); + return IRQ_HANDLED; + } + + return ret; +} + static irqreturn_t nvme_irq_check(int irq, void *data) { struct nvme_queue *nvmeq = data; @@ -1049,6 +1086,24 @@ static irqreturn_t nvme_irq_check(int irq, void *data) return IRQ_NONE; } +static void nvme_poll_in_submission(struct nvme_queue *nvmeq) +{ + if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) && + nvme_cqe_pending(nvmeq)) { + unsigned long flags; + + if (spin_trylock_irqsave(&nvmeq->cq_lock, flags)) { + u16 start, end; + + nvme_update_cq(nvmeq, &start, &end); + spin_unlock_irqrestore(&nvmeq->cq_lock, flags); + + if (start != end) + nvme_complete_cqes(nvmeq, start, end); + } + } +} + /* * Poll for completions any queue, including those not dedicated to polling. * Can be called from any context. @@ -1499,12 +1554,14 @@ static int queue_request_irq(struct nvme_queue *nvmeq) { struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); int nr = nvmeq->dev->ctrl.instance; + irq_handler_t handler = test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) ? + nvme_irq_multi_mapping : nvme_irq; if (use_threaded_interrupts) { return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); + handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid); } else { - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, + return pci_request_irq(pdev, nvmeq->cq_vector, handler, NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); } } @@ -1514,7 +1571,13 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid) struct nvme_dev *dev = nvmeq->dev; spin_lock_init(&nvmeq->sq_lock); - spin_lock_init(&nvmeq->cq_poll_lock); + + if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags)) { + WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags)); + spin_lock_init(&nvmeq->cq_lock); + } else { + spin_lock_init(&nvmeq->cq_poll_lock); + } nvmeq->sq_tail = 0; nvmeq->last_sq_tail = 0; @@ -1534,15 +1597,22 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled) u16 vector = 0; clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags); + clear_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags); /* * A queue's vector matches the queue identifier unless the controller * has only one vector available. */ - if (!polled) + if (!polled) { + struct pci_dev *pdev = to_pci_dev(dev->dev); + vector = dev->num_vecs == 1 ? 0 : qid; - else + if (vector && cpumask_weight(pci_irq_get_affinity(pdev, + vector)) > 1) + set_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags); + } else { set_bit(NVMEQ_POLLED, &nvmeq->flags); + } result = adapter_alloc_cq(dev, qid, nvmeq, vector); if (result) -- 2.20.1 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply index Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-11-08 3:55 [PATCH 0/2] nvme-pci: improve IO performance via poll after batch submission Ming Lei 2019-11-08 3:55 ` [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei 2019-11-08 4:12 ` Keith Busch 2019-11-08 7:09 ` Ming Lei 2019-11-08 3:55 ` Ming Lei [this message] 2019-11-11 20:44 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Christoph Hellwig 2019-11-12 0:33 ` Long Li 2019-11-12 1:35 ` Sagi Grimberg 2019-11-12 2:39 ` Ming Lei 2019-11-12 16:25 ` Hannes Reinecke 2019-11-12 16:49 ` Keith Busch 2019-11-12 17:29 ` Hannes Reinecke 2019-11-13 3:05 ` Ming Lei 2019-11-13 3:17 ` Keith Busch 2019-11-13 3:57 ` Ming Lei 2019-11-12 21:20 ` Long Li 2019-11-12 21:36 ` Keith Busch 2019-11-13 0:50 ` Long Li 2019-11-13 2:24 ` Ming Lei 2019-11-12 2:07 ` Ming Lei 2019-11-12 1:44 ` Sagi Grimberg 2019-11-12 9:56 ` Ming Lei 2019-11-12 17:35 ` Sagi Grimberg 2019-11-12 21:17 ` Long Li 2019-11-12 23:44 ` Jens Axboe 2019-11-13 2:47 ` Ming Lei 2019-11-12 18:11 ` Nadolski, Edmund 2019-11-13 13:46 ` Ming Lei
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20191108035508.26395-3-ming.lei@redhat.com \ --to=ming.lei@redhat.com \ --cc=axboe@fb.com \ --cc=hch@lst.de \ --cc=kbusch@kernel.org \ --cc=linux-nvme@lists.infradead.org \ --cc=longli@microsoft.com \ --cc=sagi@grimberg.me \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-NVME Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \ linux-nvme@lists.infradead.org public-inbox-index linux-nvme Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme AGPL code for this site: git clone https://public-inbox.org/public-inbox.git