From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75B26C4740A for ; Tue, 10 Sep 2019 03:10:27 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 46FEE21670 for ; Tue, 10 Sep 2019 03:10:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="AfwYMaAE" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46FEE21670 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=grimberg.me Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender:Content-Type: Content-Transfer-Encoding:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date:Message-ID:From: References:To:Subject:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=MW2zRr490TiKC7Sxmf4pbkS+CNfKrKVXvZl3d8ZtIUE=; b=AfwYMaAER5VN9SGvGvypGStdE Jwab672eNNpxbvXsCObvFPkiIJ75TcpmWMr3TcJL4drt32HSi4Oovj0nY0gTKk8ahPdP9GddEQ38A chV1N5U3wHwN/uvyLNt1EF3W9Wz99oMOmL6/Lq11CCr+ED8o2u6JpyEAt7BGzgdlngKf2ylvENvJR doc3lX8qveio7LZkp7vTbQOjgFB4g6tevjShlfxtj0skYN4JYny/V7DQPSJXPAb/X0NHzH5W6Z5Ig KdKh5J0s8nuIAMIREXh4luQ09KWRbYUzFlIxpEzRkIJWJeJHjGd6k+fTtKL6YDXP0Drh0wf9aNWn1 e+ygDJfVw==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.92 #3 (Red Hat Linux)) id 1i7WXr-0003yL-C2; Tue, 10 Sep 2019 03:10:19 +0000 Received: from mail-ot1-f68.google.com ([209.85.210.68]) by bombadil.infradead.org with esmtps (Exim 4.92 #3 (Red Hat Linux)) id 1i7WXn-0003x7-KO for linux-nvme@lists.infradead.org; Tue, 10 Sep 2019 03:10:17 +0000 Received: by mail-ot1-f68.google.com with SMTP id 67so16042410oto.3 for ; Mon, 09 Sep 2019 20:10:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=kxwunZkkHtv4RXlkcMG5MzHeEbMqLnTkl646TCFSclc=; b=acdl6/G6t5VFMEkbIvV/p7R435TpCDuoquo72wznOK87uZTKd4yGi8tkqG4jVVRO3G QFrBt7O9OmKKNahPTisvJnVwf0/4fVQ3YGnSBuYLBS8RIPJNMKnK18CnSEx2xLd/oy0i beOygvWmqkmnLlKzWeJoUqHL0teVLLr6SRYyZsdPqcUnfw0uTu52pAnCDlx7fA1GKsaT suLV6vC1w2k9RR4QhiH4xWJb1Fy5SY3C5wMBOhThO3/4rfhFWz83e2pVpZ2QU/opi0j4 fyl9Xo6K2ji86wgsKdfqdidedJfendSA9sV/Uf44DhPRqMCLQBNc/g7krPkjgO6y/bTJ yBlw== X-Gm-Message-State: APjAAAXvMfhsKJPF2QxAnUsULqNvAv2QT+u3xG2dLMXPQJ2yPd43z2pr 6vs1+X6uocmByRI/I+e9UuM= X-Google-Smtp-Source: APXvYqwPFnL2WYpMErCGuFYsJ7vq1xNVv+Ygt9e7CxeTc2yjOW04vNcinTGassMuQ4zx4sJLmkDoiQ== X-Received: by 2002:a05:6830:1217:: with SMTP id r23mr11958726otp.104.1568085011205; Mon, 09 Sep 2019 20:10:11 -0700 (PDT) Received: from ?IPv6:2600:1700:65a0:78e0:514:7862:1503:8e4d? ([2600:1700:65a0:78e0:514:7862:1503:8e4d]) by smtp.gmail.com with ESMTPSA id q124sm3814579oia.5.2019.09.09.20.10.08 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 09 Sep 2019 20:10:10 -0700 (PDT) Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism To: Ming Lei References: <6b88719c-782a-4a63-db9f-bf62734a7874@linaro.org> <20190903072848.GA22170@ming.t460p> <6f3b6557-1767-8c80-f786-1ea667179b39@acm.org> <2a8bd278-5384-d82f-c09b-4fce236d2d95@linaro.org> <20190905090617.GB4432@ming.t460p> <6a36ccc7-24cd-1d92-fef1-2c5e0f798c36@linaro.org> <20190906014819.GB27116@ming.t460p> <6eb2a745-7b92-73ce-46f5-cc6a5ef08abc@grimberg.me> <20190907000100.GC12290@ming.t460p> From: Sagi Grimberg Message-ID: Date: Mon, 9 Sep 2019 20:10:07 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190907000100.GC12290@ming.t460p> Content-Language: en-US X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20190909_201015_675785_1115C4E8 X-CRM114-Status: GOOD ( 32.93 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Keith Busch , Hannes Reinecke , Daniel Lezcano , Bart Van Assche , linux-scsi@vger.kernel.org, Peter Zijlstra , Long Li , John Garry , LKML , linux-nvme@lists.infradead.org, Jens Axboe , Ingo Molnar , Thomas Gleixner , Christoph Hellwig Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org Hey Ming, >>> Ok, so the real problem is per-cpu bounded tasks. >>> >>> I share Thomas opinion about a NAPI like approach. >> >> We already have that, its irq_poll, but it seems that for this >> use-case, we get lower performance for some reason. I'm not >> entirely sure why that is, maybe its because we need to mask interrupts >> because we don't have an "arm" register in nvme like network devices >> have? > > Long observed that IOPS drops much too by switching to threaded irq. If > softirqd is waken up for handing softirq, the performance shouldn't > be better than threaded irq. Its true that it shouldn't be any faster, but what irqpoll already has and we don't need to reinvent is a proper budgeting mechanism that needs to occur when multiple devices map irq vectors to the same cpu core. irqpoll already maintains a percpu list and dispatch the ->poll with a budget that the backend enforces and irqpoll multiplexes between them. Having this mechanism in irq (hard or threaded) context sounds unnecessary a bit. It seems like we're attempting to stay in irq context for as long as we can instead of scheduling to softirq/thread context if we have more than a minimal amount of work to do. Without at least understanding why softirq/thread degrades us so much this code seems like the wrong approach to me. Interrupt context will always be faster, but it is not a sufficient reason to spend as much time as possible there, is it? We should also keep in mind, that the networking stack has been doing this for years, I would try to understand why this cannot work for nvme before dismissing. > Especially, Long found that context > switch is increased a lot after applying your irq poll patch. > > http://lists.infradead.org/pipermail/linux-nvme/2019-August/026788.html Oh, I didn't see that one, wonder why... thanks! 5% improvement, I guess we can buy that for other users as is :) If we suffer from lots of context switches while the CPU is flooded with interrupts, then I would argue that we're re-raising softirq too much. In this use-case, my assumption is that the cpu cannot keep up with the interrupts and not that it doesn't reap enough (we also reap the first batch in interrupt context...) Perhaps making irqpoll continue until it must resched would improve things further? Although this is a latency vs. efficiency tradeoff, looks like MAX_SOFTIRQ_TIME is set to 2ms: " * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in * certain cases, such as stop_machine(), jiffies may cease to * increment and so we need the MAX_SOFTIRQ_RESTART limit as * well to make sure we eventually return from this method. * * These limits have been established via experimentation. * The two things to balance is latency against fairness - * we want to handle softirqs as soon as possible, but they * should not be able to lock up the box. " Long, does this patch make any difference? -- diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..d8eab563fa77 100644 --- a/lib/irq_poll.c +++ b/lib/irq_poll.c @@ -12,8 +12,6 @@ #include #include -static unsigned int irq_poll_budget __read_mostly = 256; - static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll); /** @@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete); static void __latent_entropy irq_poll_softirq(struct softirq_action *h) { - struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll); - int rearm = 0, budget = irq_poll_budget; - unsigned long start_time = jiffies; + struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll); + LIST_HEAD(list); local_irq_disable(); + list_splice_init(irqpoll_list, &list); + local_irq_enable(); - while (!list_empty(list)) { + while (!list_empty(&list)) { struct irq_poll *iop; int work, weight; - /* - * If softirq window is exhausted then punt. - */ - if (budget <= 0 || time_after(jiffies, start_time)) { - rearm = 1; - break; - } - - local_irq_enable(); - /* Even though interrupts have been re-enabled, this * access is safe because interrupts can only add new * entries to the tail of this list, and only ->poll() * calls can remove this head entry from the list. */ - iop = list_entry(list->next, struct irq_poll, list); + iop = list_first_entry(&list, struct irq_poll, list); weight = iop->weight; work = 0; if (test_bit(IRQ_POLL_F_SCHED, &iop->state)) work = iop->poll(iop, weight); - budget -= work; - - local_irq_disable(); - /* * Drivers must not modify the iopoll state, if they * consume their assigned weight (or more, some drivers can't @@ -125,11 +110,21 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h) if (test_bit(IRQ_POLL_F_DISABLE, &iop->state)) __irq_poll_complete(iop); else - list_move_tail(&iop->list, list); + list_move_tail(&iop->list, &list); } + + /* + * If softirq window is exhausted then punt. + */ + if (need_resched()) + break; } - if (rearm) + local_irq_disable(); + + list_splice_tail_init(irqpoll_list, &list); + list_splice(&list, irqpoll_list); + if (!list_empty(irqpoll_list)) __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ); local_irq_enable(); -- Reminder to the nvme side (slightly modified): -- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 52205f8d90b4..09dc6da67b05 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "trace.h" #include "nvme.h" @@ -32,6 +33,7 @@ #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion)) #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc)) +#define NVME_POLL_BUDGET_IRQ 256 /* * These can be higher, but we need to ensure that any command doesn't @@ -189,6 +191,7 @@ struct nvme_queue { u32 *dbbuf_cq_db; u32 *dbbuf_sq_ei; u32 *dbbuf_cq_ei; + struct irq_poll iop; struct completion delete_done; }; @@ -1014,11 +1017,29 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start, return found; } +static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) +{ + struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, iop); + struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); + u16 start, end; + int completed; + + completed = nvme_process_cq(nvmeq, &start, &end, budget); + nvme_complete_cqes(nvmeq, start, end); + if (completed < budget) { + irq_poll_complete(&nvmeq->iop); + enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)); + } + + return completed; +} + static irqreturn_t nvme_irq(int irq, void *data) { struct nvme_queue *nvmeq = data; irqreturn_t ret = IRQ_NONE; u16 start, end; + int budget = nvmeq->q_depth; /* * The rmb/wmb pair ensures we see all updates from a previous run of @@ -1027,13 +1048,23 @@ static irqreturn_t nvme_irq(int irq, void *data) rmb(); if (nvmeq->cq_head != nvmeq->last_cq_head) ret = IRQ_HANDLED; - nvme_process_cq(nvmeq, &start, &end, -1); + + /* reap here up to a budget of the size the queue depth */ + do { + budget -= nvme_process_cq(nvmeq, &start, &end, budget); + if (start != end) { + nvme_complete_cqes(nvmeq, start, end); + ret = IRQ_HANDLED; + } + } while (start != end && budget > 0); + nvmeq->last_cq_head = nvmeq->cq_head; wmb(); - if (start != end) { - nvme_complete_cqes(nvmeq, start, end); - return IRQ_HANDLED; + /* if we still have cqes to reap, schedule irqpoll */ + if (start != end && nvme_cqe_pending(nvmeq)) { + disable_irq_nosync(irq); + irq_poll_sched(&nvmeq->iop); } return ret; @@ -1346,6 +1377,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) static void nvme_free_queue(struct nvme_queue *nvmeq) { + irq_poll_disable(&nvmeq->iop); dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes, nvmeq->cq_dma_addr); if (!nvmeq->sq_cmds) @@ -1480,6 +1512,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth) nvmeq->dev = dev; spin_lock_init(&nvmeq->sq_lock); spin_lock_init(&nvmeq->cq_poll_lock); + irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, nvme_irqpoll_handler); nvmeq->cq_head = 0; nvmeq->cq_phase = 1; nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride]; -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme