linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Long Li <longli@microsoft.com>
To: Sagi Grimberg <sagi@grimberg.me>, Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@fb.com>, Hannes Reinecke <hare@suse.com>,
	John Garry <john.garry@huawei.com>,
	Bart Van Assche <bvanassche@acm.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Keith Busch <keith.busch@intel.com>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Hellwig <hch@lst.de>
Subject: RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
Date: Wed, 18 Sep 2019 00:00:16 +0000	[thread overview]
Message-ID: <CY4PR21MB0741838CE0C9D52556AA4558CE8E0@CY4PR21MB0741.namprd21.prod.outlook.com> (raw)
In-Reply-To: <f5685543-8cd5-6c6a-5b80-c77ef09c6b3b@grimberg.me>

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>Hey Ming,
>
>>>> Ok, so the real problem is per-cpu bounded tasks.
>>>>
>>>> I share Thomas opinion about a NAPI like approach.
>>>
>>> We already have that, its irq_poll, but it seems that for this
>>> use-case, we get lower performance for some reason. I'm not entirely
>>> sure why that is, maybe its because we need to mask interrupts
>>> because we don't have an "arm" register in nvme like network devices
>>> have?
>>
>> Long observed that IOPS drops much too by switching to threaded irq.
>> If softirqd is waken up for handing softirq, the performance shouldn't
>> be better than threaded irq.
>
>Its true that it shouldn't be any faster, but what irqpoll already has and we
>don't need to reinvent is a proper budgeting mechanism that needs to occur
>when multiple devices map irq vectors to the same cpu core.
>
>irqpoll already maintains a percpu list and dispatch the ->poll with a budget
>that the backend enforces and irqpoll multiplexes between them.
>Having this mechanism in irq (hard or threaded) context sounds unnecessary a
>bit.
>
>It seems like we're attempting to stay in irq context for as long as we can
>instead of scheduling to softirq/thread context if we have more than a
>minimal amount of work to do. Without at least understanding why
>softirq/thread degrades us so much this code seems like the wrong approach
>to me. Interrupt context will always be faster, but it is not a sufficient reason
>to spend as much time as possible there, is it?
>
>We should also keep in mind, that the networking stack has been doing this
>for years, I would try to understand why this cannot work for nvme before
>dismissing.
>
>> Especially, Long found that context
>> switch is increased a lot after applying your irq poll patch.
>>
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
>> .infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
>August%2F026788.html&am
>>
>p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
>59c
>>
>64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
>742&a
>>
>mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
>p;reserved
>> =0
>
>Oh, I didn't see that one, wonder why... thanks!
>
>5% improvement, I guess we can buy that for other users as is :)
>
>If we suffer from lots of context switches while the CPU is flooded with
>interrupts, then I would argue that we're re-raising softirq too much.
>In this use-case, my assumption is that the cpu cannot keep up with the
>interrupts and not that it doesn't reap enough (we also reap the first batch in
>interrupt context...)
>
>Perhaps making irqpoll continue until it must resched would improve things
>further? Although this is a latency vs. efficiency tradeoff, looks like
>MAX_SOFTIRQ_TIME is set to 2ms:
>
>"
>  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>  * certain cases, such as stop_machine(), jiffies may cease to
>  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>  * well to make sure we eventually return from this method.
>  *
>  * These limits have been established via experimentation.
>  * The two things to balance is latency against fairness -
>  * we want to handle softirqs as soon as possible, but they
>  * should not be able to lock up the box.
>"
>
>Long, does this patch make any difference?

Sagi,

Sorry it took a while to bring my system back online.

With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.

The following are captured by "perf sched record" for 30 seconds during tests.

"perf sched latency"
With patch:
  fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123 ms | max at:    768.274023 s

without patch:
  fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446 ms | max at:   6447.310255 s

Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
(captured in /sys/kernel/debug/tracing, echo sched:* >set_event)

On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
           <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
           <...>-17    [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
           <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
           <...>-17    [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120

On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
          <idle>-0     [001] d...  6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
             fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d...  6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
             fio-14342 [001] d...  6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120

Thanks

Long

>--
>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..d8eab563fa77
>100644
>--- a/lib/irq_poll.c
>+++ b/lib/irq_poll.c
>@@ -12,8 +12,6 @@
>  #include <linux/irq_poll.h>
>  #include <linux/delay.h>
>
>-static unsigned int irq_poll_budget __read_mostly = 256;
>-
>  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>
>  /**
>@@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete);
>
>  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>  {
>-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>-       int rearm = 0, budget = irq_poll_budget;
>-       unsigned long start_time = jiffies;
>+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>+       LIST_HEAD(list);
>
>         local_irq_disable();
>+       list_splice_init(irqpoll_list, &list);
>+       local_irq_enable();
>
>-       while (!list_empty(list)) {
>+       while (!list_empty(&list)) {
>                 struct irq_poll *iop;
>                 int work, weight;
>
>-               /*
>-                * If softirq window is exhausted then punt.
>-                */
>-               if (budget <= 0 || time_after(jiffies, start_time)) {
>-                       rearm = 1;
>-                       break;
>-               }
>-
>-               local_irq_enable();
>-
>                 /* Even though interrupts have been re-enabled, this
>                  * access is safe because interrupts can only add new
>                  * entries to the tail of this list, and only ->poll()
>                  * calls can remove this head entry from the list.
>                  */
>-               iop = list_entry(list->next, struct irq_poll, list);
>+               iop = list_first_entry(&list, struct irq_poll, list);
>
>                 weight = iop->weight;
>                 work = 0;
>                 if (test_bit(IRQ_POLL_F_SCHED, &iop->state))
>                         work = iop->poll(iop, weight);
>
>-               budget -= work;
>-
>-               local_irq_disable();
>-
>                 /*
>                  * Drivers must not modify the iopoll state, if they
>                  * consume their assigned weight (or more, some drivers can't @@ -
>125,11 +110,21 @@ static void __latent_entropy irq_poll_softirq(struct
>softirq_action *h)
>                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>                                 __irq_poll_complete(iop);
>                         else
>-                               list_move_tail(&iop->list, list);
>+                               list_move_tail(&iop->list, &list);
>                 }
>+
>+               /*
>+                * If softirq window is exhausted then punt.
>+                */
>+               if (need_resched())
>+                       break;
>         }
>
>-       if (rearm)
>+       local_irq_disable();
>+
>+       list_splice_tail_init(irqpoll_list, &list);
>+       list_splice(&list, irqpoll_list);
>+       if (!list_empty(irqpoll_list))
>                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>
>         local_irq_enable();
>--
>
>Reminder to the nvme side (slightly modified):
>--
>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>52205f8d90b4..09dc6da67b05 100644
>--- a/drivers/nvme/host/pci.c
>+++ b/drivers/nvme/host/pci.c
>@@ -24,6 +24,7 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
>  #include <linux/pci-p2pdma.h>
>+#include <linux/irq_poll.h>
>
>  #include "trace.h"
>  #include "nvme.h"
>@@ -32,6 +33,7 @@
>  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))
>
>  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>+#define NVME_POLL_BUDGET_IRQ   256
>
>  /*
>   * These can be higher, but we need to ensure that any command doesn't
>@@ -189,6 +191,7 @@ struct nvme_queue {
>         u32 *dbbuf_cq_db;
>         u32 *dbbuf_sq_ei;
>         u32 *dbbuf_cq_ei;
>+       struct irq_poll iop;
>         struct completion delete_done;
>  };
>
>@@ -1014,11 +1017,29 @@ static inline int nvme_process_cq(struct
>nvme_queue *nvmeq, u16 *start,
>         return found;
>  }
>
>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue,
>iop);
>+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>+       u16 start, end;
>+       int completed;
>+
>+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
>+       nvme_complete_cqes(nvmeq, start, end);
>+       if (completed < budget) {
>+               irq_poll_complete(&nvmeq->iop);
>+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>+       }
>+
>+       return completed;
>+}
>+
>  static irqreturn_t nvme_irq(int irq, void *data)
>  {
>         struct nvme_queue *nvmeq = data;
>         irqreturn_t ret = IRQ_NONE;
>         u16 start, end;
>+       int budget = nvmeq->q_depth;
>
>         /*
>          * The rmb/wmb pair ensures we see all updates from a previous run of
>@@ -1027,13 +1048,23 @@ static irqreturn_t nvme_irq(int irq, void *data)
>         rmb();
>         if (nvmeq->cq_head != nvmeq->last_cq_head)
>                 ret = IRQ_HANDLED;
>-       nvme_process_cq(nvmeq, &start, &end, -1);
>+
>+       /* reap here up to a budget of the size the queue depth */
>+       do {
>+               budget -= nvme_process_cq(nvmeq, &start, &end, budget);
>+               if (start != end) {
>+                       nvme_complete_cqes(nvmeq, start, end);
>+                       ret = IRQ_HANDLED;
>+               }
>+       } while (start != end && budget > 0);
>+
>         nvmeq->last_cq_head = nvmeq->cq_head;
>         wmb();
>
>-       if (start != end) {
>-               nvme_complete_cqes(nvmeq, start, end);
>-               return IRQ_HANDLED;
>+       /* if we still have cqes to reap, schedule irqpoll */
>+       if (start != end && nvme_cqe_pending(nvmeq)) {
>+               disable_irq_nosync(irq);
>+               irq_poll_sched(&nvmeq->iop);
>         }
>
>         return ret;
>@@ -1346,6 +1377,7 @@ static enum blk_eh_timer_return
>nvme_timeout(struct request *req, bool reserved)
>
>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>+       irq_poll_disable(&nvmeq->iop);
>         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>         if (!nvmeq->sq_cmds)
>@@ -1480,6 +1512,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev,
>int qid, int depth)
>         nvmeq->dev = dev;
>         spin_lock_init(&nvmeq->sq_lock);
>         spin_lock_init(&nvmeq->cq_poll_lock);
>+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>nvme_irqpoll_handler);
>         nvmeq->cq_head = 0;
>         nvmeq->cq_phase = 1;
>         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>--

  reply	other threads:[~2019-09-18  0:00 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-27  8:53 [PATCH 0/4] genirq/nvme: add IRQF_RESCUE_THREAD for avoiding IRQ flood Ming Lei
2019-08-27  8:53 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
2019-08-27 14:42   ` Thomas Gleixner
2019-08-27 16:19     ` Thomas Gleixner
2019-08-27 23:04       ` Ming Lei
2019-08-27 23:12         ` Thomas Gleixner
2019-08-27 22:58     ` Ming Lei
2019-08-27 23:09       ` Thomas Gleixner
2019-08-28 11:06         ` Ming Lei
2019-08-28 11:23           ` Thomas Gleixner
2019-08-28 13:50             ` Ming Lei
2019-08-28 14:07               ` Thomas Gleixner
2019-09-03  3:30                 ` Ming Lei
2019-09-03  5:59                   ` Daniel Lezcano
2019-09-03  6:31                     ` Ming Lei
2019-09-03  6:40                       ` Daniel Lezcano
2019-09-03  7:28                         ` Ming Lei
2019-09-03  7:50                           ` Daniel Lezcano
2019-09-03  9:30                             ` Ming Lei
2019-09-04 17:07                             ` Bart Van Assche
2019-09-04 17:31                               ` Daniel Lezcano
2019-09-04 17:38                                 ` Bart Van Assche
2019-09-04 18:02                                   ` Peter Zijlstra
2019-09-04 19:47                                     ` Bart Van Assche
2019-09-05  9:11                                       ` Ming Lei
2019-09-05  9:06                                 ` Ming Lei
2019-09-05 10:37                                   ` Daniel Lezcano
2019-09-06  1:22                                     ` Long Li
2019-09-06  4:36                                       ` Daniel Lezcano
2019-09-06  4:44                                         ` Long Li
2019-09-06  1:48                                     ` Ming Lei
2019-09-06  5:14                                       ` Daniel Lezcano
2019-09-06 18:30                                         ` Sagi Grimberg
2019-09-06 18:52                                           ` Keith Busch
2019-09-07  0:01                                           ` Ming Lei
2019-09-10  3:10                                             ` Sagi Grimberg
2019-09-18  0:00                                               ` Long Li [this message]
2019-09-20 17:14                                                 ` Sagi Grimberg
2019-09-20 19:12                                                   ` Long Li
2019-09-20 20:45                                                     ` Sagi Grimberg
2019-09-24  0:57                                                       ` Long Li
2019-09-18 14:37                                               ` Ming Lei
2019-09-20 17:09                                                 ` Sagi Grimberg
2019-09-06 14:18                                       ` Keith Busch
2019-09-06 17:50                                         ` Long Li
2019-09-06 22:19                                           ` Ming Lei
2019-09-06 22:25                                             ` Keith Busch
2019-09-06 23:13                                               ` Ming Lei
2019-09-10  0:24                                             ` Ming Lei
2019-09-03  8:09                           ` Thomas Gleixner
2019-09-03  9:24                             ` Ming Lei
2019-08-29  6:15   ` Long Li
2019-08-30  0:55     ` Ming Lei
2019-08-27  8:53 ` [PATCH 2/4] genirq: add IRQF_RESCUE_THREAD Ming Lei
2019-08-27  8:53 ` [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq Ming Lei
2019-08-27  9:06   ` Johannes Thumshirn
2019-08-27  9:09     ` Ming Lei
2019-08-27  9:12       ` Johannes Thumshirn
2019-08-27 14:34       ` Keith Busch
2019-08-27 14:44         ` Keith Busch
2019-08-27 15:10   ` Bart Van Assche
2019-08-28  1:45     ` Ming Lei
2019-08-27  8:53 ` [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD Ming Lei
2019-08-27 14:35   ` Keith Busch
2019-09-06  8:50   ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR21MB0741838CE0C9D52556AA4558CE8E0@CY4PR21MB0741.namprd21.prod.outlook.com \
    --to=longli@microsoft.com \
    --cc=axboe@fb.com \
    --cc=bvanassche@acm.org \
    --cc=daniel.lezcano@linaro.org \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sagi@grimberg.me \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).