linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Long Li <longli@microsoft.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	"longli@linuxonhyperv.com" <longli@linuxonhyperv.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Keith Busch <keith.busch@intel.com>, Jens Axboe <axboe@fb.com>,
	Christoph Hellwig <hch@lst.de>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
Date: Sat, 24 Aug 2019 00:13:47 +0000	[thread overview]
Message-ID: <CY4PR21MB0741292F0C535156DAC690D7CEA70@CY4PR21MB0741.namprd21.prod.outlook.com> (raw)
In-Reply-To: <7fbdf43a-9499-8fb3-f6ec-5f1027b9fb65@grimberg.me>

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> Sagi,
>>>>
>>>> Here are the test results.
>>>>
>>>> Benchmark command:
>>>> fio --bs=4k --ioengine=libaio --iodepth=64
>>>> --
>>>filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/d
>>>ev/nv
>>>>
>>>me4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev
>>>/nvme9n1
>>>> --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test
>>>> --group_reporting --gtod_reduce=1
>>>>
>>>> With your patch: 1720k IOPS
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>>
>>>> Interrupts are the fastest but we need to find a way to throttle it.
>>>
>>>This is the workload that generates the flood?
>>>If so I did not expect that this would be the perf difference..
>>>
>>>If completions keep pounding on the cpu, I would expect irqpoll to simply
>>>keep running forever and poll the cqs. There is no fundamental reason why
>>>polling would be faster in an interrupt, what matters could be:
>>>1. we reschedule more than we need to
>>>2. we don't reap enough completions in every poll round, which will trigger
>>>rearming the interrupt and then when it fires reschedule another softirq...
>>>

Yes I think it's the rescheduling that takes some. With the patch there are lots of ksoftirqd activities. (compared to nearly none without the patch)
A 90 seconds FIO run shows a big difference of context switches on all CPUs:
With patch: 5755849
Without patch: 1462931

>>>Maybe we need to take care of some irq_poll optimizations?
>>>
>>>Does this (untested) patch make any difference?
>>>--
>>>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..0e94183eba15
>>>100644
>>>--- a/lib/irq_poll.c
>>>+++ b/lib/irq_poll.c
>>>@@ -12,7 +12,8 @@
>>>  #include <linux/irq_poll.h>
>>>  #include <linux/delay.h>
>>>
>>>-static unsigned int irq_poll_budget __read_mostly = 256;
>>>+static unsigned int irq_poll_budget __read_mostly = 3000; unsigned int
>>>+__read_mostly irqpoll_budget_usecs = 2000;
>>>
>>>  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>>>
>>>@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);
>>>
>>>  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>>>  {
>>>-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>>>-       int rearm = 0, budget = irq_poll_budget;
>>>-       unsigned long start_time = jiffies;
>>>+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>>>+       unsigned int budget = irq_poll_budget;
>>>+       unsigned long time_limit =
>>>+               jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
>>>+       LIST_HEAD(list);
>>>
>>>         local_irq_disable();
>>>+       list_splice_init(irqpoll_list, &list);
>>>+       local_irq_enable();
>>>
>>>-       while (!list_empty(list)) {
>>>+       while (!list_empty(&list)) {
>>>                 struct irq_poll *iop;
>>>                 int work, weight;
>>>
>>>-               /*
>>>-                * If softirq window is exhausted then punt.
>>>-                */
>>>-               if (budget <= 0 || time_after(jiffies, start_time)) {
>>>-                       rearm = 1;
>>>-                       break;
>>>-               }
>>>-
>>>-               local_irq_enable();
>>>-
>>>                 /* Even though interrupts have been re-enabled, this
>>>                  * access is safe because interrupts can only add new
>>>                  * entries to the tail of this list, and only ->poll()
>>>                  * calls can remove this head entry from the list.
>>>                  */
>>>-               iop = list_entry(list->next, struct irq_poll, list);
>>>+               iop = list_first_entry(&list, struct irq_poll, list);
>>>
>>>                 weight = iop->weight;
>>>                 work = 0;
>>>@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>
>>>                 budget -= work;
>>>
>>>-               local_irq_disable();
>>>-
>>>                 /*
>>>                  * Drivers must not modify the iopoll state, if they
>>>                  * consume their assigned weight (or more, some drivers can't @@
>>>-125,11 +118,21 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>>>                                 __irq_poll_complete(iop);
>>>                         else
>>>-                               list_move_tail(&iop->list, list);
>>>+                               list_move_tail(&iop->list, &list);
>>>                 }
>>>+
>>>+               /*
>>>+                * If softirq window is exhausted then punt.
>>>+                */
>>>+               if (budget <= 0 || time_after_eq(jiffies, time_limit))
>>>+                       break;
>>>         }
>>>
>>>-       if (rearm)
>>>+       local_irq_disable();
>>>+
>>>+       list_splice_tail_init(irqpoll_list, &list);
>>>+       list_splice(&list, irqpoll_list);
>>>+       if (!list_empty(irqpoll_list))
>>>                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>>>
>>>         local_irq_enable();
>>>--

It's got slightly better IOPS. With this 2nd patch, the number of context switches is at 5445863 (~5% improvement over the 1st patch).

  reply	other threads:[~2019-08-24  0:13 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
2019-08-20  9:38   ` Peter Zijlstra
2019-08-21  8:20     ` Long Li
2019-08-21 10:34       ` Peter Zijlstra
2019-08-20  9:39   ` Peter Zijlstra
2019-08-20  6:14 ` [PATCH 2/3] sched: export idle_cpu() longli
2019-08-20  6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli
2019-08-20  9:52   ` Peter Zijlstra
2019-08-21  8:37     ` Long Li
2019-08-21 10:35       ` Peter Zijlstra
2019-08-20 17:33   ` Sagi Grimberg
2019-08-21  8:39     ` Long Li
2019-08-21 17:36       ` Long Li
2019-08-21 21:54         ` Sagi Grimberg
2019-08-24  0:13           ` Long Li [this message]
2019-08-23  3:21     ` Ming Lei
2019-08-24  0:27       ` Long Li
2019-08-24 12:55         ` Ming Lei
2019-08-20  8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei
2019-08-20  8:59   ` John Garry
2019-08-20 15:05     ` Keith Busch
2019-08-21  7:47     ` Long Li
2019-08-21  9:44       ` Ming Lei
2019-08-21 10:03         ` John Garry
2019-08-21 16:27         ` Long Li
2019-08-22  1:33           ` Ming Lei
2019-08-22  2:00             ` Keith Busch
2019-08-22  2:23               ` Ming Lei
2019-08-22  9:48               ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR21MB0741292F0C535156DAC690D7CEA70@CY4PR21MB0741.namprd21.prod.outlook.com \
    --to=longli@microsoft.com \
    --cc=axboe@fb.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=longli@linuxonhyperv.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).