All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: John Garry <john.garry@huawei.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"chenxiang (M)" <chenxiang66@hisilicon.com>,
	"bigeasy@linutronix.de" <bigeasy@linutronix.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"maz@kernel.org" <maz@kernel.org>,
	"hare@suse.com" <hare@suse.com>, "hch@lst.de" <hch@lst.de>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"bvanassche@acm.org" <bvanassche@acm.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"mingo@redhat.com" <mingo@redhat.com>
Subject: Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt
Date: Sat, 14 Dec 2019 01:12:22 +0800	[thread overview]
Message-ID: <20191213171222.GA17267@ming.t460p> (raw)
In-Reply-To: <b7f3bcea-84ec-f9f6-a3aa-007ae712415f@huawei.com>

On Fri, Dec 13, 2019 at 03:43:07PM +0000, John Garry wrote:
> On 13/12/2019 13:18, Ming Lei wrote:
> 
> Hi Ming,
> 
> > 
> > On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote:
> > > Hi Ming,
> > > 
> > > > > I am running some NVMe perf tests with Marc's patch.
> > > > 
> > > > We need to confirm that if Marc's patch works as expected, could you
> > > > collect log via the attached script?
> > > 
> > > As immediately below, I see this on vanilla mainline, so let's see what the
> > > issue is without that patch.
> > 
> > IMO, the interrupt load needs to be distributed as what X86 IRQ matrix
> > does. If the ARM64 server doesn't do that, the 1st step should align to
> > that.
> 
> That would make sense. But still, I would like to think that a CPU could
> sink the interrupts from 2x queues.
> 
> > 
> > Also do you pass 'use_threaded_interrupts=1' in your test?
> 
> When I set this, then, as I anticipated, no lockup. But IOPS drops from ~ 1M
> IOPS->800K.
> 
> > 
> > > 
> > > >   >
> > > > You never provide the test details(how many drives, how many disks
> > > > attached to each drive) as I asked, so I can't comment on the reason,
> > > > also no reason shows that the patch is a good fix.
> > > 
> > > So I have only 2x ES3000 V3s. This looks like the same one:
> > > https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf
> > > 
> > > > 
> > > > My theory is simple, so far, the CPU is still much quicker than
> > > > current storage in case that IO aren't from multiple disks which are
> > > > connected to same drive.
> > > 
> 
> [...]
> 
> > > irq 98, cpu list 88-91, effective list 88
> > > irq 99, cpu list 92-95, effective list 92
> > The above log shows there are two nvme drives, each drive has 24 hw
> > queues.
> > 
> > Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine,
> > each hw queue can be assigned one unique effective CPU for handling
> > the queue's interrupt.
> > 
> > Because arm64's gic driver doesn't distribute irq's effective cpu affinity,
> > each hw queue is assigned same CPU to handle its interrupt.
> > 
> > As you saw, the detected RCU stall is on CPU0, which is for handling
> > both irq 77 and irq 100.
> > 
> > Please apply Marc's patch and observe if unique effective CPU is
> > assigned to each hw queue's irq.
> > 
> 
> Same issue:
> 
> 979826] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse
> [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1
> [   38.772536] IRQ25 CPU14 -> CPU3
> [   38.777138] IRQ58 CPU8 -> CPU17
> [  119.499459] rcu: INFO: rcu_preempt self-detected stall on CPU
> [  119.505202] rcu: 16-....: (1 GPs behind) idle=a8a/1/0x4000000000000002
> softirq=952/1211 fqs=2625
> [  119.514188] (t=5253 jiffies g=2613 q=4573)
> [  119.514193] Task dump for CPU 16:
> [  119.514197] ksoftirqd/16    R  running task        0    91      2
> 0x0000002a
> [  119.514206] Call trace:
> [  119.514224]  dump_backtrace+0x0/0x1a0
> [  119.514228]  show_stack+0x14/0x20
> [  119.514236]  sched_show_task+0x164/0x1a0
> [  119.514240]  dump_cpu_task+0x40/0x2e8
> [  119.514245]  rcu_dump_cpu_stacks+0xa0/0xe0
> [  119.514247]  rcu_sched_clock_irq+0x6d8/0xaa8
> [  119.514251]  update_process_times+0x2c/0x50
> [  119.514258]  tick_sched_handle.isra.14+0x30/0x50
> [  119.514261]  tick_sched_timer+0x48/0x98
> [  119.514264]  __hrtimer_run_queues+0x120/0x1b8
> [  119.514266]  hrtimer_interrupt+0xd4/0x250
> [  119.514277]  arch_timer_handler_phys+0x28/0x40
> [  119.514280]  handle_percpu_devid_irq+0x80/0x140
> [  119.514283]  generic_handle_irq+0x24/0x38
> [  119.514285]  __handle_domain_irq+0x5c/0xb0
> [  119.514299]  gic_handle_irq+0x5c/0x148
> [  119.514301]  el1_irq+0xb8/0x180
> [  119.514305]  load_balance+0x478/0xb98
> [  119.514308]  rebalance_domains+0x1cc/0x2f8
> [  119.514311]  run_rebalance_domains+0x78/0xe0
> [  119.514313]  efi_header_end+0x114/0x234
> [  119.514317]  run_ksoftirqd+0x38/0x48
> [  119.514322]  smpboot_thread_fn+0x16c/0x270
> [  119.514324]  kthread+0x118/0x120
> [  119.514326]  ret_from_fork+0x10/0x18
> john@ubuntu:~$ ./dump-io-irq-affinity
> kernel version:
> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec
> 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
> PCI name is 04:00.0: nvme0n1
> irq 56, cpu list 75, effective list 5
> irq 60, cpu list 24-28, effective list 10

The effect list supposes to be subset of irq's affinity(24-28).

> irq 61, cpu list 29-33, effective list 7
> irq 62, cpu list 34-38, effective list 5
> irq 63, cpu list 39-43, effective list 6
> irq 64, cpu list 44-47, effective list 8
> irq 65, cpu list 48-51, effective list 9
> irq 66, cpu list 52-55, effective list 10
> irq 67, cpu list 56-59, effective list 11
> irq 68, cpu list 60-63, effective list 12
> irq 69, cpu list 64-67, effective list 13
> irq 70, cpu list 68-71, effective list 14
> irq 71, cpu list 72-75, effective list 15
> irq 72, cpu list 76-79, effective list 16
> irq 73, cpu list 80-83, effective list 17
> irq 74, cpu list 84-87, effective list 18
> irq 75, cpu list 88-91, effective list 19
> irq 76, cpu list 92-95, effective list 20

Same with above, so looks Marc's patch is wrong.

> irq 77, cpu list 0-3, effective list 3
> irq 78, cpu list 4-7, effective list 4
> irq 79, cpu list 8-11, effective list 8
> irq 80, cpu list 12-15, effective list 12
> irq 81, cpu list 16-19, effective list 16
> irq 82, cpu list 20-23, effective list 23
> PCI name is 81:00.0: nvme1n1
> irq 100, cpu list 0-3, effective list 0
> irq 101, cpu list 4-7, effective list 5
> irq 102, cpu list 8-11, effective list 9
> irq 103, cpu list 12-15, effective list 13
> irq 104, cpu list 16-19, effective list 17
> irq 105, cpu list 20-23, effective list 21
> irq 57, cpu list 63, effective list 7
> irq 83, cpu list 24-28, effective list 5
> irq 84, cpu list 29-33, effective list 6
> irq 85, cpu list 34-38, effective list 8
> irq 86, cpu list 39-43, effective list 9
> irq 87, cpu list 44-47, effective list 10
> irq 88, cpu list 48-51, effective list 11
> irq 89, cpu list 52-55, effective list 12
> irq 90, cpu list 56-59, effective list 13
> irq 91, cpu list 60-63, effective list 14
> irq 92, cpu list 64-67, effective list 15
> irq 93, cpu list 68-71, effective list 16
> irq 94, cpu list 72-75, effective list 17
> irq 95, cpu list 76-79, effective list 18
> irq 96, cpu list 80-83, effective list 19
> irq 97, cpu list 84-87, effective list 20
> irq 98, cpu list 88-91, effective list 21
> irq 99, cpu list 92-95, effective list 22

More are wrong.

> john@ubuntu:~$
> 
> but you can see that CPU16 is handling irq72, 81, and 93.

As I mentioned, the effective affinity has to be subset of the irq's
affinity.

> 
> > If unique effective CPU is assigned to each hw queue's irq, and the RCU
> > stall can still be triggered, let's investigate further, given one single
> > ARM64 CPU core should be quick enough to handle IO completion from single
> > NVNe drive.
> 
> If I remove the code for bring the affinity within the ITS numa node mask -
> as Marc hinted - then I still get a lockup, but we still we have CPUs
> serving multiple interrupts:
> 
>   116.166881] rcu: INFO: rcu_preempt self-detected stall on CPU
> [  116.181432] Task dump for CPU 4:
> [  116.181502] Task dump for CPU 8:
> john@ubuntu:~$ ./dump-io-irq-affinity
> kernel version:
> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1443 SMP PREEMPT Fri Dec
> 13 15:29:55 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
> PCI name is 04:00.0: nvme0n1
> irq 56, cpu list 75, effective list 75
> irq 60, cpu list 24-28, effective list 25
> irq 61, cpu list 29-33, effective list 29
> irq 62, cpu list 34-38, effective list 34
> irq 63, cpu list 39-43, effective list 39
> irq 64, cpu list 44-47, effective list 44
> irq 65, cpu list 48-51, effective list 49
> irq 66, cpu list 52-55, effective list 55
> irq 67, cpu list 56-59, effective list 56
> irq 68, cpu list 60-63, effective list 61
> irq 69, cpu list 64-67, effective list 64
> irq 70, cpu list 68-71, effective list 68
> irq 71, cpu list 72-75, effective list 73
> irq 72, cpu list 76-79, effective list 76
> irq 73, cpu list 80-83, effective list 80
> irq 74, cpu list 84-87, effective list 85
> irq 75, cpu list 88-91, effective list 88
> irq 76, cpu list 92-95, effective list 92
> irq 77, cpu list 0-3, effective list 1
> irq 78, cpu list 4-7, effective list 4
> irq 79, cpu list 8-11, effective list 8
> irq 80, cpu list 12-15, effective list 14
> irq 81, cpu list 16-19, effective list 16
> irq 82, cpu list 20-23, effective list 20
> PCI name is 81:00.0: nvme1n1
> irq 100, cpu list 0-3, effective list 0
> irq 101, cpu list 4-7, effective list 4
> irq 102, cpu list 8-11, effective list 8
> irq 103, cpu list 12-15, effective list 13
> irq 104, cpu list 16-19, effective list 16
> irq 105, cpu list 20-23, effective list 20
> irq 57, cpu list 63, effective list 63
> irq 83, cpu list 24-28, effective list 26
> irq 84, cpu list 29-33, effective list 31
> irq 85, cpu list 34-38, effective list 35
> irq 86, cpu list 39-43, effective list 40
> irq 87, cpu list 44-47, effective list 45
> irq 88, cpu list 48-51, effective list 50
> irq 89, cpu list 52-55, effective list 52
> irq 90, cpu list 56-59, effective list 57
> irq 91, cpu list 60-63, effective list 62
> irq 92, cpu list 64-67, effective list 65
> irq 93, cpu list 68-71, effective list 69
> irq 94, cpu list 72-75, effective list 74
> irq 95, cpu list 76-79, effective list 77
> irq 96, cpu list 80-83, effective list 81
> irq 97, cpu list 84-87, effective list 86
> irq 98, cpu list 88-91, effective list 89
> irq 99, cpu list 92-95, effective list 93
> john@ubuntu:~$
> 
> I'm now thinking that we should just attempt this intelligent CPU affinity
> assignment for managed interrupts.

Right, the rule is simple: distribute effective list among CPUs evenly,
meantime select the effective CPU from the irq's affinity mask.


Thanks,
Ming


  reply	other threads:[~2019-12-13 20:39 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-06 14:35 [PATCH RFC 0/1] Threaded handler uses irq affinity for when the interrupt is managed John Garry
2019-12-06 14:35 ` [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt John Garry
2019-12-06 15:22   ` Marc Zyngier
2019-12-06 16:16     ` John Garry
2019-12-07  8:03   ` Ming Lei
2019-12-09 14:30     ` John Garry
2019-12-09 15:09       ` Hannes Reinecke
2019-12-09 15:17         ` Marc Zyngier
2019-12-09 15:25           ` Hannes Reinecke
2019-12-09 15:36             ` Marc Zyngier
2019-12-09 15:49           ` Qais Yousef
2019-12-09 15:55             ` Marc Zyngier
2019-12-10  1:43       ` Ming Lei
2019-12-10  9:45         ` John Garry
2019-12-10 10:06           ` Ming Lei
2019-12-10 10:28           ` Marc Zyngier
2019-12-10 10:59             ` John Garry
2019-12-10 11:36               ` Marc Zyngier
2019-12-10 12:05                 ` John Garry
2019-12-10 18:32                   ` Marc Zyngier
2019-12-11  9:41                     ` John Garry
2019-12-13 10:07                       ` John Garry
2019-12-13 10:31                         ` Marc Zyngier
2019-12-13 12:08                           ` John Garry
2019-12-14 10:59                             ` Marc Zyngier
2019-12-11 17:09         ` John Garry
2019-12-12 22:38           ` Ming Lei
2019-12-13 11:12             ` John Garry
2019-12-13 13:18               ` Ming Lei
2019-12-13 15:43                 ` John Garry
2019-12-13 17:12                   ` Ming Lei [this message]
2019-12-13 17:50                     ` John Garry
2019-12-14 13:56                   ` Marc Zyngier
2019-12-16 10:47                     ` John Garry
2019-12-16 11:40                       ` Marc Zyngier
2019-12-16 14:17                         ` John Garry
2019-12-16 18:00                           ` Marc Zyngier
2019-12-16 18:50                             ` John Garry
2019-12-20 11:30                               ` John Garry
2019-12-20 14:43                                 ` Marc Zyngier
2019-12-20 15:38                                   ` John Garry
2019-12-20 16:16                                     ` Marc Zyngier
2019-12-20 23:31                                     ` Ming Lei
2019-12-23  9:07                                       ` Marc Zyngier
2019-12-23 10:26                                         ` John Garry
2019-12-23 10:47                                           ` Marc Zyngier
2019-12-23 11:35                                             ` John Garry
2019-12-24  1:59                                             ` Ming Lei
2019-12-24 11:20                                               ` Marc Zyngier
2019-12-25  0:48                                                 ` Ming Lei
2020-01-02 10:35                                                   ` John Garry
2020-01-03  0:46                                                     ` Ming Lei
2020-01-03 10:41                                                       ` John Garry
2020-01-03 11:29                                                         ` Ming Lei
2020-01-03 11:50                                                           ` John Garry
2020-01-04 12:03                                                             ` Ming Lei
2020-05-30  7:46 ` [tip: irq/core] irqchip/gic-v3-its: Balance initial LPI affinity across CPUs tip-bot2 for Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191213171222.GA17267@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bigeasy@linutronix.de \
    --cc=bvanassche@acm.org \
    --cc=chenxiang66@hisilicon.com \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.