All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Garry <john.garry@huawei.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"chenxiang (M)" <chenxiang66@hisilicon.com>,
	"bigeasy@linutronix.de" <bigeasy@linutronix.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"maz@kernel.org" <maz@kernel.org>,
	"hare@suse.com" <hare@suse.com>, "hch@lst.de" <hch@lst.de>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"bvanassche@acm.org" <bvanassche@acm.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"mingo@redhat.com" <mingo@redhat.com>
Subject: Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt
Date: Fri, 13 Dec 2019 15:43:07 +0000	[thread overview]
Message-ID: <b7f3bcea-84ec-f9f6-a3aa-007ae712415f@huawei.com> (raw)
In-Reply-To: <20191213131822.GA19876@ming.t460p>

On 13/12/2019 13:18, Ming Lei wrote:

Hi Ming,

> 
> On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote:
>> Hi Ming,
>>
>>>> I am running some NVMe perf tests with Marc's patch.
>>>
>>> We need to confirm that if Marc's patch works as expected, could you
>>> collect log via the attached script?
>>
>> As immediately below, I see this on vanilla mainline, so let's see what the
>> issue is without that patch.
> 
> IMO, the interrupt load needs to be distributed as what X86 IRQ matrix
> does. If the ARM64 server doesn't do that, the 1st step should align to
> that.

That would make sense. But still, I would like to think that a CPU could 
sink the interrupts from 2x queues.

> 
> Also do you pass 'use_threaded_interrupts=1' in your test?

When I set this, then, as I anticipated, no lockup. But IOPS drops from 
~ 1M IOPS->800K.

> 
>>
>>>   >
>>> You never provide the test details(how many drives, how many disks
>>> attached to each drive) as I asked, so I can't comment on the reason,
>>> also no reason shows that the patch is a good fix.
>>
>> So I have only 2x ES3000 V3s. This looks like the same one:
>> https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf
>>
>>>
>>> My theory is simple, so far, the CPU is still much quicker than
>>> current storage in case that IO aren't from multiple disks which are
>>> connected to same drive.
>>

[...]

>> irq 98, cpu list 88-91, effective list 88
>> irq 99, cpu list 92-95, effective list 92
>   
> The above log shows there are two nvme drives, each drive has 24 hw
> queues.
> 
> Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine,
> each hw queue can be assigned one unique effective CPU for handling
> the queue's interrupt.
> 
> Because arm64's gic driver doesn't distribute irq's effective cpu affinity,
> each hw queue is assigned same CPU to handle its interrupt.
> 
> As you saw, the detected RCU stall is on CPU0, which is for handling
> both irq 77 and irq 100.
> 
> Please apply Marc's patch and observe if unique effective CPU is
> assigned to each hw queue's irq.
> 

Same issue:

979826] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse 
[Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1
[   38.772536] IRQ25 CPU14 -> CPU3
[   38.777138] IRQ58 CPU8 -> CPU17
[  119.499459] rcu: INFO: rcu_preempt self-detected stall on CPU
[  119.505202] rcu: 16-....: (1 GPs behind) 
idle=a8a/1/0x4000000000000002 softirq=952/1211 fqs=2625
[  119.514188] (t=5253 jiffies g=2613 q=4573)
[  119.514193] Task dump for CPU 16:
[  119.514197] ksoftirqd/16    R  running task        0    91      2 
0x0000002a
[  119.514206] Call trace:
[  119.514224]  dump_backtrace+0x0/0x1a0
[  119.514228]  show_stack+0x14/0x20
[  119.514236]  sched_show_task+0x164/0x1a0
[  119.514240]  dump_cpu_task+0x40/0x2e8
[  119.514245]  rcu_dump_cpu_stacks+0xa0/0xe0
[  119.514247]  rcu_sched_clock_irq+0x6d8/0xaa8
[  119.514251]  update_process_times+0x2c/0x50
[  119.514258]  tick_sched_handle.isra.14+0x30/0x50
[  119.514261]  tick_sched_timer+0x48/0x98
[  119.514264]  __hrtimer_run_queues+0x120/0x1b8
[  119.514266]  hrtimer_interrupt+0xd4/0x250
[  119.514277]  arch_timer_handler_phys+0x28/0x40
[  119.514280]  handle_percpu_devid_irq+0x80/0x140
[  119.514283]  generic_handle_irq+0x24/0x38
[  119.514285]  __handle_domain_irq+0x5c/0xb0
[  119.514299]  gic_handle_irq+0x5c/0x148
[  119.514301]  el1_irq+0xb8/0x180
[  119.514305]  load_balance+0x478/0xb98
[  119.514308]  rebalance_domains+0x1cc/0x2f8
[  119.514311]  run_rebalance_domains+0x78/0xe0
[  119.514313]  efi_header_end+0x114/0x234
[  119.514317]  run_ksoftirqd+0x38/0x48
[  119.514322]  smpboot_thread_fn+0x16c/0x270
[  119.514324]  kthread+0x118/0x120
[  119.514326]  ret_from_fork+0x10/0x18
john@ubuntu:~$ ./dump-io-irq-affinity
kernel version:
Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri 
Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 5
irq 60, cpu list 24-28, effective list 10
irq 61, cpu list 29-33, effective list 7
irq 62, cpu list 34-38, effective list 5
irq 63, cpu list 39-43, effective list 6
irq 64, cpu list 44-47, effective list 8
irq 65, cpu list 48-51, effective list 9
irq 66, cpu list 52-55, effective list 10
irq 67, cpu list 56-59, effective list 11
irq 68, cpu list 60-63, effective list 12
irq 69, cpu list 64-67, effective list 13
irq 70, cpu list 68-71, effective list 14
irq 71, cpu list 72-75, effective list 15
irq 72, cpu list 76-79, effective list 16
irq 73, cpu list 80-83, effective list 17
irq 74, cpu list 84-87, effective list 18
irq 75, cpu list 88-91, effective list 19
irq 76, cpu list 92-95, effective list 20
irq 77, cpu list 0-3, effective list 3
irq 78, cpu list 4-7, effective list 4
irq 79, cpu list 8-11, effective list 8
irq 80, cpu list 12-15, effective list 12
irq 81, cpu list 16-19, effective list 16
irq 82, cpu list 20-23, effective list 23
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 5
irq 102, cpu list 8-11, effective list 9
irq 103, cpu list 12-15, effective list 13
irq 104, cpu list 16-19, effective list 17
irq 105, cpu list 20-23, effective list 21
irq 57, cpu list 63, effective list 7
irq 83, cpu list 24-28, effective list 5
irq 84, cpu list 29-33, effective list 6
irq 85, cpu list 34-38, effective list 8
irq 86, cpu list 39-43, effective list 9
irq 87, cpu list 44-47, effective list 10
irq 88, cpu list 48-51, effective list 11
irq 89, cpu list 52-55, effective list 12
irq 90, cpu list 56-59, effective list 13
irq 91, cpu list 60-63, effective list 14
irq 92, cpu list 64-67, effective list 15
irq 93, cpu list 68-71, effective list 16
irq 94, cpu list 72-75, effective list 17
irq 95, cpu list 76-79, effective list 18
irq 96, cpu list 80-83, effective list 19
irq 97, cpu list 84-87, effective list 20
irq 98, cpu list 88-91, effective list 21
irq 99, cpu list 92-95, effective list 22
john@ubuntu:~$

but you can see that CPU16 is handling irq72, 81, and 93.

> If unique effective CPU is assigned to each hw queue's irq, and the RCU
> stall can still be triggered, let's investigate further, given one single
> ARM64 CPU core should be quick enough to handle IO completion from single
> NVNe drive.

If I remove the code for bring the affinity within the ITS numa node 
mask - as Marc hinted - then I still get a lockup, but we still we have 
CPUs serving multiple interrupts:

   116.166881] rcu: INFO: rcu_preempt self-detected stall on CPU
[  116.181432] Task dump for CPU 4:
[  116.181502] Task dump for CPU 8:
john@ubuntu:~$ ./dump-io-irq-affinity
kernel version:
Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1443 SMP PREEMPT Fri 
Dec 13 15:29:55 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 75
irq 60, cpu list 24-28, effective list 25
irq 61, cpu list 29-33, effective list 29
irq 62, cpu list 34-38, effective list 34
irq 63, cpu list 39-43, effective list 39
irq 64, cpu list 44-47, effective list 44
irq 65, cpu list 48-51, effective list 49
irq 66, cpu list 52-55, effective list 55
irq 67, cpu list 56-59, effective list 56
irq 68, cpu list 60-63, effective list 61
irq 69, cpu list 64-67, effective list 64
irq 70, cpu list 68-71, effective list 68
irq 71, cpu list 72-75, effective list 73
irq 72, cpu list 76-79, effective list 76
irq 73, cpu list 80-83, effective list 80
irq 74, cpu list 84-87, effective list 85
irq 75, cpu list 88-91, effective list 88
irq 76, cpu list 92-95, effective list 92
irq 77, cpu list 0-3, effective list 1
irq 78, cpu list 4-7, effective list 4
irq 79, cpu list 8-11, effective list 8
irq 80, cpu list 12-15, effective list 14
irq 81, cpu list 16-19, effective list 16
irq 82, cpu list 20-23, effective list 20
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 4
irq 102, cpu list 8-11, effective list 8
irq 103, cpu list 12-15, effective list 13
irq 104, cpu list 16-19, effective list 16
irq 105, cpu list 20-23, effective list 20
irq 57, cpu list 63, effective list 63
irq 83, cpu list 24-28, effective list 26
irq 84, cpu list 29-33, effective list 31
irq 85, cpu list 34-38, effective list 35
irq 86, cpu list 39-43, effective list 40
irq 87, cpu list 44-47, effective list 45
irq 88, cpu list 48-51, effective list 50
irq 89, cpu list 52-55, effective list 52
irq 90, cpu list 56-59, effective list 57
irq 91, cpu list 60-63, effective list 62
irq 92, cpu list 64-67, effective list 65
irq 93, cpu list 68-71, effective list 69
irq 94, cpu list 72-75, effective list 74
irq 95, cpu list 76-79, effective list 77
irq 96, cpu list 80-83, effective list 81
irq 97, cpu list 84-87, effective list 86
irq 98, cpu list 88-91, effective list 89
irq 99, cpu list 92-95, effective list 93
john@ubuntu:~$

I'm now thinking that we should just attempt this intelligent CPU 
affinity assignment for managed interrupts.

Thanks,
John

  reply	other threads:[~2019-12-13 20:38 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-06 14:35 [PATCH RFC 0/1] Threaded handler uses irq affinity for when the interrupt is managed John Garry
2019-12-06 14:35 ` [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt John Garry
2019-12-06 15:22   ` Marc Zyngier
2019-12-06 16:16     ` John Garry
2019-12-07  8:03   ` Ming Lei
2019-12-09 14:30     ` John Garry
2019-12-09 15:09       ` Hannes Reinecke
2019-12-09 15:17         ` Marc Zyngier
2019-12-09 15:25           ` Hannes Reinecke
2019-12-09 15:36             ` Marc Zyngier
2019-12-09 15:49           ` Qais Yousef
2019-12-09 15:55             ` Marc Zyngier
2019-12-10  1:43       ` Ming Lei
2019-12-10  9:45         ` John Garry
2019-12-10 10:06           ` Ming Lei
2019-12-10 10:28           ` Marc Zyngier
2019-12-10 10:59             ` John Garry
2019-12-10 11:36               ` Marc Zyngier
2019-12-10 12:05                 ` John Garry
2019-12-10 18:32                   ` Marc Zyngier
2019-12-11  9:41                     ` John Garry
2019-12-13 10:07                       ` John Garry
2019-12-13 10:31                         ` Marc Zyngier
2019-12-13 12:08                           ` John Garry
2019-12-14 10:59                             ` Marc Zyngier
2019-12-11 17:09         ` John Garry
2019-12-12 22:38           ` Ming Lei
2019-12-13 11:12             ` John Garry
2019-12-13 13:18               ` Ming Lei
2019-12-13 15:43                 ` John Garry [this message]
2019-12-13 17:12                   ` Ming Lei
2019-12-13 17:50                     ` John Garry
2019-12-14 13:56                   ` Marc Zyngier
2019-12-16 10:47                     ` John Garry
2019-12-16 11:40                       ` Marc Zyngier
2019-12-16 14:17                         ` John Garry
2019-12-16 18:00                           ` Marc Zyngier
2019-12-16 18:50                             ` John Garry
2019-12-20 11:30                               ` John Garry
2019-12-20 14:43                                 ` Marc Zyngier
2019-12-20 15:38                                   ` John Garry
2019-12-20 16:16                                     ` Marc Zyngier
2019-12-20 23:31                                     ` Ming Lei
2019-12-23  9:07                                       ` Marc Zyngier
2019-12-23 10:26                                         ` John Garry
2019-12-23 10:47                                           ` Marc Zyngier
2019-12-23 11:35                                             ` John Garry
2019-12-24  1:59                                             ` Ming Lei
2019-12-24 11:20                                               ` Marc Zyngier
2019-12-25  0:48                                                 ` Ming Lei
2020-01-02 10:35                                                   ` John Garry
2020-01-03  0:46                                                     ` Ming Lei
2020-01-03 10:41                                                       ` John Garry
2020-01-03 11:29                                                         ` Ming Lei
2020-01-03 11:50                                                           ` John Garry
2020-01-04 12:03                                                             ` Ming Lei
2020-05-30  7:46 ` [tip: irq/core] irqchip/gic-v3-its: Balance initial LPI affinity across CPUs tip-bot2 for Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b7f3bcea-84ec-f9f6-a3aa-007ae712415f@huawei.com \
    --to=john.garry@huawei.com \
    --cc=axboe@kernel.dk \
    --cc=bigeasy@linutronix.de \
    --cc=bvanassche@acm.org \
    --cc=chenxiang66@hisilicon.com \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.