LKML Archive on lore.kernel.org
 help / color / Atom feed
From: John Garry <john.garry@huawei.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"chenxiang (M)" <chenxiang66@hisilicon.com>,
	"bigeasy@linutronix.de" <bigeasy@linutronix.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"maz@kernel.org" <maz@kernel.org>,
	"hare@suse.com" <hare@suse.com>, "hch@lst.de" <hch@lst.de>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"bvanassche@acm.org" <bvanassche@acm.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"mingo@redhat.com" <mingo@redhat.com>
Subject: Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt
Date: Fri, 13 Dec 2019 15:43:07 +0000
Message-ID: <b7f3bcea-84ec-f9f6-a3aa-007ae712415f@huawei.com> (raw)
In-Reply-To: <20191213131822.GA19876@ming.t460p>

On 13/12/2019 13:18, Ming Lei wrote:

Hi Ming,

> 
> On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote:
>> Hi Ming,
>>
>>>> I am running some NVMe perf tests with Marc's patch.
>>>
>>> We need to confirm that if Marc's patch works as expected, could you
>>> collect log via the attached script?
>>
>> As immediately below, I see this on vanilla mainline, so let's see what the
>> issue is without that patch.
> 
> IMO, the interrupt load needs to be distributed as what X86 IRQ matrix
> does. If the ARM64 server doesn't do that, the 1st step should align to
> that.

That would make sense. But still, I would like to think that a CPU could 
sink the interrupts from 2x queues.

> 
> Also do you pass 'use_threaded_interrupts=1' in your test?

When I set this, then, as I anticipated, no lockup. But IOPS drops from 
~ 1M IOPS->800K.

> 
>>
>>>   >
>>> You never provide the test details(how many drives, how many disks
>>> attached to each drive) as I asked, so I can't comment on the reason,
>>> also no reason shows that the patch is a good fix.
>>
>> So I have only 2x ES3000 V3s. This looks like the same one:
>> https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf
>>
>>>
>>> My theory is simple, so far, the CPU is still much quicker than
>>> current storage in case that IO aren't from multiple disks which are
>>> connected to same drive.
>>

[...]

>> irq 98, cpu list 88-91, effective list 88
>> irq 99, cpu list 92-95, effective list 92
>   
> The above log shows there are two nvme drives, each drive has 24 hw
> queues.
> 
> Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine,
> each hw queue can be assigned one unique effective CPU for handling
> the queue's interrupt.
> 
> Because arm64's gic driver doesn't distribute irq's effective cpu affinity,
> each hw queue is assigned same CPU to handle its interrupt.
> 
> As you saw, the detected RCU stall is on CPU0, which is for handling
> both irq 77 and irq 100.
> 
> Please apply Marc's patch and observe if unique effective CPU is
> assigned to each hw queue's irq.
> 

Same issue:

979826] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse 
[Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1
[   38.772536] IRQ25 CPU14 -> CPU3
[   38.777138] IRQ58 CPU8 -> CPU17
[  119.499459] rcu: INFO: rcu_preempt self-detected stall on CPU
[  119.505202] rcu: 16-....: (1 GPs behind) 
idle=a8a/1/0x4000000000000002 softirq=952/1211 fqs=2625
[  119.514188] (t=5253 jiffies g=2613 q=4573)
[  119.514193] Task dump for CPU 16:
[  119.514197] ksoftirqd/16    R  running task        0    91      2 
0x0000002a
[  119.514206] Call trace:
[  119.514224]  dump_backtrace+0x0/0x1a0
[  119.514228]  show_stack+0x14/0x20
[  119.514236]  sched_show_task+0x164/0x1a0
[  119.514240]  dump_cpu_task+0x40/0x2e8
[  119.514245]  rcu_dump_cpu_stacks+0xa0/0xe0
[  119.514247]  rcu_sched_clock_irq+0x6d8/0xaa8
[  119.514251]  update_process_times+0x2c/0x50
[  119.514258]  tick_sched_handle.isra.14+0x30/0x50
[  119.514261]  tick_sched_timer+0x48/0x98
[  119.514264]  __hrtimer_run_queues+0x120/0x1b8
[  119.514266]  hrtimer_interrupt+0xd4/0x250
[  119.514277]  arch_timer_handler_phys+0x28/0x40
[  119.514280]  handle_percpu_devid_irq+0x80/0x140
[  119.514283]  generic_handle_irq+0x24/0x38
[  119.514285]  __handle_domain_irq+0x5c/0xb0
[  119.514299]  gic_handle_irq+0x5c/0x148
[  119.514301]  el1_irq+0xb8/0x180
[  119.514305]  load_balance+0x478/0xb98
[  119.514308]  rebalance_domains+0x1cc/0x2f8
[  119.514311]  run_rebalance_domains+0x78/0xe0
[  119.514313]  efi_header_end+0x114/0x234
[  119.514317]  run_ksoftirqd+0x38/0x48
[  119.514322]  smpboot_thread_fn+0x16c/0x270
[  119.514324]  kthread+0x118/0x120
[  119.514326]  ret_from_fork+0x10/0x18
john@ubuntu:~$ ./dump-io-irq-affinity
kernel version:
Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri 
Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 5
irq 60, cpu list 24-28, effective list 10
irq 61, cpu list 29-33, effective list 7
irq 62, cpu list 34-38, effective list 5
irq 63, cpu list 39-43, effective list 6
irq 64, cpu list 44-47, effective list 8
irq 65, cpu list 48-51, effective list 9
irq 66, cpu list 52-55, effective list 10
irq 67, cpu list 56-59, effective list 11
irq 68, cpu list 60-63, effective list 12
irq 69, cpu list 64-67, effective list 13
irq 70, cpu list 68-71, effective list 14
irq 71, cpu list 72-75, effective list 15
irq 72, cpu list 76-79, effective list 16
irq 73, cpu list 80-83, effective list 17
irq 74, cpu list 84-87, effective list 18
irq 75, cpu list 88-91, effective list 19
irq 76, cpu list 92-95, effective list 20
irq 77, cpu list 0-3, effective list 3
irq 78, cpu list 4-7, effective list 4
irq 79, cpu list 8-11, effective list 8
irq 80, cpu list 12-15, effective list 12
irq 81, cpu list 16-19, effective list 16
irq 82, cpu list 20-23, effective list 23
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 5
irq 102, cpu list 8-11, effective list 9
irq 103, cpu list 12-15, effective list 13
irq 104, cpu list 16-19, effective list 17
irq 105, cpu list 20-23, effective list 21
irq 57, cpu list 63, effective list 7
irq 83, cpu list 24-28, effective list 5
irq 84, cpu list 29-33, effective list 6
irq 85, cpu list 34-38, effective list 8
irq 86, cpu list 39-43, effective list 9
irq 87, cpu list 44-47, effective list 10
irq 88, cpu list 48-51, effective list 11
irq 89, cpu list 52-55, effective list 12
irq 90, cpu list 56-59, effective list 13
irq 91, cpu list 60-63, effective list 14
irq 92, cpu list 64-67, effective list 15
irq 93, cpu list 68-71, effective list 16
irq 94, cpu list 72-75, effective list 17
irq 95, cpu list 76-79, effective list 18
irq 96, cpu list 80-83, effective list 19
irq 97, cpu list 84-87, effective list 20
irq 98, cpu list 88-91, effective list 21
irq 99, cpu list 92-95, effective list 22
john@ubuntu:~$

but you can see that CPU16 is handling irq72, 81, and 93.

> If unique effective CPU is assigned to each hw queue's irq, and the RCU
> stall can still be triggered, let's investigate further, given one single
> ARM64 CPU core should be quick enough to handle IO completion from single
> NVNe drive.

If I remove the code for bring the affinity within the ITS numa node 
mask - as Marc hinted - then I still get a lockup, but we still we have 
CPUs serving multiple interrupts:

   116.166881] rcu: INFO: rcu_preempt self-detected stall on CPU
[  116.181432] Task dump for CPU 4:
[  116.181502] Task dump for CPU 8:
john@ubuntu:~$ ./dump-io-irq-affinity
kernel version:
Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1443 SMP PREEMPT Fri 
Dec 13 15:29:55 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 75
irq 60, cpu list 24-28, effective list 25
irq 61, cpu list 29-33, effective list 29
irq 62, cpu list 34-38, effective list 34
irq 63, cpu list 39-43, effective list 39
irq 64, cpu list 44-47, effective list 44
irq 65, cpu list 48-51, effective list 49
irq 66, cpu list 52-55, effective list 55
irq 67, cpu list 56-59, effective list 56
irq 68, cpu list 60-63, effective list 61
irq 69, cpu list 64-67, effective list 64
irq 70, cpu list 68-71, effective list 68
irq 71, cpu list 72-75, effective list 73
irq 72, cpu list 76-79, effective list 76
irq 73, cpu list 80-83, effective list 80
irq 74, cpu list 84-87, effective list 85
irq 75, cpu list 88-91, effective list 88
irq 76, cpu list 92-95, effective list 92
irq 77, cpu list 0-3, effective list 1
irq 78, cpu list 4-7, effective list 4
irq 79, cpu list 8-11, effective list 8
irq 80, cpu list 12-15, effective list 14
irq 81, cpu list 16-19, effective list 16
irq 82, cpu list 20-23, effective list 20
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 4
irq 102, cpu list 8-11, effective list 8
irq 103, cpu list 12-15, effective list 13
irq 104, cpu list 16-19, effective list 16
irq 105, cpu list 20-23, effective list 20
irq 57, cpu list 63, effective list 63
irq 83, cpu list 24-28, effective list 26
irq 84, cpu list 29-33, effective list 31
irq 85, cpu list 34-38, effective list 35
irq 86, cpu list 39-43, effective list 40
irq 87, cpu list 44-47, effective list 45
irq 88, cpu list 48-51, effective list 50
irq 89, cpu list 52-55, effective list 52
irq 90, cpu list 56-59, effective list 57
irq 91, cpu list 60-63, effective list 62
irq 92, cpu list 64-67, effective list 65
irq 93, cpu list 68-71, effective list 69
irq 94, cpu list 72-75, effective list 74
irq 95, cpu list 76-79, effective list 77
irq 96, cpu list 80-83, effective list 81
irq 97, cpu list 84-87, effective list 86
irq 98, cpu list 88-91, effective list 89
irq 99, cpu list 92-95, effective list 93
john@ubuntu:~$

I'm now thinking that we should just attempt this intelligent CPU 
affinity assignment for managed interrupts.

Thanks,
John

  reply index

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-06 14:35 [PATCH RFC 0/1] Threaded handler uses irq affinity for when the interrupt is managed John Garry
2019-12-06 14:35 ` [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt John Garry
2019-12-06 15:22   ` Marc Zyngier
2019-12-06 16:16     ` John Garry
2019-12-07  8:03   ` Ming Lei
2019-12-09 14:30     ` John Garry
2019-12-09 15:09       ` Hannes Reinecke
2019-12-09 15:17         ` Marc Zyngier
2019-12-09 15:25           ` Hannes Reinecke
2019-12-09 15:36             ` Marc Zyngier
2019-12-09 15:49           ` Qais Yousef
2019-12-09 15:55             ` Marc Zyngier
2019-12-10  1:43       ` Ming Lei
2019-12-10  9:45         ` John Garry
2019-12-10 10:06           ` Ming Lei
2019-12-10 10:28           ` Marc Zyngier
2019-12-10 10:59             ` John Garry
2019-12-10 11:36               ` Marc Zyngier
2019-12-10 12:05                 ` John Garry
2019-12-10 18:32                   ` Marc Zyngier
2019-12-11  9:41                     ` John Garry
2019-12-13 10:07                       ` John Garry
2019-12-13 10:31                         ` Marc Zyngier
2019-12-13 12:08                           ` John Garry
2019-12-14 10:59                             ` Marc Zyngier
2019-12-11 17:09         ` John Garry
2019-12-12 22:38           ` Ming Lei
2019-12-13 11:12             ` John Garry
2019-12-13 13:18               ` Ming Lei
2019-12-13 15:43                 ` John Garry [this message]
2019-12-13 17:12                   ` Ming Lei
2019-12-13 17:50                     ` John Garry
2019-12-14 13:56                   ` Marc Zyngier
2019-12-16 10:47                     ` John Garry
2019-12-16 11:40                       ` Marc Zyngier
2019-12-16 14:17                         ` John Garry
2019-12-16 18:00                           ` Marc Zyngier
2019-12-16 18:50                             ` John Garry
2019-12-20 11:30                               ` John Garry
2019-12-20 14:43                                 ` Marc Zyngier
2019-12-20 15:38                                   ` John Garry
2019-12-20 16:16                                     ` Marc Zyngier
2019-12-20 23:31                                     ` Ming Lei
2019-12-23  9:07                                       ` Marc Zyngier
2019-12-23 10:26                                         ` John Garry
2019-12-23 10:47                                           ` Marc Zyngier
2019-12-23 11:35                                             ` John Garry
2019-12-24  1:59                                             ` Ming Lei
2019-12-24 11:20                                               ` Marc Zyngier
2019-12-25  0:48                                                 ` Ming Lei
2020-01-02 10:35                                                   ` John Garry
2020-01-03  0:46                                                     ` Ming Lei
2020-01-03 10:41                                                       ` John Garry
2020-01-03 11:29                                                         ` Ming Lei
2020-01-03 11:50                                                           ` John Garry
2020-01-04 12:03                                                             ` Ming Lei
2020-05-30  7:46 ` [tip: irq/core] irqchip/gic-v3-its: Balance initial LPI affinity across CPUs tip-bot2 for Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b7f3bcea-84ec-f9f6-a3aa-007ae712415f@huawei.com \
    --to=john.garry@huawei.com \
    --cc=axboe@kernel.dk \
    --cc=bigeasy@linutronix.de \
    --cc=bvanassche@acm.org \
    --cc=chenxiang66@hisilicon.com \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git