Re: system hung up when offlining CPUs

From: Marc Zyngier <marc.zyngier@arm.com>
To: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: tglx@linutronix.de, axboe@kernel.dk, mpe@ellerman.id.au,
	keith.busch@intel.com, peterz@infradead.org,
	LKML <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: system hung up when offlining CPUs
Date: Thu, 10 Aug 2017 12:54:33 +0100	[thread overview]
Message-ID: <c1c7e0d6-d908-b511-8418-bca288a0d20a@arm.com> (raw)
In-Reply-To: <cd524af7-1f20-1956-1e44-92a451053387@gmail.com>

+ Christoph, since he's the one who came up with the idea

On 09/08/17 20:09, YASUAKI ISHIMATSU wrote:
> Hi Marc,
> 
> On 08/09/2017 07:42 AM, Marc Zyngier wrote:
>> On Tue, 8 Aug 2017 15:25:35 -0400
>> YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> wrote:
>>
>>> Hi Thomas,
>>>
>>> When offlining all CPUs except cpu0, system hung up with the following message.
>>>
>>> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds.
>>> [...]       Not tainted 4.12.0-rc6+ #19
>>> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [...] kworker/u384:1  D    0  1234      2 0x00000000
>>> [...] Workqueue: writeback wb_workfn (flush-253:0)
>>> [...] Call Trace:
>>> [...]  __schedule+0x28a/0x880
>>> [...]  schedule+0x36/0x80
>>> [...]  schedule_timeout+0x249/0x300
>>> [...]  ? __schedule+0x292/0x880
>>> [...]  __down_common+0xfc/0x132
>>> [...]  ? _xfs_buf_find+0x2bb/0x510 [xfs]
>>> [...]  __down+0x1d/0x1f
>>> [...]  down+0x41/0x50
>>> [...]  xfs_buf_lock+0x3c/0xf0 [xfs]
>>> [...]  _xfs_buf_find+0x2bb/0x510 [xfs]
>>> [...]  xfs_buf_get_map+0x2a/0x280 [xfs]
>>> [...]  xfs_buf_read_map+0x2d/0x180 [xfs]
>>> [...]  xfs_trans_read_buf_map+0xf5/0x310 [xfs]
>>> [...]  xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs]
>>> [...]  xfs_btree_lookup_get_block+0x88/0x160 [xfs]
>>> [...]  xfs_btree_lookup+0xd0/0x3b0 [xfs]
>>> [...]  ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs]
>>> [...]  xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs]
>>> [...]  xfs_alloc_ag_vextent+0x13c/0x150 [xfs]
>>> [...]  xfs_alloc_vextent+0x425/0x590 [xfs]
>>> [...]  xfs_bmap_btalloc+0x448/0x770 [xfs]
>>> [...]  xfs_bmap_alloc+0xe/0x10 [xfs]
>>> [...]  xfs_bmapi_write+0x61d/0xc10 [xfs]
>>> [...]  ? kmem_zone_alloc+0x96/0x100 [xfs]
>>> [...]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
>>> [...]  xfs_map_blocks+0x1e8/0x260 [xfs]
>>> [...]  xfs_do_writepage+0x1ca/0x680 [xfs]
>>> [...]  write_cache_pages+0x26f/0x510
>>> [...]  ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs]
>>> [...]  ? blk_mq_dispatch_rq_list+0x305/0x410
>>> [...]  ? deadline_remove_request+0x7d/0xc0
>>> [...]  xfs_vm_writepages+0xb6/0xd0 [xfs]
>>> [...]  do_writepages+0x1c/0x70
>>> [...]  __writeback_single_inode+0x45/0x320
>>> [...]  writeback_sb_inodes+0x280/0x570
>>> [...]  __writeback_inodes_wb+0x8c/0xc0
>>> [...]  wb_writeback+0x276/0x310
>>> [...]  ? get_nr_dirty_inodes+0x4d/0x80
>>> [...]  wb_workfn+0x2d4/0x3b0
>>> [...]  process_one_work+0x149/0x360
>>> [...]  worker_thread+0x4d/0x3c0
>>> [...]  kthread+0x109/0x140
>>> [...]  ? rescuer_thread+0x380/0x380
>>> [...]  ? kthread_park+0x60/0x60
>>> [...]  ret_from_fork+0x25/0x30
>>>
>>>
>>> I bisected upstream kernel. And I found that the following commit lead
>>> the issue.
>>>
>>> commit c5cb83bb337c25caae995d992d1cdf9b317f83de
>>> Author: Thomas Gleixner <tglx@linutronix.de>
>>> Date:   Tue Jun 20 01:37:51 2017 +0200
>>>
>>>     genirq/cpuhotplug: Handle managed IRQs on CPU hotplug
>>
>> Can you please post your /proc/interrupts and details of which
>> interrupt you think goes wrong? This backtrace is not telling us much
>> in terms of where to start looking...
> 
> Thank you for giving advise.
> 
> The issue is easily reproduced on physical/virtual machine by offling CPUs except cpu0.
> Here are my /proc/interrupts on kvm guest before reproducing the issue. And when offlining
> cpu1, the issue occurred. But when offling cpu0, the issue didn't occur.
> 
>            CPU0       CPU1
>   0:        127          0   IO-APIC   2-edge      timer
>   1:         10          0   IO-APIC   1-edge      i8042
>   4:        227          0   IO-APIC   4-edge      ttyS0
>   6:          3          0   IO-APIC   6-edge      floppy
>   8:          0          0   IO-APIC   8-edge      rtc0
>   9:          0          0   IO-APIC   9-fasteoi   acpi
>  10:      10822          0   IO-APIC  10-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2, virtio3
>  11:         23          0   IO-APIC  11-fasteoi   uhci_hcd:usb3, uhci_hcd:usb4, qxl
>  12:         15          0   IO-APIC  12-edge      i8042
>  14:        218          0   IO-APIC  14-edge      ata_piix
>  15:          0          0   IO-APIC  15-edge      ata_piix
>  24:          0          0   PCI-MSI 49152-edge      virtio0-config
>  25:        359          0   PCI-MSI 49153-edge      virtio0-input.0
>  26:          1          0   PCI-MSI 49154-edge      virtio0-output.0
>  27:          0          0   PCI-MSI 114688-edge      virtio2-config
>  28:          1       3639   PCI-MSI 114689-edge      virtio2-req.0
>  29:          0          0   PCI-MSI 98304-edge      virtio1-config
>  30:          4          0   PCI-MSI 98305-edge      virtio1-virtqueues
>  31:        189          0   PCI-MSI 65536-edge      snd_hda_intel:card0
> NMI:          0          0   Non-maskable interrupts
> LOC:      16115      12845   Local timer interrupts
> SPU:          0          0   Spurious interrupts
> PMI:          0          0   Performance monitoring interrupts
> IWI:          0          0   IRQ work interrupts
> RTR:          0          0   APIC ICR read retries
> RES:       3016       2135   Rescheduling interrupts
> CAL:       3666        557   Function call interrupts
> TLB:         65         12   TLB shootdowns
> TRM:          0          0   Thermal event interrupts
> THR:          0          0   Threshold APIC interrupts
> DFR:          0          0   Deferred Error APIC interrupts
> MCE:          0          0   Machine check exceptions
> MCP:          1          1   Machine check polls
> ERR:          0
> MIS:          0
> PIN:          0          0   Posted-interrupt notification event
> NPI:          0          0   Nested posted-interrupt event
> PIW:          0          0   Posted-interrupt wakeup event

I was able to reproduce this with an arm64 VM pretty easily. The issue
is that (in the above case), IRQ27 is affine to CPU0, and IRQ28 to CPU1.
If you have more CPUs, IRQ27 is affine to the first half of the CPUs,
and IRQ28 to the others.

When CPU1 is offlined, the fact that we have a "managed" interrupt
affinity prevents the interrupt from being moved to CPU0, and you loose
your disk. I don't think that's the expected effect... The opposite case
(offlining CPU0) only "works" because you're not getting any config
interrupt (IOW, you're lucky).

I'm not sure how this is supposed to work. Shutting down the interrupt
in migrate_one_irq() just breaks everything (unless the driver somehow
knows about it, which doesn't seem to be the case).

Christoph?

	M.
-- 
Jazz is not dead. It just smells funny...