From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751972AbdHITJk (ORCPT ); Wed, 9 Aug 2017 15:09:40 -0400 Received: from mail-qt0-f181.google.com ([209.85.216.181]:38015 "EHLO mail-qt0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751555AbdHITJj (ORCPT ); Wed, 9 Aug 2017 15:09:39 -0400 Subject: Re: system hung up when offlining CPUs To: Marc Zyngier References: <20170809124213.0d9518bb@why.wild-wind.fr.eu.org> Cc: tglx@linutronix.de, axboe@kernel.dk, mpe@ellerman.id.au, keith.busch@intel.com, peterz@infradead.org, LKML , yasu.isimatu@gmail.com From: YASUAKI ISHIMATSU Message-ID: Date: Wed, 9 Aug 2017 15:09:33 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170809124213.0d9518bb@why.wild-wind.fr.eu.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Marc, On 08/09/2017 07:42 AM, Marc Zyngier wrote: > On Tue, 8 Aug 2017 15:25:35 -0400 > YASUAKI ISHIMATSU wrote: > >> Hi Thomas, >> >> When offlining all CPUs except cpu0, system hung up with the following message. >> >> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds. >> [...] Not tainted 4.12.0-rc6+ #19 >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [...] kworker/u384:1 D 0 1234 2 0x00000000 >> [...] Workqueue: writeback wb_workfn (flush-253:0) >> [...] Call Trace: >> [...] __schedule+0x28a/0x880 >> [...] schedule+0x36/0x80 >> [...] schedule_timeout+0x249/0x300 >> [...] ? __schedule+0x292/0x880 >> [...] __down_common+0xfc/0x132 >> [...] ? _xfs_buf_find+0x2bb/0x510 [xfs] >> [...] __down+0x1d/0x1f >> [...] down+0x41/0x50 >> [...] xfs_buf_lock+0x3c/0xf0 [xfs] >> [...] _xfs_buf_find+0x2bb/0x510 [xfs] >> [...] xfs_buf_get_map+0x2a/0x280 [xfs] >> [...] xfs_buf_read_map+0x2d/0x180 [xfs] >> [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs] >> [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs] >> [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs] >> [...] xfs_btree_lookup+0xd0/0x3b0 [xfs] >> [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs] >> [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs] >> [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs] >> [...] xfs_alloc_vextent+0x425/0x590 [xfs] >> [...] xfs_bmap_btalloc+0x448/0x770 [xfs] >> [...] xfs_bmap_alloc+0xe/0x10 [xfs] >> [...] xfs_bmapi_write+0x61d/0xc10 [xfs] >> [...] ? kmem_zone_alloc+0x96/0x100 [xfs] >> [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] >> [...] xfs_map_blocks+0x1e8/0x260 [xfs] >> [...] xfs_do_writepage+0x1ca/0x680 [xfs] >> [...] write_cache_pages+0x26f/0x510 >> [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs] >> [...] ? blk_mq_dispatch_rq_list+0x305/0x410 >> [...] ? deadline_remove_request+0x7d/0xc0 >> [...] xfs_vm_writepages+0xb6/0xd0 [xfs] >> [...] do_writepages+0x1c/0x70 >> [...] __writeback_single_inode+0x45/0x320 >> [...] writeback_sb_inodes+0x280/0x570 >> [...] __writeback_inodes_wb+0x8c/0xc0 >> [...] wb_writeback+0x276/0x310 >> [...] ? get_nr_dirty_inodes+0x4d/0x80 >> [...] wb_workfn+0x2d4/0x3b0 >> [...] process_one_work+0x149/0x360 >> [...] worker_thread+0x4d/0x3c0 >> [...] kthread+0x109/0x140 >> [...] ? rescuer_thread+0x380/0x380 >> [...] ? kthread_park+0x60/0x60 >> [...] ret_from_fork+0x25/0x30 >> >> >> I bisected upstream kernel. And I found that the following commit lead >> the issue. >> >> commit c5cb83bb337c25caae995d992d1cdf9b317f83de >> Author: Thomas Gleixner >> Date: Tue Jun 20 01:37:51 2017 +0200 >> >> genirq/cpuhotplug: Handle managed IRQs on CPU hotplug > > Can you please post your /proc/interrupts and details of which > interrupt you think goes wrong? This backtrace is not telling us much > in terms of where to start looking... Thank you for giving advise. The issue is easily reproduced on physical/virtual machine by offling CPUs except cpu0. Here are my /proc/interrupts on kvm guest before reproducing the issue. And when offlining cpu1, the issue occurred. But when offling cpu0, the issue didn't occur. CPU0 CPU1 0: 127 0 IO-APIC 2-edge timer 1: 10 0 IO-APIC 1-edge i8042 4: 227 0 IO-APIC 4-edge ttyS0 6: 3 0 IO-APIC 6-edge floppy 8: 0 0 IO-APIC 8-edge rtc0 9: 0 0 IO-APIC 9-fasteoi acpi 10: 10822 0 IO-APIC 10-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, virtio3 11: 23 0 IO-APIC 11-fasteoi uhci_hcd:usb3, uhci_hcd:usb4, qxl 12: 15 0 IO-APIC 12-edge i8042 14: 218 0 IO-APIC 14-edge ata_piix 15: 0 0 IO-APIC 15-edge ata_piix 24: 0 0 PCI-MSI 49152-edge virtio0-config 25: 359 0 PCI-MSI 49153-edge virtio0-input.0 26: 1 0 PCI-MSI 49154-edge virtio0-output.0 27: 0 0 PCI-MSI 114688-edge virtio2-config 28: 1 3639 PCI-MSI 114689-edge virtio2-req.0 29: 0 0 PCI-MSI 98304-edge virtio1-config 30: 4 0 PCI-MSI 98305-edge virtio1-virtqueues 31: 189 0 PCI-MSI 65536-edge snd_hda_intel:card0 NMI: 0 0 Non-maskable interrupts LOC: 16115 12845 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 0 0 APIC ICR read retries RES: 3016 2135 Rescheduling interrupts CAL: 3666 557 Function call interrupts TLB: 65 12 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts DFR: 0 0 Deferred Error APIC interrupts MCE: 0 0 Machine check exceptions MCP: 1 1 Machine check polls ERR: 0 MIS: 0 PIN: 0 0 Posted-interrupt notification event NPI: 0 0 Nested posted-interrupt event PIW: 0 0 Posted-interrupt wakeup event Thanks, Yasuaki Ishimatsu > > Thanks, > > M. >