From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161648AbaKNRtx (ORCPT ); Fri, 14 Nov 2014 12:49:53 -0500 Received: from mga02.intel.com ([134.134.136.20]:22017 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161180AbaKNRtv (ORCPT ); Fri, 14 Nov 2014 12:49:51 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.07,386,1413270000"; d="scan'208";a="607993195" From: "Luck, Tony" To: Andy Lutomirski CC: Oleg Nesterov , Borislav Petkov , X86 ML , "linux-kernel@vger.kernel.org" , Peter Zijlstra , "Andi Kleen" Subject: RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace Thread-Topic: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace Thread-Index: AQHP/s7+MsolZCiViEiCPzqCVeTRAJxdoOwAgACR1oD//3yvMIAAtdCAgAB2NOCAAM3pAP//e6+AgACK+ACAAAdiAIAAGwAAgACUXsA= Date: Fri, 14 Nov 2014 17:49:29 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F3293A5D8@ORSMSX114.amr.corp.intel.com> References: <20141112220058.GA5295@redhat.com> <3908561D78D1C84285E8C5FCA982C28F3292BAB4@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F3292BD44@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F3292CB9A@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F3292D57B@ORSMSX114.amr.corp.intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id sAEHo0Pq002655 > Can you also try rebasing onto what will probably be v3? > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9 Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE etc. No printk() in the MCE context. System ran 736 injection/consumption/recovery cycles and then got an RCU stall - followed by a zillion soft lockups. [ 203.326117] mce: Uncorrected hardware memory error in user-access at 100f07f800 [ 203.326193] MCE 0x100f07f: Killing harderrors:12052 due to hardware memory corruption [ 203.326195] MCE 0x100f07f: dirty LRU page recovery: Recovered [ 204.721893] mce: Uncorrected hardware memory error in user-access at 100f7073c0 [ 204.721906] INFO: rcu_sched self-detected stall on CPU { 91} (t=60002 jiffies g=5125 c=5124 q=0) [ 204.721908] Task dump for CPU 91: [ 204.721911] kworker/91:1 R running task 0 1033 2 0x00000008 [ 204.721925] Workqueue: events_power_efficient fb_flashcursor [ 204.721929] ffff880c6767def0 00000000c74bfa96 ffff880c6fa63d68 ffffffff81099d68 [ 204.721930] 000000000000005b ffffffff819d1140 ffff880c6fa63d88 ffffffff8109d38d [ 204.721932] 0000000000000087 000000000000000c ffff880c6fa63db8 ffffffff810caed0 [ 204.721933] Call Trace: [ 204.721946] [] sched_show_task+0xa8/0x110 [ 204.721951] [] dump_cpu_task+0x3d/0x50 [ 204.721961] [] rcu_dump_cpu_stacks+0x90/0xd0 [ 204.721967] [] rcu_check_callbacks+0x497/0x710 [ 204.721974] [] update_process_times+0x4b/0x80 [ 204.721986] [] tick_sched_handle.isra.19+0x25/0x60 [ 204.721989] [] tick_sched_timer+0x45/0x80 [ 204.721992] [] __run_hrtimer+0x77/0x1d0 [ 204.721995] [] ? tick_sched_handle.isra.19+0x60/0x60 [ 204.721997] [] hrtimer_interrupt+0xf7/0x240 [ 204.722008] [] local_apic_timer_interrupt+0x3b/0x70 [ 204.722018] [] smp_apic_timer_interrupt+0x45/0x60 [ 204.722020] [] apic_timer_interrupt+0x6d/0x80 [ 204.722034] [] ? console_unlock+0x418/0x460 [ 204.722037] [] fb_flashcursor+0x5d/0x140 [ 204.722040] [] ? bit_clear+0x120/0x120 [ 204.722049] [] process_one_work+0x14e/0x3f0 [ 204.722051] [] worker_thread+0x11b/0x510 [ 204.722053] [] ? rescuer_thread+0x350/0x350 [ 204.722057] [] kthread+0xe1/0x100 [ 204.722059] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 204.722074] [] ret_from_fork+0x7c/0xb0 [ 204.722076] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 227.462386] NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [migration/18:134] [ 227.462452] Modules linked in: einj ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw sg iptable_filter ip_tables vfat fat iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm ixgbe crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ptp lrw gf128mul pps_core glue_helper mdio dca ablk_helper sb_edac cryptd edac_core lpc_ich pcspkr shpchp i2c_i801 mfd_core ipmi_si wmi ipmi_msghandler acpi_pad xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper sr_mod cdrom ttm drm ahci libahci mpt2sas libata raid_class i2c_core scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [ 227.462470] CPU: 18 PID: 134 Comm: migration/18 Tainted: G M W 3.18.0-rc3 #1 [ 227.462472] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0058.D01.1410201505 10/20/2014 [ 227.462474] task: ffff880c68605ef0 ti: ffff880c67d9c000 task.ti: ffff880c67d9c000 [ 227.462484] RIP: 0010:[] [] multi_cpu_stop+0x70/0xf0 [ 227.462485] RSP: 0018:ffff880c67d9fd68 EFLAGS: 00000293 [ 227.462487] RAX: 0000000000000000 RBX: ffff880c6f814840 RCX: ffffffffffffffff [ 227.462488] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff81ab3320 [ 227.462489] RBP: ffff880c67d9fd88 R08: ffffffff81ab3328 R09: ffff881467e58d90 [ 227.462490] R10: ffffffff81ab3320 R11: 0000000000000001 R12: 0000000000000000 [ 227.462492] R13: ffff880c677c7800 R14: ffff880c67000800 R15: ffff880c00000000 [ 227.462494] FS: 0000000000000000(0000) GS:ffff880c6f800000(0000) knlGS:0000000000000000 [ 227.462495] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 227.462496] CR2: 00007f2147fcce90 CR3: 0000000001978000 CR4: 00000000001407e0 [ 227.462498] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 227.462500] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 227.462500] Stack: [ 227.462503] ffff880c65a8fd20 ffff880c6f80f0a0 ffff880c65a8fdb8 ffff880c6f80f0a8 [ 227.462505] ffff880c67d9fe58 ffffffff81105778 ffffffff81095387 0000000000000010 [ 227.462507] 0000000000000282 ffff880c67d9fdc8 0000000000000018 0000000000000000 [ 227.462508] Call Trace: [ 227.462512] [] cpu_stopper_thread+0x78/0x150 [ 227.462516] [] ? finish_task_switch+0x57/0x180 [ 227.462522] [] ? __schedule+0x2f7/0x7e0 [ 227.462531] [] smpboot_thread_fn+0xff/0x1b0 [ 227.462534] [] ? SyS_setgroups+0x1a0/0x1a0 [ 227.462537] [] kthread+0xe1/0x100 [ 227.462539] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 227.462544] [] ret_from_fork+0x7c/0xb0 [ 227.462547] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 227.462572] Code: 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da 83 fa 04 74 3d f3 90 41 8b 5c 24 20 <39> d3 74 f0 83 fb 02 75 d7 fa 66 0f 1f 44 00 00 eb d8 66 0f 1f [ 227.478401] NMI watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:142] [ 227.478437] Modules linked in: einj ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw sg iptable_filter ip_tables vfat fat iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm ixgbe crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ptp lrw gf128mul pps_core glue_helper mdio dca ablk_helper sb_edac cryptd edac_core lpc_ich pcspkr shpchp i2c_i801 mfd_core ipmi_si wmi ipmi_msghandler acpi_pad xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper sr_mod cdrom ttm drm ahci libahci mpt2sas libata raid_class i2c_core scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [ 227.478448] CPU: 19 PID: 142 Comm: migration/19 Tainted: G M W L 3.18.0-rc3 #1 [ 227.478449] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0058.D01.1410201505 10/20/2014 [ 227.478451] task: ffff880c67dc1b20 ti: ffff880c67dd0000 task.ti: ffff880c67dd0000 [ 227.478456] RIP: 0010:[] [] multi_cpu_stop+0x70/0xf0 [ 227.478457] RSP: 0018:ffff880c67dd3d68 EFLAGS: 00000293 [ 227.478459] RAX: 0000000000000000 RBX: ffff880c6f834840 RCX: ffffffffffffffff [ 227.478460] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff81ab3320 [ 227.478461] RBP: ffff880c67dd3d88 R08: ffffffff81ab3328 R09: ffff881467e59b20 [ 227.478462] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000000000000 [ 227.478463] R13: ffff880c677c6000 R14: ffff880c67002800 R15: ffff880c00000000 [ 227.478464] FS: 0000000000000000(0000) GS:ffff880c6f820000(0000) knlGS:0000000000000000 [ 227.478466] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 227.478467] CR2: 00007f09b6e2eef0 CR3: 0000000001978000 CR4: 00000000001407e0 [ 227.478468] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 227.478469] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 227.478470] Stack: [ 227.478472] ffff880c65a8fd20 ffff880c6f82f0a0 ffff880c65a8fdb8 ffff880c6f82f0a8 [ 227.478474] ffff880c67dd3e58 ffffffff81105778 ffffffff81095387 0000000000000010 [ 227.478476] 0000000000000216 ffff880c67dd3dc8 0000000000000018 0000000000000000 [ 227.478477] Call Trace: [ 227.478480] [] cpu_stopper_thread+0x78/0x150 [ 227.478483] [] ? finish_task_switch+0x57/0x180 [ 227.478486] [] ? __schedule+0x2f7/0x7e0 [ 227.478491] [] smpboot_thread_fn+0xff/0x1b0 [ 227.478494] [] ? SyS_setgroups+0x1a0/0x1a0 [ 227.478496] [] kthread+0xe1/0x100 [ 227.478498] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 227.478502] [] ret_from_fork+0x7c/0xb0 [ 227.478504] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 227.478526] Code: 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da 83 fa 04 74 3d f3 90 41 8b 5c 24 20 <39> d3 74 f0 83 fb 02 75 d7 fa 66 0f 1f 44 00 00 eb d8 66 0f 1f [ 227.493414] NMI watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [migration/20:149] [ 227.493448] Modules linked in: einj ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw sg iptable_filter ip_tables vfat fat iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm ixgbe crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ptp lrw gf128mul pps_core glue_helper mdio dca ablk_helper sb_edac cryptd edac_core lpc_ich pcspkr shpchp i2c_i801 mfd_core ipmi_si wmi ipmi_msghandler acpi_pad xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper sr_mod cdrom ttm drm ahci libahci mpt2sas libata raid_class i2c_core scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [ 227.493460] CPU: 20 PID: 149 Comm: migration/20 Tainted: G M W L 3.18.0-rc3 #1 > It adds debugging for inappropriate reschedules from the wrong stack. > Setting CONFIG_DEBUG_ATOMIC_SLEEP might also be a good idea. Will add that for next build/test -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I