From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754703AbcAYBnr (ORCPT <rfc822;w@1wt.eu>);
	Sun, 24 Jan 2016 20:43:47 -0500
Received: from mail-io0-f174.google.com ([209.85.223.174]:36551 "EHLO
	mail-io0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752543AbcAYBno (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sun, 24 Jan 2016 20:43:44 -0500
MIME-Version: 1.0
In-Reply-To: <CAO6TR8WKrjy7T_1u-aQTPTpKXyNYG5sfYgJUS6FFFG3RYjUnrw@mail.gmail.com>
References: <CAO6TR8WKrjy7T_1u-aQTPTpKXyNYG5sfYgJUS6FFFG3RYjUnrw@mail.gmail.com>
Date: Sun, 24 Jan 2016 18:43:43 -0700
Message-ID: <CAO6TR8UXOZJgyFP8Xd01MjPxj+kMYWfpeA7B9ASuntsCDuTn5w@mail.gmail.com>
Subject: Re: [BUG REPORT] Soft Lockup in smp_call_function_single+0xD8
From: Jeff Merkey <linux.mdb@gmail.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>, Andrew Morton <akpm@linux-foundation.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Mel Gorman <mgorman@techsingularity.net>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 1/24/16, Jeff Merkey <linux.mdb@gmail.com> wrote:
> If I single step with either kgdb, kgdb, or mdb kernel debuggers over
> a sysret instruction anywhere in the OS,  the system hard hangs in
> smp_call_function_single after the debugger releases the system and it
> resumes normal operation.    The specific place the kernel hangs is in
> the loop below.   The softlockup detector will periodically detect
> this condition when it occurs, but not always, most of the time the
> system is just hung and unresponsive.
>
> (2)> u smp_call_function_single+d8
> <<<< hard hang in this loop with EDX=3
> 0xffffffff810fce48 8B55E0          mov    edx,DWORD PTR [rbp-32]=0xCE037DC0
> 0xffffffff810fce4b 83E201          and    edx,0x1
> 0xffffffff810fce4e 75F6            jne
> smp_call_function_single+0xd6 (0xffffffff810fce46) (up)
> <<<<<
> 0xffffffff810fce50 EBC3            jmp
> smp_call_function_single+0xa5 (0xffffffff810fce15) (up)
> 0xffffffff810fce52 8B05E08EC700    mov    eax,[oops_in_progress]=0x0
> 0xffffffff810fce58 85C0            test   eax,eax
> 0xffffffff810fce5a 7585            jne
> smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
> 0xffffffff810fce5c 803D8E0C9D0000  cmp    [__warned.20610]=0x00,0x0
> 0xffffffff810fce63 0F8578FFFFFF    jne
> smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
> 0xffffffff810fce69 BE24010000      mov    esi,0x124
> 0xffffffff810fce6e 48C7C796B08C81  mov    rdi,0xffffffff818cb096
> 0xffffffff810fce75 894DBC          mov    DWORD PTR [rbp-68]=0x0,ecx
> 0xffffffff810fce78 488955C0        mov    QWORD PTR
> [rbp-64]=0xFFFFFFFFFFFFFF10,rdx
> 0xffffffff810fce7c E8FF21F8FF      call   warn_slowpath_null
> 0xffffffff810fce81 C605690C9D0001  mov    [__warned.20610]=0x00,0x1
> 0xffffffff810fce88 8B4DBC          mov    ecx,DWORD PTR [rbp-68]=0x0
> 0xffffffff810fce8b 488B55C0        mov    rdx,QWORD PTR
> [rbp-64]=0xFFFFFFFFFFFFFF10
> 0xffffffff810fce8f E94DFFFFFF      jmp
> smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
> 0xffffffff810fce94 E8A71EF8FF      call   __stack_chk_fail
> 0xffffffff810fce99 0F1F8000000000  nop    DWORD PTR [rax]=0x0
> (2)> g
>
>
> The stack backtrace when the bug occurs is:
>
> smp_call_function_single+0xd8
> unmap_page_range+0x613
> flush_tlb_func+0x0
> smp_call_function_many+215
> native_flush_tlb_others+0x118
> flush_tlb_mm_range+0x61
> tlb_flush_mmu_tlbonly+0x6b
> tlb_finish_mmu+0x14
> unmap_region+0xe2
> vma_rb_erase+0x10f
> do_unmap+0x217
> vm_unmap+0x41
> SyS_munmap+0x22
> entry_SYSCALL_64_fastpath+0x12
>
> I traced through this code a bunch of times in just normal operations
> without triggering the bug to get a feel for what it normally sees in
> EDX and it looks like someone has coded a looping function that always
> has EDX=0 in every case I saw in the except for when this bug occurs.
>
> So the exact C code this maps fro objdump of kernel/smp.o is:
>
>  469:	e8 62 fe ff ff       	callq  2d0 <generic_exec_single>
>  46e:	8b 55 e0             	mov    -0x20(%rbp),%edx
>  * previous function call. For multi-cpu calls its even more interesting
>  * as we'll have to ensure no other cpu is observing our csd.
>  */
> static void csd_lock_wait(struct call_single_data *csd)
> {
> 	while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
>  471:	83 e2 01             	and    $0x1,%edx
>  474:	74 cf                	je     445 <smp_call_function_single+0xa5>
>  476:	f3 90                	pause
> <<<<<<<<<<
>  478:	8b 55 e0             	mov    -0x20(%rbp),%edx
>  47b:	83 e2 01             	and    $0x1,%edx
>  47e:	75 f6                	jne    476 <smp_call_function_single+0xd6>
> <<<<<<<<<<<
>  480:	eb c3                	jmp    445 <smp_call_function_single+0xa5>
> 	 * Can deadlock when called with interrupts disabled.
> 	 * We allow cpu's that are not yet online though, as no one else can
> 	 * send smp call function interrupt to this cpu and as such deadlocks
> 	 * can't happen.
> 	 */
> 	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
>
> Each time this bug occurs csd->flags is always set to a value of 3 and
> never changes.  When the system is just running normally, it seems to
> be 0 the rest of the time.   Setting EDX=0 from the debugger console
> clears the hang condition and the system seems to recover except the
> system reports this error from the console when you attempt to load
> programs, indicating the ability of the system to load shared objects
> is fritzed.
>
> #
> # ls -l
> /lib64/libc.so.6 version GLI not found   << this error and no shared
> objects will load
> #
> #
>
> Jeff
>

I am running down a trace of the MSR values for swapgs.  Looks like it
got nested somewhere down in the entry_64 code.   If so, then this is
just a symptom and not the sickness.

Jeff