From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754703AbcAYBnr (ORCPT ); Sun, 24 Jan 2016 20:43:47 -0500 Received: from mail-io0-f174.google.com ([209.85.223.174]:36551 "EHLO mail-io0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752543AbcAYBno (ORCPT ); Sun, 24 Jan 2016 20:43:44 -0500 MIME-Version: 1.0 In-Reply-To: References: Date: Sun, 24 Jan 2016 18:43:43 -0700 Message-ID: Subject: Re: [BUG REPORT] Soft Lockup in smp_call_function_single+0xD8 From: Jeff Merkey To: LKML Cc: Ingo Molnar , Andrew Morton , Vlastimil Babka , "Peter Zijlstra (Intel)" , Mel Gorman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/24/16, Jeff Merkey wrote: > If I single step with either kgdb, kgdb, or mdb kernel debuggers over > a sysret instruction anywhere in the OS, the system hard hangs in > smp_call_function_single after the debugger releases the system and it > resumes normal operation. The specific place the kernel hangs is in > the loop below. The softlockup detector will periodically detect > this condition when it occurs, but not always, most of the time the > system is just hung and unresponsive. > > (2)> u smp_call_function_single+d8 > <<<< hard hang in this loop with EDX=3 > 0xffffffff810fce48 8B55E0 mov edx,DWORD PTR [rbp-32]=0xCE037DC0 > 0xffffffff810fce4b 83E201 and edx,0x1 > 0xffffffff810fce4e 75F6 jne > smp_call_function_single+0xd6 (0xffffffff810fce46) (up) > <<<<< > 0xffffffff810fce50 EBC3 jmp > smp_call_function_single+0xa5 (0xffffffff810fce15) (up) > 0xffffffff810fce52 8B05E08EC700 mov eax,[oops_in_progress]=0x0 > 0xffffffff810fce58 85C0 test eax,eax > 0xffffffff810fce5a 7585 jne > smp_call_function_single+0x71 (0xffffffff810fcde1) (up) > 0xffffffff810fce5c 803D8E0C9D0000 cmp [__warned.20610]=0x00,0x0 > 0xffffffff810fce63 0F8578FFFFFF jne > smp_call_function_single+0x71 (0xffffffff810fcde1) (up) > 0xffffffff810fce69 BE24010000 mov esi,0x124 > 0xffffffff810fce6e 48C7C796B08C81 mov rdi,0xffffffff818cb096 > 0xffffffff810fce75 894DBC mov DWORD PTR [rbp-68]=0x0,ecx > 0xffffffff810fce78 488955C0 mov QWORD PTR > [rbp-64]=0xFFFFFFFFFFFFFF10,rdx > 0xffffffff810fce7c E8FF21F8FF call warn_slowpath_null > 0xffffffff810fce81 C605690C9D0001 mov [__warned.20610]=0x00,0x1 > 0xffffffff810fce88 8B4DBC mov ecx,DWORD PTR [rbp-68]=0x0 > 0xffffffff810fce8b 488B55C0 mov rdx,QWORD PTR > [rbp-64]=0xFFFFFFFFFFFFFF10 > 0xffffffff810fce8f E94DFFFFFF jmp > smp_call_function_single+0x71 (0xffffffff810fcde1) (up) > 0xffffffff810fce94 E8A71EF8FF call __stack_chk_fail > 0xffffffff810fce99 0F1F8000000000 nop DWORD PTR [rax]=0x0 > (2)> g > > > The stack backtrace when the bug occurs is: > > smp_call_function_single+0xd8 > unmap_page_range+0x613 > flush_tlb_func+0x0 > smp_call_function_many+215 > native_flush_tlb_others+0x118 > flush_tlb_mm_range+0x61 > tlb_flush_mmu_tlbonly+0x6b > tlb_finish_mmu+0x14 > unmap_region+0xe2 > vma_rb_erase+0x10f > do_unmap+0x217 > vm_unmap+0x41 > SyS_munmap+0x22 > entry_SYSCALL_64_fastpath+0x12 > > I traced through this code a bunch of times in just normal operations > without triggering the bug to get a feel for what it normally sees in > EDX and it looks like someone has coded a looping function that always > has EDX=0 in every case I saw in the except for when this bug occurs. > > So the exact C code this maps fro objdump of kernel/smp.o is: > > 469: e8 62 fe ff ff callq 2d0 > 46e: 8b 55 e0 mov -0x20(%rbp),%edx > * previous function call. For multi-cpu calls its even more interesting > * as we'll have to ensure no other cpu is observing our csd. > */ > static void csd_lock_wait(struct call_single_data *csd) > { > while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK) > 471: 83 e2 01 and $0x1,%edx > 474: 74 cf je 445 > 476: f3 90 pause > <<<<<<<<<< > 478: 8b 55 e0 mov -0x20(%rbp),%edx > 47b: 83 e2 01 and $0x1,%edx > 47e: 75 f6 jne 476 > <<<<<<<<<<< > 480: eb c3 jmp 445 > * Can deadlock when called with interrupts disabled. > * We allow cpu's that are not yet online though, as no one else can > * send smp call function interrupt to this cpu and as such deadlocks > * can't happen. > */ > WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() > > Each time this bug occurs csd->flags is always set to a value of 3 and > never changes. When the system is just running normally, it seems to > be 0 the rest of the time. Setting EDX=0 from the debugger console > clears the hang condition and the system seems to recover except the > system reports this error from the console when you attempt to load > programs, indicating the ability of the system to load shared objects > is fritzed. > > # > # ls -l > /lib64/libc.so.6 version GLI not found << this error and no shared > objects will load > # > # > > Jeff > I am running down a trace of the MSR values for swapgs. Looks like it got nested somewhere down in the entry_64 code. If so, then this is just a symptom and not the sickness. Jeff