From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754367AbcAYAtr (ORCPT ); Sun, 24 Jan 2016 19:49:47 -0500 Received: from mail-io0-f179.google.com ([209.85.223.179]:36262 "EHLO mail-io0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752353AbcAYAtn (ORCPT ); Sun, 24 Jan 2016 19:49:43 -0500 MIME-Version: 1.0 Date: Sun, 24 Jan 2016 17:49:43 -0700 Message-ID: Subject: [BUG REPORT] Soft Lockup in smp_call_function_single+0xD8 From: Jeff Merkey To: LKML Cc: Ingo Molnar , Andrew Morton , Vlastimil Babka , "Peter Zijlstra (Intel)" , Mel Gorman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If I single step with either kgdb, kgdb, or mdb kernel debuggers over a sysret instruction anywhere in the OS, the system hard hangs in smp_call_function_single after the debugger releases the system and it resumes normal operation. The specific place the kernel hangs is in the loop below. The softlockup detector will periodically detect this condition when it occurs, but not always, most of the time the system is just hung and unresponsive. (2)> u smp_call_function_single+d8 <<<< hard hang in this loop with EDX=3 0xffffffff810fce48 8B55E0 mov edx,DWORD PTR [rbp-32]=0xCE037DC0 0xffffffff810fce4b 83E201 and edx,0x1 0xffffffff810fce4e 75F6 jne smp_call_function_single+0xd6 (0xffffffff810fce46) (up) <<<<< 0xffffffff810fce50 EBC3 jmp smp_call_function_single+0xa5 (0xffffffff810fce15) (up) 0xffffffff810fce52 8B05E08EC700 mov eax,[oops_in_progress]=0x0 0xffffffff810fce58 85C0 test eax,eax 0xffffffff810fce5a 7585 jne smp_call_function_single+0x71 (0xffffffff810fcde1) (up) 0xffffffff810fce5c 803D8E0C9D0000 cmp [__warned.20610]=0x00,0x0 0xffffffff810fce63 0F8578FFFFFF jne smp_call_function_single+0x71 (0xffffffff810fcde1) (up) 0xffffffff810fce69 BE24010000 mov esi,0x124 0xffffffff810fce6e 48C7C796B08C81 mov rdi,0xffffffff818cb096 0xffffffff810fce75 894DBC mov DWORD PTR [rbp-68]=0x0,ecx 0xffffffff810fce78 488955C0 mov QWORD PTR [rbp-64]=0xFFFFFFFFFFFFFF10,rdx 0xffffffff810fce7c E8FF21F8FF call warn_slowpath_null 0xffffffff810fce81 C605690C9D0001 mov [__warned.20610]=0x00,0x1 0xffffffff810fce88 8B4DBC mov ecx,DWORD PTR [rbp-68]=0x0 0xffffffff810fce8b 488B55C0 mov rdx,QWORD PTR [rbp-64]=0xFFFFFFFFFFFFFF10 0xffffffff810fce8f E94DFFFFFF jmp smp_call_function_single+0x71 (0xffffffff810fcde1) (up) 0xffffffff810fce94 E8A71EF8FF call __stack_chk_fail 0xffffffff810fce99 0F1F8000000000 nop DWORD PTR [rax]=0x0 (2)> g The stack backtrace when the bug occurs is: smp_call_function_single+0xd8 unmap_page_range+0x613 flush_tlb_func+0x0 smp_call_function_many+215 native_flush_tlb_others+0x118 flush_tlb_mm_range+0x61 tlb_flush_mmu_tlbonly+0x6b tlb_finish_mmu+0x14 unmap_region+0xe2 vma_rb_erase+0x10f do_unmap+0x217 vm_unmap+0x41 SyS_munmap+0x22 entry_SYSCALL_64_fastpath+0x12 I traced through this code a bunch of times in just normal operations without triggering the bug to get a feel for what it normally sees in EDX and it looks like someone has coded a looping function that always has EDX=0 in every case I saw in the except for when this bug occurs. So the exact C code this maps fro objdump of kernel/smp.o is: 469: e8 62 fe ff ff callq 2d0 46e: 8b 55 e0 mov -0x20(%rbp),%edx * previous function call. For multi-cpu calls its even more interesting * as we'll have to ensure no other cpu is observing our csd. */ static void csd_lock_wait(struct call_single_data *csd) { while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK) 471: 83 e2 01 and $0x1,%edx 474: 74 cf je 445 476: f3 90 pause <<<<<<<<<< 478: 8b 55 e0 mov -0x20(%rbp),%edx 47b: 83 e2 01 and $0x1,%edx 47e: 75 f6 jne 476 <<<<<<<<<<< 480: eb c3 jmp 445 * Can deadlock when called with interrupts disabled. * We allow cpu's that are not yet online though, as no one else can * send smp call function interrupt to this cpu and as such deadlocks * can't happen. */ WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() Each time this bug occurs csd->flags is always set to a value of 3 and never changes. When the system is just running normally, it seems to be 0 the rest of the time. Setting EDX=0 from the debugger console clears the hang condition and the system seems to recover except the system reports this error from the console when you attempt to load programs, indicating the ability of the system to load shared objects is fritzed. # # ls -l /lib64/libc.so.6 version GLI not found << this error and no shared objects will load # # Jeff