[BUG REPORT] Soft Lockup in smp_call_function_single+0xD8

* [BUG REPORT] Soft Lockup in smp_call_function_single+0xD8
@ 2016-01-25  0:49 Jeff Merkey
  2016-01-25  1:43 ` Jeff Merkey
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Merkey @ 2016-01-25  0:49 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Andrew Morton, Vlastimil Babka,
	Peter Zijlstra (Intel),
	Mel Gorman

If I single step with either kgdb, kgdb, or mdb kernel debuggers over
a sysret instruction anywhere in the OS,  the system hard hangs in
smp_call_function_single after the debugger releases the system and it
resumes normal operation.    The specific place the kernel hangs is in
the loop below.   The softlockup detector will periodically detect
this condition when it occurs, but not always, most of the time the
system is just hung and unresponsive.

(2)> u smp_call_function_single+d8
<<<< hard hang in this loop with EDX=3
0xffffffff810fce48 8B55E0          mov    edx,DWORD PTR [rbp-32]=0xCE037DC0
0xffffffff810fce4b 83E201          and    edx,0x1
0xffffffff810fce4e 75F6            jne
smp_call_function_single+0xd6 (0xffffffff810fce46) (up)
<<<<<
0xffffffff810fce50 EBC3            jmp
smp_call_function_single+0xa5 (0xffffffff810fce15) (up)
0xffffffff810fce52 8B05E08EC700    mov    eax,[oops_in_progress]=0x0
0xffffffff810fce58 85C0            test   eax,eax
0xffffffff810fce5a 7585            jne
smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
0xffffffff810fce5c 803D8E0C9D0000  cmp    [__warned.20610]=0x00,0x0
0xffffffff810fce63 0F8578FFFFFF    jne
smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
0xffffffff810fce69 BE24010000      mov    esi,0x124
0xffffffff810fce6e 48C7C796B08C81  mov    rdi,0xffffffff818cb096
0xffffffff810fce75 894DBC          mov    DWORD PTR [rbp-68]=0x0,ecx
0xffffffff810fce78 488955C0        mov    QWORD PTR
[rbp-64]=0xFFFFFFFFFFFFFF10,rdx
0xffffffff810fce7c E8FF21F8FF      call   warn_slowpath_null
0xffffffff810fce81 C605690C9D0001  mov    [__warned.20610]=0x00,0x1
0xffffffff810fce88 8B4DBC          mov    ecx,DWORD PTR [rbp-68]=0x0
0xffffffff810fce8b 488B55C0        mov    rdx,QWORD PTR
[rbp-64]=0xFFFFFFFFFFFFFF10
0xffffffff810fce8f E94DFFFFFF      jmp
smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
0xffffffff810fce94 E8A71EF8FF      call   __stack_chk_fail
0xffffffff810fce99 0F1F8000000000  nop    DWORD PTR [rax]=0x0
(2)> g

The stack backtrace when the bug occurs is:

smp_call_function_single+0xd8
unmap_page_range+0x613
flush_tlb_func+0x0
smp_call_function_many+215
native_flush_tlb_others+0x118
flush_tlb_mm_range+0x61
tlb_flush_mmu_tlbonly+0x6b
tlb_finish_mmu+0x14
unmap_region+0xe2
vma_rb_erase+0x10f
do_unmap+0x217
vm_unmap+0x41
SyS_munmap+0x22
entry_SYSCALL_64_fastpath+0x12

I traced through this code a bunch of times in just normal operations
without triggering the bug to get a feel for what it normally sees in
EDX and it looks like someone has coded a looping function that always
has EDX=0 in every case I saw in the except for when this bug occurs.

So the exact C code this maps fro objdump of kernel/smp.o is:

 469:	e8 62 fe ff ff       	callq  2d0 <generic_exec_single>
 46e:	8b 55 e0             	mov    -0x20(%rbp),%edx
 * previous function call. For multi-cpu calls its even more interesting
 * as we'll have to ensure no other cpu is observing our csd.
 */
static void csd_lock_wait(struct call_single_data *csd)
{
	while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
 471:	83 e2 01             	and    $0x1,%edx
 474:	74 cf                	je     445 <smp_call_function_single+0xa5>
 476:	f3 90                	pause
<<<<<<<<<<
 478:	8b 55 e0             	mov    -0x20(%rbp),%edx
 47b:	83 e2 01             	and    $0x1,%edx
 47e:	75 f6                	jne    476 <smp_call_function_single+0xd6>
<<<<<<<<<<<
 480:	eb c3                	jmp    445 <smp_call_function_single+0xa5>
	 * Can deadlock when called with interrupts disabled.
	 * We allow cpu's that are not yet online though, as no one else can
	 * send smp call function interrupt to this cpu and as such deadlocks
	 * can't happen.
	 */
	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()

Each time this bug occurs csd->flags is always set to a value of 3 and
never changes.  When the system is just running normally, it seems to
be 0 the rest of the time.   Setting EDX=0 from the debugger console
clears the hang condition and the system seems to recover except the
system reports this error from the console when you attempt to load
programs, indicating the ability of the system to load shared objects
is fritzed.

#
# ls -l
/lib64/libc.so.6 version GLI not found   << this error and no shared
objects will load
#
#

Jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread