From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dashi DS1 Cao <caods1@lenovo.com>
Subject: Kernel crashes in __migration_entry_wait
Date: Sun, 13 Nov 2016 12:39:38 +0000
Message-ID: <23B7B563BA4E9446B962B142C86EF24A3DCF87@CNMAILEX03.lenovo.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
Return-path: <linux-numa-owner@vger.kernel.org>
Content-Language: zh-CN
Sender: linux-numa-owner@vger.kernel.org
List-ID: <linux-numa.vger.kernel.org>
To: "'linux-x86_64@vger.kernel.org'" <linux-x86_64@vger.kernel.org>, "'linux-numa@vger.kernel.org'" <linux-numa@vger.kernel.org>

Hi all,
A X86_64 server repeatedly dumps once a while with the following signature:

PID: 32577  TASK: ffff882d4351d080  CPU: 22  COMMAND: "vertica"
 #0 [ffff8812a2bdfba8] machine_kexec at ffffffff81051beb
 #1 [ffff8812a2bdfc08] crash_kexec at ffffffff810f2542
 #2 [ffff8812a2bdfcd8] oops_end at ffffffff8163e1a8
 #3 [ffff8812a2bdfd00] die at ffffffff8101859b
 #4 [ffff8812a2bdfd30] do_general_protection at ffffffff8163da9e
 #5 [ffff8812a2bdfd60] general_protection at ffffffff8163d3a8
    [exception RIP: __migration_entry_wait+148]
    RIP: ffffffff811c5f64  RSP: ffff8812a2bdfe18  RFLAGS: 00010203
    RAX: 01ffffffffffffff  RBX: ffffea0000000030  RCX: 0000000000000000
    RDX: 1e001897de001880  RSI: ffffea0000000030  RDI: f000c4bef000c4be
    RBP: ffff8812a2bdfe28   R8: 00003ffffffff000   R9: 00000000000000a9
    R10: 0000000000000000  R11: f000c4bef000c4be  R12: 1e000297de001880
    R13: ffff88195992d440  R14: ffff883be5ea0000  R15: 0000000000000080
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #6 [ffff8812a2bdfe10] __migration_entry_wait at ffffffff811c5eea
 #7 [ffff8812a2bdfe30] migration_entry_wait at ffffffff811c62b3
 #8 [ffff8812a2bdfe40] handle_mm_fault at ffffffff81197a12
 #9 [ffff8812a2bdfed0] __do_page_fault at ffffffff81640e22
#10 [ffff8812a2bdff28] do_page_fault at ffffffff81641113
#11 [ffff8812a2bdff50] page_fault at ffffffff8163d408
    RIP: 00000000022d80ba  RSP: 00007f2171bf7990  RFLAGS: 00010206
    RAX: 0000000000002000  RBX: 00007f1a2521fac0  RCX: 00007f1a2521fac0
    RDX: 0000000000000000  RSI: 00000000b9b0b802  RDI: 00000000000039d8
    RBP: 00007f2171bf79a0   R8: 0000000000000000   R9: 000000000de857de
    R10: 0000000000000000  R11: 00007f1a2521fac0  R12: 00007f1bad9215f0
    R13: 00007f21540c4710  R14: 00007f1e09b40d70  R15: 00007f2154040d78
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

      KERNEL: vmlinux
    DUMPFILE: 127.0.0.1-2016-10-03-09:59:36/vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Mon Oct  3 10:13:22 2016
      UPTIME: 4 days, 17:04:52
LOAD AVERAGE: 0.49, 0.26, 0.24
       TASKS: 657
    NODENAME: node04-priv
     RELEASE: 3.10.0-327.el7.x86_64
     VERSION: #1 SMP Thu Nov 19 22:10:57 UTC 2015
     MACHINE: x86_64  (2600 Mhz)
      MEMORY: 240 GB
       PANIC: "general protection fault: 0000 [#1] SMP "
         PID: 32577
     COMMAND: "vertica"
        TASK: ffff882d4351d080  [THREAD_INFO: ffff8812a2bdc000]
         CPU: 22
       STATE: TASK_RUNNING (PANIC)

It seems that this is a bug. I'm not sure if it has been identified and removed, but it cannot be found on the web. The customer was adviced to disable numa balancing to work around and I'm waiting for the latest results from them.

Thank you all!
Dashi Cao
181 0102 1741