From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752179AbdKXHdA (ORCPT ); Fri, 24 Nov 2017 02:33:00 -0500 Received: from cmccmta3.chinamobile.com ([221.176.66.81]:30130 "EHLO cmccmta3.chinamobile.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751577AbdKXHc6 (ORCPT ); Fri, 24 Nov 2017 02:32:58 -0500 X-Greylist: delayed 545 seconds by postgrey-1.27 at vger.kernel.org; Fri, 24 Nov 2017 02:32:58 EST X-RM-TRANSID: 2eec5a17c9032f9-4f7bf X-RM-TagInfo: emlType=0 X-RM-SPAM-FLAG: 00000000 X-RM-TRANSID: 2eea5a17c905767-f614a Date: Fri, 24 Nov 2017 02:23:44 -0500 From: baiyaowei To: tj@kernel.org, jiangshanlai@gmail.com Cc: akpm@linux-foundation.org, baiyaowei@cmss.chinamobile.com, linux-kernel@vger.kernel.org Subject: NULL pointer dereference in process_one_work Message-ID: <20171124072344.GA313@byw> Reply-To: baiyaowei@cmss.chinamobile.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi,tj and jiangshan, I build a ceph storage pool to run some benchmarks with 3.10 kernel. Occasionally, when the cpus' load is very high, some nodes crash with message below. [292273.612014] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [292273.612057] IP: [] process_one_work+0x31/0x470 [292273.612087] PGD 0 [292273.612099] Oops: 0000 [#1] SMP [292273.612117] Modules linked in: rbd(OE) bcache(OE) ip_vs xfs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bonding intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mxm_wmi iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf pcspkr ipmi_ssif mei_me sg lpc_ich mei sb_edac ipmi_si mfd_core edac_core ipmi_msghandler shpchp wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper [292273.612495] crct10dif_pclmul crct10dif_common ttm crc32c_intel drm ahci nvme bnx2x libahci i2c_core libata mdio libcrc32c megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod [292273.612580] CPU: 16 PID: 353223 Comm: kworker/16:2 Tainted: G OE ------------ 3.10.0-327.el7.x86_64 #1 [292273.612620] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS 2.4.3 01/17/2017 [292273.612655] task: ffff8801f55e6780 ti: ffff882a199b0000 task.ti: ffff882a199b0000 [292273.612685] RIP: 0010:[] [] process_one_work+0x31/0x470 [292273.612721] RSP: 0018:ffff882a199b3e28 EFLAGS: 00010046 [292273.612743] RAX: 0000000000000000 RBX: ffff88088b273028 RCX: ffff882a199b3fd8 [292273.612771] RDX: 0000000000000000 RSI: ffff88088b273028 RDI: ffff88088b273000 [292273.612799] RBP: ffff882a199b3e60 R08: 0000000000000000 R09: 0000000000000770 [292273.612827] R10: ffff8822a3bb1f80 R11: ffff8822a3bb1f80 R12: ffff88088b273000 [292273.612855] R13: ffff881fff313fc0 R14: 0000000000000000 R15: ffff881fff313fc0 [292273.612883] FS: 0000000000000000(0000) GS:ffff881fff300000(0000) knlGS:0000000000000000 [292273.612914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [292273.612937] CR2: 00000000000000b8 CR3: 000000000194a000 CR4: 00000000003407e0 [292273.612965] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [292273.612994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [292273.613021] Stack: [292273.613031] 00000000ff313fd8 0000000000000000 ffff881fff313fd8 000188088b273030 [292273.613069] ffff8801f55e6780 ffff88088b273000 ffff881fff313fc0 ffff882a199b3ec0 [292273.613108] ffffffff8109e4cc ffff882a199b3fd8 ffff882a199b3fd8 ffff8801f55e6780 [292273.613146] Call Trace: [292273.613160] [] worker_thread+0x21c/0x400 [292273.613185] [] ? rescuer_thread+0x400/0x400 [292273.613212] [] kthread+0xcf/0xe0 [292273.613234] [] ? kthread_create_on_node+0x140/0x140 [292273.613263] [] ret_from_fork+0x58/0x90 [292273.613287] [] ? kthread_create_on_node+0x140/0x140 [292273.614303] Code: 48 89 e5 41 57 41 56 45 31 f6 41 55 41 54 49 89 fc 53 48 89 f3 48 83 ec 10 48 8b 06 4c 8b 6f 48 48 89 c2 30 d2 a8 04 4c 0f 45 f2 <49> 8b 46 08 44 8b b8 00 01 00 00 41 c1 ef 05 44 89 f8 83 e0 01 [292273.617971] RIP [] process_one_work+0x31/0x470 [292273.620011] RSP [292273.621940] CR2: 0000000000000008 Some crash messsages: crash> sys KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 32 DATE: Wed Oct 18 05:21:14 2017 UPTIME: 3 days, 09:07:25 LOAD AVERAGE: 221.70, 222.22, 224.96 TASKS: 3115 NODENAME: node121 RELEASE: 3.10.0-327.el7.x86_64 VERSION: #1 SMP Thu Nov 19 22:10:57 UTC 2015 MACHINE: x86_64 (2099 Mhz) MEMORY: 255.9 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008" crash> bt PID: 353223 TASK: ffff8801f55e6780 CPU: 16 COMMAND: "kworker/16:2" #0 [ffff882a199b3af0] machine_kexec at ffffffff81051beb #1 [ffff882a199b3b50] crash_kexec at ffffffff810f2542 #2 [ffff882a199b3c20] oops_end at ffffffff8163e1a8 #3 [ffff882a199b3c48] no_context at ffffffff8162e2b8 #4 [ffff882a199b3c98] __bad_area_nosemaphore at ffffffff8162e34e #5 [ffff882a199b3ce0] bad_area_nosemaphore at ffffffff8162e4b8 #6 [ffff882a199b3cf0] __do_page_fault at ffffffff81640fce #7 [ffff882a199b3d48] do_page_fault at ffffffff81641113 #8 [ffff882a199b3d70] page_fault at ffffffff8163d408 [exception RIP: process_one_work+49] RIP: ffffffff8109d4b1 RSP: ffff882a199b3e28 RFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff88088b273028 RCX: ffff882a199b3fd8 RDX: 0000000000000000 RSI: ffff88088b273028 RDI: ffff88088b273000 RBP: ffff882a199b3e60 R8: 0000000000000000 R9: 0000000000000770 R10: ffff8822a3bb1f80 R11: ffff8822a3bb1f80 R12: ffff88088b273000 R13: ffff881fff313fc0 R14: 0000000000000000 R15: ffff881fff313fc0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff882a199b3e68] worker_thread at ffffffff8109e4cc #10 [ffff882a199b3ec8] kthread at ffffffff810a5aef #11 [ffff882a199b3f50] ret_from_fork at ffffffff81645858 crash> dis process_one_work 0xffffffff8109d480 : nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff8109d485 : push %rbp 0xffffffff8109d486 : mov %rsp,%rbp 0xffffffff8109d489 : push %r15 0xffffffff8109d48b : push %r14 0xffffffff8109d48d : xor %r14d,%r14d 0xffffffff8109d490 : push %r13 0xffffffff8109d492 : push %r12 0xffffffff8109d494 : mov %rdi,%r12 0xffffffff8109d497 : push %rbx 0xffffffff8109d498 : mov %rsi,%rbx 0xffffffff8109d49b : sub $0x10,%rsp 0xffffffff8109d49f : mov (%rsi),%rax 0xffffffff8109d4a2 : mov 0x48(%rdi),%r13 0xffffffff8109d4a6 : mov %rax,%rdx 0xffffffff8109d4a9 : xor %dl,%dl 0xffffffff8109d4ab : test $0x4,%al 0xffffffff8109d4ad : cmovne %rdx,%r14 0xffffffff8109d4b1 : mov 0x8(%r14),%rax 0xffffffff8109d4b5 : mov 0x100(%rax),%r15d 0xffffffff8109d4bc : shr $0x5,%r15d 0xffffffff8109d4c0 : mov %r15d,%eax 0xffffffff8109d4c3 : and $0x1,%eax 0xffffffff8109d4c6 : testb $0x80,0x58(%rdi) 0xffffffff8109d4ca : mov %al,-0x30(%rbp) 0xffffffff8109d4cd : jne 0xffffffff8109d4da crash> *work_struct ffff88088b273028 struct work_struct { data = { counter = 0 }, entry = { next = 0xffff88088b273030, prev = 0xffff88088b273030 }, func = 0xffff8801f55e6780 } crash> *worker ffff88088b273000 struct worker { { entry = { next = 0x0, prev = 0x0 }, hentry = { next = 0x0, pprev = 0x0 } }, current_work = 0x0, current_func = 0x0, current_pwq = 0x0, desc_valid = false, scheduled = { next = 0xffff88088b273030, prev = 0xffff88088b273030 }, task = 0xffff8801f55e6780, pool = 0xffff881fff313fc0, last_active = 4586712874, flags = 1, id = 2, desc = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", rescue_wq = 0x0 } crash> It looks like the work parameter is mistakenly pointing into the worker parameter. I noticed there was one similar case like this before, but not exactly same. https://lists.gt.net/linux/kernel/2022989 I wonder how this situation could happen. When process_one_work is called, the work being processed shall be on worker_pool->worklist, but not here. crash> *worker_pool 0xffff881fff313fc0 struct worker_pool { lock = { { rlock = { raw_lock = { { head_tail = 4142331618, tickets = { head = 63202, tail = 63206 } } } } } }, cpu = 16, node = 0, id = 32, flags = 0, worklist = { next = 0xffff881fff313fd8, prev = 0xffff881fff313fd8 }, nr_workers = 3, nr_idle = 2, idle_list = { next = 0xffff880245a18280, prev = 0xffff88078d3c5780 }, ... Maybe we'd add if(get_work_pwq(work)) in process_one_work or add BUG_ON like this: diff --git a/kernel/workqueue.c b/kernel/workqueue.c index dde6298..82a92e0 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2012,11 +2012,15 @@ static void process_one_work(struct worker *worker, struct work_struct *work) __releases(&pool->lock) __acquires(&pool->lock) { - struct pool_workqueue *pwq = get_work_pwq(work); + struct pool_workqueue *pwq; struct worker_pool *pool = worker->pool; - bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; + bool cpu_intensive; int work_color; struct worker *collision; + + BUG_ON(!(pwq = get_work_pwq(work))); + bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; + #ifdef CONFIG_LOCKDEP I really appreciate your any ideas and please let me know if you want any more information from the crashed system. Thanks, Yaowei,Bai