* MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) @ 2012-10-18 1:53 Dave Jones 2012-10-18 5:53 ` Jens Axboe 0 siblings, 1 reply; 15+ messages in thread From: Dave Jones @ 2012-10-18 1:53 UTC (permalink / raw) To: axboe; +Cc: Linux Kernel Triggered while fuzz testing.. BUG: MAX_LOCKDEP_ENTRIES too low! turning off the locking correctness validator. Pid: 22788, comm: kworker/2:1 Not tainted 3.7.0-rc1+ #34 Call Trace: [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0 [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0 [<ffffffff810e3a12>] lock_acquire+0xa2/0x220 [<ffffffff8117bad2>] ? free_one_page+0x32/0x450 [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0 [<ffffffff816c0800>] _raw_spin_lock+0x40/0x80 [<ffffffff8117bad2>] ? free_one_page+0x32/0x450 [<ffffffff8117bad2>] free_one_page+0x32/0x450 [<ffffffff8117bf41>] ? __free_pages_ok.part.58+0x51/0x110 [<ffffffff8117bf9c>] __free_pages_ok.part.58+0xac/0x110 [<ffffffff8117cd73>] __free_pages+0x73/0x90 [<ffffffff811cb4f3>] __free_slab+0xd3/0x1b0 [<ffffffff811cb609>] discard_slab+0x39/0x50 [<ffffffff816b77db>] __slab_free+0x378/0x3a3 [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0 [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0 [<ffffffff811cd4e2>] kmem_cache_free+0x2f2/0x320 [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0 [<ffffffff81341289>] ioc_release_fn+0x99/0xe0 [<ffffffff81095a37>] process_one_work+0x207/0x780 [<ffffffff810959c7>] ? process_one_work+0x197/0x780 [<ffffffff813411f0>] ? get_io_context+0x20/0x20 [<ffffffff8109638e>] worker_thread+0x15e/0x440 [<ffffffff81096230>] ? rescuer_thread+0x240/0x240 [<ffffffff8109d0cd>] kthread+0xed/0x100 [<ffffffff810de02e>] ? put_lock_stats.isra.25+0xe/0x40 [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160 [<ffffffff816c9dac>] ret_from_fork+0x7c/0xb0 [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) 2012-10-18 1:53 MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) Dave Jones @ 2012-10-18 5:53 ` Jens Axboe 2012-10-19 5:21 ` Dave Jones 0 siblings, 1 reply; 15+ messages in thread From: Jens Axboe @ 2012-10-18 5:53 UTC (permalink / raw) To: Dave Jones, Linux Kernel, Peter Zijlstra, Ingo Molnar On 2012-10-18 03:53, Dave Jones wrote: > Triggered while fuzz testing.. > > > BUG: MAX_LOCKDEP_ENTRIES too low! > turning off the locking correctness validator. > Pid: 22788, comm: kworker/2:1 Not tainted 3.7.0-rc1+ #34 > Call Trace: > [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0 > [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0 > [<ffffffff810e3a12>] lock_acquire+0xa2/0x220 > [<ffffffff8117bad2>] ? free_one_page+0x32/0x450 > [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0 > [<ffffffff816c0800>] _raw_spin_lock+0x40/0x80 > [<ffffffff8117bad2>] ? free_one_page+0x32/0x450 > [<ffffffff8117bad2>] free_one_page+0x32/0x450 > [<ffffffff8117bf41>] ? __free_pages_ok.part.58+0x51/0x110 > [<ffffffff8117bf9c>] __free_pages_ok.part.58+0xac/0x110 > [<ffffffff8117cd73>] __free_pages+0x73/0x90 > [<ffffffff811cb4f3>] __free_slab+0xd3/0x1b0 > [<ffffffff811cb609>] discard_slab+0x39/0x50 > [<ffffffff816b77db>] __slab_free+0x378/0x3a3 > [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0 > [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0 > [<ffffffff811cd4e2>] kmem_cache_free+0x2f2/0x320 > [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0 > [<ffffffff81341289>] ioc_release_fn+0x99/0xe0 > [<ffffffff81095a37>] process_one_work+0x207/0x780 > [<ffffffff810959c7>] ? process_one_work+0x197/0x780 > [<ffffffff813411f0>] ? get_io_context+0x20/0x20 > [<ffffffff8109638e>] worker_thread+0x15e/0x440 > [<ffffffff81096230>] ? rescuer_thread+0x240/0x240 > [<ffffffff8109d0cd>] kthread+0xed/0x100 > [<ffffffff810de02e>] ? put_lock_stats.isra.25+0xe/0x40 > [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160 > [<ffffffff816c9dac>] ret_from_fork+0x7c/0xb0 > [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160 Not sure why you are CC'ing a call site, rather than the maintainers of the code. Just looks like lockdep is using too small a static value. Though it is pretty darn large... -- Jens Axboe ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) 2012-10-18 5:53 ` Jens Axboe @ 2012-10-19 5:21 ` Dave Jones 2012-10-19 12:49 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Dave Jones @ 2012-10-19 5:21 UTC (permalink / raw) To: Jens Axboe; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar On Thu, Oct 18, 2012 at 07:53:08AM +0200, Jens Axboe wrote: > On 2012-10-18 03:53, Dave Jones wrote: > > Triggered while fuzz testing.. > > > > > > BUG: MAX_LOCKDEP_ENTRIES too low! > > turning off the locking correctness validator. > > Pid: 22788, comm: kworker/2:1 Not tainted 3.7.0-rc1+ #34 > > Call Trace: > > [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0 > > [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0 > > [<ffffffff810e3a12>] lock_acquire+0xa2/0x220 > > [<ffffffff8117bad2>] ? free_one_page+0x32/0x450 > > [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0 > > [<ffffffff816c0800>] _raw_spin_lock+0x40/0x80 > > [<ffffffff8117bad2>] ? free_one_page+0x32/0x450 > > [<ffffffff8117bad2>] free_one_page+0x32/0x450 > > [<ffffffff8117bf41>] ? __free_pages_ok.part.58+0x51/0x110 > > [<ffffffff8117bf9c>] __free_pages_ok.part.58+0xac/0x110 > > [<ffffffff8117cd73>] __free_pages+0x73/0x90 > > [<ffffffff811cb4f3>] __free_slab+0xd3/0x1b0 > > [<ffffffff811cb609>] discard_slab+0x39/0x50 > > [<ffffffff816b77db>] __slab_free+0x378/0x3a3 > > [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0 > > [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0 > > [<ffffffff811cd4e2>] kmem_cache_free+0x2f2/0x320 > > [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0 > > [<ffffffff81341289>] ioc_release_fn+0x99/0xe0 > > [<ffffffff81095a37>] process_one_work+0x207/0x780 > > [<ffffffff810959c7>] ? process_one_work+0x197/0x780 > > [<ffffffff813411f0>] ? get_io_context+0x20/0x20 > > [<ffffffff8109638e>] worker_thread+0x15e/0x440 > > [<ffffffff81096230>] ? rescuer_thread+0x240/0x240 > > [<ffffffff8109d0cd>] kthread+0xed/0x100 > > [<ffffffff810de02e>] ? put_lock_stats.isra.25+0xe/0x40 > > [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160 > > [<ffffffff816c9dac>] ret_from_fork+0x7c/0xb0 > > [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160 > > Not sure why you are CC'ing a call site, rather than the maintainers of > the code. Just looks like lockdep is using too small a static value. > Though it is pretty darn large... You're right, it's a huge chunk of memory. It looks like I can trigger this from multiple callsites.. Another different trace below. Not sure why this suddenly got a lot worse in 3.7 Peter, Ingo ? Dave BUG: MAX_LOCKDEP_ENTRIES too low! turning off the locking correctness validator. Pid: 22350, comm: trinity-child0 Not tainted 3.7.0-rc1+ #36 Call Trace: [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0 [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0 [<ffffffff810b6f89>] ? local_clock+0x89/0xa0 [<ffffffff811b84a5>] ? __swap_duplicate+0xb5/0x190 [<ffffffff810dea58>] ? trace_hardirqs_off_caller+0x28/0xd0 [<ffffffff810e3a12>] lock_acquire+0xa2/0x220 [<ffffffff811b663d>] ? __add_to_swap_cache+0x6d/0x180 [<ffffffff816c0f79>] ? _raw_spin_lock_irq+0x29/0x90 [<ffffffff816c0fa6>] _raw_spin_lock_irq+0x56/0x90 [<ffffffff811b663d>] ? __add_to_swap_cache+0x6d/0x180 [<ffffffff811b663d>] __add_to_swap_cache+0x6d/0x180 [<ffffffff811b6c45>] read_swap_cache_async+0xb5/0x220 [<ffffffff811b6e4e>] swapin_readahead+0x9e/0xf0 [<ffffffff811a2c86>] handle_pte_fault+0x6d6/0xae0 [<ffffffff816c5ad9>] ? sub_preempt_count+0x79/0xd0 [<ffffffff8136d37e>] ? delay_tsc+0xae/0x120 [<ffffffff8136d268>] ? __const_udelay+0x28/0x30 [<ffffffff811a4919>] handle_mm_fault+0x289/0x350 [<ffffffff816c538e>] __do_page_fault+0x18e/0x530 [<ffffffff810b6f89>] ? local_clock+0x89/0xa0 [<ffffffff816b77e1>] ? __slab_free+0x32e/0x3a3 [<ffffffff8112d2e9>] ? rcu_user_exit+0xc9/0xf0 [<ffffffffa0000020>] ? 0xffffffffa000001f [<ffffffffa0000020>] ? 0xffffffffa000001f [<ffffffff816c575b>] do_page_fault+0x2b/0x50 [<ffffffff816c1e38>] page_fault+0x28/0x30 [<ffffffffa0000020>] ? 0xffffffffa000001f [<ffffffff813885ac>] ? strncpy_from_user+0x6c/0x120 [<ffffffffa0000020>] ? 0xffffffffa000001f [<ffffffff8120e94f>] setxattr+0x6f/0x1d0 [<ffffffff816c5ad9>] ? sub_preempt_count+0x79/0xd0 [<ffffffff8137e8b5>] ? __percpu_counter_add+0x75/0xc0 [<ffffffff811e92d1>] ? __sb_start_write+0x101/0x1d0 [<ffffffff81209d94>] ? mnt_want_write+0x24/0x50 [<ffffffff81209d94>] ? mnt_want_write+0x24/0x50 [<ffffffff810b0de1>] ? get_parent_ip+0x11/0x50 [<ffffffff816c5ad9>] ? sub_preempt_count+0x79/0xd0 [<ffffffff81209d30>] ? __mnt_want_write+0x60/0xa0 [<ffffffffa0000020>] ? 0xffffffffa000001f [<ffffffff8120ecd5>] sys_setxattr+0x95/0xb0 [<ffffffff816ca108>] tracesys+0xe1/0xe6 [<ffffffffa0000020>] ? 0xffffffffa000001f ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) 2012-10-19 5:21 ` Dave Jones @ 2012-10-19 12:49 ` Peter Zijlstra 2012-10-19 19:29 ` Dave Jones 2012-10-23 19:50 ` MAX_LOCKDEP_ENTRIES too low Dave Jones 0 siblings, 2 replies; 15+ messages in thread From: Peter Zijlstra @ 2012-10-19 12:49 UTC (permalink / raw) To: Dave Jones; +Cc: Jens Axboe, Linux Kernel, Ingo Molnar On Fri, 2012-10-19 at 01:21 -0400, Dave Jones wrote: > > Not sure why you are CC'ing a call site, rather than the maintainers of > > the code. Just looks like lockdep is using too small a static value. > > Though it is pretty darn large... > > You're right, it's a huge chunk of memory. > It looks like I can trigger this from multiple callsites.. > Another different trace below. > > Not sure why this suddenly got a lot worse in 3.7 Did we add a static array of structures with locks in somewhere? Doing that is a great way of blowing up the number of lock classes and the resulting amount of lock dependency chains. >From Documentation/lockdep-design.txt; it talks about overflowing MAX_LOCKDEP_KEYS, but I suppose its a good starts for overflowing the dependency entries too, more classes means more dependencies after all. --- Of course, if you do run out of lock classes, the next thing to do is to find the offending lock classes. First, the following command gives you the number of lock classes currently in use along with the maximum: grep "lock-classes" /proc/lockdep_stats This command produces the following output on a modest system: lock-classes: 748 [max: 8191] If the number allocated (748 above) increases continually over time, then there is likely a leak. The following command can be used to identify the leaking lock classes: grep "BD" /proc/lockdep Run the command and save the output, then compare against the output from a later run of this command to identify the leakers. This same output can also help you find situations where runtime lock initialization has been omitted. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) 2012-10-19 12:49 ` Peter Zijlstra @ 2012-10-19 19:29 ` Dave Jones 2012-10-23 19:50 ` MAX_LOCKDEP_ENTRIES too low Dave Jones 1 sibling, 0 replies; 15+ messages in thread From: Dave Jones @ 2012-10-19 19:29 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jens Axboe, Linux Kernel, Ingo Molnar On Fri, Oct 19, 2012 at 02:49:32PM +0200, Peter Zijlstra wrote: > Of course, if you do run out of lock classes, the next thing to do is > to find the offending lock classes. First, the following command gives > you the number of lock classes currently in use along with the maximum: > > grep "lock-classes" /proc/lockdep_stats > > This command produces the following output on a modest system: > > lock-classes: 748 [max: 8191] After the BUG gets hit.. lock-classes: 1726 [max: 8191] > If the number allocated (748 above) increases continually over time, > then there is likely a leak. The following command can be used to > identify the leaking lock classes: > > grep "BD" /proc/lockdep > > Run the command and save the output, then compare against the output from > a later run of this command to identify the leakers. This same output > can also help you find situations where runtime lock initialization has > been omitted. I've not had chance to do this, because after the BUG, lockdep turns itself off, and I've not rebooted. I'm probably not going to get to this until after the weekend. There's just a *lot* of dependancies. Here's the full output http://codemonkey.org.uk/junk/lockdep the top few backwards deps.. ffffffff81c8f218 FD: 1 BD: 1201 -.-.-.: pool_lock ffffffff82ae1210 FD: 2 BD: 1200 -.-.-.: &obj_hash[i].lock ffffffff820677c1 FD: 1 BD: 1131 -.-.-.: &rt_rq->rt_runtime_lock ffffffff82066949 FD: 3 BD: 1131 -.-.-.: &cpu_base->lock ffffffff820221c0 FD: 1 BD: 1130 -.-.-.: &sig->cputimer.lock ffffffff820677b8 FD: 5 BD: 1129 -.-.-.: &rt_b->rt_runtime_lock ffffffff82067675 FD: 3 BD: 1129 ..-.-.: &rq->lock/1 ffffffff82067674 FD: 8 BD: 1128 -.-.-.: &rq->lock ffffffff8298bbd0 FD: 1 BD: 1006 -.-.-.: &(&n->list_lock)->rlock Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: MAX_LOCKDEP_ENTRIES too low 2012-10-19 12:49 ` Peter Zijlstra 2012-10-19 19:29 ` Dave Jones @ 2012-10-23 19:50 ` Dave Jones 2012-10-24 20:24 ` pi futex oops in __lock_acquire Dave Jones 1 sibling, 1 reply; 15+ messages in thread From: Dave Jones @ 2012-10-23 19:50 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Linux Kernel, Ingo Molnar On Fri, Oct 19, 2012 at 02:49:32PM +0200, Peter Zijlstra wrote: > > Not sure why this suddenly got a lot worse in 3.7 > > Did we add a static array of structures with locks in somewhere? Doing > that is a great way of blowing up the number of lock classes and the > resulting amount of lock dependency chains. So I hit it again, but didn't see anything that I hadn't seen already in /proc/lockdep. I so tried doubling MAX_LOCKDEP_ENTRIES again. A while later, I got this oops. I'm unsure now if this is a genuine bug, or a side-effect of lockdep not working with such a large MAX_LOCKDEP_ENTRIES opinions ? BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 PGD 8e72c067 PUD 34f07067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: fuse nfnetlink ipt_ULOG binfmt_misc nfc caif_socket caif phonet can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key decnet rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 lockd sunrpc bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables kvm_intel kvm usb_debug crc32c_intel ghash_clmulni_intel microcode pcspkr i2c_i801 e1000e uinput i915 video i2c_algo_bit drm_kms_helper drm i2c_core CPU 7 Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 Intel Corporation 2012 Client Platform/Emerald Lake 2 RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) Stack: ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 Call Trace: [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 [<ffffffff810b7039>] ? local_clock+0x89/0xa0 [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 [<ffffffff810e99e4>] do_futex+0x394/0xa50 [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 [<ffffffff810ea12d>] sys_futex+0x8d/0x190 [<ffffffff816ca288>] tracesys+0xe1/0xe6 Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 RSP <ffff8800803f7b28> CR2: 0000000000000018 ^ permalink raw reply [flat|nested] 15+ messages in thread
* pi futex oops in __lock_acquire 2012-10-23 19:50 ` MAX_LOCKDEP_ENTRIES too low Dave Jones @ 2012-10-24 20:24 ` Dave Jones 2012-10-25 4:44 ` Darren Hart 0 siblings, 1 reply; 15+ messages in thread From: Dave Jones @ 2012-10-24 20:24 UTC (permalink / raw) To: Linux Kernel; +Cc: Darren Hart I've been able to trigger this for the last week or so. Unclear whether this is a new bug, or my fuzzer got smarter, but I see the pi-futex code hasn't changed since the last time it found something.. > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > PGD 8e72c067 PUD 34f07067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP > CPU 7 > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 > RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 > FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) > Stack: > ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 > ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 > ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 > Call Trace: > [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 > [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 > [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 > [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 > [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 > [<ffffffff810b7039>] ? local_clock+0x89/0xa0 > [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 > [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 > [<ffffffff810e99e4>] do_futex+0x394/0xa50 > [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 > [<ffffffff810ea12d>] sys_futex+0x8d/0x190 > [<ffffffff816ca288>] tracesys+0xe1/0xe6 > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 > RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > RSP <ffff8800803f7b28> > CR2: 0000000000000018 It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. Darren, any idea how this could happen ? Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-10-24 20:24 ` pi futex oops in __lock_acquire Dave Jones @ 2012-10-25 4:44 ` Darren Hart 2012-10-25 11:09 ` Dave Jones 2012-11-20 16:46 ` Dave Jones 0 siblings, 2 replies; 15+ messages in thread From: Darren Hart @ 2012-10-25 4:44 UTC (permalink / raw) To: Dave Jones, Linux Kernel On 10/24/2012 01:24 PM, Dave Jones wrote: > I've been able to trigger this for the last week or so. > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the > pi-futex code hasn't changed since the last time it found something.. > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > PGD 8e72c067 PUD 34f07067 PMD 0 > > Oops: 0000 [#1] PREEMPT SMP > > CPU 7 > > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 > > RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 > > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 > > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 > > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 > > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 > > FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) > > Stack: > > ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 > > ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 > > ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 > > Call Trace: > > [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 > > [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 > > [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 > > [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 > > [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 > > [<ffffffff810b7039>] ? local_clock+0x89/0xa0 > > [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 > > [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 > > [<ffffffff810e99e4>] do_futex+0x394/0xa50 > > [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 > > [<ffffffff810ea12d>] sys_futex+0x8d/0x190 > > [<ffffffff816ca288>] tracesys+0xe1/0xe6 > > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 > > RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > RSP <ffff8800803f7b28> > > CR2: 0000000000000018 > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. > > Darren, any idea how this could happen ? I'm digging. Can you get trinity to provide the arguments it used that trigger the crash? That might help hone in on the exact path. -- Darren Hart Intel Open Source Technology Center Yocto Project - Technical Lead - Linux Kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-10-25 4:44 ` Darren Hart @ 2012-10-25 11:09 ` Dave Jones 2012-11-20 16:46 ` Dave Jones 1 sibling, 0 replies; 15+ messages in thread From: Dave Jones @ 2012-10-25 11:09 UTC (permalink / raw) To: Darren Hart; +Cc: Linux Kernel On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote: > > > On 10/24/2012 01:24 PM, Dave Jones wrote: > > I've been able to trigger this for the last week or so. > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the > > pi-futex code hasn't changed since the last time it found something.. > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. > > > > Darren, any idea how this could happen ? > > I'm digging. Can you get trinity to provide the arguments it used that > trigger the crash? That might help hone in on the exact path. Happens pretty quickly with -c futex for me. (30 seconds or so) Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-10-25 4:44 ` Darren Hart 2012-10-25 11:09 ` Dave Jones @ 2012-11-20 16:46 ` Dave Jones 2012-11-20 17:27 ` Darren Hart 2012-11-20 23:10 ` Darren Hart 1 sibling, 2 replies; 15+ messages in thread From: Dave Jones @ 2012-11-20 16:46 UTC (permalink / raw) To: Darren Hart; +Cc: Linux Kernel On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote: > > I've been able to trigger this for the last week or so. > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the > > pi-futex code hasn't changed since the last time it found something.. > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > PGD 8e72c067 PUD 34f07067 PMD 0 > > > Oops: 0000 [#1] PREEMPT SMP > > > CPU 7 > > > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 > > > RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 > > > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 > > > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 > > > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 > > > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 > > > FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) > > > Stack: > > > ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 > > > ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 > > > ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 > > > Call Trace: > > > [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 > > > [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 > > > [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 > > > [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 > > > [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 > > > [<ffffffff810b7039>] ? local_clock+0x89/0xa0 > > > [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > > [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > > [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 > > > [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 > > > [<ffffffff810e99e4>] do_futex+0x394/0xa50 > > > [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 > > > [<ffffffff810ea12d>] sys_futex+0x8d/0x190 > > > [<ffffffff816ca288>] tracesys+0xe1/0xe6 > > > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 > > > RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > RSP <ffff8800803f7b28> > > > CR2: 0000000000000018 > > > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. > > > > Darren, any idea how this could happen ? > > I'm digging. Can you get trinity to provide the arguments it used that > trigger the crash?u That might help hone in on the exact path. Still seeing this on rc6. It happens very quickly when I run with "-c futex" I just pushed out all the pending trinity changes I was running with, but I don't think they would have been responsible. Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-11-20 16:46 ` Dave Jones @ 2012-11-20 17:27 ` Darren Hart 2012-11-20 23:10 ` Darren Hart 1 sibling, 0 replies; 15+ messages in thread From: Darren Hart @ 2012-11-20 17:27 UTC (permalink / raw) To: Dave Jones, Linux Kernel On 11/20/2012 08:46 AM, Dave Jones wrote: > On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote: > > > > I've been able to trigger this for the last week or so. > > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the > > > pi-futex code hasn't changed since the last time it found something.. > > > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > > > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > > PGD 8e72c067 PUD 34f07067 PMD 0 > > > > Oops: 0000 [#1] PREEMPT SMP > > > > CPU 7 > > > > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 > > > > RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > > RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 > > > > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 > > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 > > > > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 > > > > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 > > > > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 > > > > FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) > > > > Stack: > > > > ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 > > > > ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 > > > > ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 > > > > Call Trace: > > > > [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 > > > > [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 > > > > [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 > > > > [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 > > > > [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 > > > > [<ffffffff810b7039>] ? local_clock+0x89/0xa0 > > > > [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 > > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > > > [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 > > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > > > [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 > > > > [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 > > > > [<ffffffff810e99e4>] do_futex+0x394/0xa50 > > > > [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 > > > > [<ffffffff810ea12d>] sys_futex+0x8d/0x190 > > > > [<ffffffff816ca288>] tracesys+0xe1/0xe6 > > > > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 > > > > RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > > RSP <ffff8800803f7b28> > > > > CR2: 0000000000000018 > > > > > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. > > > > > > Darren, any idea how this could happen ? > > > > I'm digging. Can you get trinity to provide the arguments it used that > > trigger the crash?u That might help hone in on the exact path. > > Still seeing this on rc6. It happens very quickly when I run with "-c futex" > I just pushed out all the pending trinity changes I was running with, but > I don't think they would have been responsible. Thanks for the poke - I've been away on vacations and conferences and tied up with other things. I'll do a build a testrun today. -- Darren Hart Intel Open Source Technology Center Yocto Project - Technical Lead - Linux Kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-11-20 16:46 ` Dave Jones 2012-11-20 17:27 ` Darren Hart @ 2012-11-20 23:10 ` Darren Hart 2012-11-21 0:30 ` Darren Hart 1 sibling, 1 reply; 15+ messages in thread From: Darren Hart @ 2012-11-20 23:10 UTC (permalink / raw) To: Dave Jones, Linux Kernel On 11/20/2012 08:46 AM, Dave Jones wrote: > On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote: > > > > I've been able to trigger this for the last week or so. > > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the > > > pi-futex code hasn't changed since the last time it found something.. > > > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > > > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > > PGD 8e72c067 PUD 34f07067 PMD 0 > > > > Oops: 0000 [#1] PREEMPT SMP > > > > CPU 7 > > > > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 > > > > RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > > RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 > > > > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 > > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 > > > > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 > > > > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 > > > > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 > > > > FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) > > > > Stack: > > > > ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 > > > > ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 > > > > ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 > > > > Call Trace: > > > > [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 > > > > [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 > > > > [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 > > > > [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 > > > > [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 > > > > [<ffffffff810b7039>] ? local_clock+0x89/0xa0 > > > > [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 > > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > > > [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 > > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 > > > > [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 > > > > [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 > > > > [<ffffffff810e99e4>] do_futex+0x394/0xa50 > > > > [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 > > > > [<ffffffff810ea12d>] sys_futex+0x8d/0x190 > > > > [<ffffffff816ca288>] tracesys+0xe1/0xe6 > > > > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 > > > > RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 > > > > RSP <ffff8800803f7b28> > > > > CR2: 0000000000000018 > > > > > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. > > > > > > Darren, any idea how this could happen ? > > > > I'm digging. Can you get trinity to provide the arguments it used that > > trigger the crash?u That might help hone in on the exact path. > > Still seeing this on rc6. It happens very quickly when I run with "-c futex" > I just pushed out all the pending trinity changes I was running with, but > I don't think they would have been responsible. > > Dave > OK, so some instrumentation yields: [ 1320.762028] futex_lock_pi: timed lock [ 1320.762488] futex_lock_pi: hb=ffffffff81e89e28, hb->lock=ffffffff81e89e28, &q=ffff880181fa5cd8, q.lock_ptr=ffffffff81e89e28 [ 1320.763647] q.lock_ptr=ffffffff81e89e28 [ 1320.764132] fixup_owner: uaddr=00007f05465ac000, q=ffff880181fa5cd8, locked=0 [ 1323.066371] futex_lock_pi: hb=ffffffff81e89e28, hb->lock=ffffffff81e89e28, &q=ffff880181fc5cd8, q.lock_ptr= (null) [ 1323.069032] ------------[ cut here ]------------ [ 1323.069817] kernel BUG at kernel/futex.c:2052! So somewhere between blocking on the lock and waking, q.lock_ptr is getting set to NULL. The only legitimate place this happens in is wake_futex, and I see some wake_futex calls after the futex_lock_pi calls with the same uaddr in the trinity log. We are supposed to be protected by the q lock_ptr here and wake_futex aborts any wakes of q's with a pi_state or rt_waiter.... but .... there appears to be a window in there. I say a window because I see a similar failure where the instrumentation doesn't catch the lock_ptr as NULL and fails at a later point on the same lock. Nice nasty parallel locking race. Digging in. -- Darren Hart Intel Open Source Technology Center Yocto Project - Technical Lead - Linux Kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-11-20 23:10 ` Darren Hart @ 2012-11-21 0:30 ` Darren Hart 2012-11-21 17:37 ` Darren Hart 0 siblings, 1 reply; 15+ messages in thread From: Darren Hart @ 2012-11-21 0:30 UTC (permalink / raw) To: Dave Jones, Linux Kernel, Thomas Gleixner, Ingo Molnar On 11/20/2012 03:10 PM, Darren Hart wrote: > > > On 11/20/2012 08:46 AM, Dave Jones wrote: >> On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote: >> >> > > I've been able to trigger this for the last week or so. >> > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the >> > > pi-futex code hasn't changed since the last time it found something.. >> > > >> > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 >> > > > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 >> > > > PGD 8e72c067 PUD 34f07067 PMD 0 >> > > > Oops: 0000 [#1] PREEMPT SMP >> > > > CPU 7 >> > > > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 >> > > > RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 >> > > > RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 >> > > > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 >> > > > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 >> > > > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 >> > > > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 >> > > > FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 >> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> > > > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 >> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> > > > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) >> > > > Stack: >> > > > ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 >> > > > ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 >> > > > ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 >> > > > Call Trace: >> > > > [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 >> > > > [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 >> > > > [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 >> > > > [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 >> > > > [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 >> > > > [<ffffffff810b7039>] ? local_clock+0x89/0xa0 >> > > > [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 >> > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 >> > > > [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 >> > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 >> > > > [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 >> > > > [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 >> > > > [<ffffffff810e99e4>] do_futex+0x394/0xa50 >> > > > [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 >> > > > [<ffffffff810ea12d>] sys_futex+0x8d/0x190 >> > > > [<ffffffff816ca288>] tracesys+0xe1/0xe6 >> > > > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 >> > > > RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 >> > > > RSP <ffff8800803f7b28> >> > > > CR2: 0000000000000018 >> > > >> > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. >> > > >> > > Darren, any idea how this could happen ? >> > >> > I'm digging. Can you get trinity to provide the arguments it used that >> > trigger the crash?u That might help hone in on the exact path. >> >> Still seeing this on rc6. It happens very quickly when I run with "-c futex" >> I just pushed out all the pending trinity changes I was running with, but >> I don't think they would have been responsible. >> >> Dave >> > > OK, so some instrumentation yields: > > [ 1320.762028] futex_lock_pi: timed lock > [ 1320.762488] futex_lock_pi: hb=ffffffff81e89e28, > hb->lock=ffffffff81e89e28, &q=ffff880181fa5cd8, q.lock_ptr=ffffffff81e89e28 > [ 1320.763647] q.lock_ptr=ffffffff81e89e28 > [ 1320.764132] fixup_owner: uaddr=00007f05465ac000, q=ffff880181fa5cd8, > locked=0 > [ 1323.066371] futex_lock_pi: hb=ffffffff81e89e28, > hb->lock=ffffffff81e89e28, &q=ffff880181fc5cd8, q.lock_ptr= (null) > [ 1323.069032] ------------[ cut here ]------------ > [ 1323.069817] kernel BUG at kernel/futex.c:2052! > > So somewhere between blocking on the lock and waking, q.lock_ptr is > getting set to NULL. The only legitimate place this happens in is > wake_futex, and I see some wake_futex calls after the futex_lock_pi > calls with the same uaddr in the trinity log. We are supposed to be > protected by the q lock_ptr here and wake_futex aborts any wakes of q's > with a pi_state or rt_waiter.... but .... there appears to be a window > in there. I say a window because I see a similar failure where the > instrumentation doesn't catch the lock_ptr as NULL and fails at a later > point on the same lock. Nice nasty parallel locking race. Digging in. > OK, the problem is a futex_wake_op() calling wake_futex() without checking for a pi_state or rt_waiters. I'm looking at the best way to fix it, considering moving the check out of futex_wake and into wake_futex... some more analysis needed for the best fix, but will have one by tomorrow. -- Darren Hart Intel Open Source Technology Center Yocto Project - Technical Lead - Linux Kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-11-21 0:30 ` Darren Hart @ 2012-11-21 17:37 ` Darren Hart 2012-11-21 17:46 ` Dave Jones 0 siblings, 1 reply; 15+ messages in thread From: Darren Hart @ 2012-11-21 17:37 UTC (permalink / raw) To: Dave Jones, Linux Kernel, Thomas Gleixner, Ingo Molnar On 11/20/2012 04:30 PM, Darren Hart wrote: > > > On 11/20/2012 03:10 PM, Darren Hart wrote: >> >> >> On 11/20/2012 08:46 AM, Dave Jones wrote: >>> On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote: >>> >>> > > I've been able to trigger this for the last week or so. >>> > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the >>> > > pi-futex code hasn't changed since the last time it found something.. >>> > > >>> > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 >>> > > > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 >>> > > > PGD 8e72c067 PUD 34f07067 PMD 0 >>> > > > Oops: 0000 [#1] PREEMPT SMP >>> > > > CPU 7 >>> > > > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 >>> > > > RIP: 0010:[<ffffffff810e185e>] [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 >>> > > > RSP: 0018:ffff8800803f7b28 EFLAGS: 00010046 >>> > > > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >>> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 >>> > > > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000 >>> > > > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002 >>> > > > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018 >>> > > > FS: 00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000 >>> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> > > > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0 >>> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>> > > > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000) >>> > > > Stack: >>> > > > ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0 >>> > > > ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001 >>> > > > ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282 >>> > > > Call Trace: >>> > > > [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0 >>> > > > [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0 >>> > > > [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210 >>> > > > [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180 >>> > > > [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0 >>> > > > [<ffffffff810b7039>] ? local_clock+0x89/0xa0 >>> > > > [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220 >>> > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 >>> > > > [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80 >>> > > > [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390 >>> > > > [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390 >>> > > > [<ffffffff810a1980>] ? update_rmtp+0x70/0x70 >>> > > > [<ffffffff810e99e4>] do_futex+0x394/0xa50 >>> > > > [<ffffffff8119ec43>] ? might_fault+0x53/0xb0 >>> > > > [<ffffffff810ea12d>] sys_futex+0x8d/0x190 >>> > > > [<ffffffff816ca288>] tracesys+0xe1/0xe6 >>> > > > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 >>> > > > RIP [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0 >>> > > > RSP <ffff8800803f7b28> >>> > > > CR2: 0000000000000018 >>> > > >>> > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow. >>> > > >>> > > Darren, any idea how this could happen ? >>> > >>> > I'm digging. Can you get trinity to provide the arguments it used that >>> > trigger the crash?u That might help hone in on the exact path. >>> >>> Still seeing this on rc6. It happens very quickly when I run with "-c futex" >>> I just pushed out all the pending trinity changes I was running with, but >>> I don't think they would have been responsible. >>> >>> Dave >>> >> >> OK, so some instrumentation yields: >> >> [ 1320.762028] futex_lock_pi: timed lock >> [ 1320.762488] futex_lock_pi: hb=ffffffff81e89e28, >> hb->lock=ffffffff81e89e28, &q=ffff880181fa5cd8, q.lock_ptr=ffffffff81e89e28 >> [ 1320.763647] q.lock_ptr=ffffffff81e89e28 >> [ 1320.764132] fixup_owner: uaddr=00007f05465ac000, q=ffff880181fa5cd8, >> locked=0 >> [ 1323.066371] futex_lock_pi: hb=ffffffff81e89e28, >> hb->lock=ffffffff81e89e28, &q=ffff880181fc5cd8, q.lock_ptr= (null) >> [ 1323.069032] ------------[ cut here ]------------ >> [ 1323.069817] kernel BUG at kernel/futex.c:2052! >> >> So somewhere between blocking on the lock and waking, q.lock_ptr is >> getting set to NULL. The only legitimate place this happens in is >> wake_futex, and I see some wake_futex calls after the futex_lock_pi >> calls with the same uaddr in the trinity log. We are supposed to be >> protected by the q lock_ptr here and wake_futex aborts any wakes of q's >> with a pi_state or rt_waiter.... but .... there appears to be a window >> in there. I say a window because I see a similar failure where the >> instrumentation doesn't catch the lock_ptr as NULL and fails at a later >> point on the same lock. Nice nasty parallel locking race. Digging in. >> > > OK, the problem is a futex_wake_op() calling wake_futex() without checking > for a pi_state or rt_waiters. I'm looking at the best way to fix it, > considering moving the check out of futex_wake and into wake_futex... some > more analysis needed for the best fix, but will have one by tomorrow. > I've sent a patch to address this. It has now been running "trinity -c futex" for about 12 hours. I haven't seen any more futex failures, but I have seen a few OOMs. Is that expected from trinity? -- Darren Hart Intel Open Source Technology Center Yocto Project - Technical Lead - Linux Kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: pi futex oops in __lock_acquire 2012-11-21 17:37 ` Darren Hart @ 2012-11-21 17:46 ` Dave Jones 0 siblings, 0 replies; 15+ messages in thread From: Dave Jones @ 2012-11-21 17:46 UTC (permalink / raw) To: Darren Hart; +Cc: Linux Kernel, Thomas Gleixner, Ingo Molnar On Wed, Nov 21, 2012 at 09:37:40AM -0800, Darren Hart wrote: > I've sent a patch to address this. It has now been running "trinity -c > futex" for about 12 hours. I haven't seen any more futex failures Looks good here too. > but I > have seen a few OOMs. Is that expected from trinity? Yeah, seems there's some leaks (both in trinity and the kernel). Oddly enough someone else emailed me about that this morning, which prompted me to write up this: http://codemonkey.org.uk/2012/11/21/fuzz-testing-oom-killer/ Which is just part of the problem. There are other issues too it seems. Fun. Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-11-21 17:47 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-10-18 1:53 MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) Dave Jones 2012-10-18 5:53 ` Jens Axboe 2012-10-19 5:21 ` Dave Jones 2012-10-19 12:49 ` Peter Zijlstra 2012-10-19 19:29 ` Dave Jones 2012-10-23 19:50 ` MAX_LOCKDEP_ENTRIES too low Dave Jones 2012-10-24 20:24 ` pi futex oops in __lock_acquire Dave Jones 2012-10-25 4:44 ` Darren Hart 2012-10-25 11:09 ` Dave Jones 2012-11-20 16:46 ` Dave Jones 2012-11-20 17:27 ` Darren Hart 2012-11-20 23:10 ` Darren Hart 2012-11-21 0:30 ` Darren Hart 2012-11-21 17:37 ` Darren Hart 2012-11-21 17:46 ` Dave Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).