linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* MAX_LOCKDEP_ENTRIES too low  (called from ioc_release_fn)
@ 2012-10-18  1:53 Dave Jones
  2012-10-18  5:53 ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Jones @ 2012-10-18  1:53 UTC (permalink / raw)
  To: axboe; +Cc: Linux Kernel

Triggered while fuzz testing..


BUG: MAX_LOCKDEP_ENTRIES too low!
turning off the locking correctness validator.
Pid: 22788, comm: kworker/2:1 Not tainted 3.7.0-rc1+ #34
Call Trace:
 [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0
 [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0
 [<ffffffff810e3a12>] lock_acquire+0xa2/0x220
 [<ffffffff8117bad2>] ? free_one_page+0x32/0x450
 [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0
 [<ffffffff816c0800>] _raw_spin_lock+0x40/0x80
 [<ffffffff8117bad2>] ? free_one_page+0x32/0x450
 [<ffffffff8117bad2>] free_one_page+0x32/0x450
 [<ffffffff8117bf41>] ? __free_pages_ok.part.58+0x51/0x110
 [<ffffffff8117bf9c>] __free_pages_ok.part.58+0xac/0x110
 [<ffffffff8117cd73>] __free_pages+0x73/0x90
 [<ffffffff811cb4f3>] __free_slab+0xd3/0x1b0
 [<ffffffff811cb609>] discard_slab+0x39/0x50
 [<ffffffff816b77db>] __slab_free+0x378/0x3a3
 [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0
 [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0
 [<ffffffff811cd4e2>] kmem_cache_free+0x2f2/0x320
 [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0
 [<ffffffff81341289>] ioc_release_fn+0x99/0xe0
 [<ffffffff81095a37>] process_one_work+0x207/0x780
 [<ffffffff810959c7>] ? process_one_work+0x197/0x780
 [<ffffffff813411f0>] ? get_io_context+0x20/0x20
 [<ffffffff8109638e>] worker_thread+0x15e/0x440
 [<ffffffff81096230>] ? rescuer_thread+0x240/0x240
 [<ffffffff8109d0cd>] kthread+0xed/0x100
 [<ffffffff810de02e>] ? put_lock_stats.isra.25+0xe/0x40
 [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160
 [<ffffffff816c9dac>] ret_from_fork+0x7c/0xb0
 [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: MAX_LOCKDEP_ENTRIES too low  (called from ioc_release_fn)
  2012-10-18  1:53 MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) Dave Jones
@ 2012-10-18  5:53 ` Jens Axboe
  2012-10-19  5:21   ` Dave Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2012-10-18  5:53 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Peter Zijlstra, Ingo Molnar

On 2012-10-18 03:53, Dave Jones wrote:
> Triggered while fuzz testing..
> 
> 
> BUG: MAX_LOCKDEP_ENTRIES too low!
> turning off the locking correctness validator.
> Pid: 22788, comm: kworker/2:1 Not tainted 3.7.0-rc1+ #34
> Call Trace:
>  [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0
>  [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0
>  [<ffffffff810e3a12>] lock_acquire+0xa2/0x220
>  [<ffffffff8117bad2>] ? free_one_page+0x32/0x450
>  [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0
>  [<ffffffff816c0800>] _raw_spin_lock+0x40/0x80
>  [<ffffffff8117bad2>] ? free_one_page+0x32/0x450
>  [<ffffffff8117bad2>] free_one_page+0x32/0x450
>  [<ffffffff8117bf41>] ? __free_pages_ok.part.58+0x51/0x110
>  [<ffffffff8117bf9c>] __free_pages_ok.part.58+0xac/0x110
>  [<ffffffff8117cd73>] __free_pages+0x73/0x90
>  [<ffffffff811cb4f3>] __free_slab+0xd3/0x1b0
>  [<ffffffff811cb609>] discard_slab+0x39/0x50
>  [<ffffffff816b77db>] __slab_free+0x378/0x3a3
>  [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0
>  [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0
>  [<ffffffff811cd4e2>] kmem_cache_free+0x2f2/0x320
>  [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0
>  [<ffffffff81341289>] ioc_release_fn+0x99/0xe0
>  [<ffffffff81095a37>] process_one_work+0x207/0x780
>  [<ffffffff810959c7>] ? process_one_work+0x197/0x780
>  [<ffffffff813411f0>] ? get_io_context+0x20/0x20
>  [<ffffffff8109638e>] worker_thread+0x15e/0x440
>  [<ffffffff81096230>] ? rescuer_thread+0x240/0x240
>  [<ffffffff8109d0cd>] kthread+0xed/0x100
>  [<ffffffff810de02e>] ? put_lock_stats.isra.25+0xe/0x40
>  [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160
>  [<ffffffff816c9dac>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160

Not sure why you are CC'ing a call site, rather than the maintainers of
the code. Just looks like lockdep is using too small a static value.
Though it is pretty darn large...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: MAX_LOCKDEP_ENTRIES too low  (called from ioc_release_fn)
  2012-10-18  5:53 ` Jens Axboe
@ 2012-10-19  5:21   ` Dave Jones
  2012-10-19 12:49     ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Jones @ 2012-10-19  5:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar

On Thu, Oct 18, 2012 at 07:53:08AM +0200, Jens Axboe wrote:
 > On 2012-10-18 03:53, Dave Jones wrote:
 > > Triggered while fuzz testing..
 > > 
 > > 
 > > BUG: MAX_LOCKDEP_ENTRIES too low!
 > > turning off the locking correctness validator.
 > > Pid: 22788, comm: kworker/2:1 Not tainted 3.7.0-rc1+ #34
 > > Call Trace:
 > >  [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0
 > >  [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0
 > >  [<ffffffff810e3a12>] lock_acquire+0xa2/0x220
 > >  [<ffffffff8117bad2>] ? free_one_page+0x32/0x450
 > >  [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0
 > >  [<ffffffff816c0800>] _raw_spin_lock+0x40/0x80
 > >  [<ffffffff8117bad2>] ? free_one_page+0x32/0x450
 > >  [<ffffffff8117bad2>] free_one_page+0x32/0x450
 > >  [<ffffffff8117bf41>] ? __free_pages_ok.part.58+0x51/0x110
 > >  [<ffffffff8117bf9c>] __free_pages_ok.part.58+0xac/0x110
 > >  [<ffffffff8117cd73>] __free_pages+0x73/0x90
 > >  [<ffffffff811cb4f3>] __free_slab+0xd3/0x1b0
 > >  [<ffffffff811cb609>] discard_slab+0x39/0x50
 > >  [<ffffffff816b77db>] __slab_free+0x378/0x3a3
 > >  [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0
 > >  [<ffffffff81341289>] ? ioc_release_fn+0x99/0xe0
 > >  [<ffffffff811cd4e2>] kmem_cache_free+0x2f2/0x320
 > >  [<ffffffff816c5a59>] ? sub_preempt_count+0x79/0xd0
 > >  [<ffffffff81341289>] ioc_release_fn+0x99/0xe0
 > >  [<ffffffff81095a37>] process_one_work+0x207/0x780
 > >  [<ffffffff810959c7>] ? process_one_work+0x197/0x780
 > >  [<ffffffff813411f0>] ? get_io_context+0x20/0x20
 > >  [<ffffffff8109638e>] worker_thread+0x15e/0x440
 > >  [<ffffffff81096230>] ? rescuer_thread+0x240/0x240
 > >  [<ffffffff8109d0cd>] kthread+0xed/0x100
 > >  [<ffffffff810de02e>] ? put_lock_stats.isra.25+0xe/0x40
 > >  [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160
 > >  [<ffffffff816c9dac>] ret_from_fork+0x7c/0xb0
 > >  [<ffffffff8109cfe0>] ? kthread_create_on_node+0x160/0x160
 > 
 > Not sure why you are CC'ing a call site, rather than the maintainers of
 > the code. Just looks like lockdep is using too small a static value.
 > Though it is pretty darn large...

You're right, it's a huge chunk of memory.
It looks like I can trigger this from multiple callsites..
Another different trace below.

Not sure why this suddenly got a lot worse in 3.7

Peter, Ingo ?

	Dave

BUG: MAX_LOCKDEP_ENTRIES too low!
turning off the locking correctness validator.
Pid: 22350, comm: trinity-child0 Not tainted 3.7.0-rc1+ #36
Call Trace:
 [<ffffffff810decdd>] add_lock_to_list.isra.29.constprop.45+0xdd/0xf0
 [<ffffffff810e2871>] __lock_acquire+0x1121/0x1ba0
 [<ffffffff810b6f89>] ? local_clock+0x89/0xa0
 [<ffffffff811b84a5>] ? __swap_duplicate+0xb5/0x190
 [<ffffffff810dea58>] ? trace_hardirqs_off_caller+0x28/0xd0
 [<ffffffff810e3a12>] lock_acquire+0xa2/0x220
 [<ffffffff811b663d>] ? __add_to_swap_cache+0x6d/0x180
 [<ffffffff816c0f79>] ? _raw_spin_lock_irq+0x29/0x90
 [<ffffffff816c0fa6>] _raw_spin_lock_irq+0x56/0x90
 [<ffffffff811b663d>] ? __add_to_swap_cache+0x6d/0x180
 [<ffffffff811b663d>] __add_to_swap_cache+0x6d/0x180
 [<ffffffff811b6c45>] read_swap_cache_async+0xb5/0x220
 [<ffffffff811b6e4e>] swapin_readahead+0x9e/0xf0
 [<ffffffff811a2c86>] handle_pte_fault+0x6d6/0xae0
 [<ffffffff816c5ad9>] ? sub_preempt_count+0x79/0xd0
 [<ffffffff8136d37e>] ? delay_tsc+0xae/0x120
 [<ffffffff8136d268>] ? __const_udelay+0x28/0x30
 [<ffffffff811a4919>] handle_mm_fault+0x289/0x350
 [<ffffffff816c538e>] __do_page_fault+0x18e/0x530
 [<ffffffff810b6f89>] ? local_clock+0x89/0xa0
 [<ffffffff816b77e1>] ? __slab_free+0x32e/0x3a3
 [<ffffffff8112d2e9>] ? rcu_user_exit+0xc9/0xf0
 [<ffffffffa0000020>] ? 0xffffffffa000001f
 [<ffffffffa0000020>] ? 0xffffffffa000001f
 [<ffffffff816c575b>] do_page_fault+0x2b/0x50
 [<ffffffff816c1e38>] page_fault+0x28/0x30
 [<ffffffffa0000020>] ? 0xffffffffa000001f
 [<ffffffff813885ac>] ? strncpy_from_user+0x6c/0x120
 [<ffffffffa0000020>] ? 0xffffffffa000001f
 [<ffffffff8120e94f>] setxattr+0x6f/0x1d0
 [<ffffffff816c5ad9>] ? sub_preempt_count+0x79/0xd0
 [<ffffffff8137e8b5>] ? __percpu_counter_add+0x75/0xc0
 [<ffffffff811e92d1>] ? __sb_start_write+0x101/0x1d0
 [<ffffffff81209d94>] ? mnt_want_write+0x24/0x50
 [<ffffffff81209d94>] ? mnt_want_write+0x24/0x50
 [<ffffffff810b0de1>] ? get_parent_ip+0x11/0x50
 [<ffffffff816c5ad9>] ? sub_preempt_count+0x79/0xd0
 [<ffffffff81209d30>] ? __mnt_want_write+0x60/0xa0
 [<ffffffffa0000020>] ? 0xffffffffa000001f
 [<ffffffff8120ecd5>] sys_setxattr+0x95/0xb0
 [<ffffffff816ca108>] tracesys+0xe1/0xe6
 [<ffffffffa0000020>] ? 0xffffffffa000001f


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: MAX_LOCKDEP_ENTRIES too low  (called from ioc_release_fn)
  2012-10-19  5:21   ` Dave Jones
@ 2012-10-19 12:49     ` Peter Zijlstra
  2012-10-19 19:29       ` Dave Jones
  2012-10-23 19:50       ` MAX_LOCKDEP_ENTRIES too low Dave Jones
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2012-10-19 12:49 UTC (permalink / raw)
  To: Dave Jones; +Cc: Jens Axboe, Linux Kernel, Ingo Molnar

On Fri, 2012-10-19 at 01:21 -0400, Dave Jones wrote:
>  > Not sure why you are CC'ing a call site, rather than the maintainers of
>  > the code. Just looks like lockdep is using too small a static value.
>  > Though it is pretty darn large...
> 
> You're right, it's a huge chunk of memory.
> It looks like I can trigger this from multiple callsites..
> Another different trace below.
> 
> Not sure why this suddenly got a lot worse in 3.7 

Did we add a static array of structures with locks in somewhere? Doing
that is a great way of blowing up the number of lock classes and the
resulting amount of lock dependency chains.

>From Documentation/lockdep-design.txt; it talks about overflowing
MAX_LOCKDEP_KEYS, but I suppose its a good starts for overflowing the
dependency entries too, more classes means more dependencies after all.

---
Of course, if you do run out of lock classes, the next thing to do is
to find the offending lock classes.  First, the following command gives
you the number of lock classes currently in use along with the maximum:

        grep "lock-classes" /proc/lockdep_stats

This command produces the following output on a modest system:

         lock-classes:                          748 [max: 8191]

If the number allocated (748 above) increases continually over time,
then there is likely a leak.  The following command can be used to
identify the leaking lock classes:

        grep "BD" /proc/lockdep

Run the command and save the output, then compare against the output from
a later run of this command to identify the leakers.  This same output
can also help you find situations where runtime lock initialization has
been omitted.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: MAX_LOCKDEP_ENTRIES too low  (called from ioc_release_fn)
  2012-10-19 12:49     ` Peter Zijlstra
@ 2012-10-19 19:29       ` Dave Jones
  2012-10-23 19:50       ` MAX_LOCKDEP_ENTRIES too low Dave Jones
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Jones @ 2012-10-19 19:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jens Axboe, Linux Kernel, Ingo Molnar

On Fri, Oct 19, 2012 at 02:49:32PM +0200, Peter Zijlstra wrote:

 > Of course, if you do run out of lock classes, the next thing to do is
 > to find the offending lock classes.  First, the following command gives
 > you the number of lock classes currently in use along with the maximum:
 > 
 >         grep "lock-classes" /proc/lockdep_stats
 > 
 > This command produces the following output on a modest system:
 > 
 >          lock-classes:                          748 [max: 8191]

After the BUG gets hit..

 lock-classes:                         1726 [max: 8191]

 > If the number allocated (748 above) increases continually over time,
 > then there is likely a leak.  The following command can be used to
 > identify the leaking lock classes:
 > 
 >         grep "BD" /proc/lockdep
 > 
 > Run the command and save the output, then compare against the output from
 > a later run of this command to identify the leakers.  This same output
 > can also help you find situations where runtime lock initialization has
 > been omitted.

I've not had chance to do this, because after the BUG, lockdep turns itself off,
and I've not rebooted. I'm probably not going to get to this until after the weekend.

There's just a *lot* of dependancies.

Here's the full output http://codemonkey.org.uk/junk/lockdep

the top few backwards deps..

ffffffff81c8f218 FD:    1 BD: 1201 -.-.-.: pool_lock
ffffffff82ae1210 FD:    2 BD: 1200 -.-.-.: &obj_hash[i].lock
ffffffff820677c1 FD:    1 BD: 1131 -.-.-.: &rt_rq->rt_runtime_lock
ffffffff82066949 FD:    3 BD: 1131 -.-.-.: &cpu_base->lock
ffffffff820221c0 FD:    1 BD: 1130 -.-.-.: &sig->cputimer.lock
ffffffff820677b8 FD:    5 BD: 1129 -.-.-.: &rt_b->rt_runtime_lock
ffffffff82067675 FD:    3 BD: 1129 ..-.-.: &rq->lock/1
ffffffff82067674 FD:    8 BD: 1128 -.-.-.: &rq->lock
ffffffff8298bbd0 FD:    1 BD: 1006 -.-.-.: &(&n->list_lock)->rlock

	Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: MAX_LOCKDEP_ENTRIES too low
  2012-10-19 12:49     ` Peter Zijlstra
  2012-10-19 19:29       ` Dave Jones
@ 2012-10-23 19:50       ` Dave Jones
  2012-10-24 20:24         ` pi futex oops in __lock_acquire Dave Jones
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Jones @ 2012-10-23 19:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Linux Kernel, Ingo Molnar

On Fri, Oct 19, 2012 at 02:49:32PM +0200, Peter Zijlstra wrote:

 > > Not sure why this suddenly got a lot worse in 3.7 
 > 
 > Did we add a static array of structures with locks in somewhere? Doing
 > that is a great way of blowing up the number of lock classes and the
 > resulting amount of lock dependency chains.

So I hit it again, but didn't see anything that I hadn't seen already
in /proc/lockdep.  I so tried doubling MAX_LOCKDEP_ENTRIES again.

A while later, I got this oops. I'm unsure now if this is a genuine bug,
or a side-effect of lockdep not working with such a large MAX_LOCKDEP_ENTRIES

opinions ?


BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
PGD 8e72c067 PUD 34f07067 PMD 0 
Oops: 0000 [#1] PREEMPT SMP 
Modules linked in: fuse nfnetlink ipt_ULOG binfmt_misc nfc caif_socket caif phonet can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key decnet rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 lockd sunrpc bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables kvm_intel kvm usb_debug crc32c_intel ghash_clmulni_intel microcode pcspkr i2c_i801 e1000e uinput i915 video i2c_algo_bit drm_kms_helper drm i2c_core
CPU 7 
Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43 Intel Corporation 2012 Client Platform/Emerald Lake 2
RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
Stack:
 ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
 ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
 ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
Call Trace:
 [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
 [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
 [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
 [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
 [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
 [<ffffffff810b7039>] ? local_clock+0x89/0xa0
 [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
 [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
 [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
 [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
 [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
 [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
 [<ffffffff810e99e4>] do_futex+0x394/0xa50
 [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
 [<ffffffff810ea12d>] sys_futex+0x8d/0x190
 [<ffffffff816ca288>] tracesys+0xe1/0xe6
Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
 RSP <ffff8800803f7b28>
CR2: 0000000000000018



^ permalink raw reply	[flat|nested] 15+ messages in thread

* pi futex oops in __lock_acquire
  2012-10-23 19:50       ` MAX_LOCKDEP_ENTRIES too low Dave Jones
@ 2012-10-24 20:24         ` Dave Jones
  2012-10-25  4:44           ` Darren Hart
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Jones @ 2012-10-24 20:24 UTC (permalink / raw)
  To: Linux Kernel; +Cc: Darren Hart

I've been able to trigger this for the last week or so.
Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
pi-futex code hasn't changed since the last time it found something..

 > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
 > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
 > PGD 8e72c067 PUD 34f07067 PMD 0 
 > Oops: 0000 [#1] PREEMPT SMP 
 > CPU 7 
 > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43
 > RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
 > RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
 > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
 > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
 > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
 > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
 > FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
 > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
 > Stack:
 >  ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
 >  ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
 >  ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
 > Call Trace:
 >  [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
 >  [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
 >  [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
 >  [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
 >  [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
 >  [<ffffffff810b7039>] ? local_clock+0x89/0xa0
 >  [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
 >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
 >  [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
 >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
 >  [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
 >  [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
 >  [<ffffffff810e99e4>] do_futex+0x394/0xa50
 >  [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
 >  [<ffffffff810ea12d>] sys_futex+0x8d/0x190
 >  [<ffffffff816ca288>] tracesys+0xe1/0xe6
 > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
 > RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
 >  RSP <ffff8800803f7b28>
 > CR2: 0000000000000018

It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.

Darren, any idea how this could happen ?

	Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-10-24 20:24         ` pi futex oops in __lock_acquire Dave Jones
@ 2012-10-25  4:44           ` Darren Hart
  2012-10-25 11:09             ` Dave Jones
  2012-11-20 16:46             ` Dave Jones
  0 siblings, 2 replies; 15+ messages in thread
From: Darren Hart @ 2012-10-25  4:44 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel



On 10/24/2012 01:24 PM, Dave Jones wrote:
> I've been able to trigger this for the last week or so.
> Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
> pi-futex code hasn't changed since the last time it found something..
> 
>  > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>  > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > PGD 8e72c067 PUD 34f07067 PMD 0 
>  > Oops: 0000 [#1] PREEMPT SMP 
>  > CPU 7 
>  > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43
>  > RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
>  > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
>  > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
>  > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
>  > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
>  > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
>  > FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
>  > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
>  > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
>  > Stack:
>  >  ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
>  >  ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
>  >  ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
>  > Call Trace:
>  >  [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
>  >  [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
>  >  [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
>  >  [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
>  >  [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
>  >  [<ffffffff810b7039>] ? local_clock+0x89/0xa0
>  >  [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
>  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>  >  [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
>  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>  >  [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
>  >  [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
>  >  [<ffffffff810e99e4>] do_futex+0x394/0xa50
>  >  [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
>  >  [<ffffffff810ea12d>] sys_futex+0x8d/0x190
>  >  [<ffffffff816ca288>] tracesys+0xe1/0xe6
>  > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
>  > RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  >  RSP <ffff8800803f7b28>
>  > CR2: 0000000000000018
> 
> It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.
> 
> Darren, any idea how this could happen ?

I'm digging. Can you get trinity to provide the arguments it used that
trigger the crash? That might help hone in on the exact path.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-10-25  4:44           ` Darren Hart
@ 2012-10-25 11:09             ` Dave Jones
  2012-11-20 16:46             ` Dave Jones
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Jones @ 2012-10-25 11:09 UTC (permalink / raw)
  To: Darren Hart; +Cc: Linux Kernel

On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote:
 > 
 > 
 > On 10/24/2012 01:24 PM, Dave Jones wrote:
 > > I've been able to trigger this for the last week or so.
 > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
 > > pi-futex code hasn't changed since the last time it found something..
 > > 
 > >  > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
 > >
 > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.
 > > 
 > > Darren, any idea how this could happen ?
 > 
 > I'm digging. Can you get trinity to provide the arguments it used that
 > trigger the crash? That might help hone in on the exact path.
 
Happens pretty quickly with -c futex for me. (30 seconds or so)

	Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-10-25  4:44           ` Darren Hart
  2012-10-25 11:09             ` Dave Jones
@ 2012-11-20 16:46             ` Dave Jones
  2012-11-20 17:27               ` Darren Hart
  2012-11-20 23:10               ` Darren Hart
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Jones @ 2012-11-20 16:46 UTC (permalink / raw)
  To: Darren Hart; +Cc: Linux Kernel

On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote:
 
 > > I've been able to trigger this for the last week or so.
 > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
 > > pi-futex code hasn't changed since the last time it found something..
 > > 
 > >  > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
 > >  > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
 > >  > PGD 8e72c067 PUD 34f07067 PMD 0 
 > >  > Oops: 0000 [#1] PREEMPT SMP 
 > >  > CPU 7 
 > >  > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43
 > >  > RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
 > >  > RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
 > >  > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
 > >  > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
 > >  > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
 > >  > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
 > >  > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
 > >  > FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
 > >  > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 > >  > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
 > >  > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 > >  > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 > >  > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
 > >  > Stack:
 > >  >  ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
 > >  >  ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
 > >  >  ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
 > >  > Call Trace:
 > >  >  [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
 > >  >  [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
 > >  >  [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
 > >  >  [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
 > >  >  [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
 > >  >  [<ffffffff810b7039>] ? local_clock+0x89/0xa0
 > >  >  [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
 > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
 > >  >  [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
 > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
 > >  >  [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
 > >  >  [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
 > >  >  [<ffffffff810e99e4>] do_futex+0x394/0xa50
 > >  >  [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
 > >  >  [<ffffffff810ea12d>] sys_futex+0x8d/0x190
 > >  >  [<ffffffff816ca288>] tracesys+0xe1/0xe6
 > >  > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
 > >  > RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
 > >  >  RSP <ffff8800803f7b28>
 > >  > CR2: 0000000000000018
 > > 
 > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.
 > > 
 > > Darren, any idea how this could happen ?
 > 
 > I'm digging. Can you get trinity to provide the arguments it used that
 > trigger the crash?u That might help hone in on the exact path.

Still seeing this on rc6. It happens very quickly when I run with "-c futex"
I just pushed out all the pending trinity changes I was running with, but
I don't think they would have been responsible.

	Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-11-20 16:46             ` Dave Jones
@ 2012-11-20 17:27               ` Darren Hart
  2012-11-20 23:10               ` Darren Hart
  1 sibling, 0 replies; 15+ messages in thread
From: Darren Hart @ 2012-11-20 17:27 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel



On 11/20/2012 08:46 AM, Dave Jones wrote:
> On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote:
>  
>  > > I've been able to trigger this for the last week or so.
>  > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
>  > > pi-futex code hasn't changed since the last time it found something..
>  > > 
>  > >  > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>  > >  > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > >  > PGD 8e72c067 PUD 34f07067 PMD 0 
>  > >  > Oops: 0000 [#1] PREEMPT SMP 
>  > >  > CPU 7 
>  > >  > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43
>  > >  > RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > >  > RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
>  > >  > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
>  > >  > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
>  > >  > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
>  > >  > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
>  > >  > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
>  > >  > FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
>  > >  > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  > >  > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
>  > >  > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  > >  > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  > >  > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
>  > >  > Stack:
>  > >  >  ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
>  > >  >  ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
>  > >  >  ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
>  > >  > Call Trace:
>  > >  >  [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
>  > >  >  [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
>  > >  >  [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
>  > >  >  [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
>  > >  >  [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
>  > >  >  [<ffffffff810b7039>] ? local_clock+0x89/0xa0
>  > >  >  [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>  > >  >  [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>  > >  >  [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
>  > >  >  [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
>  > >  >  [<ffffffff810e99e4>] do_futex+0x394/0xa50
>  > >  >  [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
>  > >  >  [<ffffffff810ea12d>] sys_futex+0x8d/0x190
>  > >  >  [<ffffffff816ca288>] tracesys+0xe1/0xe6
>  > >  > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
>  > >  > RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > >  >  RSP <ffff8800803f7b28>
>  > >  > CR2: 0000000000000018
>  > > 
>  > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.
>  > > 
>  > > Darren, any idea how this could happen ?
>  > 
>  > I'm digging. Can you get trinity to provide the arguments it used that
>  > trigger the crash?u That might help hone in on the exact path.
> 
> Still seeing this on rc6. It happens very quickly when I run with "-c futex"
> I just pushed out all the pending trinity changes I was running with, but
> I don't think they would have been responsible.

Thanks for the poke - I've been away on vacations and conferences and
tied up with other things. I'll do a build a testrun today.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-11-20 16:46             ` Dave Jones
  2012-11-20 17:27               ` Darren Hart
@ 2012-11-20 23:10               ` Darren Hart
  2012-11-21  0:30                 ` Darren Hart
  1 sibling, 1 reply; 15+ messages in thread
From: Darren Hart @ 2012-11-20 23:10 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel



On 11/20/2012 08:46 AM, Dave Jones wrote:
> On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote:
>  
>  > > I've been able to trigger this for the last week or so.
>  > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
>  > > pi-futex code hasn't changed since the last time it found something..
>  > > 
>  > >  > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>  > >  > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > >  > PGD 8e72c067 PUD 34f07067 PMD 0 
>  > >  > Oops: 0000 [#1] PREEMPT SMP 
>  > >  > CPU 7 
>  > >  > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43
>  > >  > RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > >  > RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
>  > >  > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
>  > >  > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
>  > >  > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
>  > >  > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
>  > >  > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
>  > >  > FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
>  > >  > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  > >  > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
>  > >  > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  > >  > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  > >  > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
>  > >  > Stack:
>  > >  >  ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
>  > >  >  ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
>  > >  >  ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
>  > >  > Call Trace:
>  > >  >  [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
>  > >  >  [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
>  > >  >  [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
>  > >  >  [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
>  > >  >  [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
>  > >  >  [<ffffffff810b7039>] ? local_clock+0x89/0xa0
>  > >  >  [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>  > >  >  [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>  > >  >  [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
>  > >  >  [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
>  > >  >  [<ffffffff810e99e4>] do_futex+0x394/0xa50
>  > >  >  [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
>  > >  >  [<ffffffff810ea12d>] sys_futex+0x8d/0x190
>  > >  >  [<ffffffff816ca288>] tracesys+0xe1/0xe6
>  > >  > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
>  > >  > RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>  > >  >  RSP <ffff8800803f7b28>
>  > >  > CR2: 0000000000000018
>  > > 
>  > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.
>  > > 
>  > > Darren, any idea how this could happen ?
>  > 
>  > I'm digging. Can you get trinity to provide the arguments it used that
>  > trigger the crash?u That might help hone in on the exact path.
> 
> Still seeing this on rc6. It happens very quickly when I run with "-c futex"
> I just pushed out all the pending trinity changes I was running with, but
> I don't think they would have been responsible.
> 
> 	Dave
> 

OK, so some instrumentation yields:

[ 1320.762028] futex_lock_pi: timed lock
[ 1320.762488] futex_lock_pi: hb=ffffffff81e89e28,
hb->lock=ffffffff81e89e28, &q=ffff880181fa5cd8, q.lock_ptr=ffffffff81e89e28
[ 1320.763647] q.lock_ptr=ffffffff81e89e28
[ 1320.764132] fixup_owner: uaddr=00007f05465ac000, q=ffff880181fa5cd8,
locked=0
[ 1323.066371] futex_lock_pi: hb=ffffffff81e89e28,
hb->lock=ffffffff81e89e28, &q=ffff880181fc5cd8, q.lock_ptr=          (null)
[ 1323.069032] ------------[ cut here ]------------
[ 1323.069817] kernel BUG at kernel/futex.c:2052!

So somewhere between blocking on the lock and waking, q.lock_ptr is
getting set to NULL. The only legitimate place this happens in is
wake_futex, and I see some wake_futex calls after the futex_lock_pi
calls with the same uaddr in the trinity log. We are supposed to be
protected by the q lock_ptr here and wake_futex aborts any wakes of q's
with a pi_state or rt_waiter.... but .... there appears to be a window
in there. I say a window because I see a similar failure where the
instrumentation doesn't catch the lock_ptr as NULL and fails at a later
point on the same lock. Nice nasty parallel locking race. Digging in.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-11-20 23:10               ` Darren Hart
@ 2012-11-21  0:30                 ` Darren Hart
  2012-11-21 17:37                   ` Darren Hart
  0 siblings, 1 reply; 15+ messages in thread
From: Darren Hart @ 2012-11-21  0:30 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Thomas Gleixner, Ingo Molnar



On 11/20/2012 03:10 PM, Darren Hart wrote:
> 
> 
> On 11/20/2012 08:46 AM, Dave Jones wrote:
>> On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote:
>>  
>>  > > I've been able to trigger this for the last week or so.
>>  > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
>>  > > pi-futex code hasn't changed since the last time it found something..
>>  > > 
>>  > >  > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>>  > >  > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>>  > >  > PGD 8e72c067 PUD 34f07067 PMD 0 
>>  > >  > Oops: 0000 [#1] PREEMPT SMP 
>>  > >  > CPU 7 
>>  > >  > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43
>>  > >  > RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>>  > >  > RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
>>  > >  > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
>>  > >  > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
>>  > >  > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
>>  > >  > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
>>  > >  > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
>>  > >  > FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
>>  > >  > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  > >  > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
>>  > >  > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>  > >  > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>  > >  > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
>>  > >  > Stack:
>>  > >  >  ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
>>  > >  >  ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
>>  > >  >  ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
>>  > >  > Call Trace:
>>  > >  >  [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
>>  > >  >  [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
>>  > >  >  [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
>>  > >  >  [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
>>  > >  >  [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
>>  > >  >  [<ffffffff810b7039>] ? local_clock+0x89/0xa0
>>  > >  >  [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
>>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>>  > >  >  [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
>>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>>  > >  >  [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
>>  > >  >  [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
>>  > >  >  [<ffffffff810e99e4>] do_futex+0x394/0xa50
>>  > >  >  [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
>>  > >  >  [<ffffffff810ea12d>] sys_futex+0x8d/0x190
>>  > >  >  [<ffffffff816ca288>] tracesys+0xe1/0xe6
>>  > >  > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
>>  > >  > RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>>  > >  >  RSP <ffff8800803f7b28>
>>  > >  > CR2: 0000000000000018
>>  > > 
>>  > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.
>>  > > 
>>  > > Darren, any idea how this could happen ?
>>  > 
>>  > I'm digging. Can you get trinity to provide the arguments it used that
>>  > trigger the crash?u That might help hone in on the exact path.
>>
>> Still seeing this on rc6. It happens very quickly when I run with "-c futex"
>> I just pushed out all the pending trinity changes I was running with, but
>> I don't think they would have been responsible.
>>
>> 	Dave
>>
> 
> OK, so some instrumentation yields:
> 
> [ 1320.762028] futex_lock_pi: timed lock
> [ 1320.762488] futex_lock_pi: hb=ffffffff81e89e28,
> hb->lock=ffffffff81e89e28, &q=ffff880181fa5cd8, q.lock_ptr=ffffffff81e89e28
> [ 1320.763647] q.lock_ptr=ffffffff81e89e28
> [ 1320.764132] fixup_owner: uaddr=00007f05465ac000, q=ffff880181fa5cd8,
> locked=0
> [ 1323.066371] futex_lock_pi: hb=ffffffff81e89e28,
> hb->lock=ffffffff81e89e28, &q=ffff880181fc5cd8, q.lock_ptr=          (null)
> [ 1323.069032] ------------[ cut here ]------------
> [ 1323.069817] kernel BUG at kernel/futex.c:2052!
> 
> So somewhere between blocking on the lock and waking, q.lock_ptr is
> getting set to NULL. The only legitimate place this happens in is
> wake_futex, and I see some wake_futex calls after the futex_lock_pi
> calls with the same uaddr in the trinity log. We are supposed to be
> protected by the q lock_ptr here and wake_futex aborts any wakes of q's
> with a pi_state or rt_waiter.... but .... there appears to be a window
> in there. I say a window because I see a similar failure where the
> instrumentation doesn't catch the lock_ptr as NULL and fails at a later
> point on the same lock. Nice nasty parallel locking race. Digging in.
> 

OK, the problem is a futex_wake_op() calling wake_futex() without checking
for a pi_state or rt_waiters. I'm looking at the best way to fix it,
considering moving the check out of futex_wake and into wake_futex... some
more analysis needed for the best fix, but will have one by tomorrow.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-11-21  0:30                 ` Darren Hart
@ 2012-11-21 17:37                   ` Darren Hart
  2012-11-21 17:46                     ` Dave Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Darren Hart @ 2012-11-21 17:37 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Thomas Gleixner, Ingo Molnar



On 11/20/2012 04:30 PM, Darren Hart wrote:
> 
> 
> On 11/20/2012 03:10 PM, Darren Hart wrote:
>>
>>
>> On 11/20/2012 08:46 AM, Dave Jones wrote:
>>> On Wed, Oct 24, 2012 at 09:44:07PM -0700, Darren Hart wrote:
>>>  
>>>  > > I've been able to trigger this for the last week or so.
>>>  > > Unclear whether this is a new bug, or my fuzzer got smarter, but I see the
>>>  > > pi-futex code hasn't changed since the last time it found something..
>>>  > > 
>>>  > >  > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>>>  > >  > IP: [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>>>  > >  > PGD 8e72c067 PUD 34f07067 PMD 0 
>>>  > >  > Oops: 0000 [#1] PREEMPT SMP 
>>>  > >  > CPU 7 
>>>  > >  > Pid: 27513, comm: trinity-child0 Not tainted 3.7.0-rc2+ #43
>>>  > >  > RIP: 0010:[<ffffffff810e185e>]  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>>>  > >  > RSP: 0018:ffff8800803f7b28  EFLAGS: 00010046
>>>  > >  > RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
>>>  > >  > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
>>>  > >  > RBP: ffff8800803f7c18 R08: 0000000000000002 R09: 0000000000000000
>>>  > >  > R10: 0000000000000000 R11: 2222222222222222 R12: 0000000000000002
>>>  > >  > R13: ffff880051dd8000 R14: 0000000000000002 R15: 0000000000000018
>>>  > >  > FS:  00007f9fc6ccb740(0000) GS:ffff880148a00000(0000) knlGS:0000000000000000
>>>  > >  > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>  > >  > CR2: 0000000000000018 CR3: 000000008e6fb000 CR4: 00000000001407e0
>>>  > >  > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>  > >  > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>  > >  > Process trinity-child0 (pid: 27513, threadinfo ffff8800803f6000, task ffff880051dd8000)
>>>  > >  > Stack:
>>>  > >  >  ffff8800803f7b48 ffffffff816c5c59 ffff8800803f7b48 ffff88014840ebc0
>>>  > >  >  ffff8800803f7b68 ffffffff816c18e3 ffff8800803f7d10 0000000000000001
>>>  > >  >  ffff8800803f7ba8 ffffffff810a1e62 ffff8800803f7d10 0000000000000282
>>>  > >  > Call Trace:
>>>  > >  >  [<ffffffff816c5c59>] ? sub_preempt_count+0x79/0xd0
>>>  > >  >  [<ffffffff816c18e3>] ? _raw_spin_unlock_irqrestore+0x73/0xa0
>>>  > >  >  [<ffffffff810a1e62>] ? hrtimer_try_to_cancel+0x52/0x210
>>>  > >  >  [<ffffffff810eb9e5>] ? debug_rt_mutex_free_waiter+0x15/0x180
>>>  > >  >  [<ffffffff816c0107>] ? rt_mutex_slowlock+0x127/0x1b0
>>>  > >  >  [<ffffffff810b7039>] ? local_clock+0x89/0xa0
>>>  > >  >  [<ffffffff810e3ac2>] lock_acquire+0xa2/0x220
>>>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>>>  > >  >  [<ffffffff816c09e0>] _raw_spin_lock+0x40/0x80
>>>  > >  >  [<ffffffff810e812c>] ? futex_lock_pi.isra.18+0x1cc/0x390
>>>  > >  >  [<ffffffff810e812c>] futex_lock_pi.isra.18+0x1cc/0x390
>>>  > >  >  [<ffffffff810a1980>] ? update_rmtp+0x70/0x70
>>>  > >  >  [<ffffffff810e99e4>] do_futex+0x394/0xa50
>>>  > >  >  [<ffffffff8119ec43>] ? might_fault+0x53/0xb0
>>>  > >  >  [<ffffffff810ea12d>] sys_futex+0x8d/0x190
>>>  > >  >  [<ffffffff816ca288>] tracesys+0xe1/0xe6
>>>  > >  > Code: d8 45 0f 45 e0 4c 89 75 f0 4c 89 7d f8 85 c0 0f 84 f8 00 00 00 8b 05 22 fe f3 00 49 89 ff 89 f3 41 89 d2 85 c0 0f 84 02 01 00 00 <49> 8b 07 ba 01 00 00 00 48 3d c0 81 06 82 44 0f 44 e2 83 fb 01 
>>>  > >  > RIP  [<ffffffff810e185e>] __lock_acquire+0x5e/0x1ba0
>>>  > >  >  RSP <ffff8800803f7b28>
>>>  > >  > CR2: 0000000000000018
>>>  > > 
>>>  > > It looks like we got all the way to lock_acquire with a NULL 'lock' somehow.
>>>  > > 
>>>  > > Darren, any idea how this could happen ?
>>>  > 
>>>  > I'm digging. Can you get trinity to provide the arguments it used that
>>>  > trigger the crash?u That might help hone in on the exact path.
>>>
>>> Still seeing this on rc6. It happens very quickly when I run with "-c futex"
>>> I just pushed out all the pending trinity changes I was running with, but
>>> I don't think they would have been responsible.
>>>
>>> 	Dave
>>>
>>
>> OK, so some instrumentation yields:
>>
>> [ 1320.762028] futex_lock_pi: timed lock
>> [ 1320.762488] futex_lock_pi: hb=ffffffff81e89e28,
>> hb->lock=ffffffff81e89e28, &q=ffff880181fa5cd8, q.lock_ptr=ffffffff81e89e28
>> [ 1320.763647] q.lock_ptr=ffffffff81e89e28
>> [ 1320.764132] fixup_owner: uaddr=00007f05465ac000, q=ffff880181fa5cd8,
>> locked=0
>> [ 1323.066371] futex_lock_pi: hb=ffffffff81e89e28,
>> hb->lock=ffffffff81e89e28, &q=ffff880181fc5cd8, q.lock_ptr=          (null)
>> [ 1323.069032] ------------[ cut here ]------------
>> [ 1323.069817] kernel BUG at kernel/futex.c:2052!
>>
>> So somewhere between blocking on the lock and waking, q.lock_ptr is
>> getting set to NULL. The only legitimate place this happens in is
>> wake_futex, and I see some wake_futex calls after the futex_lock_pi
>> calls with the same uaddr in the trinity log. We are supposed to be
>> protected by the q lock_ptr here and wake_futex aborts any wakes of q's
>> with a pi_state or rt_waiter.... but .... there appears to be a window
>> in there. I say a window because I see a similar failure where the
>> instrumentation doesn't catch the lock_ptr as NULL and fails at a later
>> point on the same lock. Nice nasty parallel locking race. Digging in.
>>
> 
> OK, the problem is a futex_wake_op() calling wake_futex() without checking
> for a pi_state or rt_waiters. I'm looking at the best way to fix it,
> considering moving the check out of futex_wake and into wake_futex... some
> more analysis needed for the best fix, but will have one by tomorrow.
> 

I've sent a patch to address this. It has now been running "trinity -c
futex" for about 12 hours. I haven't seen any more futex failures, but I
have seen a few OOMs. Is that expected from trinity?

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pi futex oops in __lock_acquire
  2012-11-21 17:37                   ` Darren Hart
@ 2012-11-21 17:46                     ` Dave Jones
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Jones @ 2012-11-21 17:46 UTC (permalink / raw)
  To: Darren Hart; +Cc: Linux Kernel, Thomas Gleixner, Ingo Molnar

On Wed, Nov 21, 2012 at 09:37:40AM -0800, Darren Hart wrote:
 > I've sent a patch to address this. It has now been running "trinity -c
 > futex" for about 12 hours. I haven't seen any more futex failures

Looks good here too.

 > but I
 > have seen a few OOMs. Is that expected from trinity?

Yeah, seems there's some leaks (both in trinity and the kernel).
Oddly enough someone else emailed me about that this morning, which
prompted me to write up this: http://codemonkey.org.uk/2012/11/21/fuzz-testing-oom-killer/
Which is just part of the problem. There are other issues too it seems. Fun.

	Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-11-21 17:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-18  1:53 MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn) Dave Jones
2012-10-18  5:53 ` Jens Axboe
2012-10-19  5:21   ` Dave Jones
2012-10-19 12:49     ` Peter Zijlstra
2012-10-19 19:29       ` Dave Jones
2012-10-23 19:50       ` MAX_LOCKDEP_ENTRIES too low Dave Jones
2012-10-24 20:24         ` pi futex oops in __lock_acquire Dave Jones
2012-10-25  4:44           ` Darren Hart
2012-10-25 11:09             ` Dave Jones
2012-11-20 16:46             ` Dave Jones
2012-11-20 17:27               ` Darren Hart
2012-11-20 23:10               ` Darren Hart
2012-11-21  0:30                 ` Darren Hart
2012-11-21 17:37                   ` Darren Hart
2012-11-21 17:46                     ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).