All of lore.kernel.org
 help / color / mirror / Atom feed
* VFS deadlock ?
@ 2013-03-21 19:06 Dave Jones
  2013-03-21 19:21 ` Al Viro
  2013-03-21 19:29 ` Al Viro
  0 siblings, 2 replies; 45+ messages in thread
From: Dave Jones @ 2013-03-21 19:06 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel, Linus Torvalds

Al, Linus,
 I've seen this happen a few times recently.. I have a bunch of trinity processes
that run for a few hours, and then eventually all those processes end up spinning
waiting on each other.

The stack traces always seem to look like this..

Does this look like a real kernel deadlock, or more likely that trinity was unlucky enough
to do something really dumb that I need to protect itself against ?

I've only seen it happening the last week or so (but that may be because of all
the crap that I ended up finding last week preventing me running long enough
to hit this until now..)

Tree is Linus' latest with Peter Hurley's tty patchset on top (which afair
doesn't touch anything in these paths).

lockdep hasn't picked up anything, which makes me wonder if it's my problem..

	Dave


trinity-child2  D ffff880110b3a7e0  5448  7669      1 0x00000004
 ffff8801272a7d20 0000000000000086 ffff880101874920 ffff8801272a7fd8
 ffff8801272a7fd8 ffff8801272a7fd8 ffff880128ca8000 ffff880101874920
 ffff8801039a0250 0000000000000246 ffff880101874920 ffff8801272a7fd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff811c7034>] ? lock_rename+0x114/0x120
 [<ffffffff811c7034>] ? lock_rename+0x114/0x120
 [<ffffffff811c7034>] lock_rename+0x114/0x120
 [<ffffffff811cd7a7>] sys_renameat+0x1f7/0x3b0
 [<ffffffff810b33a2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810b6b95>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff810b6b95>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134a24e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b

trinity-child1  D ffff880110b3a7e0  5200  7901      1 0x00000004
 ffff8801015ddc50 0000000000000086 ffff880123fca490 ffff8801015ddfd8
 ffff8801015ddfd8 ffff8801015ddfd8 ffff880128cac920 ffff880123fca490
 ffff8801015dde60 0000000000000246 ffff880123fca490 ffff8801015ddfd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
 [<ffffffff811c805a>] ? lookup_fast+0x16a/0x2d0
 [<ffffffff811c9d7c>] path_lookupat+0x79c/0x7c0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff8100a144>] ? native_sched_clock+0x24/0x80
 [<ffffffff811c826a>] ? getname_flags+0x5a/0x1a0
 [<ffffffff811c9dd4>] filename_lookup+0x34/0xc0
 [<ffffffff811ccd23>] user_path_at_empty+0x63/0xa0
 [<ffffffff8100a144>] ? native_sched_clock+0x24/0x80
 [<ffffffff810b3288>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff816cd127>] ? sysret_check+0x1b/0x56
 [<ffffffff811ccd71>] user_path_at+0x11/0x20
 [<ffffffff811bb538>] sys_fchmodat+0x38/0xa0
 [<ffffffff8134a24e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff811bb5b9>] sys_chmod+0x19/0x20
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b

trinity-child1  D ffff880110b3a7e0  5528  8651  29705 0x00000004
 ffff8801187cfc60 0000000000000082 ffff880126a28000 ffff8801187cffd8
 ffff8801187cffd8 ffff8801187cffd8 ffffffff81c10440 ffff880126a28000
 ffff8801187cfe70 0000000000000246 ffff880126a28000 ffff8801187cffd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
 [<ffffffff811c805a>] ? lookup_fast+0x16a/0x2d0
 [<ffffffff811c9d7c>] path_lookupat+0x79c/0x7c0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff810b3288>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff811c826a>] ? getname_flags+0x5a/0x1a0
 [<ffffffff811c9dd4>] filename_lookup+0x34/0xc0
 [<ffffffff811ccd23>] user_path_at_empty+0x63/0xa0
 [<ffffffff810b3288>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff816cd127>] ? sysret_check+0x1b/0x56
 [<ffffffff811ccd71>] user_path_at+0x11/0x20
 [<ffffffff811e4904>] sys_removexattr+0x34/0xb0
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b

trinity-child3  D ffff880110b3a7e0  5312  9067      1 0x00000004
 ffff880100d29c60 0000000000000082 ffff8801271a0000 ffff880100d29fd8
 ffff880100d29fd8 ffff880100d29fd8 ffff880128cac920 ffff8801271a0000
 ffff880100d29e70 0000000000000246 ffff8801271a0000 ffff880100d29fd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
 [<ffffffff811c805a>] ? lookup_fast+0x16a/0x2d0
 [<ffffffff811c9d7c>] path_lookupat+0x79c/0x7c0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff811c826a>] ? getname_flags+0x5a/0x1a0
 [<ffffffff811c9dd4>] filename_lookup+0x34/0xc0
 [<ffffffff811ccd23>] user_path_at_empty+0x63/0xa0
 [<ffffffff810b3ace>] ? put_lock_stats.isra.27+0xe/0x40
 [<ffffffff810b6c2d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff816c4787>] ? _raw_spin_unlock_irq+0x37/0x60
 [<ffffffff816cd127>] ? sysret_check+0x1b/0x56
 [<ffffffff811ccd71>] user_path_at+0x11/0x20
 [<ffffffff811bb538>] sys_fchmodat+0x38/0xa0
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b

trinity-child0  D ffff880110b3a7e0  5320  9112      1 0x00000004
 ffff8801014fbc50 0000000000000086 ffff880123a94920 ffff8801014fbfd8
 ffff8801014fbfd8 ffff8801014fbfd8 ffffffff81c10440 ffff880123a94920
 ffff8801014fbe60 0000000000000246 ffff880123a94920 ffff8801014fbfd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
 [<ffffffff811c805a>] ? lookup_fast+0x16a/0x2d0
 [<ffffffff811c9d7c>] path_lookupat+0x79c/0x7c0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff8100a144>] ? native_sched_clock+0x24/0x80
 [<ffffffff811c826a>] ? getname_flags+0x5a/0x1a0
 [<ffffffff811c9dd4>] filename_lookup+0x34/0xc0
 [<ffffffff811ccd23>] user_path_at_empty+0x63/0xa0
 [<ffffffff8100a144>] ? native_sched_clock+0x24/0x80
 [<ffffffff810b3288>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff816cd127>] ? sysret_check+0x1b/0x56
 [<ffffffff811ccd71>] user_path_at+0x11/0x20
 [<ffffffff811babb5>] do_sys_truncate+0x35/0x90
 [<ffffffff811bad7e>] sys_truncate+0xe/0x10
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b

trinity-child3  D ffff880110b3a7e0  5304  9414  29705 0x00000004
 ffff880127b23c40 0000000000000082 ffff880128d38000 ffff880127b23fd8
 ffff880127b23fd8 ffff880127b23fd8 ffffffff81c10440 ffff880128d38000
 ffff880127b23e50 0000000000000246 ffff880128d38000 ffff880127b23fd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
 [<ffffffff811c805a>] ? lookup_fast+0x16a/0x2d0
 [<ffffffff811c9d7c>] path_lookupat+0x79c/0x7c0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff811c826a>] ? getname_flags+0x5a/0x1a0
 [<ffffffff811c9dd4>] filename_lookup+0x34/0xc0
 [<ffffffff811ccd23>] user_path_at_empty+0x63/0xa0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff81078c32>] ? creds_are_invalid.part.8+0x12/0x50
 [<ffffffff81078f6a>] ? prepare_creds+0x3a/0x1f0
 [<ffffffff811ccd71>] user_path_at+0x11/0x20
 [<ffffffff811bb093>] sys_faccessat+0xa3/0x230
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b

trinity-child2  D ffff880110b3a7e0  5704 10021  29705 0x00000004
 ffff8801228a7be0 0000000000000086 ffff880128c18000 ffff8801228a7fd8
 ffff8801228a7fd8 ffff8801228a7fd8 ffffffff81c10440 ffff880128c18000
 ffff8801228a7df0 0000000000000246 ffff880128c18000 ffff8801228a7fd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
 [<ffffffff811c805a>] ? lookup_fast+0x16a/0x2d0
 [<ffffffff8100a144>] ? native_sched_clock+0x24/0x80
 [<ffffffff811c9d7c>] path_lookupat+0x79c/0x7c0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff811c826a>] ? getname_flags+0x5a/0x1a0
 [<ffffffff811c9dd4>] filename_lookup+0x34/0xc0
 [<ffffffff811ccd23>] user_path_at_empty+0x63/0xa0
 [<ffffffff8116fbdc>] ? might_fault+0x9c/0xb0
 [<ffffffff811ccd71>] user_path_at+0x11/0x20
 [<ffffffff811c1379>] vfs_fstatat+0x49/0x90
 [<ffffffff811c1917>] sys_newlstat+0x27/0x40
 [<ffffffff810b6b95>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134a24e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b

trinity-child0  D ffff880110b3a7e0  5624 10222  29705 0x00000004
 ffff8801005ebc50 0000000000000086 ffff880101872490 ffff8801005ebfd8
 ffff8801005ebfd8 ffff8801005ebfd8 ffffffff81c10440 ffff880101872490
 ffff8801005ebe60 0000000000000246 ffff880101872490 ffff8801005ebfd8
Call Trace:
 [<ffffffff816c2fd3>] schedule_preempt_disabled+0x33/0x80
 [<ffffffff816bfcc1>] mutex_lock_nested+0x1a1/0x3d0
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] ? lookup_slow+0x13f/0x1b8
 [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
 [<ffffffff811c805a>] ? lookup_fast+0x16a/0x2d0
 [<ffffffff811c9d7c>] path_lookupat+0x79c/0x7c0
 [<ffffffff8119fa87>] ? kmem_cache_alloc+0xf7/0x340
 [<ffffffff811c826a>] ? getname_flags+0x5a/0x1a0
 [<ffffffff811c9dd4>] filename_lookup+0x34/0xc0
 [<ffffffff811ccd23>] user_path_at_empty+0x63/0xa0
 [<ffffffff810b33a2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810b6c2d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff816c4787>] ? _raw_spin_unlock_irq+0x37/0x60
 [<ffffffff816cd127>] ? sysret_check+0x1b/0x56
 [<ffffffff811ccd71>] user_path_at+0x11/0x20
 [<ffffffff811bb626>] sys_fchownat+0x66/0x110
 [<ffffffff816cd102>] system_call_fastpath+0x16/0x1b


runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------


Showing all locks held in the system:
4 locks on stack by trinity-child2/7669:
 #0: blocked:  (sb_writers#4){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
 #1: held:     (&type->s_vfs_rename_key){+.+.+.}, instance: ffff8801292d1928, at: [<ffffffff811c6f5e>] lock_rename+0x3e/0x120
 #2: held:     (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c701e>] lock_rename+0xfe/0x120
 #3: blocked:  (&type->i_mutex_dir_key#2/2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c7034>] lock_rename+0x114/0x120
1 lock on stack by trinity-child1/7901:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child1/8651:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child3/9067:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child0/9112:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child3/9414:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child2/10021:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child0/10222:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 19:06 VFS deadlock ? Dave Jones
@ 2013-03-21 19:21 ` Al Viro
  2013-03-21 20:31   ` Dave Jones
  2013-03-21 19:29 ` Al Viro
  1 sibling, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-21 19:21 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Linus Torvalds

On Thu, Mar 21, 2013 at 03:06:53PM -0400, Dave Jones wrote:
> Al, Linus,

> trinity-child2  D ffff880110b3a7e0  5448  7669      1 0x00000004
[sits in rename(), trying to grab ->i_mutex on a parent]

> Showing all locks held in the system:
> 4 locks on stack by trinity-child2/7669:
>  #0: blocked:  (sb_writers#4){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
>  #1: held:     (&type->s_vfs_rename_key){+.+.+.}, instance: ffff8801292d1928, at: [<ffffffff811c6f5e>] lock_rename+0x3e/0x120
>  #2: held:     (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c701e>] lock_rename+0xfe/0x120
>  #3: blocked:  (&type->i_mutex_dir_key#2/2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c7034>] lock_rename+0x114/0x120

Wait a minute...  How the hell does it manage to sit with *two* blocked locks?
Confused...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 19:06 VFS deadlock ? Dave Jones
  2013-03-21 19:21 ` Al Viro
@ 2013-03-21 19:29 ` Al Viro
  2013-03-21 20:15   ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-21 19:29 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Linus Torvalds

On Thu, Mar 21, 2013 at 03:06:53PM -0400, Dave Jones wrote:

> Showing all locks held in the system:
> 4 locks on stack by trinity-child2/7669:
>  #0: blocked:  (sb_writers#4){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
>  #1: held:     (&type->s_vfs_rename_key){+.+.+.}, instance: ffff8801292d1928, at: [<ffffffff811c6f5e>] lock_rename+0x3e/0x120
>  #2: held:     (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c701e>] lock_rename+0xfe/0x120
>  #3: blocked:  (&type->i_mutex_dir_key#2/2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c7034>] lock_rename+0x114/0x120

#0 oddity aside, that looks very much like directory aliased by two different
dentries.  Try to add
	BUG_ON(p1->d_inode == p2->d_inode);
just before
        mutex_lock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
and see if it triggers.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 19:29 ` Al Viro
@ 2013-03-21 20:15   ` Linus Torvalds
  2013-03-21 20:26     ` Dave Jones
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-21 20:15 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel

On Thu, Mar 21, 2013 at 12:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> #0 oddity aside, that looks very much like directory aliased by two different
> dentries.  Try to add
>         BUG_ON(p1->d_inode == p2->d_inode);
> just before
>         mutex_lock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
> and see if it triggers.

Don't do a BUG_ON(), instead do something like

   if (WARN_ON_ONCE(p1->d_inode == p2->d_inode)) {
        printk("pi=%s p2=%s\n", pi->d_name, p2->d_name);
        mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
        return NULL;
   }

so that we actually see where it is. I'm assuming it's some sysfs oddity again..

            Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 20:15   ` Linus Torvalds
@ 2013-03-21 20:26     ` Dave Jones
  2013-03-21 20:32       ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Jones @ 2013-03-21 20:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel

On Thu, Mar 21, 2013 at 01:15:08PM -0700, Linus Torvalds wrote:
 > On Thu, Mar 21, 2013 at 12:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
 > >
 > > #0 oddity aside, that looks very much like directory aliased by two different
 > > dentries.  Try to add
 > >         BUG_ON(p1->d_inode == p2->d_inode);
 > > just before
 > >         mutex_lock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
 > > and see if it triggers.
 > 
 > Don't do a BUG_ON(), instead do something like
 > 
 >    if (WARN_ON_ONCE(p1->d_inode == p2->d_inode)) {
 >         printk("pi=%s p2=%s\n", pi->d_name, p2->d_name);
 >         mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
 >         return NULL;
 >    }

those are qstr's, so I used d_name.name, right ?

 > so that we actually see where it is. I'm assuming it's some sysfs oddity again..

I'd be surprised actually, I've got sysfs excluded from its list of victim files,
due to unrelated issues still unresolved.  So unless it followed a symlink into
sys from somewhere in /proc or /dev...

It took a few hours to reproduce last time, I'll increase the number of child
processes to see if I can trigger it faster now that I have the debug stuff in there.

	Dave

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 19:21 ` Al Viro
@ 2013-03-21 20:31   ` Dave Jones
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Jones @ 2013-03-21 20:31 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel, Linus Torvalds

On Thu, Mar 21, 2013 at 07:21:46PM +0000, Al Viro wrote:
 > On Thu, Mar 21, 2013 at 03:06:53PM -0400, Dave Jones wrote:
 > > Al, Linus,
 > 
 > > trinity-child2  D ffff880110b3a7e0  5448  7669      1 0x00000004
 > [sits in rename(), trying to grab ->i_mutex on a parent]
 > 
 > > Showing all locks held in the system:
 > > 4 locks on stack by trinity-child2/7669:
 > >  #0: blocked:  (sb_writers#4){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
 > >  #1: held:     (&type->s_vfs_rename_key){+.+.+.}, instance: ffff8801292d1928, at: [<ffffffff811c6f5e>] lock_rename+0x3e/0x120
 > >  #2: held:     (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c701e>] lock_rename+0xfe/0x120
 > >  #3: blocked:  (&type->i_mutex_dir_key#2/2){+.+.+.}, instance: ffff880110b3a858, at: [<ffffffff811c7034>] lock_rename+0x114/0x120
 > 
 > Wait a minute...  How the hell does it manage to sit with *two* blocked locks?
 > Confused...

Same thing happened when I ran into this yesterday..
(bit more confusing, as there were more parallel runs going on, but similar patterns..)


Showing all locks held in the system:
1 lock on stack by trinity-main/784:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
2 locks on stack by trinity-child3/6674:
 #0: blocked:  (sb_writers#5){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
 #1: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff811ba1d9>] chmod_common+0x49/0xa0
1 lock on stack by trinity-child3/9678:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child0/9845:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
2 locks on stack by trinity-child1/10147:
 #0: blocked:  (sb_writers#5){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
 #1: blocked:  (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff811c9ffd>] kern_path_create+0xad/0x190
1 lock on stack by trinity-child2/10793:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff811cb5da>] do_last+0x33a/0xf20
1 lock on stack by trinity-child1/11549:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
4 locks on stack by trinity-child0/11592:
 #0: blocked:  (sb_writers#5){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
 #1: held:     (&type->s_vfs_rename_key){+.+.+.}, instance: ffff8801292d1928, at: [<ffffffff811c6f5e>] lock_rename+0x3e/0x120
 #2: held:     (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff811c701e>] lock_rename+0xfe/0x120
 #3: blocked:  (&type->i_mutex_dir_key#2/2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff811c7034>] lock_rename+0x114/0x120
1 lock on stack by trinity-child0/11645:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
2 locks on stack by trinity-child3/11740:
 #0: blocked:  (sb_writers#5){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
 #1: blocked:  (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff811c9ffd>] kern_path_create+0xad/0x190
1 lock on stack by trinity-child2/11953:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
1 lock on stack by trinity-child2/12018:
 #0: blocked:  (&type->i_mutex_dir_key#2){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff816bb512>] lookup_slow+0x13f/0x1b8
2 locks on stack by trinity-child1/12312:
 #0: blocked:  (sb_writers#5){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
 #1: blocked:  (&type->i_mutex_dir_key#2/1){+.+.+.}, instance: ffff88010c1717c0, at: [<ffffffff811caaa2>] do_unlinkat+0x112/0x230



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 20:26     ` Dave Jones
@ 2013-03-21 20:32       ` Linus Torvalds
  2013-03-21 20:36         ` Dave Jones
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-21 20:32 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Al Viro, Linux Kernel

On Thu, Mar 21, 2013 at 1:26 PM, Dave Jones <davej@redhat.com> wrote:
>
> those are qstr's, so I used d_name.name, right ?

Yup. And if you want to, you could do p1->d_parent->d_name.name too,
just to make things obvious. It's technically racy, but by the time
the bug happens, who cares?

> I'd be surprised actually, I've got sysfs excluded from its list of victim files,
> due to unrelated issues still unresolved.  So unless it followed a symlink into
> sys from somewhere in /proc or /dev...
>
> It took a few hours to reproduce last time, I'll increase the number of child
> processes to see if I can trigger it faster now that I have the debug stuff in there.

Hmm, ok. Do you have any network mounts or fuse or other "odd"
filesystems etc? The whole "aliased inodes" thing might come from
something like that.

        Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 20:32       ` Linus Torvalds
@ 2013-03-21 20:36         ` Dave Jones
  2013-03-21 20:47           ` Al Viro
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Jones @ 2013-03-21 20:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel

On Thu, Mar 21, 2013 at 01:32:45PM -0700, Linus Torvalds wrote:
 > On Thu, Mar 21, 2013 at 1:26 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > those are qstr's, so I used d_name.name, right ?
 > 
 > Yup. And if you want to, you could do p1->d_parent->d_name.name too,
 > just to make things obvious. It's technically racy, but by the time
 > the bug happens, who cares?

I'll add that next time around if the current run doesn't turn up anything interesting.
 
 > > I'd be surprised actually, I've got sysfs excluded from its list of victim files,
 > > due to unrelated issues still unresolved.  So unless it followed a symlink into
 > > sys from somewhere in /proc or /dev...
 > >
 > > It took a few hours to reproduce last time, I'll increase the number of child
 > > processes to see if I can trigger it faster now that I have the debug stuff in there.
 > 
 > Hmm, ok. Do you have any network mounts or fuse or other "odd"
 > filesystems etc? The whole "aliased inodes" thing might come from
 > something like that.

at some point during the fuzz run, this happened..

Mar 20 15:20:41 kernel: [ 7578.784674] fuse init (API version 7.21)
Mar 20 15:20:41 systemd[1]: Mounting FUSE Control File System...
Mar 20 15:20:41 systemd[1]: Mounted FUSE Control File System.

I guess something wandered into /dev/fuse and did something. Not sure why 
systemd reacted though...

	Dave


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 20:36         ` Dave Jones
@ 2013-03-21 20:47           ` Al Viro
  2013-03-21 21:02             ` Dave Jones
  0 siblings, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-21 20:47 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel

On Thu, Mar 21, 2013 at 04:36:39PM -0400, Dave Jones wrote:
> at some point during the fuzz run, this happened..
> 
> Mar 20 15:20:41 kernel: [ 7578.784674] fuse init (API version 7.21)
> Mar 20 15:20:41 systemd[1]: Mounting FUSE Control File System...
> Mar 20 15:20:41 systemd[1]: Mounted FUSE Control File System.
> 
> I guess something wandered into /dev/fuse and did something. Not sure why 
> systemd reacted though...

AFAICS, fuse uses d_materialise_unique() and d_splice_alias(), though, so
it's not too likely source of that crap; there's no d_instantiate() calls
at all and the sole d_add() is using an inode that has just been allocated,
so it won't be creating aliases either...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 20:47           ` Al Viro
@ 2013-03-21 21:02             ` Dave Jones
  2013-03-21 21:18               ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Jones @ 2013-03-21 21:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Linux Kernel

On Thu, Mar 21, 2013 at 08:47:04PM +0000, Al Viro wrote:
 > On Thu, Mar 21, 2013 at 04:36:39PM -0400, Dave Jones wrote:
 > > at some point during the fuzz run, this happened..
 > > 
 > > Mar 20 15:20:41 kernel: [ 7578.784674] fuse init (API version 7.21)
 > > Mar 20 15:20:41 systemd[1]: Mounting FUSE Control File System...
 > > Mar 20 15:20:41 systemd[1]: Mounted FUSE Control File System.
 > > 
 > > I guess something wandered into /dev/fuse and did something. Not sure why 
 > > systemd reacted though...
 > 
 > AFAICS, fuse uses d_materialise_unique() and d_splice_alias(), though, so
 > it's not too likely source of that crap; there's no d_instantiate() calls
 > at all and the sole d_add() is using an inode that has just been allocated,
 > so it won't be creating aliases either...

here we go...

WARNING: at fs/namei.c:2335 lock_rename+0x156/0x160()
Hardware name: GA-MA78GM-S2H
Modules linked in: vmw_vsock_vmci_transport vmw_vmci vsock hidp cmtp kernelcapi bnep caif_socket caif phonet nfnetlink rfcomm l2tp_ppp l2tp_netlink l2tp_core rose pppoe pppox ppp_generic slhc llc2 netrom af_key can_raw af_rxrpc scsi_transport_iscsi ipt_ULOG appletalk irda crc_ccitt decnet can_bcm can rds atm x25 ipx p8023 psnap p8022 llc ax25 nfc af_802154 lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb snd_pcm bluetooth rfkill snd_page_alloc snd_timer microcode serio_raw pcspkr snd edac_core soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
Pid: 5633, comm: trinity-child3 Not tainted 3.9.0-rc3+ #107
Call Trace:
 [<ffffffff810450f5>] warn_slowpath_common+0x75/0xa0
 [<ffffffff810451da>] warn_slowpath_null+0x1a/0x20
 [<ffffffff811c6df6>] lock_rename+0x156/0x160
 [<ffffffff811cd907>] sys_renameat+0x1f7/0x3b0
 [<ffffffff810b33a2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810b6b95>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff810b6b95>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134a3ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff816cd242>] system_call_fastpath+0x16/0x1b
---[ end trace 421592dfa22abfb0 ]---
p1=irda p2=irda



followed by...  


=====================================
[ BUG: bad unlock balance detected! ]
3.9.0-rc3+ #107 Tainted: G        W   
-------------------------------------
trinity-child3/5633 is trying to release lock (&type->i_mutex_dir_key) at:
[<ffffffff816c104e>] mutex_unlock+0xe/0x10
but there are no more locks to release!

other info that might help us debug this:
1 lock on stack by trinity-child3/5633:
 #0: blocked:  (sb_writers#4){.+.+.+}, instance: ffff8801292d17d8, at: [<ffffffff811df294>] mnt_want_write+0x24/0x50

stack backtrace:
Pid: 5633, comm: trinity-child3 Tainted: G        W    3.9.0-rc3+ #107
Call Trace:
 [<ffffffff816c104e>] ? mutex_unlock+0xe/0x10
 [<ffffffff810b4e10>] print_unlock_imbalance_bug+0x100/0x110
 [<ffffffff810b9417>] lock_release_non_nested+0x257/0x320
 [<ffffffff810b3288>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff816c104e>] ? mutex_unlock+0xe/0x10
 [<ffffffff816c104e>] ? mutex_unlock+0xe/0x10
 [<ffffffff810b9587>] lock_release+0xa7/0x310
 [<ffffffff816c0f4a>] __mutex_unlock_slowpath+0x8a/0x180
 [<ffffffff816c104e>] mutex_unlock+0xe/0x10
 [<ffffffff811c67f1>] unlock_rename+0x41/0x60
 [<ffffffff811cd996>] sys_renameat+0x286/0x3b0
 [<ffffffff810b33a2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810b6b95>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff810b6b95>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134a3ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff816cd242>] system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:02             ` Dave Jones
@ 2013-03-21 21:18               ` Linus Torvalds
  2013-03-21 21:26                 ` Al Viro
  2013-03-21 22:12                 ` Dave Jones
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-21 21:18 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Linux Kernel

On Thu, Mar 21, 2013 at 2:02 PM, Dave Jones <davej@redhat.com> wrote:
>
> here we go...
>
> WARNING: at fs/namei.c:2335 lock_rename+0x156/0x160()
> p1=irda p2=irda

Ok, good. I ssupect it's /proc or /sys, we do have that entry there.

But in fact I suspect we do want the parent name after all, because I
think we have multiple "irda" directories. There's the one in
/proc/net/ (added by net/irda/irproc.c), and there's a sysctl  CTL_DIR
"irda" directory (kernel/sysctl_binary.c). And there might even be a
few other ones in /sys too, thanks to the ldisc called "irda" etc.

I don't see where the shared inode comes from, but I suspect that
would be easier to guess if we actually see which particular case it
ends up being..

> followed by...
> =====================================
> [ BUG: bad unlock balance detected! ]

Oh, ok, that's just because the unlock path doesn't have the same
logic for unlocking identical inodes that the thing just added to the
locking path. You'd need to add a check for "same inode" and only
unlock it once.

So that was my fault in asking for a non-BUG_ON and not doing the
complete thing. See "unlock_rename()" - you'd need to change the "p1
!= p2" test there to "p1->d_inode != p2->d_inode" there to match the
logic in lock_rename()

            Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:18               ` Linus Torvalds
@ 2013-03-21 21:26                 ` Al Viro
  2013-03-21 21:41                   ` Dave Jones
  2013-03-21 22:12                 ` Dave Jones
  1 sibling, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-21 21:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel

On Thu, Mar 21, 2013 at 02:18:15PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 2:02 PM, Dave Jones <davej@redhat.com> wrote:
> >
> > here we go...
> >
> > WARNING: at fs/namei.c:2335 lock_rename+0x156/0x160()
> > p1=irda p2=irda
> 
> Ok, good. I ssupect it's /proc or /sys, we do have that entry there.
> 
> But in fact I suspect we do want the parent name after all, because I
> think we have multiple "irda" directories. There's the one in
> /proc/net/ (added by net/irda/irproc.c), and there's a sysctl  CTL_DIR
> "irda" directory (kernel/sysctl_binary.c). And there might even be a
> few other ones in /sys too, thanks to the ldisc called "irda" etc.
> 
> I don't see where the shared inode comes from, but I suspect that
> would be easier to guess if we actually see which particular case it
> ends up being..

Well, something like
	static char path[4096];
	d_absolute_path(p1, path, 4096);
	printk(KERN_ERR "%s %s %d %d"
		path, p1->d_sb->s_type->name, d_unlinked(p1), d_unlinked(p2));
might be useful - pathname within fs, fs type and which of those suckers are
unlinked...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:26                 ` Al Viro
@ 2013-03-21 21:41                   ` Dave Jones
  2013-03-21 21:47                     ` Linus Torvalds
  2013-03-21 21:52                     ` Al Viro
  0 siblings, 2 replies; 45+ messages in thread
From: Dave Jones @ 2013-03-21 21:41 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Linux Kernel

On Thu, Mar 21, 2013 at 09:26:17PM +0000, Al Viro wrote:
 > On Thu, Mar 21, 2013 at 02:18:15PM -0700, Linus Torvalds wrote:
 > > On Thu, Mar 21, 2013 at 2:02 PM, Dave Jones <davej@redhat.com> wrote:
 > > >
 > > > here we go...
 > > >
 > > > WARNING: at fs/namei.c:2335 lock_rename+0x156/0x160()
 > > > p1=irda p2=irda
 > > 
 > > Ok, good. I ssupect it's /proc or /sys, we do have that entry there.
 > > 
 > > But in fact I suspect we do want the parent name after all, because I
 > > think we have multiple "irda" directories. There's the one in
 > > /proc/net/ (added by net/irda/irproc.c), and there's a sysctl  CTL_DIR
 > > "irda" directory (kernel/sysctl_binary.c). And there might even be a
 > > few other ones in /sys too, thanks to the ldisc called "irda" etc.
 > > 
 > > I don't see where the shared inode comes from, but I suspect that
 > > would be easier to guess if we actually see which particular case it
 > > ends up being..
 > 
 > Well, something like
 > 	static char path[4096];
 > 	d_absolute_path(p1, path, 4096);
 > 	printk(KERN_ERR "%s %s %d %d"
 > 		path, p1->d_sb->s_type->name, d_unlinked(p1), d_unlinked(p2));
 > might be useful - pathname within fs, fs type and which of those suckers are
 > unlinked...

uh..

fs/namei.c:2342:3: warning: passing argument 1 of ‘d_absolute_path’ from incompatible pointer type [enabled by default]
   d_absolute_path(p1, path, 4096);
   ^
In file included from include/linux/fs.h:8:0,
                 from fs/namei.c:21:
include/linux/dcache.h:337:14: note: expected ‘const struct path *’ but argument is of type ‘struct dentry *’
 extern char *d_absolute_path(const struct path *, char *, int);
              ^


How do I go from the dentry in p1 to the path d_absolute_path expects ?

	Dave


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:41                   ` Dave Jones
@ 2013-03-21 21:47                     ` Linus Torvalds
  2013-03-21 21:55                       ` Al Viro
  2013-03-21 21:52                     ` Al Viro
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-21 21:47 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linus Torvalds, Linux Kernel

On Thu, Mar 21, 2013 at 2:41 PM, Dave Jones <davej@redhat.com> wrote:
>
> How do I go from the dentry in p1 to the path d_absolute_path expects ?

You don't. Use dentry_path() instead. It will also end up giving you
the information about whether the dentry is unlinked or not.

         Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:41                   ` Dave Jones
  2013-03-21 21:47                     ` Linus Torvalds
@ 2013-03-21 21:52                     ` Al Viro
  1 sibling, 0 replies; 45+ messages in thread
From: Al Viro @ 2013-03-21 21:52 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel

On Thu, Mar 21, 2013 at 05:41:18PM -0400, Dave Jones wrote:
> fs/namei.c:2342:3: warning: passing argument 1 of ???d_absolute_path??? from incompatible pointer type [enabled by default]
>    d_absolute_path(p1, path, 4096);
>    ^
> In file included from include/linux/fs.h:8:0,
>                  from fs/namei.c:21:
> include/linux/dcache.h:337:14: note: expected ???const struct path *??? but argument is of type ???struct dentry *???
>  extern char *d_absolute_path(const struct path *, char *, int);
>               ^
> 
> 
> How do I go from the dentry in p1 to the path d_absolute_path expects ?

Ugh...  s/d_absolute_path/dentry_path/, of course.  Sorry about the braino...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:47                     ` Linus Torvalds
@ 2013-03-21 21:55                       ` Al Viro
  2013-03-21 21:57                         ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-21 21:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel

On Thu, Mar 21, 2013 at 02:47:06PM -0700, Linus Torvalds wrote:

> It will also end up giving you
> the information about whether the dentry is unlinked or not.

You also want to know if p2 is unlinked, though (and yes, it's dentry_path(),
of course - apologies for the braino).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:55                       ` Al Viro
@ 2013-03-21 21:57                         ` Linus Torvalds
  2013-03-21 22:03                           ` Al Viro
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-21 21:57 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel

On Thu, Mar 21, 2013 at 2:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Mar 21, 2013 at 02:47:06PM -0700, Linus Torvalds wrote:
>
>> It will also end up giving you
>> the information about whether the dentry is unlinked or not.
>
> You also want to know if p2 is unlinked, though (and yes, it's dentry_path(),
> of course - apologies for the braino).

Just do dentry_path on both p1 and p2. You'll have to have two
different buffers for them, of course.

          Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:57                         ` Linus Torvalds
@ 2013-03-21 22:03                           ` Al Viro
  0 siblings, 0 replies; 45+ messages in thread
From: Al Viro @ 2013-03-21 22:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel

On Thu, Mar 21, 2013 at 02:57:29PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 2:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Thu, Mar 21, 2013 at 02:47:06PM -0700, Linus Torvalds wrote:
> >
> >> It will also end up giving you
> >> the information about whether the dentry is unlinked or not.
> >
> > You also want to know if p2 is unlinked, though (and yes, it's dentry_path(),
> > of course - apologies for the braino).
> 
> Just do dentry_path on both p1 and p2. You'll have to have two
> different buffers for them, of course.

I would be rather surprised if the pathnames proper would turn out to be
different, actually...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 21:18               ` Linus Torvalds
  2013-03-21 21:26                 ` Al Viro
@ 2013-03-21 22:12                 ` Dave Jones
  2013-03-21 22:29                   ` Dave Jones
  2013-03-21 22:53                   ` Linus Torvalds
  1 sibling, 2 replies; 45+ messages in thread
From: Dave Jones @ 2013-03-21 22:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel

On Thu, Mar 21, 2013 at 02:18:15PM -0700, Linus Torvalds wrote:
 > On Thu, Mar 21, 2013 at 2:02 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > here we go...
 > >
 > > WARNING: at fs/namei.c:2335 lock_rename+0x156/0x160()
 > > p1=irda p2=irda
 > 
 > Ok, good. I ssupect it's /proc or /sys, we do have that entry there.
 > 
 > But in fact I suspect we do want the parent name after all, because I
 > think we have multiple "irda" directories. There's the one in
 > /proc/net/ (added by net/irda/irproc.c), and there's a sysctl  CTL_DIR
 > "irda" directory (kernel/sysctl_binary.c). And there might even be a
 > few other ones in /sys too, thanks to the ldisc called "irda" etc.
 > 
 > I don't see where the shared inode comes from, but I suspect that
 > would be easier to guess if we actually see which particular case it
 > ends up being..

it's not just irda fwiw..

p1=rpc p2=rpc p1parent=net p2parent=net

not sure why the printk with Al's last debugging info didn't print out..
Investigating..

	Dave


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 22:12                 ` Dave Jones
@ 2013-03-21 22:29                   ` Dave Jones
  2013-03-21 22:53                   ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: Dave Jones @ 2013-03-21 22:29 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Linux Kernel

On Thu, Mar 21, 2013 at 06:12:56PM -0400, Dave Jones wrote:
 > On Thu, Mar 21, 2013 at 02:18:15PM -0700, Linus Torvalds wrote:
 >  > On Thu, Mar 21, 2013 at 2:02 PM, Dave Jones <davej@redhat.com> wrote:
 >  > >
 >  > > here we go...
 >  > >
 >  > > WARNING: at fs/namei.c:2335 lock_rename+0x156/0x160()
 >  > > p1=irda p2=irda
 >  > 
 >  > Ok, good. I ssupect it's /proc or /sys, we do have that entry there.
 >  > 
 >  > But in fact I suspect we do want the parent name after all, because I
 >  > think we have multiple "irda" directories. There's the one in
 >  > /proc/net/ (added by net/irda/irproc.c), and there's a sysctl  CTL_DIR
 >  > "irda" directory (kernel/sysctl_binary.c). And there might even be a
 >  > few other ones in /sys too, thanks to the ldisc called "irda" etc.
 >  > 
 >  > I don't see where the shared inode comes from, but I suspect that
 >  > would be easier to guess if we actually see which particular case it
 >  > ends up being..
 > 
 > it's not just irda fwiw..
 > 
 > p1=rpc p2=rpc p1parent=net p2parent=net
 > 
 > not sure why the printk with Al's last debugging info didn't print out..
 > Investigating..

missing \n meant it didn't go over serial..
on screen however was..

path1: path2: proc 0 0

sanity check ?

 
diff --git a/fs/namei.c b/fs/namei.c
index 57ae9c8..5afffe1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2282,6 +2332,26 @@ struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
 		return NULL;
 	}
 
+	if (WARN_ON_ONCE(p1->d_inode == p2->d_inode)) {
+		static char path1[4096];
+		static char path2[4096];
+
+		printk(KERN_ERR "p1=%s p2=%s p1parent=%s p2parent=%s\n", p1->d_name.name, p2->d_name.name,
+				p1->d_parent->d_name.name,
+				p2->d_parent->d_name.name);
+
+		dentry_path(p1, path1, 4096);
+		dentry_path(p2, path2, 4096);
+
+		printk(KERN_ERR "path1:%s path2:%s %s %d %d\n",
+	                path1, path2, p1->d_sb->s_type->name, d_unlinked(p1), d_unlinked(p2));
+
+
+		mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
+		return NULL;
+	}
+
+
 	mutex_lock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
 
 	p = d_ancestor(p2, p1);

@@ -2306,7 +2376,8 @@ struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
 void unlock_rename(struct dentry *p1, struct dentry *p2)
 {
 	mutex_unlock(&p1->d_inode->i_mutex);
-	if (p1 != p2) {
+
+	if (p1->d_inode != p2->d_inode) {
 		mutex_unlock(&p2->d_inode->i_mutex);
 		mutex_unlock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
 	}


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 22:12                 ` Dave Jones
  2013-03-21 22:29                   ` Dave Jones
@ 2013-03-21 22:53                   ` Linus Torvalds
  2013-03-21 23:07                     ` Dave Jones
  2013-03-21 23:36                     ` Al Viro
  1 sibling, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-21 22:53 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Al Viro, Linux Kernel

On Thu, Mar 21, 2013 at 3:12 PM, Dave Jones <davej@redhat.com> wrote:
>
> it's not just irda fwiw..
>
> p1=rpc p2=rpc p1parent=net p2parent=net

Ok, good. The only rpc/irda that has something in common is
/proc/net/, and they both use proc_mkdir() to create the directory:

    proc_irda = proc_mkdir("irda", init_net.proc_net);
   ...
    sn->proc_net_rpc = proc_mkdir("rpc", net->proc_net);

so it's almost certainly that case. What I do *not* see is how we got
two different dentries for the same name in /proc. But if that
happens, then yes, they will have aliased inodes (because
proc_get_inode() will look them up by "sb,de->low_ino".

Al, any ideas? There shouldn't be some lookup race, because that's
done under the parent inode lock. And multiple mount-points will have
different superblocks, so proc_get_inode() will give them separate
inodes. And bind mounts should have all the same dentry tree. So what
the heck am I missing?

       Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 22:53                   ` Linus Torvalds
@ 2013-03-21 23:07                     ` Dave Jones
  2013-03-21 23:36                     ` Al Viro
  1 sibling, 0 replies; 45+ messages in thread
From: Dave Jones @ 2013-03-21 23:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel

On Thu, Mar 21, 2013 at 03:53:13PM -0700, Linus Torvalds wrote:
 > On Thu, Mar 21, 2013 at 3:12 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > it's not just irda fwiw..
 > >
 > > p1=rpc p2=rpc p1parent=net p2parent=net
 > 
 > Ok, good. The only rpc/irda that has something in common is
 > /proc/net/, and they both use proc_mkdir() to create the directory:
 > 
 >     proc_irda = proc_mkdir("irda", init_net.proc_net);
 >    ...
 >     sn->proc_net_rpc = proc_mkdir("rpc", net->proc_net);
 > 
 > so it's almost certainly that case. What I do *not* see is how we got
 > two different dentries for the same name in /proc. But if that
 > happens, then yes, they will have aliased inodes (because
 > proc_get_inode() will look them up by "sb,de->low_ino".
 > 
 > Al, any ideas? There shouldn't be some lookup race, because that's
 > done under the parent inode lock. And multiple mount-points will have
 > different superblocks, so proc_get_inode() will give them separate
 > inodes. And bind mounts should have all the same dentry tree. So what
 > the heck am I missing?

Hmm, these also seem to have appeared around about the time I reenabled
all the namespace options after Eric fixed that last proc bug.

could that be related ?

	Dave

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 22:53                   ` Linus Torvalds
  2013-03-21 23:07                     ` Dave Jones
@ 2013-03-21 23:36                     ` Al Viro
  2013-03-21 23:58                       ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-21 23:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 03:53:13PM -0700, Linus Torvalds wrote:
> Ok, good. The only rpc/irda that has something in common is
> /proc/net/, and they both use proc_mkdir() to create the directory:
> 
>     proc_irda = proc_mkdir("irda", init_net.proc_net);
>    ...
>     sn->proc_net_rpc = proc_mkdir("rpc", net->proc_net);
> 
> so it's almost certainly that case. What I do *not* see is how we got
> two different dentries for the same name in /proc. But if that
> happens, then yes, they will have aliased inodes (because
> proc_get_inode() will look them up by "sb,de->low_ino".
> 
> Al, any ideas? There shouldn't be some lookup race, because that's
> done under the parent inode lock. And multiple mount-points will have
> different superblocks, so proc_get_inode() will give them separate
> inodes. And bind mounts should have all the same dentry tree. So what
> the heck am I missing?

Some netns-related idiocy.  Oh, shit...

al@duke:~/linux/trees/vfs$ ls -lid /proc/{1,2}/net/stat
4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/1/net/stat
4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/2/net/stat

Eric, would you mind explaining WTF is going on here?  Again, WE CAN NOT
HAVE SEVERAL DENTRIES OVER THE SAME DIRECTORY INODE.  Ever.  We do that,
we are fucked.

Sigh...  Namespace kinds - there should've been only one...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 23:36                     ` Al Viro
@ 2013-03-21 23:58                       ` Linus Torvalds
  2013-03-22  0:01                         ` Linus Torvalds
  2013-03-22  0:08                         ` Al Viro
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-21 23:58 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 4:36 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Some netns-related idiocy.  Oh, shit...
>
> al@duke:~/linux/trees/vfs$ ls -lid /proc/{1,2}/net/stat
> 4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/1/net/stat
> 4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/2/net/stat
>
> Eric, would you mind explaining WTF is going on here?  Again, WE CAN NOT
> HAVE SEVERAL DENTRIES OVER THE SAME DIRECTORY INODE.  Ever.  We do that,
> we are fucked.

Hmm. That certainly explains the situation, but it leaves me wondering
whether the simplest solution to this is not to say "ok, let's allow
it in this case".

The locking is already per-inode, so we can literally change the code
that checks "if same dentry" to "if same inode" instead.

And the only other reason we don't want to allow it is to make sure
you can't have directory loops etc, afaik, and again, for this
particular case of /proc, we happen to be ok.

So yes, it's against the rules, and we get that deadlock right now,
but one solution would be to just allow this particular case. The
patch for the deadlock looks dead simple:

    diff --git a/fs/namei.c b/fs/namei.c
    index 57ae9c8c66bf..435002f99bd8 100644
    --- a/fs/namei.c
    +++ b/fs/namei.c
    @@ -2277,7 +2277,7 @@ struct dentry *lock_rename(struct dentry
*p1, struct dentry *p2)
     {
             struct dentry *p;

    -        if (p1 == p2) {
    +        if (p1->d_inode == p2->d_inode) {
                     mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
                     return NULL;
             }
    @@ -2306,7 +2306,7 @@ struct dentry *lock_rename(struct dentry
*p1, struct dentry *p2)
     void unlock_rename(struct dentry *p1, struct dentry *p2)
     {
             mutex_unlock(&p1->d_inode->i_mutex);
    -        if (p1 != p2) {
    +        if (p1->d_inode != p2->d_inode) {
                     mutex_unlock(&p2->d_inode->i_mutex);
                     mutex_unlock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
             }

Are there any other reasons why these kinds of "hardlinked
directories" would cause problems?

             Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 23:58                       ` Linus Torvalds
@ 2013-03-22  0:01                         ` Linus Torvalds
  2013-03-22  0:12                           ` Al Viro
  2013-03-22  0:08                         ` Al Viro
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  0:01 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 4:58 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So yes, it's against the rules, and we get that deadlock right now,
> but one solution would be to just allow this particular case. The
> patch for the deadlock looks dead simple:

It should go without saying that that whitespace-damaged patch is
entirely untested. But especially since we need to back-port whatever
fix, it would be good if we could make the fix be something as simple
as this. Because I don't believe we really want to backport some big
network-namespace reorganization.

This is, of course, all assuming that hardlinked directories are ok if
we can just guarantee the absence of loops. If there's some other
reason why they'd be problematic, we're screwed.

        Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-21 23:58                       ` Linus Torvalds
  2013-03-22  0:01                         ` Linus Torvalds
@ 2013-03-22  0:08                         ` Al Viro
  2013-03-22  0:15                           ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-22  0:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 04:58:41PM -0700, Linus Torvalds wrote:

> And the only other reason we don't want to allow it is to make sure
> you can't have directory loops etc, afaik, and again, for this
> particular case of /proc, we happen to be ok.

Not really.  Do that and yes, this deadlock goes away.  But the locking
order in general goes to hell - we order directory inodes by "which dentry
is an ancestor of another?"  So we have no warranty that we won't get
alias1/foo/bar/baz < alias2/foo.  Take rename_lock() on those two and
have it race with rmdir alias2/foo/bar/baz (locks alias2/foo/bar, then
alias2/foo/bar/baz) and rmdir alias2/foo/bar (locks alias2/foo and
alias2/foo/bar).  Oops - we have a cycle now...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  0:01                         ` Linus Torvalds
@ 2013-03-22  0:12                           ` Al Viro
  2013-03-22  0:20                             ` Al Viro
  2013-03-22  0:22                             ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Al Viro @ 2013-03-22  0:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 05:01:49PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 4:58 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So yes, it's against the rules, and we get that deadlock right now,
> > but one solution would be to just allow this particular case. The
> > patch for the deadlock looks dead simple:
> 
> It should go without saying that that whitespace-damaged patch is
> entirely untested. But especially since we need to back-port whatever
> fix, it would be good if we could make the fix be something as simple
> as this. Because I don't believe we really want to backport some big
> network-namespace reorganization.
> 
> This is, of course, all assuming that hardlinked directories are ok if
> we can just guarantee the absence of loops. If there's some other
> reason why they'd be problematic, we're screwed.

See the posting upthread.  We could try to kludge around that as well
(e.g. have d_ancestor() compare ->d_inode instead of dentries themselves),
but I really think it's a lousy idea only inviting further abuse.

What we should do, IMO, is to turn /proc/<pid>/net into a honest symlink -
to ../nets/<netns ID>/net.  Hell, might even make it a magical symlink
instead...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  0:08                         ` Al Viro
@ 2013-03-22  0:15                           ` Linus Torvalds
  2013-03-22  0:19                             ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  0:15 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 5:08 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Not really.  Do that and yes, this deadlock goes away.  But the locking
> order in general goes to hell - we order directory inodes by "which dentry
> is an ancestor of another?"  So we have no warranty that we won't get
> alias1/foo/bar/baz < alias2/foo.  Take rename_lock() on those two and
> have it race with rmdir alias2/foo/bar/baz (locks alias2/foo/bar, then
> alias2/foo/bar/baz) and rmdir alias2/foo/bar (locks alias2/foo and
> alias2/foo/bar).  Oops - we have a cycle now...

Hmm. But again, that can't actually happen here. We're in /proc. You
can't move the entries around. Also, we only changed the locking order
for the "inode is identical" case where we take only *one* lock, we
didn't change it for the cases where we take multiple locks (and order
them topologically).

So I agree that we need to avoid aliased directories in the *general*
case. I'm just arguing that for the specific case of /proc, we should
be ok. No?

             Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  0:15                           ` Linus Torvalds
@ 2013-03-22  0:19                             ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  0:19 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

Thinking some more..

On Thu, Mar 21, 2013 at 5:15 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. But again, that can't actually happen here. We're in /proc. You
> can't move the entries around.

.. this wasn't a good argument, because we will take the locks before
we do that.

> Also, we only changed the locking order
> for the "inode is identical" case where we take only *one* lock, we
> didn't change it for the cases where we take multiple locks (and order
> them topologically).

.. and this isn't a good argument either, because your argument was
that you can get the deadlock by always taking two directories, and
never hitting the alias case itself.

Hmm.

             Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  0:12                           ` Al Viro
@ 2013-03-22  0:20                             ` Al Viro
  2013-03-22  0:22                             ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: Al Viro @ 2013-03-22  0:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Fri, Mar 22, 2013 at 12:12:57AM +0000, Al Viro wrote:

> See the posting upthread.  We could try to kludge around that as well
> (e.g. have d_ancestor() compare ->d_inode instead of dentries themselves),
> but I really think it's a lousy idea only inviting further abuse.
> 
> What we should do, IMO, is to turn /proc/<pid>/net into a honest symlink -
> to ../nets/<netns ID>/net.  Hell, might even make it a magical symlink
> instead...

BTW, the root cause is that what used to be /proc/net became per-process.
So Eric (IIRC) had added /proc/<pid>/net.  Only they are not really per-process
- they are per-netns.  And instead of putting those per-ns trees elsewhere and
having /proc/<pid>/net resolve to the right one, we got them as directories,
with each entry hardlinked between all /proc/<pid>/net for processes from
the same netns.  Including the subdirectory ones.  Oops...

Another variant is to keep cross-hardlinks for non-directories and duplicate
directory dentries/inodes as we do for /proc/<pid>/net themselves.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  0:12                           ` Al Viro
  2013-03-22  0:20                             ` Al Viro
@ 2013-03-22  0:22                             ` Linus Torvalds
  2013-03-22  1:22                               ` Al Viro
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  0:22 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 5:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What we should do, IMO, is to turn /proc/<pid>/net into a honest symlink -
> to ../nets/<netns ID>/net.  Hell, might even make it a magical symlink
> instead...

Ok, having seen the error of my ways, I'm starting to agree with you..
 How painful would that be? Especially since we'd need to backport
it..

              Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  0:22                             ` Linus Torvalds
@ 2013-03-22  1:22                               ` Al Viro
  2013-03-22  1:33                                 ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-22  1:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 05:22:59PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 5:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What we should do, IMO, is to turn /proc/<pid>/net into a honest symlink -
> > to ../nets/<netns ID>/net.  Hell, might even make it a magical symlink
> > instead...
> 
> Ok, having seen the error of my ways, I'm starting to agree with you..
>  How painful would that be? Especially since we'd need to backport
> it..

Not sure; right now I'm looking through the guts of what procfs had become.
Unfortunately, there are fairly subtle interactions with other shit -
tomoyo, etc.  Sigh...

BTW, the variant with d_ancestor() modification is also not enough -
/proc/1/net and /proc/2/net have different inodes, so for the pair
(/proc/net/1, /proc/2/net/stat) d_ancestor() won't trigger
even with this change.  And we have /proc/net/1 < /proc/net/1/stat,
since the latter is a subdirectory of the former.  With /proc/net/{1,2}/stat
having the same inode...

In theory, we can make vfs_rmdir() and vfs_unlink() check the presense of
the corresponding method before locking the victim; that would suffice to
kludge around that mess on procfs.  Along with ->d_inode comparison in
lock_rename() it *might* suffice.  OTOH, there are places in fs/dcache.c
where we rely on the lack of such aliases; they might or might not trigger
in case of procfs.

We are talking about the violation of fundamental assert used in
correctness analysis all over the place, unfortunately.  The right fix
is to restore it; I'll try to come up with something that could be
reasonably easily backported - the kludge above is a fallback in case if
no real fix turns out to be easy to backport.  Assuming that this kludge
is sufficient, that is...  For 3.9 and later we *definitely* want to
restore that assertion.

PS: Once more, with feeling, to everyone even thinking of pulling something
like that again:
	Hardlinks to directories do not work.  Don't do that, or we'll be
sorry, and then so will you.
					A Very Peeved BOFH

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  1:22                               ` Al Viro
@ 2013-03-22  1:33                                 ` Linus Torvalds
  2013-03-22  1:40                                   ` Al Viro
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  1:33 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 6:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> In theory, we can make vfs_rmdir() and vfs_unlink() check the presense of
> the corresponding method before locking the victim; that would suffice to
> kludge around that mess on procfs.  Along with ->d_inode comparison in
> lock_rename() it *might* suffice.

Hmm, yes. Maybe we can do that as a stopgap, backport that, and leave
any bigger changes for the development tree. That would make the issue
less urgent, never mind all the other worries about backporting
complicated patches for subtle issues.

I realize you aren't entirely thrilled about it, but we actually
already seem to do that check in both vfs_rmdir().and vfs_unlink()
before getting the child i_mutex.  I wonder if that is because we've
already seen lockdep splats for this case...

                 Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: VFS deadlock ?
  2013-03-22  1:33                                 ` Linus Torvalds
@ 2013-03-22  1:40                                   ` Al Viro
  2013-03-22  4:37                                     ` [CFT] " Al Viro
  0 siblings, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-22  1:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 06:33:35PM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 6:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > In theory, we can make vfs_rmdir() and vfs_unlink() check the presense of
> > the corresponding method before locking the victim; that would suffice to
> > kludge around that mess on procfs.  Along with ->d_inode comparison in
> > lock_rename() it *might* suffice.
> 
> Hmm, yes. Maybe we can do that as a stopgap, backport that, and leave
> any bigger changes for the development tree. That would make the issue
> less urgent, never mind all the other worries about backporting
> complicated patches for subtle issues.
> 
> I realize you aren't entirely thrilled about it, but we actually
> already seem to do that check in both vfs_rmdir().and vfs_unlink()
> before getting the child i_mutex.  I wonder if that is because we've
> already seen lockdep splats for this case...

Yeah, I went to do such patch after sending the previous mail and noticed
that we already did it that way.  Simplicity of error recovery was probably
more important consideration there - I honestly don't remember the reasoning
in such details; it had been a decade or so...  So lock_rename() doing
->d_inode comparison (with dire comment re not expecting that to be sufficient
for anything other than this bug in procfs) will probably suffice for fs/namei.c
part of it; I'm still looking at dcache.c side of things...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [CFT] Re: VFS deadlock ?
  2013-03-22  1:40                                   ` Al Viro
@ 2013-03-22  4:37                                     ` Al Viro
  2013-03-22  4:55                                       ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-22  4:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Fri, Mar 22, 2013 at 01:40:37AM +0000, Al Viro wrote:

> Yeah, I went to do such patch after sending the previous mail and noticed
> that we already did it that way.  Simplicity of error recovery was probably
> more important consideration there - I honestly don't remember the reasoning
> in such details; it had been a decade or so...  So lock_rename() doing
> ->d_inode comparison (with dire comment re not expecting that to be sufficient
> for anything other than this bug in procfs) will probably suffice for fs/namei.c
> part of it; I'm still looking at dcache.c side of things...

FWIW, a relatively crude solution is this:

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 4b3b3ff..778cbac 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -416,8 +416,7 @@ struct dentry *proc_lookup_de(struct proc_dir_entry *de, struct inode *dir,
 			if (!inode)
 				return ERR_PTR(-ENOMEM);
 			d_set_d_op(dentry, &proc_dentry_operations);
-			d_add(dentry, inode);
-			return NULL;
+			return d_materialise_unique(dentry, inode);
 		}
 	}
 	spin_unlock(&proc_subdir_lock);

It *is* crude, but it restores the assert, killing the deadlock and lets
everything work more or less as it used to.  The case where things start
to look odd is this:

root@kvm-amd64:~# cd /proc/1/net/stat/; ls /proc/2/net/stat; /bin/pwd
arp_cache  ndisc_cache  rt_cache
/proc/2/net/stat

IOW, if we were about to create a directory alias, the old dentry gets moved
in new place.  OTOH, I think it's the most robust backportable variant we
can do.  And yes, that should apply at least all the way back to 2.6.25 when
Eric acked a patch from Pavel that really should've been nacked...

Folks, could you test that one and see if any real userland breaks on that?
If everything works, I'd propose that one for -stable.

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  4:37                                     ` [CFT] " Al Viro
@ 2013-03-22  4:55                                       ` Linus Torvalds
  2013-03-22  5:18                                         ` Al Viro
  2013-03-22  5:19                                         ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  4:55 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

[-- Attachment #1: Type: text/plain, Size: 1791 bytes --]

On Thu, Mar 21, 2013 at 9:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, a relatively crude solution is this:
>
> -                       d_add(dentry, inode);
> -                       return NULL;
> +                       return d_materialise_unique(dentry, inode);
>
> It *is* crude, but it restores the assert, killing the deadlock and lets
> everything work more or less as it used to.  The case where things start
> to look odd is this:
>
> root@kvm-amd64:~# cd /proc/1/net/stat/; ls /proc/2/net/stat; /bin/pwd
> arp_cache  ndisc_cache  rt_cache
> /proc/2/net/stat
>
> IOW, if we were about to create a directory alias, the old dentry gets moved
> in new place.

Ugh, no, this is worse than the bug we're trying to fix.

However, I do wonder if we could take another approach... There's
really no reason why we should look up an old inode with iget_locked()
in proc_get_inode() and only fill it in if it is new. We might as well
just always create a new inode. The "iget_locked()" logic really comes
from the bad old days when the inode was the primary data structure,
but it's really the dentry that is the important data structure, and
the inode might as well follow the dentry, instead of the other way
around.

So why not just use "new_inode_pseudo()" instead? IOW, something like
this (totally untested, natch) patch? This way, if you have a new
dentry, you are guaranteed to always have a new inode. None of the
silly inode alias issues..

This seems too simple, but I don't see why iget_locked() would be any
better. It just wastes time hashing the inode that we aren't really
interested in hashing. The inode is always filled by the caller
anyway, so we migth as well just get a fresh pseudo-filesystem inode
without any crazy hashing..

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 672 bytes --]

 fs/proc/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a86aebc9ba7c..a2395317b3bf 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -446,9 +446,10 @@ static const struct file_operations proc_reg_file_ops_no_compat = {
 
 struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 {
-	struct inode *inode = iget_locked(sb, de->low_ino);
+	struct inode *inode = new_inode_pseudo(sb);
 
-	if (inode && (inode->i_state & I_NEW)) {
+	if (inode) {
+		inode->i_ino = de->low_ino;
 		inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 		PROC_I(inode)->pde = de;
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  4:55                                       ` Linus Torvalds
@ 2013-03-22  5:18                                         ` Al Viro
  2013-03-22  5:33                                           ` Linus Torvalds
  2013-03-22  5:19                                         ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-22  5:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 09:55:10PM -0700, Linus Torvalds wrote:
> However, I do wonder if we could take another approach... There's
> really no reason why we should look up an old inode with iget_locked()
> in proc_get_inode() and only fill it in if it is new. We might as well
> just always create a new inode. The "iget_locked()" logic really comes
> from the bad old days when the inode was the primary data structure,
> but it's really the dentry that is the important data structure, and
> the inode might as well follow the dentry, instead of the other way
> around.
> 
> So why not just use "new_inode_pseudo()" instead? IOW, something like
> this (totally untested, natch) patch? This way, if you have a new
> dentry, you are guaranteed to always have a new inode. None of the
> silly inode alias issues..
> 
> This seems too simple, but I don't see why iget_locked() would be any
> better. It just wastes time hashing the inode that we aren't really
> interested in hashing. The inode is always filled by the caller
> anyway, so we migth as well just get a fresh pseudo-filesystem inode
> without any crazy hashing..

Umm...
static int proc_delete_dentry(const struct dentry * dentry)
{
        return 1;
}

static const struct dentry_operations proc_dentry_operations =
{
        .d_delete       = proc_delete_dentry,
};

IOW, dcache retention in procfs is inexistent and the damn thing tries
to cut down on the amount of inode allocation/freeing/filling.

I agree that we could get rid of icache retention there and everything
ought to keep working.  Hell knows...  It applies only to the stuff that
isn't per-process, so it's probably not particulary hot anyway, but it
does need profiling...  OTOH, we probably could mark "stable" dentries
in some way and let proc_delete_dentry() check that flag in proc_dir_entry -
I seriously suspect that really hot non-per-process ones are of the
"never become invalid" variety.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  4:55                                       ` Linus Torvalds
  2013-03-22  5:18                                         ` Al Viro
@ 2013-03-22  5:19                                         ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  5:19 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 9:55 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So why not just use "new_inode_pseudo()" instead? IOW, something like
> this (totally untested, natch) patch?

Ok, so I think that patch is still probably the right way to go, but
it's broken for a silly reason: because it's not using iget_locked()
any more, it also needs to remove the now unnecessary (and actively
incorrect) call to unlock_new_inode().

So remove that one line too, and maybe it could even work. Still not
*tested*, though.

              Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  5:18                                         ` Al Viro
@ 2013-03-22  5:33                                           ` Linus Torvalds
  2013-03-22  6:09                                             ` Al Viro
                                                               ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22  5:33 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 10:18 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> This seems too simple, but I don't see why iget_locked() would be any
>> better. It just wastes time hashing the inode that we aren't really
>> interested in hashing. The inode is always filled by the caller
>> anyway, so we migth as well just get a fresh pseudo-filesystem inode
>> without any crazy hashing..
>
> Umm...
> static int proc_delete_dentry(const struct dentry * dentry)
> {
>         return 1;
> }
>
> static const struct dentry_operations proc_dentry_operations =
> {
>         .d_delete       = proc_delete_dentry,
> };
>
> IOW, dcache retention in procfs is inexistent and the damn thing tries
> to cut down on the amount of inode allocation/freeing/filling.

Ok, that's kind of ugly, but shouldn't be a correctness issue. It
should still work - just cycle through inodes quite aggressivelydue to
no longer re-using them - so I suspect Dave could test it (with the
extra line removed I pointed out just a moment ago).

And I wonder how big of a deal the aggressive dentry deletion is.
Maybe it's even ok to allocate/free the inodes all the time. The whole
"get the inode hash lock and look it up there" can't be all that
wonderful either. It takes the inode->i_lock for each entry it finds
on the hash list, which looks horrible. I suspect our slab allocator
isn't much worse than that, although the RCU freeing of the inodes
could end up being problematic.

                    Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  5:33                                           ` Linus Torvalds
@ 2013-03-22  6:09                                             ` Al Viro
  2013-03-22  6:22                                               ` Al Viro
  2013-03-22 16:23                                             ` Dave Jones
  2013-03-22 19:43                                             ` Linus Torvalds
  2 siblings, 1 reply; 45+ messages in thread
From: Al Viro @ 2013-03-22  6:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 10:33:27PM -0700, Linus Torvalds wrote:

> Ok, that's kind of ugly, but shouldn't be a correctness issue. It
> should still work - just cycle through inodes quite aggressivelydue to
> no longer re-using them - so I suspect Dave could test it (with the
> extra line removed I pointed out just a moment ago).
> 
> And I wonder how big of a deal the aggressive dentry deletion is.
> Maybe it's even ok to allocate/free the inodes all the time. The whole
> "get the inode hash lock and look it up there" can't be all that
> wonderful either. It takes the inode->i_lock for each entry it finds
> on the hash list, which looks horrible. I suspect our slab allocator
> isn't much worse than that, although the RCU freeing of the inodes
> could end up being problematic.

Hell knows...  At the very least, I'd expect /proc/self to be fairly hot.
During the boot time - /proc/mounts, /proc/filesystems, /proc/cmdline...
Dunno.  Would be interesting to slap a printk into proc_lookup_de() and
see how much (and what) does it hit on a busy system...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  6:09                                             ` Al Viro
@ 2013-03-22  6:22                                               ` Al Viro
  0 siblings, 0 replies; 45+ messages in thread
From: Al Viro @ 2013-03-22  6:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Fri, Mar 22, 2013 at 06:09:53AM +0000, Al Viro wrote:

> Hell knows...  At the very least, I'd expect /proc/self to be fairly hot.
> During the boot time - /proc/mounts, /proc/filesystems, /proc/cmdline...
> Dunno.  Would be interesting to slap a printk into proc_lookup_de() and
> see how much (and what) does it hit on a busy system...

FWIW, I'd done a quick and dirty stats collector right now - added hit count
into proc_dir_entry + /proc/counts walking the pde tree and dumping the
counts.  Left it running in practically idle kvm testbox, after boot it
shows
kmsg => 2   
kcore => 4
version => 4   
uptime => 3   
stat => 38 
meminfo => 22 
loadavg => 2
devices => 4   
consoles => 4  
cmdline => 109
filesystems => 118
swaps => 8   
modules => 2
misc => 1  
acpi => 6
fb => 1
counts => 14
sys => 1
bus => 1
fs => 1
net => 18
mounts => 10  
self => 123

IOW, counts on this one are very low so far.  OTOH, that kvm image is
practically idle, doesn't have desktop shite on it, etc.  This really
ought to be checked on something busy...  The first impression is that
the stuff outside of /proc/<pid> and /proc/sys isn't hot enough to care
about any cache retention...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  5:33                                           ` Linus Torvalds
  2013-03-22  6:09                                             ` Al Viro
@ 2013-03-22 16:23                                             ` Dave Jones
  2013-03-22 19:43                                             ` Linus Torvalds
  2 siblings, 0 replies; 45+ messages in thread
From: Dave Jones @ 2013-03-22 16:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 10:33:27PM -0700, Linus Torvalds wrote:

 > Ok, that's kind of ugly, but shouldn't be a correctness issue. It
 > should still work - just cycle through inodes quite aggressivelydue to
 > no longer re-using them - so I suspect Dave could test it (with the
 > extra line removed I pointed out just a moment ago).

Been running a few hours now without incident.
Of course, proving absense of a bug is always fun when fuzzing, as it
may have just not run long enough yet.. though I was hitting it pretty
quickly last night.

	Dave

 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22  5:33                                           ` Linus Torvalds
  2013-03-22  6:09                                             ` Al Viro
  2013-03-22 16:23                                             ` Dave Jones
@ 2013-03-22 19:43                                             ` Linus Torvalds
  2013-03-22 21:28                                               ` Al Viro
  2013-03-22 22:57                                               ` Eric W. Biederman
  2 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2013-03-22 19:43 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Thu, Mar 21, 2013 at 10:33 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I wonder how big of a deal the aggressive dentry deletion is.
> Maybe it's even ok to allocate/free the inodes all the time. The whole
> "get the inode hash lock and look it up there" can't be all that
> wonderful either. It takes the inode->i_lock for each entry it finds
> on the hash list, which looks horrible. I suspect our slab allocator
> isn't much worse than that, although the RCU freeing of the inodes
> could end up being problematic.

I tested this out by just having a process that keeps stat'ing the
same /proc inode over and over and over again, which should be pretty
much the worst-case situation (because we will delete the dentry after
each stat, and so we'll repopulate it for each stat)

The allocation overhead seems to be in the noise. The real costs are
all in proc_lookup_de() just finding the entry, with strlen/strcmp
being much hotter. So if we actually care about performance for these
cases, what we would actually want to do is to just cache the dentries
and have some kind of timeout instead of getting rid of them
immediately. That would get rid of the lookup costs, and in the
process also get rid of the constant inode allocation/deallocation
costs.

There was also some locking overhead, but that's also mainly
dentry-related, with the inode side being visible but not dominant.
Again, not immediately expiring the dentries would make all that just
go away.

Regardless, the patch seems to be the right thing to do. It actually
simplifies /proc, seems to fix the problem for Dave, and is small and
should be trivial to backport. I've committed it and marked it for
stable. Let's hope there won't be any nasty surprises, but it does
seem to be the simplest non-hacky patch.

                   Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22 19:43                                             ` Linus Torvalds
@ 2013-03-22 21:28                                               ` Al Viro
  2013-03-22 22:57                                               ` Eric W. Biederman
  1 sibling, 0 replies; 45+ messages in thread
From: Al Viro @ 2013-03-22 21:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, Eric W. Biederman

On Fri, Mar 22, 2013 at 12:43:50PM -0700, Linus Torvalds wrote:
> I tested this out by just having a process that keeps stat'ing the
> same /proc inode over and over and over again, which should be pretty
> much the worst-case situation (because we will delete the dentry after
> each stat, and so we'll repopulate it for each stat)
> 
> The allocation overhead seems to be in the noise. The real costs are
> all in proc_lookup_de() just finding the entry, with strlen/strcmp
> being much hotter. So if we actually care about performance for these
> cases, what we would actually want to do is to just cache the dentries
> and have some kind of timeout instead of getting rid of them
> immediately. That would get rid of the lookup costs, and in the
> process also get rid of the constant inode allocation/deallocation
> costs.

As far as I can see, the whole thing is fairly cold, both the globals
and per-netns stuff...  /proc/<pid> is a different story, and /proc/sys
can also get hit quite a bit, but those... nope.

> There was also some locking overhead, but that's also mainly
> dentry-related, with the inode side being visible but not dominant.
> Again, not immediately expiring the dentries would make all that just
> go away.
> 
> Regardless, the patch seems to be the right thing to do. It actually
> simplifies /proc, seems to fix the problem for Dave, and is small and
> should be trivial to backport. I've committed it and marked it for
> stable. Let's hope there won't be any nasty surprises, but it does
> seem to be the simplest non-hacky patch.

ACK.  It might make sense to look into the making procfs retain dentries
better, but I seriously suspect that we would get much bigger win simply
from
	* refusing to create an entry with numeric name in procfs root
	* reordering the proc_root_lookup() - try the per-process first.
The thing is, proc_pid_lookup() will start with looking if name is a decimal
number; that'll immediately reject the ones that should be handled by
proc_lookup().  proc_lookup() miss, OTOH, costs more.  Sure, the length
won't match for the most of them, but grabbing spinlock / walking the list /
comparing the legth for each entry / dropping the spinlock is going to be
more costly than checking that a short string isn't a decimal number.  And
we look there for /proc/<pid>/something a lot more...

IOW, I suspect that something like (very lightly tested) patch below would
be more useful than anything we can get from playing with dcache retention.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 69078c7..02b07c7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2727,12 +2727,12 @@ out:
 
 struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, unsigned int flags)
 {
-	struct dentry *result = NULL;
+	struct dentry *result = ERR_PTR(-ENOENT);
 	struct task_struct *task;
 	unsigned tgid;
 	struct pid_namespace *ns;
 
-	tgid = name_to_int(dentry);
+	tgid = name_to_int(&dentry->d_name);
 	if (tgid == ~0U)
 		goto out;
 
@@ -2990,7 +2990,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	if (!leader)
 		goto out_no_task;
 
-	tid = name_to_int(dentry);
+	tid = name_to_int(&dentry->d_name);
 	if (tid == ~0U)
 		goto out;
 
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index d7a4a28..5f4286c 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -205,7 +205,7 @@ static struct dentry *proc_lookupfd_common(struct inode *dir,
 {
 	struct task_struct *task = get_proc_task(dir);
 	struct dentry *result = ERR_PTR(-ENOENT);
-	unsigned fd = name_to_int(dentry);
+	unsigned fd = name_to_int(&dentry->d_name);
 
 	if (!task)
 		goto out_no_task;
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 4b3b3ff..e13d2da 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -580,7 +580,7 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent,
 {
 	struct proc_dir_entry *ent = NULL;
 	const char *fn = name;
-	unsigned int len;
+	struct qstr q;
 
 	/* make sure name is valid */
 	if (!name || !strlen(name))
@@ -593,14 +593,17 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent,
 	if (strchr(fn, '/'))
 		goto out;
 
-	len = strlen(fn);
+	q.name = fn;
+	q.len = strlen(fn);
+	if (*parent == &proc_root && name_to_int(&q) != ~0U)
+		return NULL;
 
-	ent = kzalloc(sizeof(struct proc_dir_entry) + len + 1, GFP_KERNEL);
+	ent = kzalloc(sizeof(struct proc_dir_entry) + q.len + 1, GFP_KERNEL);
 	if (!ent)
 		goto out;
 
-	memcpy(ent->name, fn, len + 1);
-	ent->namelen = len;
+	memcpy(ent->name, q.name, q.len + 1);
+	ent->namelen = q.len;
 	ent->mode = mode;
 	ent->nlink = nlink;
 	atomic_set(&ent->count, 1);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 85ff3a4..fba6da2 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -123,10 +123,10 @@ static inline int pid_delete_dentry(const struct dentry * dentry)
 	return !proc_pid(dentry->d_inode)->tasks[PIDTYPE_PID].first;
 }
 
-static inline unsigned name_to_int(struct dentry *dentry)
+static inline unsigned name_to_int(struct qstr *qstr)
 {
-	const char *name = dentry->d_name.name;
-	int len = dentry->d_name.len;
+	const char *name = qstr->name;
+	int len = qstr->len;
 	unsigned n = 0;
 
 	if (len > 1 && *name == '0')
diff --git a/fs/proc/root.c b/fs/proc/root.c
index c6e9fac..67c9dc4 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -190,10 +190,10 @@ static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct
 
 static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, unsigned int flags)
 {
-	if (!proc_lookup(dir, dentry, flags))
+	if (!proc_pid_lookup(dir, dentry, flags))
 		return NULL;
 	
-	return proc_pid_lookup(dir, dentry, flags);
+	return proc_lookup(dir, dentry, flags);
 }
 
 static int proc_root_readdir(struct file * filp,


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [CFT] Re: VFS deadlock ?
  2013-03-22 19:43                                             ` Linus Torvalds
  2013-03-22 21:28                                               ` Al Viro
@ 2013-03-22 22:57                                               ` Eric W. Biederman
  1 sibling, 0 replies; 45+ messages in thread
From: Eric W. Biederman @ 2013-03-22 22:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Dave Jones, Linux Kernel

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Regardless, the patch seems to be the right thing to do. It actually
> simplifies /proc, seems to fix the problem for Dave, and is small and
> should be trivial to backport. I've committed it and marked it for
> stable. Let's hope there won't be any nasty surprises, but it does
> seem to be the simplest non-hacky patch.

Thanks for pushing this to completion.

I thought I had fixed this several years ago by not reusing the
directory inodes under /proc/net but apparently that never happened.

Eric

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2013-03-22 22:58 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-21 19:06 VFS deadlock ? Dave Jones
2013-03-21 19:21 ` Al Viro
2013-03-21 20:31   ` Dave Jones
2013-03-21 19:29 ` Al Viro
2013-03-21 20:15   ` Linus Torvalds
2013-03-21 20:26     ` Dave Jones
2013-03-21 20:32       ` Linus Torvalds
2013-03-21 20:36         ` Dave Jones
2013-03-21 20:47           ` Al Viro
2013-03-21 21:02             ` Dave Jones
2013-03-21 21:18               ` Linus Torvalds
2013-03-21 21:26                 ` Al Viro
2013-03-21 21:41                   ` Dave Jones
2013-03-21 21:47                     ` Linus Torvalds
2013-03-21 21:55                       ` Al Viro
2013-03-21 21:57                         ` Linus Torvalds
2013-03-21 22:03                           ` Al Viro
2013-03-21 21:52                     ` Al Viro
2013-03-21 22:12                 ` Dave Jones
2013-03-21 22:29                   ` Dave Jones
2013-03-21 22:53                   ` Linus Torvalds
2013-03-21 23:07                     ` Dave Jones
2013-03-21 23:36                     ` Al Viro
2013-03-21 23:58                       ` Linus Torvalds
2013-03-22  0:01                         ` Linus Torvalds
2013-03-22  0:12                           ` Al Viro
2013-03-22  0:20                             ` Al Viro
2013-03-22  0:22                             ` Linus Torvalds
2013-03-22  1:22                               ` Al Viro
2013-03-22  1:33                                 ` Linus Torvalds
2013-03-22  1:40                                   ` Al Viro
2013-03-22  4:37                                     ` [CFT] " Al Viro
2013-03-22  4:55                                       ` Linus Torvalds
2013-03-22  5:18                                         ` Al Viro
2013-03-22  5:33                                           ` Linus Torvalds
2013-03-22  6:09                                             ` Al Viro
2013-03-22  6:22                                               ` Al Viro
2013-03-22 16:23                                             ` Dave Jones
2013-03-22 19:43                                             ` Linus Torvalds
2013-03-22 21:28                                               ` Al Viro
2013-03-22 22:57                                               ` Eric W. Biederman
2013-03-22  5:19                                         ` Linus Torvalds
2013-03-22  0:08                         ` Al Viro
2013-03-22  0:15                           ` Linus Torvalds
2013-03-22  0:19                             ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.