More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

* More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.
@ 2016-06-17  4:09 Oleg Drokin
  2016-06-17  4:29 ` Al Viro
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Drokin @ 2016-06-17  4:09 UTC (permalink / raw)
  To: Al Viro, Trond Myklebust
  Cc: linux-nfs, Mailing List, <linux-fsdevel@vger.kernel.org>,
	idryomov, sage, zyan

Hello!

   So I think I finally kind of understand this other d_splice_alias problem
   I've been having.

   Imagine that we have a negative dentry in the cache.

   Now we have two threads that are trying to open this file in parallel,
   the fs is NFS, so atomic_open is defined, it's not a create, so
   parent is locked in shared mode and they both enter first lookup_open
   and then (because the dentry is negative) into nfs_atomic_open().

   Now in nfs_atomic_open we have two threads that run alongside each other
   trying to open this file and both fail (the code is with Al's earlier patch
   for the other d_splice_alias problem).

        inode = NFS_PROTO(dir)->open_context(dir, ctx, open_flags, &attr, opened);
        if (IS_ERR(inode)) {
                err = PTR_ERR(inode);
                trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
                put_nfs_open_context(ctx);
                d_drop(dentry);
. . .
                case -ENOTDIR:
                        goto no_open;
. . .
no_open:
        res = nfs_lookup(dir, dentry, lookup_flags);

    So they both do d_drop(), the dentry is now unhashed, and they both
    dive into nfs_lookup().
    There eventually they both call

	res = d_splice_alias(inode, dentry);

   And so the first lucky one continues on it's merry way with a hashed dentry,
   but the other less lucky one ends up calling into d_splice_alias() with
   dentry that's already hashed and hits the very familiar assertion.

   I took a brief look into ceph and it looks like a very similar thing
   might happen there with handle_reply() for two parallel replies calling into
   ceph_fill_trace() and then splice_alias()->d_splice_alias(), since the
   unhashed check it does is not under any locks, it's unsafe, so the problem
   might be more generic than just NFS too.

   So I wonder how to best fix this? Holding some sort of dentry lock across a call
   into atomic_open in VFS? We cannot just make d_splice_alias() callers call with
   inode->i_lock held because dentry might be negative.

   I know this is really happening because I in fact have 3 threads doing the above
   described race, one luckily passing through and two hitting the assertion:
PID: 2689619  TASK: ffff8800ca740540  CPU: 6   COMMAND: "ls"
 #0 [ffff8800c6da3758] machine_kexec at ffffffff81042462
 #1 [ffff8800c6da37b8] __crash_kexec at ffffffff811365ff
 #2 [ffff8800c6da3880] __crash_kexec at ffffffff811366d5
 #3 [ffff8800c6da3898] crash_kexec at ffffffff8113671b
 #4 [ffff8800c6da38b8] oops_end at ffffffff81018cb4
 #5 [ffff8800c6da38e0] die at ffffffff8101917b
 #6 [ffff8800c6da3910] do_trap at ffffffff81015b92
 #7 [ffff8800c6da3960] do_error_trap at ffffffff81015d8b
 #8 [ffff8800c6da3a20] do_invalid_op at ffffffff810166e0
 #9 [ffff8800c6da3a30] invalid_op at ffffffff8188c91e
    [exception RIP: d_splice_alias+478]
    RIP: ffffffff8128784e  RSP: ffff8800c6da3ae8  RFLAGS: 00010286
    RAX: ffff8800d8661ab8  RBX: ffff8800d3b6b178  RCX: ffff8800ca740540
    RDX: ffffffff81382f09  RSI: ffff8800d3b6b178  RDI: ffff8800d8661ab8
    RBP: ffff8800c6da3b20   R8: 0000000000000000   R9: 0000000000000000
    R10: ffff8800ca740540  R11: 0000000000000000  R12: ffff8800d6491600
    R13: 0000000000000102  R14: ffff8800d39063d0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8800c6da3b28] nfs_lookup at ffffffff8137cd1c
#11 [ffff8800c6da3b80] nfs_atomic_open at ffffffff8137ed61
#12 [ffff8800c6da3c30] lookup_open at ffffffff8127901a
#13 [ffff8800c6da3d50] path_openat at ffffffff8127bf15
#14 [ffff8800c6da3dd8] do_filp_open at ffffffff8127d9b1
#15 [ffff8800c6da3ee0] do_sys_open at ffffffff81269f90
#16 [ffff8800c6da3f40] sys_open at ffffffff8126a09e
#17 [ffff8800c6da3f50] entry_SYSCALL_64_fastpath at ffffffff8188aebc
    RIP: 00007f92a76d4d20  RSP: 00007fff640c1160  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: 0000000000000003  RCX: 00007f92a76d4d20
    RDX: 000055c9f5be9f50  RSI: 0000000000090800  RDI: 000055c9f5bea1d0
    RBP: 0000000000000000   R8: 000055c9f5bea340   R9: 0000000000000000
    R10: 000055c9f5bea300  R11: 0000000000000206  R12: 00007f92a800d6b0
    R13: 000055c9f5bf3000  R14: 000055c9f5bea3e0  R15: 0000000000000000
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

PID: 2689597  TASK: ffff880109b40bc0  CPU: 7   COMMAND: "ls"
 #0 [ffff8800c07178e0] die at ffffffff8101914e
 #1 [ffff8800c0717910] do_trap at ffffffff81015b92
 #2 [ffff8800c0717960] do_error_trap at ffffffff81015d8b
 #3 [ffff8800c0717a20] do_invalid_op at ffffffff810166e0
 #4 [ffff8800c0717a30] invalid_op at ffffffff8188c91e
    [exception RIP: d_splice_alias+478]
    RIP: ffffffff8128784e  RSP: ffff8800c0717ae8  RFLAGS: 00010286
    RAX: ffff8800d8661ab8  RBX: ffff8800d3b6b178  RCX: ffff880109b40bc0
    RDX: ffffffff81382f09  RSI: ffff8800d3b6b178  RDI: ffff8800d8661ab8
    RBP: ffff8800c0717b20   R8: 0000000000000000   R9: 0000000000000000
    R10: ffff880109b40bc0  R11: 0000000000000000  R12: ffff8800d6a29e40
    R13: 0000000000000102  R14: ffff8800d39063d0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffff8800c0717b28] nfs_lookup at ffffffff8137cd1c
 #6 [ffff8800c0717b80] nfs_atomic_open at ffffffff8137ed61
 #7 [ffff8800c0717c30] lookup_open at ffffffff8127901a
 #8 [ffff8800c0717d50] path_openat at ffffffff8127bf15
 #9 [ffff8800c0717dd8] do_filp_open at ffffffff8127d9b1
#10 [ffff8800c0717ee0] do_sys_open at ffffffff81269f90
#11 [ffff8800c0717f40] sys_open at ffffffff8126a09e
#12 [ffff8800c0717f50] entry_SYSCALL_64_fastpath at ffffffff8188aebc
    RIP: 00007f5cef6f8d20  RSP: 00007ffcc98f87b0  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 0000000000000003  RCX: 00007f5cef6f8d20
    RDX: 000055de64230f50  RSI: 0000000000090800  RDI: 000055de642311d0
    RBP: 0000000000000000   R8: 000055de64231320   R9: 0000000000000000
    R10: 000055de64231300  R11: 0000000000000202  R12: 00007f5cf00316b0
    R13: 000055de6423a000  R14: 000055de642313c0  R15: 0000000000000000
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

The dentry is 0xffff8800d3b6b178 in both cases.

^ permalink raw reply	[flat|nested] 38+ messages in thread