All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: Kinglong Mee <kinglongmee@gmail.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root
Date: Wed, 29 Apr 2015 12:57:28 +1000	[thread overview]
Message-ID: <20150429125728.69ddfc6c@notabene.brown> (raw)
In-Reply-To: <553E2784.6020906@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9458 bytes --]

On Mon, 27 Apr 2015 20:11:48 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:

> On 4/24/2015 11:00 AM, NeilBrown wrote:
> > On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
> > 
> >> On 4/23/2015 7:44 AM, NeilBrown wrote:
> >>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> >>> wrote:
> >>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
> >>>>> it is necessary to store them in svc_export.
> >>>>>
> >>>>> Neil points out another way of 'fs_pin', I will check that.
> >>>>
> >>>> Yes, that'd be interesting.  On a very quick look--I don't understand
> >>>> what it's meant to do at all.  But if it does provide a way to get a
> >>>> callback on umount, that'd certainly be interesting....
> >>>
> >>> Yeah, on a quick look it isn't really obvious at all.
> >>>
> >>> But I didn't read the code.  I read
> >>>  https://lwn.net/Articles/636730/
> >>>
> >>> which says:
> >>>
> >>>     In its place is a mechanism by which an object can be added to a vfsmount
> >>>     structure (which represents a mounted filesystem); that object supports 
> >>>     only one method: kill(). These "pin" objects hang around until the final
> >>>     reference to the vfsmount goes away, at which point each one's kill() function
> >>>     is called. It thus is a clean mechanism for performing cleanup when a filesystem
> >>>     goes away.
> >>>
> >>> This is used to close "BSD process accounting" files on umount, and sound
> >>> like the perfect interface for flushing things from the sunrpc cache on
> >>> umount.
> >>
> >> I have review those codes in fs/fs_pin.c and kernel/acct.c.
> >> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
> >> and umount, file->f_path.mnt just holds the reference of vfsmount
> >> but not keep busy, umount will check the reference and return busy.
> >>
> >> The logical is same as sunrpc cache for exports.
> >>
> >> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
> >> method, acct get a pin to vfsmount and then put the reference,
> >> so umount finds no other reference and callback those pins in vfsmount,
> >> at last umount success.
> >>
> >> After that commit, besides pin to original vfsmount and put the reference
> >> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
> >>
> >> The different between those two methods is the value of file->f_path.mnt,
> >> the first one, it contains the original vfsmnt without reference,
> >> the second one, contains a cloned vfsmnt with reference.
> >>
> >> I have test the first method, pins to vfsmount for all exports cache,
> >> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
> >> of name_to_handle_at, I can't find the reason.
> > 
> > A kernel stack trace of exactly where it is hanging would help a lot.
> 
> Witch adding fs_pin to exports, it seems there is a mutex race between
> nfsd and rpc.mountd for locking parent inode.
> 
> Apr 27 19:57:03 ntest kernel: rpc.mountd      D ffff88006ac5fc28     0  1152      1 0x00000000
> Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08
> Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff
> Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000
> Apr 27 19:57:03 ntest kernel: Call Trace:
> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
> Apr 27 19:57:03 ntest kernel: [<ffffffff8178031e>] schedule_preempt_disabled+0xe/0x10
> Apr 27 19:57:03 ntest kernel: [<ffffffff81781e92>] __mutex_lock_slowpath+0xb2/0x120
> Apr 27 19:57:03 ntest kernel: [<ffffffff81781f23>] mutex_lock+0x23/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff81230714>] lookup_slow+0x34/0xc0
> Apr 27 19:57:03 ntest kernel: [<ffffffff812358ee>] path_lookupat+0x89e/0xc60
> Apr 27 19:57:03 ntest kernel: [<ffffffff81205192>] ? kmem_cache_alloc+0x1e2/0x260
> Apr 27 19:57:03 ntest kernel: [<ffffffff812367a6>] ? getname_flags+0x56/0x200
> Apr 27 19:57:03 ntest kernel: [<ffffffff81235cd7>] filename_lookup+0x27/0xc0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81237b53>] user_path_at_empty+0x63/0xd0
> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d722>] ? put_object+0x32/0x60
> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d94d>] ? delete_object_full+0x2d/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123e273>] ? dput+0x33/0x230
> Apr 27 19:57:03 ntest kernel: [<ffffffff81237bd1>] user_path_at+0x11/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffff81286489>] SyS_name_to_handle_at+0x59/0x200
> Apr 27 19:57:03 ntest kernel: [<ffffffff81783f6e>] system_call_fastpath+0x12/0x71
> Apr 27 19:57:03 ntest kernel: nfsd            S ffff88006e92b708     0  1170      2 0x00000000
> Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758
> Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040
> Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09
> Apr 27 19:57:03 ntest kernel: Call Trace:
> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
> Apr 27 19:57:03 ntest kernel: [<ffffffff81782e97>] schedule_timeout+0x117/0x230
> Apr 27 19:57:03 ntest kernel: [<ffffffff8110b240>] ? internal_add_timer+0xb0/0xb0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81780ff3>] wait_for_completion_interruptible_timeout+0xf3/0x150
> Apr 27 19:57:03 ntest kernel: [<ffffffff810cb090>] ? wake_up_state+0x20/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffffa01350cc>] cache_wait_req.isra.10+0x9c/0x110 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0134760>] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0135c54>] cache_check+0x1d4/0x380 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffff8133248c>] ? inode_doinit_with_dentry+0x48c/0x6a0
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045e132>] exp_get_by_name+0x82/0xb0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffff81776d0a>] ? kmemleak_free+0x3a/0xa0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81332210>] ? inode_doinit_with_dentry+0x210/0x6a0
> Apr 27 19:57:03 ntest kernel: [<ffffffff813326bc>] ? selinux_d_instantiate+0x1c/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123def7>] ? _d_rehash+0x37/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123f776>] ? d_splice_alias+0xa6/0x2d0
> Apr 27 19:57:03 ntest kernel: [<ffffffff812be90b>] ? ext4_lookup+0xdb/0x160
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0460114>] rqst_exp_get_by_name+0x64/0x140 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a8f6>] nfsd_cross_mnt+0x76/0x1b0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0475300>] nfsd4_encode_dirent+0x160/0x3d0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa04751a0>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045c631>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a1e0>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa046bd70>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa047573d>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa046a51d>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455f83>] nfsd_dispatch+0xc3/0x220 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa012a01b>] svc_process_common+0x43b/0x690 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa012b133>] svc_process+0x103/0x1b0 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045596f>] nfsd+0xff/0x170 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455870>] ? nfsd_destroy+0x80/0x80 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd378>] kthread+0xd8/0xf0
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
> Apr 27 19:57:03 ntest kernel: [<ffffffff81784362>] ret_from_fork+0x42/0x70
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
> 

Hmm... That looks bad.  I wonder why we haven't hit the deadlock before.

nfsd gets a LOOKUP request for $DIR/name which is a mountpoint, or a READDIR
request for $DIR which contains 'name' which is a mountpoint.

nfsd_lookup_dentry of nfsd_buffered_readdir lock i_mutex on $DIR, then do
the mountpoint check.  If the mounted filesystem is not exported, but the $DIR
filesystem has 'crossmnt', an upcall to mountd asks for export information.
When mountd try to do a lookup on the name, it works fine if lookup_fast finds
the child, but it deadlocks on the parent's i_mutex if it has to go to
lookup_slow.

It should only go to lookup_slow() if:
  - the child isn't in cache .. but as it is mounted on, it must be
  - d_revalidate returns 0

This probably means that $DIR is on an NFS filesystem and it failed the
revalidation for some reason....

Kinglong: can you reproduce this easily?  If so can you enable nfs debugging 
and collect the output?
I want to confirm that the dfprintk in nfs_lookup_revalidate is saying that
the name is invalid.

In any case, holding a mutex while waiting for an upcall is probably a bad
idea.  We should try to find a way to work around that...
Any ideas Bruce ?-)

NeilBrown



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

  reply	other threads:[~2015-04-29  2:57 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-21 14:50 [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root Kinglong Mee
2015-04-21 21:54 ` J. Bruce Fields
2015-04-22  5:07   ` NeilBrown
2015-04-22 11:11   ` Kinglong Mee
2015-04-22 15:07     ` J. Bruce Fields
2015-04-22 23:44       ` NeilBrown
2015-04-23 12:52         ` Kinglong Mee
2015-04-24  3:00           ` NeilBrown
2015-04-27 12:11             ` Kinglong Mee
2015-04-29  2:57               ` NeilBrown [this message]
2015-04-29  8:45                 ` Kinglong Mee
2015-04-29 19:19                 ` J. Bruce Fields
2015-04-29 21:52                   ` NeilBrown
2015-04-30 21:36                     ` J. Bruce Fields
2015-05-01  1:53                       ` NeilBrown
2015-05-01  2:03                         ` Al Viro
2015-05-01  2:23                           ` NeilBrown
2015-05-01  2:29                             ` Al Viro
2015-05-01  3:08                               ` NeilBrown
2015-05-01 13:29                                 ` J. Bruce Fields
2015-05-02 23:16                                   ` NeilBrown
2015-05-03  0:37                                     ` J. Bruce Fields
2015-05-04  4:11                                       ` NeilBrown
2015-05-04 21:48                                     ` J. Bruce Fields
2015-05-05 22:27                                       ` NeilBrown
2015-05-04 22:01                         ` J. Bruce Fields
2015-05-05 13:54                           ` Kinglong Mee
2015-05-05 14:18                             ` J. Bruce Fields
2015-05-05 15:52                               ` J. Bruce Fields
2015-05-05 22:26                                 ` NeilBrown
2015-05-08 16:15                                   ` J. Bruce Fields
2015-05-08 20:01                                     ` [PATCH] nfsd: don't hold i_mutex over userspace upcalls J. Bruce Fields
2015-06-03 15:18                                       ` J. Bruce Fields
     [not found]                                         ` <20150603151819.GA8441-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-07-05 11:27                                           ` Kinglong Mee
2015-07-05 11:27                                             ` Kinglong Mee
2015-07-06 18:22                                             ` J. Bruce Fields
2015-08-18 19:10                                           ` J. Bruce Fields
2015-08-18 19:10                                             ` J. Bruce Fields
     [not found]                                             ` <20150818191028.GA3957-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-11-12 21:22                                               ` J. Bruce Fields
2015-11-12 21:22                                                 ` J. Bruce Fields
2015-05-07 15:31                                 ` [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root J. Bruce Fields
2015-05-07 22:42                                   ` NeilBrown
2015-05-08 14:10                                     ` J. Bruce Fields
2015-05-05  3:53                       ` Kinglong Mee
2015-05-05  4:19                         ` NeilBrown
2015-05-05  8:32                           ` Kinglong Mee
2015-05-05 13:52                             ` J. Bruce Fields
2015-06-26 23:14                             ` Kinglong Mee
2015-06-26 23:35                               ` NeilBrown
2015-07-02  9:42                                 ` Kinglong Mee
2015-05-01  1:55                     ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150429125728.69ddfc6c@notabene.brown \
    --to=neilb@suse.de \
    --cc=bfields@fieldses.org \
    --cc=kinglongmee@gmail.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.