Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg

From: Stanislav Kinsbursky <skinsbursky@parallels.com>
To: Jeff Layton <jlayton@redhat.com>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
Date: Wed, 11 Apr 2012 14:09:40 +0400	[thread overview]
Message-ID: <4F855864.8050509@parallels.com> (raw)
In-Reply-To: <20120410144505.58bca0b9@corrin.poochiereds.net>

10.04.2012 22:45, Jeff Layton пишет:
>>>> This check is expensive (as you mentioned), but have to be done only once on NFS
>>>> server start.
>>>
>>> Well, no. The subtree check happens every time nfsd processes a
>>> filehandle -- see nfsd_acceptable().
>>>
>>> Basically we have to turn the filehandle into a dentry and then walk
>>> back up to the directory that's exported to verify that it is within
>>> the correct subtree. If that fails, then we might have to do it more
>>> than once if it's a hardlinked file.
>>>
>>
>> Wait. Looks like I'm missing something.
>> This subtree check has nothing with my proposal (if I'm not mistaken).
>> This option and it's logic remains the same.
>> My proposal was to check directories, desired to be exported, on NFS server
>> start. And if any of passed exports intersects with any of exports, already
>> shared by another NFSd - then shutdown NFSd and print error message.
>> Am I missing the point here?
>>
>
> Sorry I got confused with the discussion. You will need to do
> something similar to what subtree checking does in order to handle
> your proposal however.
>

Agreed. But this check should be performed only once on NFS server start (not 
every fh lookup.

>>>> With this solution, grace period can simple, and no support from exporting file
>>>> system is required.
>>>> But the main problem here is that such intersections can be checked only in
>>>> initial file system environment (containers with it's own roots, gained via
>>>> chroot, can't handle this situation).
>>>> So, it means, that there have to be some daemon (kernel or user space), which
>>>> will handle such requests from different NFS server instances... Which in turn
>>>> means, that some way of communication between this daemon and NFS servers is
>>>> required. And unix (any of them) sockets doesn't suits here, which makes this
>>>> problem more difficult.
>>>>
>>>
>>> This is a truly ugly problem, and unfortunately parts of the nfsd
>>> codebase are very old and crusty. We've got a lot of cleanup work ahead
>>> of us no matter what design we settle on.
>>>
>>> This is really a lot bigger than the grace period. I think we ought to
>>> step back a bit and consider this more "holistically" first. Do you
>>> have a pointer to an overall design document or something?
>>>
>>
>> What exactly you are asking about? Overall design of containerization?
>>
>
> I meant containerization of nfsd in particular.
>

If you are asking about some kind of white paper, then I don't have it.
But here are main visible targets:
1) Move all network-related resources to per-net data (caches, grace period, 
lockd calls, transports, your tracking engine).
2) make nfsd filesystem superblock per network namespace.
3) service itself will be controlled like Lockd done (one pool for all, per-net 
resources allocated on service start).

>>> One thing that puzzles me at the moment. We have two namespaces to deal
>>> with -- the network and the mount namespace. With nfs client code,
>>> everything is keyed off of the net namespace. That's not really the
>>> case here since we have to deal with a local fs tree as well.
>>>
>>> When an nfsd running in a container receives an RPC, how does it
>>> determine what mount namespace it should do its operations in?
>>>
>>
>> We don't use mount namespaces, so that's why I wasn't thinking about it...
>> But if we have 2 types of namespaces, then we have to tie  mount namesapce to
>> network. I.e we can get desired mount namespace from per-net NFSd data.
>>
>
> One thing that Bruce mentioned to me privately is that we could plan to
> use whatever mount namespace mountd is using within a particular net
> namespace. That makes some sense since mountd is the final arbiter of
> who gets access to what.
>

Could you, please, give some examples? I don't get the idea.

>> But, please, don't ask me, what will be, if two or more NFS servers shares the
>> same mount namespace... Looks like this case should be forbidden.
>>
>
> I'm not sure we need to forbid sharing the mount namespace. They might
> be exporting completely different filesystems after all, in which case
> we'd be forbidding it for no good reason.
>

Actually, if we will make file system responsible for grace period control, then 
yes, no reason for forbidding of shared mount namespace.

> Note that it is quite easy to get lost in the weeds with this. I've been
> struggling to get a working design for a clustered nfsv4 server for the
> last several months and have had some time to wrestle with these
> issues. It's anything but trivial.
>
> What you may need to do in order to make progress is to start with some
> valid use-cases for this stuff, and get those working while disallowing
> or ignoring other use cases. We'll never get anywhere if we try to solve
> all of these problems at once...
>

Agreed.
So, my current understanding of the situation can be summarized as follows:

1) The idea of making grace period (and int internals) per networks namespace 
stays the same. But it's implementation affect only current "generic grace 
period" code.

2) Your idea of making grace period per file system looks reasonable. And maybe 
this approach (using of filesystem's export operations if available) have to be 
used by default.
But I suggest to add new option to exports (say, "no_fs_grace"), which will 
disable this new functionality. With this option system administrator becomes 
responsible for any problems with shared file system.

Any objections?

-- 
Best regards,
Stanislav Kinsbursky