linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Grace period
       [not found] ` <20120406234039.GA20940@fieldses.org>
@ 2012-04-09 11:24   ` Stanislav Kinsbursky
  2012-04-09 13:47     ` Jeff Layton
  2012-04-09 23:26     ` bfields
  0 siblings, 2 replies; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-09 11:24 UTC (permalink / raw)
  To: bfields, Trond.Myklebust; +Cc: linux-nfs, linux-kernel

07.04.2012 03:40, bfields@fieldses.org пишет:
> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
>> Hello, Bruce.
>> Could you, please, clarify this reason why grace list is used?
>> I.e. why list is used instead of some atomic variable, for example?
>
> Like just a reference count?  Yeah, that would be OK.
>
> In theory it could provide some sort of debugging help.  (E.g. we could
> print out the list of "lock managers" currently keeping us in grace.)  I
> had some idea we'd make those lock manager objects more complicated, and
> might have more for individual containerized services.

Could you share this idea, please?

Anyway, I have nothing against lists. Just was curious, why it was used.
I added Trond and lists to this reply.

Let me explain, what is the problem with grace period I'm facing right know, and 
what I'm thinking about it.
So, one of the things to be containerized during "NFSd per net ns" work is the 
grace period, and these are the basic components of it:
1) Grace period start.
2) Grace period end.
3) Grace period check.
3) Grace period restart.

So, the simplest straight-forward way is to make all internal stuff: 
"grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and 
"nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be 
per-net as well.
In this case:
1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has 
to be moves there from "lockd()") and "nfs4_state_start()".
2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to 
be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()".
3) Check - looks easy. There is either svc_rqst or net context can be passed to 
function.
4) Restart - this is a tricky place. It would be great to restart grace period 
only for the networks namespace of the sender of the kill signal. So, the idea 
is to check siginfo_t for the pid of sender, then try to locate the task, and if 
found, then get sender's networks namespace, and restart grace period only for 
this namespace (of course, if lockd was started for this namespace - see below).

If task not found, of it's lockd wasn't started for it's namespace, then grace 
period can be either restarted for all namespaces, of just silently dropped. 
This is the place where I'm not sure, how to do. Because calling grace period 
for all namespaces will be overkill...

There also another problem with the "task by pid" search, that found task can be 
actually not sender (which died already), but some other new task with the same 
pid number. In this case, I think, we can just neglect this probability and 
always assume, that we located sender (if, of course, lockd was started for 
sender's network namespace).

Trond, Bruce, could you, please, comment this ideas?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 11:24   ` Grace period Stanislav Kinsbursky
@ 2012-04-09 13:47     ` Jeff Layton
  2012-04-09 14:25       ` Stanislav Kinsbursky
  2012-04-09 23:26     ` bfields
  1 sibling, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2012-04-09 13:47 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel

On Mon, 09 Apr 2012 15:24:19 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 07.04.2012 03:40, bfields@fieldses.org пишет:
> > On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
> >> Hello, Bruce.
> >> Could you, please, clarify this reason why grace list is used?
> >> I.e. why list is used instead of some atomic variable, for example?
> >
> > Like just a reference count?  Yeah, that would be OK.
> >
> > In theory it could provide some sort of debugging help.  (E.g. we could
> > print out the list of "lock managers" currently keeping us in grace.)  I
> > had some idea we'd make those lock manager objects more complicated, and
> > might have more for individual containerized services.
> 
> Could you share this idea, please?
> 
> Anyway, I have nothing against lists. Just was curious, why it was used.
> I added Trond and lists to this reply.
> 
> Let me explain, what is the problem with grace period I'm facing right know, and 
> what I'm thinking about it.
> So, one of the things to be containerized during "NFSd per net ns" work is the 
> grace period, and these are the basic components of it:
> 1) Grace period start.
> 2) Grace period end.
> 3) Grace period check.
> 3) Grace period restart.
> 
> So, the simplest straight-forward way is to make all internal stuff: 
> "grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and 
> "nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be 
> per-net as well.
> In this case:
> 1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has 
> to be moves there from "lockd()") and "nfs4_state_start()".
> 2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to 
> be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()".
> 3) Check - looks easy. There is either svc_rqst or net context can be passed to 
> function.
> 4) Restart - this is a tricky place. It would be great to restart grace period 
> only for the networks namespace of the sender of the kill signal. So, the idea 
> is to check siginfo_t for the pid of sender, then try to locate the task, and if 
> found, then get sender's networks namespace, and restart grace period only for 
> this namespace (of course, if lockd was started for this namespace - see below).
> 
> If task not found, of it's lockd wasn't started for it's namespace, then grace 
> period can be either restarted for all namespaces, of just silently dropped. 
> This is the place where I'm not sure, how to do. Because calling grace period 
> for all namespaces will be overkill...
> 
> There also another problem with the "task by pid" search, that found task can be 
> actually not sender (which died already), but some other new task with the same 
> pid number. In this case, I think, we can just neglect this probability and 
> always assume, that we located sender (if, of course, lockd was started for 
> sender's network namespace).
> 
> Trond, Bruce, could you, please, comment this ideas?
> 

I can comment and I'm not sure that will be sufficient.

The grace period has a particular purpose. It keeps nfsd or lockd from
handing out stateful objects (e.g. locks) before clients have an
opportunity to reclaim them. Once the grace period expires, there is no
more reclaim allowed and "normal" lock and open requests can proceed.

Traditionally, there has been one nfsd or lockd "instance" per host.
With that, we were able to get away with a relatively simple-minded
approach of a global grace period that's gated on nfsd or lockd's
startup and shutdown.

Now, you're looking at making multiple nfsd or lockd "instances". Does
it make sense to make this a per-net thing? Here's a particularly
problematic case to illustrate what I mean:

Suppose I have a filesystem that's mounted and exported in two
different containers. You start up one container and then 60s later,
start up the other. The grace period expires in the first container and
that nfsd hands out locks that conflict with some that have not been
reclaimed yet in the other.

Now, we can just try to say "don't export the same fs from more than
one container". But we all know that people will do it anyway, since
there's nothing that really stops you from doing so.

What probably makes more sense is making the grace period a per-sb
property, and coming up with a set of rules for the fs going into and
out of "grace" status.

Perhaps a way for different net namespaces to "subscribe" to a
particular fs, and don't take the fs out of grace until all of the
grace period timers pop? If a fs attempts to subscribe after the fs
comes out of grace, then its subscription would be denied and reclaim
attempts would get NFS4ERR_NOGRACE or the NLM equivalent.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 13:47     ` Jeff Layton
@ 2012-04-09 14:25       ` Stanislav Kinsbursky
  2012-04-09 15:27         ` Jeff Layton
  0 siblings, 1 reply; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-09 14:25 UTC (permalink / raw)
  To: Jeff Layton; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel

09.04.2012 17:47, Jeff Layton пишет:
> On Mon, 09 Apr 2012 15:24:19 +0400
> Stanislav Kinsbursky<skinsbursky@parallels.com>  wrote:
>
>> 07.04.2012 03:40, bfields@fieldses.org пишет:
>>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
>>>> Hello, Bruce.
>>>> Could you, please, clarify this reason why grace list is used?
>>>> I.e. why list is used instead of some atomic variable, for example?
>>>
>>> Like just a reference count?  Yeah, that would be OK.
>>>
>>> In theory it could provide some sort of debugging help.  (E.g. we could
>>> print out the list of "lock managers" currently keeping us in grace.)  I
>>> had some idea we'd make those lock manager objects more complicated, and
>>> might have more for individual containerized services.
>>
>> Could you share this idea, please?
>>
>> Anyway, I have nothing against lists. Just was curious, why it was used.
>> I added Trond and lists to this reply.
>>
>> Let me explain, what is the problem with grace period I'm facing right know, and
>> what I'm thinking about it.
>> So, one of the things to be containerized during "NFSd per net ns" work is the
>> grace period, and these are the basic components of it:
>> 1) Grace period start.
>> 2) Grace period end.
>> 3) Grace period check.
>> 3) Grace period restart.
>>
>> So, the simplest straight-forward way is to make all internal stuff:
>> "grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and
>> "nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be
>> per-net as well.
>> In this case:
>> 1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has
>> to be moves there from "lockd()") and "nfs4_state_start()".
>> 2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to
>> be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()".
>> 3) Check - looks easy. There is either svc_rqst or net context can be passed to
>> function.
>> 4) Restart - this is a tricky place. It would be great to restart grace period
>> only for the networks namespace of the sender of the kill signal. So, the idea
>> is to check siginfo_t for the pid of sender, then try to locate the task, and if
>> found, then get sender's networks namespace, and restart grace period only for
>> this namespace (of course, if lockd was started for this namespace - see below).
>>
>> If task not found, of it's lockd wasn't started for it's namespace, then grace
>> period can be either restarted for all namespaces, of just silently dropped.
>> This is the place where I'm not sure, how to do. Because calling grace period
>> for all namespaces will be overkill...
>>
>> There also another problem with the "task by pid" search, that found task can be
>> actually not sender (which died already), but some other new task with the same
>> pid number. In this case, I think, we can just neglect this probability and
>> always assume, that we located sender (if, of course, lockd was started for
>> sender's network namespace).
>>
>> Trond, Bruce, could you, please, comment this ideas?
>>
>
> I can comment and I'm not sure that will be sufficient.
>

Hi, Jeff. Thanks for the comment.

> The grace period has a particular purpose. It keeps nfsd or lockd from
> handing out stateful objects (e.g. locks) before clients have an
> opportunity to reclaim them. Once the grace period expires, there is no
> more reclaim allowed and "normal" lock and open requests can proceed.
>
> Traditionally, there has been one nfsd or lockd "instance" per host.
> With that, we were able to get away with a relatively simple-minded
> approach of a global grace period that's gated on nfsd or lockd's
> startup and shutdown.
>
> Now, you're looking at making multiple nfsd or lockd "instances". Does
> it make sense to make this a per-net thing? Here's a particularly
> problematic case to illustrate what I mean:
>
> Suppose I have a filesystem that's mounted and exported in two
> different containers. You start up one container and then 60s later,
> start up the other. The grace period expires in the first container and
> that nfsd hands out locks that conflict with some that have not been
> reclaimed yet in the other.
>
> Now, we can just try to say "don't export the same fs from more than
> one container". But we all know that people will do it anyway, since
> there's nothing that really stops you from doing so.
>

Yes, I see. But situation you described is existent already.
I.e. you can replace containers with the same file system by two nodes, sharing 
the same distributed file system (like Lustre and GPFS), and you'll experience 
the same problem in such case.

> What probably makes more sense is making the grace period a per-sb
> property, and coming up with a set of rules for the fs going into and
> out of "grace" status.
>
> Perhaps a way for different net namespaces to "subscribe" to a
> particular fs, and don't take the fs out of grace until all of the
> grace period timers pop? If a fs attempts to subscribe after the fs
> comes out of grace, then its subscription would be denied and reclaim
> attempts would get NFS4ERR_NOGRACE or the NLM equivalent.
>

This raises another problem. Imagine, that grace period has elapsed for some 
container and then you start nfsd in another one. New grace period will affect 
all both of them. And that's even worse from my pow.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 14:25       ` Stanislav Kinsbursky
@ 2012-04-09 15:27         ` Jeff Layton
  2012-04-09 16:08           ` Stanislav Kinsbursky
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2012-04-09 15:27 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel

On Mon, 09 Apr 2012 18:25:48 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 09.04.2012 17:47, Jeff Layton пишет:
> > On Mon, 09 Apr 2012 15:24:19 +0400
> > Stanislav Kinsbursky<skinsbursky@parallels.com>  wrote:
> >
> >> 07.04.2012 03:40, bfields@fieldses.org пишет:
> >>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
> >>>> Hello, Bruce.
> >>>> Could you, please, clarify this reason why grace list is used?
> >>>> I.e. why list is used instead of some atomic variable, for example?
> >>>
> >>> Like just a reference count?  Yeah, that would be OK.
> >>>
> >>> In theory it could provide some sort of debugging help.  (E.g. we could
> >>> print out the list of "lock managers" currently keeping us in grace.)  I
> >>> had some idea we'd make those lock manager objects more complicated, and
> >>> might have more for individual containerized services.
> >>
> >> Could you share this idea, please?
> >>
> >> Anyway, I have nothing against lists. Just was curious, why it was used.
> >> I added Trond and lists to this reply.
> >>
> >> Let me explain, what is the problem with grace period I'm facing right know, and
> >> what I'm thinking about it.
> >> So, one of the things to be containerized during "NFSd per net ns" work is the
> >> grace period, and these are the basic components of it:
> >> 1) Grace period start.
> >> 2) Grace period end.
> >> 3) Grace period check.
> >> 3) Grace period restart.
> >>
> >> So, the simplest straight-forward way is to make all internal stuff:
> >> "grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and
> >> "nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be
> >> per-net as well.
> >> In this case:
> >> 1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has
> >> to be moves there from "lockd()") and "nfs4_state_start()".
> >> 2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to
> >> be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()".
> >> 3) Check - looks easy. There is either svc_rqst or net context can be passed to
> >> function.
> >> 4) Restart - this is a tricky place. It would be great to restart grace period
> >> only for the networks namespace of the sender of the kill signal. So, the idea
> >> is to check siginfo_t for the pid of sender, then try to locate the task, and if
> >> found, then get sender's networks namespace, and restart grace period only for
> >> this namespace (of course, if lockd was started for this namespace - see below).
> >>
> >> If task not found, of it's lockd wasn't started for it's namespace, then grace
> >> period can be either restarted for all namespaces, of just silently dropped.
> >> This is the place where I'm not sure, how to do. Because calling grace period
> >> for all namespaces will be overkill...
> >>
> >> There also another problem with the "task by pid" search, that found task can be
> >> actually not sender (which died already), but some other new task with the same
> >> pid number. In this case, I think, we can just neglect this probability and
> >> always assume, that we located sender (if, of course, lockd was started for
> >> sender's network namespace).
> >>
> >> Trond, Bruce, could you, please, comment this ideas?
> >>
> >
> > I can comment and I'm not sure that will be sufficient.
> >
> 
> Hi, Jeff. Thanks for the comment.
> 
> > The grace period has a particular purpose. It keeps nfsd or lockd from
> > handing out stateful objects (e.g. locks) before clients have an
> > opportunity to reclaim them. Once the grace period expires, there is no
> > more reclaim allowed and "normal" lock and open requests can proceed.
> >
> > Traditionally, there has been one nfsd or lockd "instance" per host.
> > With that, we were able to get away with a relatively simple-minded
> > approach of a global grace period that's gated on nfsd or lockd's
> > startup and shutdown.
> >
> > Now, you're looking at making multiple nfsd or lockd "instances". Does
> > it make sense to make this a per-net thing? Here's a particularly
> > problematic case to illustrate what I mean:
> >
> > Suppose I have a filesystem that's mounted and exported in two
> > different containers. You start up one container and then 60s later,
> > start up the other. The grace period expires in the first container and
> > that nfsd hands out locks that conflict with some that have not been
> > reclaimed yet in the other.
> >
> > Now, we can just try to say "don't export the same fs from more than
> > one container". But we all know that people will do it anyway, since
> > there's nothing that really stops you from doing so.
> >
> 
> Yes, I see. But situation you described is existent already.
> I.e. you can replace containers with the same file system by two nodes, sharing 
> the same distributed file system (like Lustre and GPFS), and you'll experience 
> the same problem in such case.
> 

Yep, which is why we don't support active/active serving from clustered
filesystems (yet). Containers are somewhat similar to a clustered
configuration.

The simple minded grace period handling we have now is really only
suitable for very simple export configurations. The grace period exists
to ensure that filesystem objects are not "oversubscribed" so it makes
some sense to turn it into a per-sb property.

> > What probably makes more sense is making the grace period a per-sb
> > property, and coming up with a set of rules for the fs going into and
> > out of "grace" status.
> >
> > Perhaps a way for different net namespaces to "subscribe" to a
> > particular fs, and don't take the fs out of grace until all of the
> > grace period timers pop? If a fs attempts to subscribe after the fs
> > comes out of grace, then its subscription would be denied and reclaim
> > attempts would get NFS4ERR_NOGRACE or the NLM equivalent.
> >
> 
> This raises another problem. Imagine, that grace period has elapsed for some 
> container and then you start nfsd in another one. New grace period will affect 
> all both of them. And that's even worse from my pow.
> 

If you allow one container to hand out conflicting locks while another
container is allowing reclaims, then you can end up with some very
difficult to debug silent data corruption. That's the worst possible
outcome, IMO. We really need to actively keep people from shooting
themselves in the foot here.

One possibility might be to only allow filesystems to be exported from
a single container at a time (and allow that to be overridable somehow
once we have a working active/active serving solution). With that, you
may be able limp along with a per-container grace period handling
scheme like you're proposing.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 15:27         ` Jeff Layton
@ 2012-04-09 16:08           ` Stanislav Kinsbursky
  2012-04-09 16:11             ` bfields
  0 siblings, 1 reply; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-09 16:08 UTC (permalink / raw)
  To: Jeff Layton; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel

09.04.2012 19:27, Jeff Layton пишет:
>
> If you allow one container to hand out conflicting locks while another
> container is allowing reclaims, then you can end up with some very
> difficult to debug silent data corruption. That's the worst possible
> outcome, IMO. We really need to actively keep people from shooting
> themselves in the foot here.
>
> One possibility might be to only allow filesystems to be exported from
> a single container at a time (and allow that to be overridable somehow
> once we have a working active/active serving solution). With that, you
> may be able limp along with a per-container grace period handling
> scheme like you're proposing.
>

Ok then. Keeping people from shooting themselves here sounds reasonable.
And I like the idea of exporting a filesystem only from once per network 
namespace. Looks like there should be a list of pairs "exported superblock - 
network namespace". And if superblock is exported already in other namespace, 
then export in new namespace have to be skipped (replaced?) with appropriate 
warning (error?) message shown in log.
Or maybe we even should deny starting of NFS server if one of it's exports is 
shared already by other NFS server "instance"?
But any of these ideas would be easy to implement in RAM, and thus it suits only 
for containers...

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 16:08           ` Stanislav Kinsbursky
@ 2012-04-09 16:11             ` bfields
  2012-04-09 16:17               ` Myklebust, Trond
  0 siblings, 1 reply; 23+ messages in thread
From: bfields @ 2012-04-09 16:11 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: Jeff Layton, Trond.Myklebust, linux-nfs, linux-kernel

On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
> 09.04.2012 19:27, Jeff Layton пишет:
> >
> >If you allow one container to hand out conflicting locks while another
> >container is allowing reclaims, then you can end up with some very
> >difficult to debug silent data corruption. That's the worst possible
> >outcome, IMO. We really need to actively keep people from shooting
> >themselves in the foot here.
> >
> >One possibility might be to only allow filesystems to be exported from
> >a single container at a time (and allow that to be overridable somehow
> >once we have a working active/active serving solution). With that, you
> >may be able limp along with a per-container grace period handling
> >scheme like you're proposing.
> >
> 
> Ok then. Keeping people from shooting themselves here sounds reasonable.
> And I like the idea of exporting a filesystem only from once per
> network namespace.

Unfortunately that's not going to get us very far, especially not in the
v4 case where we've got the common read-only pseudoroot that everyone
has to share.

--b.

> Looks like there should be a list of pairs
> "exported superblock - network namespace". And if superblock is
> exported already in other namespace, then export in new namespace
> have to be skipped (replaced?) with appropriate warning (error?)
> message shown in log.
> Or maybe we even should deny starting of NFS server if one of it's
> exports is shared already by other NFS server "instance"?
> But any of these ideas would be easy to implement in RAM, and thus
> it suits only for containers...
> 
> -- 
> Best regards,
> Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 16:11             ` bfields
@ 2012-04-09 16:17               ` Myklebust, Trond
  2012-04-09 16:21                 ` bfields
  0 siblings, 1 reply; 23+ messages in thread
From: Myklebust, Trond @ 2012-04-09 16:17 UTC (permalink / raw)
  To: bfields; +Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1635 bytes --]

On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
> On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
> > 09.04.2012 19:27, Jeff Layton пишет:
> > >
> > >If you allow one container to hand out conflicting locks while another
> > >container is allowing reclaims, then you can end up with some very
> > >difficult to debug silent data corruption. That's the worst possible
> > >outcome, IMO. We really need to actively keep people from shooting
> > >themselves in the foot here.
> > >
> > >One possibility might be to only allow filesystems to be exported from
> > >a single container at a time (and allow that to be overridable somehow
> > >once we have a working active/active serving solution). With that, you
> > >may be able limp along with a per-container grace period handling
> > >scheme like you're proposing.
> > >
> > 
> > Ok then. Keeping people from shooting themselves here sounds reasonable.
> > And I like the idea of exporting a filesystem only from once per
> > network namespace.
> 
> Unfortunately that's not going to get us very far, especially not in the
> v4 case where we've got the common read-only pseudoroot that everyone
> has to share.

I don't see how that can work in cases where each container has its own
private mount namespace. You're going to have to tie that pseudoroot to
the mount namespace somehow.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 16:17               ` Myklebust, Trond
@ 2012-04-09 16:21                 ` bfields
  2012-04-09 16:33                   ` Myklebust, Trond
  0 siblings, 1 reply; 23+ messages in thread
From: bfields @ 2012-04-09 16:21 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel

On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote:
> On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
> > On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
> > > 09.04.2012 19:27, Jeff Layton пишет:
> > > >
> > > >If you allow one container to hand out conflicting locks while another
> > > >container is allowing reclaims, then you can end up with some very
> > > >difficult to debug silent data corruption. That's the worst possible
> > > >outcome, IMO. We really need to actively keep people from shooting
> > > >themselves in the foot here.
> > > >
> > > >One possibility might be to only allow filesystems to be exported from
> > > >a single container at a time (and allow that to be overridable somehow
> > > >once we have a working active/active serving solution). With that, you
> > > >may be able limp along with a per-container grace period handling
> > > >scheme like you're proposing.
> > > >
> > > 
> > > Ok then. Keeping people from shooting themselves here sounds reasonable.
> > > And I like the idea of exporting a filesystem only from once per
> > > network namespace.
> > 
> > Unfortunately that's not going to get us very far, especially not in the
> > v4 case where we've got the common read-only pseudoroot that everyone
> > has to share.
> 
> I don't see how that can work in cases where each container has its own
> private mount namespace. You're going to have to tie that pseudoroot to
> the mount namespace somehow.

Sure, but in typical cases it'll still be shared; requiring that they
not be sounds like a severe limitation.

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 16:21                 ` bfields
@ 2012-04-09 16:33                   ` Myklebust, Trond
  2012-04-09 16:39                     ` bfields
  2012-04-09 16:56                     ` Stanislav Kinsbursky
  0 siblings, 2 replies; 23+ messages in thread
From: Myklebust, Trond @ 2012-04-09 16:33 UTC (permalink / raw)
  To: bfields; +Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2234 bytes --]

On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote:
> On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote:
> > On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
> > > On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
> > > > 09.04.2012 19:27, Jeff Layton пишет:
> > > > >
> > > > >If you allow one container to hand out conflicting locks while another
> > > > >container is allowing reclaims, then you can end up with some very
> > > > >difficult to debug silent data corruption. That's the worst possible
> > > > >outcome, IMO. We really need to actively keep people from shooting
> > > > >themselves in the foot here.
> > > > >
> > > > >One possibility might be to only allow filesystems to be exported from
> > > > >a single container at a time (and allow that to be overridable somehow
> > > > >once we have a working active/active serving solution). With that, you
> > > > >may be able limp along with a per-container grace period handling
> > > > >scheme like you're proposing.
> > > > >
> > > > 
> > > > Ok then. Keeping people from shooting themselves here sounds reasonable.
> > > > And I like the idea of exporting a filesystem only from once per
> > > > network namespace.
> > > 
> > > Unfortunately that's not going to get us very far, especially not in the
> > > v4 case where we've got the common read-only pseudoroot that everyone
> > > has to share.
> > 
> > I don't see how that can work in cases where each container has its own
> > private mount namespace. You're going to have to tie that pseudoroot to
> > the mount namespace somehow.
> 
> Sure, but in typical cases it'll still be shared; requiring that they
> not be sounds like a severe limitation.

I'd expect the typical case to be the non-shared namespace: the whole
point of containers is to provide for complete isolation of processes.
Usually that implies that you don't want them to be able to communicate
via a shared filesystem.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 16:33                   ` Myklebust, Trond
@ 2012-04-09 16:39                     ` bfields
  2012-04-09 16:56                     ` Stanislav Kinsbursky
  1 sibling, 0 replies; 23+ messages in thread
From: bfields @ 2012-04-09 16:39 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel

On Mon, Apr 09, 2012 at 04:33:36PM +0000, Myklebust, Trond wrote:
> On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote:
> > On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote:
> > > On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
> > > > On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
> > > > > 09.04.2012 19:27, Jeff Layton пишет:
> > > > > >
> > > > > >If you allow one container to hand out conflicting locks while another
> > > > > >container is allowing reclaims, then you can end up with some very
> > > > > >difficult to debug silent data corruption. That's the worst possible
> > > > > >outcome, IMO. We really need to actively keep people from shooting
> > > > > >themselves in the foot here.
> > > > > >
> > > > > >One possibility might be to only allow filesystems to be exported from
> > > > > >a single container at a time (and allow that to be overridable somehow
> > > > > >once we have a working active/active serving solution). With that, you
> > > > > >may be able limp along with a per-container grace period handling
> > > > > >scheme like you're proposing.
> > > > > >
> > > > > 
> > > > > Ok then. Keeping people from shooting themselves here sounds reasonable.
> > > > > And I like the idea of exporting a filesystem only from once per
> > > > > network namespace.
> > > > 
> > > > Unfortunately that's not going to get us very far, especially not in the
> > > > v4 case where we've got the common read-only pseudoroot that everyone
> > > > has to share.
> > > 
> > > I don't see how that can work in cases where each container has its own
> > > private mount namespace. You're going to have to tie that pseudoroot to
> > > the mount namespace somehow.
> > 
> > Sure, but in typical cases it'll still be shared; requiring that they
> > not be sounds like a severe limitation.
> 
> I'd expect the typical case to be the non-shared namespace: the whole
> point of containers is to provide for complete isolation of processes.
> Usually that implies that you don't want them to be able to communicate
> via a shared filesystem.

If it's just a file server, then you may want to be able to bring up and
down service on individual server ip's individually, and possibly
advertise different exports; but requiring complete isolation to do that
seems like overkill.

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 16:33                   ` Myklebust, Trond
  2012-04-09 16:39                     ` bfields
@ 2012-04-09 16:56                     ` Stanislav Kinsbursky
  2012-04-09 18:11                       ` bfields
  1 sibling, 1 reply; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-09 16:56 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: bfields, Jeff Layton, linux-nfs, linux-kernel

09.04.2012 20:33, Myklebust, Trond пишет:
> On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote:
>> On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote:
>>> On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
>>>> On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
>>>>> 09.04.2012 19:27, Jeff Layton пишет:
>>>>>>
>>>>>> If you allow one container to hand out conflicting locks while another
>>>>>> container is allowing reclaims, then you can end up with some very
>>>>>> difficult to debug silent data corruption. That's the worst possible
>>>>>> outcome, IMO. We really need to actively keep people from shooting
>>>>>> themselves in the foot here.
>>>>>>
>>>>>> One possibility might be to only allow filesystems to be exported from
>>>>>> a single container at a time (and allow that to be overridable somehow
>>>>>> once we have a working active/active serving solution). With that, you
>>>>>> may be able limp along with a per-container grace period handling
>>>>>> scheme like you're proposing.
>>>>>>
>>>>>
>>>>> Ok then. Keeping people from shooting themselves here sounds reasonable.
>>>>> And I like the idea of exporting a filesystem only from once per
>>>>> network namespace.
>>>>
>>>> Unfortunately that's not going to get us very far, especially not in the
>>>> v4 case where we've got the common read-only pseudoroot that everyone
>>>> has to share.
>>>
>>> I don't see how that can work in cases where each container has its own
>>> private mount namespace. You're going to have to tie that pseudoroot to
>>> the mount namespace somehow.
>>
>> Sure, but in typical cases it'll still be shared; requiring that they
>> not be sounds like a severe limitation.
>
> I'd expect the typical case to be the non-shared namespace: the whole
> point of containers is to provide for complete isolation of processes.
> Usually that implies that you don't want them to be able to communicate
> via a shared filesystem.
>

BTW, we DO use one mount namespace for all containers and host in OpenVZ. This 
allows us to have an access to containers mount points from initial environment. 
Isolation between containers is done via chroot and some simple tricks on 
/proc/mounts read operation.
Moreover, with one mount namespace, we currently support bind-mounting on NFS 
from one container into another...

Anyway, I'm sorry, but I'm not familiar with this pseudoroot idea.
Why does it prevents implementing of check for "superblock-network namespace" 
pair on NFS server start and forbid (?) it in case of this pair is shared 
already in other namespace? I.e. maybe this pseudoroot can be an exclusion from 
this rule?
Or I'm just missing the point at all?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 16:56                     ` Stanislav Kinsbursky
@ 2012-04-09 18:11                       ` bfields
  2012-04-10 10:56                         ` Stanislav Kinsbursky
  0 siblings, 1 reply; 23+ messages in thread
From: bfields @ 2012-04-09 18:11 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel

On Mon, Apr 09, 2012 at 08:56:47PM +0400, Stanislav Kinsbursky wrote:
> 09.04.2012 20:33, Myklebust, Trond пишет:
> >On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote:
> >>On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote:
> >>>On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
> >>>>On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
> >>>>>09.04.2012 19:27, Jeff Layton пишет:
> >>>>>>
> >>>>>>If you allow one container to hand out conflicting locks while another
> >>>>>>container is allowing reclaims, then you can end up with some very
> >>>>>>difficult to debug silent data corruption. That's the worst possible
> >>>>>>outcome, IMO. We really need to actively keep people from shooting
> >>>>>>themselves in the foot here.
> >>>>>>
> >>>>>>One possibility might be to only allow filesystems to be exported from
> >>>>>>a single container at a time (and allow that to be overridable somehow
> >>>>>>once we have a working active/active serving solution). With that, you
> >>>>>>may be able limp along with a per-container grace period handling
> >>>>>>scheme like you're proposing.
> >>>>>>
> >>>>>
> >>>>>Ok then. Keeping people from shooting themselves here sounds reasonable.
> >>>>>And I like the idea of exporting a filesystem only from once per
> >>>>>network namespace.
> >>>>
> >>>>Unfortunately that's not going to get us very far, especially not in the
> >>>>v4 case where we've got the common read-only pseudoroot that everyone
> >>>>has to share.
> >>>
> >>>I don't see how that can work in cases where each container has its own
> >>>private mount namespace. You're going to have to tie that pseudoroot to
> >>>the mount namespace somehow.
> >>
> >>Sure, but in typical cases it'll still be shared; requiring that they
> >>not be sounds like a severe limitation.
> >
> >I'd expect the typical case to be the non-shared namespace: the whole
> >point of containers is to provide for complete isolation of processes.
> >Usually that implies that you don't want them to be able to communicate
> >via a shared filesystem.
> >
> 
> BTW, we DO use one mount namespace for all containers and host in
> OpenVZ. This allows us to have an access to containers mount points
> from initial environment. Isolation between containers is done via
> chroot and some simple tricks on /proc/mounts read operation.
> Moreover, with one mount namespace, we currently support
> bind-mounting on NFS from one container into another...
> 
> Anyway, I'm sorry, but I'm not familiar with this pseudoroot idea.

Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be
able to do readdir's and lookups to get to exported filesystems.  We
support this in the Linux server by exporting all the filesystems from
"/" on down that must be traversed to reach a given filesystem.  These
exports are very restricted (e.g. only parents of exports are visible).

> Why does it prevents implementing of check for "superblock-network
> namespace" pair on NFS server start and forbid (?) it in case of
> this pair is shared already in other namespace? I.e. maybe this
> pseudoroot can be an exclusion from this rule?

That might work.  It's read-only and consists only of directories, so
the grace period doesn't affect it.

--b.

> Or I'm just missing the point at all?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 11:24   ` Grace period Stanislav Kinsbursky
  2012-04-09 13:47     ` Jeff Layton
@ 2012-04-09 23:26     ` bfields
  2012-04-10 11:29       ` Stanislav Kinsbursky
  1 sibling, 1 reply; 23+ messages in thread
From: bfields @ 2012-04-09 23:26 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, linux-nfs, linux-kernel

On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote:
> 07.04.2012 03:40, bfields@fieldses.org пишет:
> >On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
> >>Hello, Bruce.
> >>Could you, please, clarify this reason why grace list is used?
> >>I.e. why list is used instead of some atomic variable, for example?
> >
> >Like just a reference count?  Yeah, that would be OK.
> >
> >In theory it could provide some sort of debugging help.  (E.g. we could
> >print out the list of "lock managers" currently keeping us in grace.)  I
> >had some idea we'd make those lock manager objects more complicated, and
> >might have more for individual containerized services.
> 
> Could you share this idea, please?
> 
> Anyway, I have nothing against lists. Just was curious, why it was used.
> I added Trond and lists to this reply.
> 
> Let me explain, what is the problem with grace period I'm facing
> right know, and what I'm thinking about it.
> So, one of the things to be containerized during "NFSd per net ns"
> work is the grace period, and these are the basic components of it:
> 1) Grace period start.
> 2) Grace period end.
> 3) Grace period check.
> 3) Grace period restart.

For restart, you're thinking of the fs/lockd/svc.c:restart_grace()
that's called on aisngal in lockd()?

I wonder if there's any way to figure out if that's actually used by
anyone?  (E.g. by any distro init scripts).  It strikes me as possibly
impossible to use correctly.  Perhaps we could deprecate it....

> So, the simplest straight-forward way is to make all internal stuff:
> "grace_list", "grace_lock", "grace_period_end" work and both
> "lockd_manager" and "nfsd4_manager" - per network namespace. Also,
> "laundromat_work" have to be per-net as well.
> In this case:
> 1) Start - grace period can be started per net ns in
> "lockd_up_net()" (thus has to be moves there from "lockd()") and
> "nfs4_state_start()".
> 2) End - grace period can be ended per net ns in "lockd_down_net()"
> (thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and
> "fs4_state_shutdown()".
> 3) Check - looks easy. There is either svc_rqst or net context can
> be passed to function.
> 4) Restart - this is a tricky place. It would be great to restart
> grace period only for the networks namespace of the sender of the
> kill signal. So, the idea is to check siginfo_t for the pid of
> sender, then try to locate the task, and if found, then get sender's
> networks namespace, and restart grace period only for this namespace
> (of course, if lockd was started for this namespace - see below).

If it's really the signalling that's the problem--perhaps we can get
away from the signal-based interface.

At least in the case of lockd I suspect we could.

Or perhaps the decision to share a single lockd thread (or set of nsfd
threads) among multiple network namespaces was a poor one.  But I
realize multithreading lockd doesn't look easy.

--b.

> If task not found, of it's lockd wasn't started for it's namespace,
> then grace period can be either restarted for all namespaces, of
> just silently dropped. This is the place where I'm not sure, how to
> do. Because calling grace period for all namespaces will be
> overkill...
> 
> There also another problem with the "task by pid" search, that found
> task can be actually not sender (which died already), but some other
> new task with the same pid number. In this case, I think, we can
> just neglect this probability and always assume, that we located
> sender (if, of course, lockd was started for sender's network
> namespace).
> 
> Trond, Bruce, could you, please, comment this ideas?
> 
> -- 
> Best regards,
> Stanislav Kinsbursky
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 18:11                       ` bfields
@ 2012-04-10 10:56                         ` Stanislav Kinsbursky
  2012-04-10 13:39                           ` bfields
  0 siblings, 1 reply; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 10:56 UTC (permalink / raw)
  To: bfields; +Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel

09.04.2012 22:11, bfields@fieldses.org пишет:
> On Mon, Apr 09, 2012 at 08:56:47PM +0400, Stanislav Kinsbursky wrote:
>> 09.04.2012 20:33, Myklebust, Trond пишет:
>>> On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote:
>>>> On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote:
>>>>> On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
>>>>>> On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
>>>>>>> 09.04.2012 19:27, Jeff Layton пишет:
>>>>>>>>
>>>>>>>> If you allow one container to hand out conflicting locks while another
>>>>>>>> container is allowing reclaims, then you can end up with some very
>>>>>>>> difficult to debug silent data corruption. That's the worst possible
>>>>>>>> outcome, IMO. We really need to actively keep people from shooting
>>>>>>>> themselves in the foot here.
>>>>>>>>
>>>>>>>> One possibility might be to only allow filesystems to be exported from
>>>>>>>> a single container at a time (and allow that to be overridable somehow
>>>>>>>> once we have a working active/active serving solution). With that, you
>>>>>>>> may be able limp along with a per-container grace period handling
>>>>>>>> scheme like you're proposing.
>>>>>>>>
>>>>>>>
>>>>>>> Ok then. Keeping people from shooting themselves here sounds reasonable.
>>>>>>> And I like the idea of exporting a filesystem only from once per
>>>>>>> network namespace.
>>>>>>
>>>>>> Unfortunately that's not going to get us very far, especially not in the
>>>>>> v4 case where we've got the common read-only pseudoroot that everyone
>>>>>> has to share.
>>>>>
>>>>> I don't see how that can work in cases where each container has its own
>>>>> private mount namespace. You're going to have to tie that pseudoroot to
>>>>> the mount namespace somehow.
>>>>
>>>> Sure, but in typical cases it'll still be shared; requiring that they
>>>> not be sounds like a severe limitation.
>>>
>>> I'd expect the typical case to be the non-shared namespace: the whole
>>> point of containers is to provide for complete isolation of processes.
>>> Usually that implies that you don't want them to be able to communicate
>>> via a shared filesystem.
>>>
>>
>> BTW, we DO use one mount namespace for all containers and host in
>> OpenVZ. This allows us to have an access to containers mount points
>> from initial environment. Isolation between containers is done via
>> chroot and some simple tricks on /proc/mounts read operation.
>> Moreover, with one mount namespace, we currently support
>> bind-mounting on NFS from one container into another...
>>
>> Anyway, I'm sorry, but I'm not familiar with this pseudoroot idea.
>
> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be
> able to do readdir's and lookups to get to exported filesystems.  We
> support this in the Linux server by exporting all the filesystems from
> "/" on down that must be traversed to reach a given filesystem.  These
> exports are very restricted (e.g. only parents of exports are visible).
>

Ok, thanks for explanation.
So, this pseudoroot looks like a part of NFS server internal implementation, but 
not a part of a standard. That's good.

>> Why does it prevents implementing of check for "superblock-network
>> namespace" pair on NFS server start and forbid (?) it in case of
>> this pair is shared already in other namespace? I.e. maybe this
>> pseudoroot can be an exclusion from this rule?
>
> That might work.  It's read-only and consists only of directories, so
> the grace period doesn't affect it.
>

I've just realized, that this per-sb grace period won't work.
I.e., it's a valid situation, when two or more containers located on the same 
filesystem, but shares different parts of it. And there is not conflict here at all.
I don't see any clear and simple way how to handle such races, because otherwise 
we have to tie network namespace and filesystem namespace.
I.e. there will be required some way to define, was passed export directory 
shared already somewhere else or not.

Realistic solution - since export check should be done in initial file system 
environment (most probably container will have it's own root), then we to pass 
this data to some kernel thread/userspace daemon in initial file system 
environment somehow (sockets doesn't suits here... Shared memory?).

Improbable solution - patching VFS layer...

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-09 23:26     ` bfields
@ 2012-04-10 11:29       ` Stanislav Kinsbursky
  2012-04-10 13:37         ` bfields
  0 siblings, 1 reply; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 11:29 UTC (permalink / raw)
  To: bfields; +Cc: Trond.Myklebust, linux-nfs, linux-kernel

10.04.2012 03:26, bfields@fieldses.org пишет:
> On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote:
>> 07.04.2012 03:40, bfields@fieldses.org пишет:
>>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
>>>> Hello, Bruce.
>>>> Could you, please, clarify this reason why grace list is used?
>>>> I.e. why list is used instead of some atomic variable, for example?
>>>
>>> Like just a reference count?  Yeah, that would be OK.
>>>
>>> In theory it could provide some sort of debugging help.  (E.g. we could
>>> print out the list of "lock managers" currently keeping us in grace.)  I
>>> had some idea we'd make those lock manager objects more complicated, and
>>> might have more for individual containerized services.
>>
>> Could you share this idea, please?
>>
>> Anyway, I have nothing against lists. Just was curious, why it was used.
>> I added Trond and lists to this reply.
>>
>> Let me explain, what is the problem with grace period I'm facing
>> right know, and what I'm thinking about it.
>> So, one of the things to be containerized during "NFSd per net ns"
>> work is the grace period, and these are the basic components of it:
>> 1) Grace period start.
>> 2) Grace period end.
>> 3) Grace period check.
>> 3) Grace period restart.
>
> For restart, you're thinking of the fs/lockd/svc.c:restart_grace()
> that's called on aisngal in lockd()?
>
> I wonder if there's any way to figure out if that's actually used by
> anyone?  (E.g. by any distro init scripts).  It strikes me as possibly
> impossible to use correctly.  Perhaps we could deprecate it....
>

Or (since lockd kthread is visible only from initial pid namespace) we can just 
hardcode "init_net" in this case. But it means, that this "kill" logic will be 
broken if two containers shares one pid namespace, but have separated networks 
namespaces.
Anyway, both (this one or Bruce's) solutions suits me.

>> So, the simplest straight-forward way is to make all internal stuff:
>> "grace_list", "grace_lock", "grace_period_end" work and both
>> "lockd_manager" and "nfsd4_manager" - per network namespace. Also,
>> "laundromat_work" have to be per-net as well.
>> In this case:
>> 1) Start - grace period can be started per net ns in
>> "lockd_up_net()" (thus has to be moves there from "lockd()") and
>> "nfs4_state_start()".
>> 2) End - grace period can be ended per net ns in "lockd_down_net()"
>> (thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and
>> "fs4_state_shutdown()".
>> 3) Check - looks easy. There is either svc_rqst or net context can
>> be passed to function.
>> 4) Restart - this is a tricky place. It would be great to restart
>> grace period only for the networks namespace of the sender of the
>> kill signal. So, the idea is to check siginfo_t for the pid of
>> sender, then try to locate the task, and if found, then get sender's
>> networks namespace, and restart grace period only for this namespace
>> (of course, if lockd was started for this namespace - see below).
>
> If it's really the signalling that's the problem--perhaps we can get
> away from the signal-based interface.
>
> At least in the case of lockd I suspect we could.
>

I'm ok with that. So, if no objections will follow, I'll drop it and send the 
patch. Or you want to do it?

BTW, I tried this "pid from siginfo" approach yesterday. And it doesn't work, 
because sender usually dead already, when lookup for task by pid is performed.

> Or perhaps the decision to share a single lockd thread (or set of nsfd
> threads) among multiple network namespaces was a poor one.  But I
> realize multithreading lockd doesn't look easy.
>

This decision was the best one in current circumstances.
Having Lockd thread (or NFSd threads) per container looks easy to implement on 
first sight. But kernel threads currently supported only in initial pid 
namespace. I.e. it means that per-container kernel thread won't be visible in 
container, if it has it's own pid namespace. And there is no way to put a kernel 
thread into container.
In OpenVZ we have per-container kernel threads. But integrating this feature to 
mainline looks hopeless (or very difficult) to me. At least for now.
So this problem with signals remains unsolved.

So, as it looks to me, this "one service per all" is the only one suitable for 
now. But there are some corner cases which have to be solved.

Anyway, Jeff's question is still open.
Do we need to prevent people from exporting nested directories from different 
network namespaces?
And if yes, how to do this?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 11:29       ` Stanislav Kinsbursky
@ 2012-04-10 13:37         ` bfields
  2012-04-10 14:10           ` Stanislav Kinsbursky
  0 siblings, 1 reply; 23+ messages in thread
From: bfields @ 2012-04-10 13:37 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, linux-nfs, linux-kernel

On Tue, Apr 10, 2012 at 03:29:11PM +0400, Stanislav Kinsbursky wrote:
> 10.04.2012 03:26, bfields@fieldses.org пишет:
> >On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote:
> >>07.04.2012 03:40, bfields@fieldses.org пишет:
> >>>On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
> >>>>Hello, Bruce.
> >>>>Could you, please, clarify this reason why grace list is used?
> >>>>I.e. why list is used instead of some atomic variable, for example?
> >>>
> >>>Like just a reference count?  Yeah, that would be OK.
> >>>
> >>>In theory it could provide some sort of debugging help.  (E.g. we could
> >>>print out the list of "lock managers" currently keeping us in grace.)  I
> >>>had some idea we'd make those lock manager objects more complicated, and
> >>>might have more for individual containerized services.
> >>
> >>Could you share this idea, please?
> >>
> >>Anyway, I have nothing against lists. Just was curious, why it was used.
> >>I added Trond and lists to this reply.
> >>
> >>Let me explain, what is the problem with grace period I'm facing
> >>right know, and what I'm thinking about it.
> >>So, one of the things to be containerized during "NFSd per net ns"
> >>work is the grace period, and these are the basic components of it:
> >>1) Grace period start.
> >>2) Grace period end.
> >>3) Grace period check.
> >>3) Grace period restart.
> >
> >For restart, you're thinking of the fs/lockd/svc.c:restart_grace()
> >that's called on aisngal in lockd()?
> >
> >I wonder if there's any way to figure out if that's actually used by
> >anyone?  (E.g. by any distro init scripts).  It strikes me as possibly
> >impossible to use correctly.  Perhaps we could deprecate it....
> >
> 
> Or (since lockd kthread is visible only from initial pid namespace)
> we can just hardcode "init_net" in this case. But it means, that
> this "kill" logic will be broken if two containers shares one pid
> namespace, but have separated networks namespaces.
> Anyway, both (this one or Bruce's) solutions suits me.
> 
> >>So, the simplest straight-forward way is to make all internal stuff:
> >>"grace_list", "grace_lock", "grace_period_end" work and both
> >>"lockd_manager" and "nfsd4_manager" - per network namespace. Also,
> >>"laundromat_work" have to be per-net as well.
> >>In this case:
> >>1) Start - grace period can be started per net ns in
> >>"lockd_up_net()" (thus has to be moves there from "lockd()") and
> >>"nfs4_state_start()".
> >>2) End - grace period can be ended per net ns in "lockd_down_net()"
> >>(thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and
> >>"fs4_state_shutdown()".
> >>3) Check - looks easy. There is either svc_rqst or net context can
> >>be passed to function.
> >>4) Restart - this is a tricky place. It would be great to restart
> >>grace period only for the networks namespace of the sender of the
> >>kill signal. So, the idea is to check siginfo_t for the pid of
> >>sender, then try to locate the task, and if found, then get sender's
> >>networks namespace, and restart grace period only for this namespace
> >>(of course, if lockd was started for this namespace - see below).
> >
> >If it's really the signalling that's the problem--perhaps we can get
> >away from the signal-based interface.
> >
> >At least in the case of lockd I suspect we could.
> >
> 
> I'm ok with that. So, if no objections will follow, I'll drop it and
> send the patch. Or you want to do it?

Please do go ahead.

The safest approach might be:
	- leave lockd's signal handling there (just accept that it may
	  behave incorrectly in container case), assuming that's safe.
	- add a printk ("signalling lockd to restart is deprecated",
	  or something) if it's used.

Then eventually we'll remove it entirely.

(But if that doesn't work, it'd likely also be OK just to remove it
completely now.)

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 10:56                         ` Stanislav Kinsbursky
@ 2012-04-10 13:39                           ` bfields
  2012-04-10 15:36                             ` Stanislav Kinsbursky
  0 siblings, 1 reply; 23+ messages in thread
From: bfields @ 2012-04-10 13:39 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel

On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote:
> 09.04.2012 22:11, bfields@fieldses.org пишет:
> >Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be
> >able to do readdir's and lookups to get to exported filesystems.  We
> >support this in the Linux server by exporting all the filesystems from
> >"/" on down that must be traversed to reach a given filesystem.  These
> >exports are very restricted (e.g. only parents of exports are visible).
> >
> 
> Ok, thanks for explanation.
> So, this pseudoroot looks like a part of NFS server internal
> implementation, but not a part of a standard. That's good.
> 
> >>Why does it prevents implementing of check for "superblock-network
> >>namespace" pair on NFS server start and forbid (?) it in case of
> >>this pair is shared already in other namespace? I.e. maybe this
> >>pseudoroot can be an exclusion from this rule?
> >
> >That might work.  It's read-only and consists only of directories, so
> >the grace period doesn't affect it.
> >
> 
> I've just realized, that this per-sb grace period won't work.
> I.e., it's a valid situation, when two or more containers located on
> the same filesystem, but shares different parts of it. And there is
> not conflict here at all.

Well, there may be some conflict in that a file could be hardlinked into
both subtrees, and that file could be locked from users of either
export.

--b.

> I don't see any clear and simple way how to handle such races,
> because otherwise we have to tie network namespace and filesystem
> namespace.
> I.e. there will be required some way to define, was passed export
> directory shared already somewhere else or not.
> 
> Realistic solution - since export check should be done in initial
> file system environment (most probably container will have it's own
> root), then we to pass this data to some kernel thread/userspace
> daemon in initial file system environment somehow (sockets doesn't
> suits here... Shared memory?).
> 
> Improbable solution - patching VFS layer...
> 
> -- 
> Best regards,
> Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 13:37         ` bfields
@ 2012-04-10 14:10           ` Stanislav Kinsbursky
  2012-04-10 14:18             ` bfields
  0 siblings, 1 reply; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 14:10 UTC (permalink / raw)
  To: bfields; +Cc: Trond.Myklebust, linux-nfs, linux-kernel

10.04.2012 17:37, bfields@fieldses.org пишет:
> On Tue, Apr 10, 2012 at 03:29:11PM +0400, Stanislav Kinsbursky wrote:
>> 10.04.2012 03:26, bfields@fieldses.org пишет:
>>> On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote:
>>>> 07.04.2012 03:40, bfields@fieldses.org пишет:
>>>>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote:
>>>>>> Hello, Bruce.
>>>>>> Could you, please, clarify this reason why grace list is used?
>>>>>> I.e. why list is used instead of some atomic variable, for example?
>>>>>
>>>>> Like just a reference count?  Yeah, that would be OK.
>>>>>
>>>>> In theory it could provide some sort of debugging help.  (E.g. we could
>>>>> print out the list of "lock managers" currently keeping us in grace.)  I
>>>>> had some idea we'd make those lock manager objects more complicated, and
>>>>> might have more for individual containerized services.
>>>>
>>>> Could you share this idea, please?
>>>>
>>>> Anyway, I have nothing against lists. Just was curious, why it was used.
>>>> I added Trond and lists to this reply.
>>>>
>>>> Let me explain, what is the problem with grace period I'm facing
>>>> right know, and what I'm thinking about it.
>>>> So, one of the things to be containerized during "NFSd per net ns"
>>>> work is the grace period, and these are the basic components of it:
>>>> 1) Grace period start.
>>>> 2) Grace period end.
>>>> 3) Grace period check.
>>>> 3) Grace period restart.
>>>
>>> For restart, you're thinking of the fs/lockd/svc.c:restart_grace()
>>> that's called on aisngal in lockd()?
>>>
>>> I wonder if there's any way to figure out if that's actually used by
>>> anyone?  (E.g. by any distro init scripts).  It strikes me as possibly
>>> impossible to use correctly.  Perhaps we could deprecate it....
>>>
>>
>> Or (since lockd kthread is visible only from initial pid namespace)
>> we can just hardcode "init_net" in this case. But it means, that
>> this "kill" logic will be broken if two containers shares one pid
>> namespace, but have separated networks namespaces.
>> Anyway, both (this one or Bruce's) solutions suits me.
>>
>>>> So, the simplest straight-forward way is to make all internal stuff:
>>>> "grace_list", "grace_lock", "grace_period_end" work and both
>>>> "lockd_manager" and "nfsd4_manager" - per network namespace. Also,
>>>> "laundromat_work" have to be per-net as well.
>>>> In this case:
>>>> 1) Start - grace period can be started per net ns in
>>>> "lockd_up_net()" (thus has to be moves there from "lockd()") and
>>>> "nfs4_state_start()".
>>>> 2) End - grace period can be ended per net ns in "lockd_down_net()"
>>>> (thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and
>>>> "fs4_state_shutdown()".
>>>> 3) Check - looks easy. There is either svc_rqst or net context can
>>>> be passed to function.
>>>> 4) Restart - this is a tricky place. It would be great to restart
>>>> grace period only for the networks namespace of the sender of the
>>>> kill signal. So, the idea is to check siginfo_t for the pid of
>>>> sender, then try to locate the task, and if found, then get sender's
>>>> networks namespace, and restart grace period only for this namespace
>>>> (of course, if lockd was started for this namespace - see below).
>>>
>>> If it's really the signalling that's the problem--perhaps we can get
>>> away from the signal-based interface.
>>>
>>> At least in the case of lockd I suspect we could.
>>>
>>
>> I'm ok with that. So, if no objections will follow, I'll drop it and
>> send the patch. Or you want to do it?
>
> Please do go ahead.
>
> The safest approach might be:
> 	- leave lockd's signal handling there (just accept that it may
> 	  behave incorrectly in container case), assuming that's safe.
> 	- add a printk ("signalling lockd to restart is deprecated",
> 	  or something) if it's used.
>
> Then eventually we'll remove it entirely.
>
> (But if that doesn't work, it'd likely also be OK just to remove it
> completely now.)
>

Well, I can do this to restart grace only for "init_net" and a printk with your 
message and information, that it affect only init_net.
Looks good to you?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 14:10           ` Stanislav Kinsbursky
@ 2012-04-10 14:18             ` bfields
  0 siblings, 0 replies; 23+ messages in thread
From: bfields @ 2012-04-10 14:18 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, linux-nfs, linux-kernel

On Tue, Apr 10, 2012 at 06:10:27PM +0400, Stanislav Kinsbursky wrote:
> Well, I can do this to restart grace only for "init_net" and a
> printk with your message and information, that it affect only
> init_net.
> Looks good to you?

Yep, thanks!

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 13:39                           ` bfields
@ 2012-04-10 15:36                             ` Stanislav Kinsbursky
  2012-04-10 18:28                               ` Jeff Layton
  0 siblings, 1 reply; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 15:36 UTC (permalink / raw)
  To: bfields; +Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel

10.04.2012 17:39, bfields@fieldses.org пишет:
> On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote:
>> 09.04.2012 22:11, bfields@fieldses.org пишет:
>>> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be
>>> able to do readdir's and lookups to get to exported filesystems.  We
>>> support this in the Linux server by exporting all the filesystems from
>>> "/" on down that must be traversed to reach a given filesystem.  These
>>> exports are very restricted (e.g. only parents of exports are visible).
>>>
>>
>> Ok, thanks for explanation.
>> So, this pseudoroot looks like a part of NFS server internal
>> implementation, but not a part of a standard. That's good.
>>
>>>> Why does it prevents implementing of check for "superblock-network
>>>> namespace" pair on NFS server start and forbid (?) it in case of
>>>> this pair is shared already in other namespace? I.e. maybe this
>>>> pseudoroot can be an exclusion from this rule?
>>>
>>> That might work.  It's read-only and consists only of directories, so
>>> the grace period doesn't affect it.
>>>
>>
>> I've just realized, that this per-sb grace period won't work.
>> I.e., it's a valid situation, when two or more containers located on
>> the same filesystem, but shares different parts of it. And there is
>> not conflict here at all.
>
> Well, there may be some conflict in that a file could be hardlinked into
> both subtrees, and that file could be locked from users of either
> export.
>

Is this case handled if both links or visible in the same export?
But anyway, this is not that bad. I.e it doesn't make things unpredictable.
Probably, there are some more issues like this one (bind-mounting, for example).
But I think, that it's root responsibility to handle such problems.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 15:36                             ` Stanislav Kinsbursky
@ 2012-04-10 18:28                               ` Jeff Layton
  2012-04-10 20:46                                 ` bfields
  2012-04-11 10:08                                 ` Stanislav Kinsbursky
  0 siblings, 2 replies; 23+ messages in thread
From: Jeff Layton @ 2012-04-10 18:28 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: bfields, Myklebust, Trond, linux-nfs, linux-kernel

On Tue, 10 Apr 2012 19:36:26 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 10.04.2012 17:39, bfields@fieldses.org пишет:
> > On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote:
> >> 09.04.2012 22:11, bfields@fieldses.org пишет:
> >>> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be
> >>> able to do readdir's and lookups to get to exported filesystems.  We
> >>> support this in the Linux server by exporting all the filesystems from
> >>> "/" on down that must be traversed to reach a given filesystem.  These
> >>> exports are very restricted (e.g. only parents of exports are visible).
> >>>
> >>
> >> Ok, thanks for explanation.
> >> So, this pseudoroot looks like a part of NFS server internal
> >> implementation, but not a part of a standard. That's good.
> >>
> >>>> Why does it prevents implementing of check for "superblock-network
> >>>> namespace" pair on NFS server start and forbid (?) it in case of
> >>>> this pair is shared already in other namespace? I.e. maybe this
> >>>> pseudoroot can be an exclusion from this rule?
> >>>
> >>> That might work.  It's read-only and consists only of directories, so
> >>> the grace period doesn't affect it.
> >>>
> >>
> >> I've just realized, that this per-sb grace period won't work.
> >> I.e., it's a valid situation, when two or more containers located on
> >> the same filesystem, but shares different parts of it. And there is
> >> not conflict here at all.
> >
> > Well, there may be some conflict in that a file could be hardlinked into
> > both subtrees, and that file could be locked from users of either
> > export.
> >
> 
> Is this case handled if both links or visible in the same export?
> But anyway, this is not that bad. I.e it doesn't make things unpredictable.
> Probably, there are some more issues like this one (bind-mounting, for example).
> But I think, that it's root responsibility to handle such problems.
> 

Well, it's a problem and one that you'll probably have to address to
some degree. In truth, the fact that you're exporting different
subtrees in different containers is immaterial since they're both on
the same fs and filehandles don't carry any info about the path in and
of themselves...

Suppose for instance that we have a hardlinked file that's available
from two different exports in two different containers. The grace
period ends in container #1, so that nfsd starts servicing normal lock
requests. An application takes a lock on that hardlinked file. In the
meantime, a client of container #2 attempts to reclaim the lock that he
previously held on that same inode and gets denied.

That's just one example. The scarier case is that the client of
container #1 takes the lock, alters the file and then drops it again
with the client of container #2 none the wiser. Now the file got
altered while client #2 thought he held a lock on it. That won't be fun
to track down...

This sort of thing is one of the reasons I've been saying that the
grace period is really a property of the underlying filesystem and not
of nfsd itself. Of course, we do have to come up with a way to handle
the grace period that doesn't involve altering every exportable fs.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 18:28                               ` Jeff Layton
@ 2012-04-10 20:46                                 ` bfields
  2012-04-11 10:08                                 ` Stanislav Kinsbursky
  1 sibling, 0 replies; 23+ messages in thread
From: bfields @ 2012-04-10 20:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Stanislav Kinsbursky, Myklebust, Trond, linux-nfs, linux-kernel

On Tue, Apr 10, 2012 at 02:28:53PM -0400, Jeff Layton wrote:
> This sort of thing is one of the reasons I've been saying that the
> grace period is really a property of the underlying filesystem and not
> of nfsd itself. Of course, we do have to come up with a way to handle
> the grace period that doesn't involve altering every exportable fs.

By the way, the case of multiple containers exporting a single
filesystem does look a lot like an active/active cluster filesystem
export.  It might be an opportunity to prototype the interfaces for
handling that case without having to deal with modifying the DLM.

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Grace period
  2012-04-10 18:28                               ` Jeff Layton
  2012-04-10 20:46                                 ` bfields
@ 2012-04-11 10:08                                 ` Stanislav Kinsbursky
  1 sibling, 0 replies; 23+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 10:08 UTC (permalink / raw)
  To: Jeff Layton; +Cc: bfields, Myklebust, Trond, linux-nfs, linux-kernel

10.04.2012 22:28, Jeff Layton пишет:
> On Tue, 10 Apr 2012 19:36:26 +0400
> Stanislav Kinsbursky<skinsbursky@parallels.com>  wrote:
>
>> 10.04.2012 17:39, bfields@fieldses.org пишет:
>>> On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote:
>>>> 09.04.2012 22:11, bfields@fieldses.org пишет:
>>>>> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be
>>>>> able to do readdir's and lookups to get to exported filesystems.  We
>>>>> support this in the Linux server by exporting all the filesystems from
>>>>> "/" on down that must be traversed to reach a given filesystem.  These
>>>>> exports are very restricted (e.g. only parents of exports are visible).
>>>>>
>>>>
>>>> Ok, thanks for explanation.
>>>> So, this pseudoroot looks like a part of NFS server internal
>>>> implementation, but not a part of a standard. That's good.
>>>>
>>>>>> Why does it prevents implementing of check for "superblock-network
>>>>>> namespace" pair on NFS server start and forbid (?) it in case of
>>>>>> this pair is shared already in other namespace? I.e. maybe this
>>>>>> pseudoroot can be an exclusion from this rule?
>>>>>
>>>>> That might work.  It's read-only and consists only of directories, so
>>>>> the grace period doesn't affect it.
>>>>>
>>>>
>>>> I've just realized, that this per-sb grace period won't work.
>>>> I.e., it's a valid situation, when two or more containers located on
>>>> the same filesystem, but shares different parts of it. And there is
>>>> not conflict here at all.
>>>
>>> Well, there may be some conflict in that a file could be hardlinked into
>>> both subtrees, and that file could be locked from users of either
>>> export.
>>>
>>
>> Is this case handled if both links or visible in the same export?
>> But anyway, this is not that bad. I.e it doesn't make things unpredictable.
>> Probably, there are some more issues like this one (bind-mounting, for example).
>> But I think, that it's root responsibility to handle such problems.
>>
>
> Well, it's a problem and one that you'll probably have to address to
> some degree. In truth, the fact that you're exporting different
> subtrees in different containers is immaterial since they're both on
> the same fs and filehandles don't carry any info about the path in and
> of themselves...
>
> Suppose for instance that we have a hardlinked file that's available
> from two different exports in two different containers. The grace
> period ends in container #1, so that nfsd starts servicing normal lock
> requests. An application takes a lock on that hardlinked file. In the
> meantime, a client of container #2 attempts to reclaim the lock that he
> previously held on that same inode and gets denied.
>

> That's just one example. The scarier case is that the client of
> container #1 takes the lock, alters the file and then drops it again
> with the client of container #2 none the wiser. Now the file got
> altered while client #2 thought he held a lock on it. That won't be fun
> to track down...
>
> This sort of thing is one of the reasons I've been saying that the
> grace period is really a property of the underlying filesystem and not
> of nfsd itself. Of course, we do have to come up with a way to handle
> the grace period that doesn't involve altering every exportable fs.
>

I see.
But, frankly speaking, looks like the problem you are talking about is another 
task (comparing to containerization).
I.e. making NFSd work per network namespace is somewhat different comparing to 
these "shared file system" issues (which are actually a part of mount namespace).



-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-04-11 10:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4F7F230A.6080506@parallels.com>
     [not found] ` <20120406234039.GA20940@fieldses.org>
2012-04-09 11:24   ` Grace period Stanislav Kinsbursky
2012-04-09 13:47     ` Jeff Layton
2012-04-09 14:25       ` Stanislav Kinsbursky
2012-04-09 15:27         ` Jeff Layton
2012-04-09 16:08           ` Stanislav Kinsbursky
2012-04-09 16:11             ` bfields
2012-04-09 16:17               ` Myklebust, Trond
2012-04-09 16:21                 ` bfields
2012-04-09 16:33                   ` Myklebust, Trond
2012-04-09 16:39                     ` bfields
2012-04-09 16:56                     ` Stanislav Kinsbursky
2012-04-09 18:11                       ` bfields
2012-04-10 10:56                         ` Stanislav Kinsbursky
2012-04-10 13:39                           ` bfields
2012-04-10 15:36                             ` Stanislav Kinsbursky
2012-04-10 18:28                               ` Jeff Layton
2012-04-10 20:46                                 ` bfields
2012-04-11 10:08                                 ` Stanislav Kinsbursky
2012-04-09 23:26     ` bfields
2012-04-10 11:29       ` Stanislav Kinsbursky
2012-04-10 13:37         ` bfields
2012-04-10 14:10           ` Stanislav Kinsbursky
2012-04-10 14:18             ` bfields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).