* [PATCH] NFS: Don't let readdirplus revalidate an inode that was marked as stale @ 2016-06-14 21:25 Trond Myklebust 2016-06-30 21:46 ` grace period Marc Eshel 0 siblings, 1 reply; 44+ messages in thread From: Trond Myklebust @ 2016-06-14 21:25 UTC (permalink / raw) To: linux-nfs Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> --- fs/nfs/dir.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index aaf7bd0cbae2..a924d66b5608 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -424,12 +424,17 @@ static int xdr_decode(nfs_readdir_descriptor_t *desc, static int nfs_same_file(struct dentry *dentry, struct nfs_entry *entry) { + struct inode *inode; struct nfs_inode *nfsi; if (d_really_is_negative(dentry)) return 0; - nfsi = NFS_I(d_inode(dentry)); + inode = d_inode(dentry); + if (is_bad_inode(inode) || NFS_STALE(inode)) + return 0; + + nfsi = NFS_I(inode); if (entry->fattr->fileid == nfsi->fileid) return 1; if (nfs_compare_fh(entry->fh, &nfsi->fh) == 0) -- 2.5.5 ^ permalink raw reply related [flat|nested] 44+ messages in thread
* grace period 2016-06-14 21:25 [PATCH] NFS: Don't let readdirplus revalidate an inode that was marked as stale Trond Myklebust @ 2016-06-30 21:46 ` Marc Eshel 2016-07-01 16:08 ` Bruce Fields 0 siblings, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-06-30 21:46 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs Hi Bruce, I see that setting the number of nfsd threads to 0 (echo 0 > /proc/fs/nfsd/threads) is not releasing the locks and putting the server in grace mode. What is the best way to go into grace period, in new version of the kernel, without restarting the nfs server? Thanks, Marc. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-06-30 21:46 ` grace period Marc Eshel @ 2016-07-01 16:08 ` Bruce Fields 2016-07-01 17:31 ` Marc Eshel 0 siblings, 1 reply; 44+ messages in thread From: Bruce Fields @ 2016-07-01 16:08 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > I see that setting the number of nfsd threads to 0 (echo 0 > > /proc/fs/nfsd/threads) is not releasing the locks and putting the server > in grace mode. Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should certainly drop locks. If that's not happening, there's a bug, but we'd need to know more details (version numbers, etc.) to help. That alone has never been enough to start a grace period--you'd have to start knfsd again to do that. > What is the best way to go into grace period, in new version of the > kernel, without restarting the nfs server? Restarting the nfs server is the only way. That's true on older kernels true, as far as I know. (OK, you can apparently make lockd do something like this with a signal, I don't know if that's used much, and I doubt it works outside an NFSv3-only environment.) So if you want locks dropped and a new grace period, then you should run "systemctl restart nfs-server", or your distro's equivalent. But you're probably doing something more complicated than that. I'm not sure I understand the question.... --b. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 16:08 ` Bruce Fields @ 2016-07-01 17:31 ` Marc Eshel 2016-07-01 20:07 ` Bruce Fields 0 siblings, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-07-01 17:31 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs It used to be that sending KILL signal to lockd would free locks and start Grace period, and when setting nfsd threads to zero, nfsd_last_thread() calls nfsd_shutdown that called lockd_down that I believe was causing both freeing of locks and starting grace period or maybe it was setting it back to a value > 0 that started the grace period. Any way starting with the kernels that are in RHEL7.1 and up echo 0 > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common grace period for NLM and NFSv4 changed things. The question is how to do IP fail-over, so when a node fails and the IP is moving to another node, we need to go into grace period on all the nodes in the cluster so the locks of the failed node are not given to anyone other than the client that is reclaiming his locks. Restarting NFS server is to distractive. For NFSv3 KILL signal to lockd still works but for NFSv4 have no way to do it for v4. Marc. From: Bruce Fields <bfields@fieldses.org> To: Marc Eshel/Almaden/IBM@IBMUS Cc: linux-nfs@vger.kernel.org Date: 07/01/2016 09:09 AM Subject: Re: grace period On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > I see that setting the number of nfsd threads to 0 (echo 0 > > /proc/fs/nfsd/threads) is not releasing the locks and putting the server > in grace mode. Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should certainly drop locks. If that's not happening, there's a bug, but we'd need to know more details (version numbers, etc.) to help. That alone has never been enough to start a grace period--you'd have to start knfsd again to do that. > What is the best way to go into grace period, in new version of the > kernel, without restarting the nfs server? Restarting the nfs server is the only way. That's true on older kernels true, as far as I know. (OK, you can apparently make lockd do something like this with a signal, I don't know if that's used much, and I doubt it works outside an NFSv3-only environment.) So if you want locks dropped and a new grace period, then you should run "systemctl restart nfs-server", or your distro's equivalent. But you're probably doing something more complicated than that. I'm not sure I understand the question.... --b. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 17:31 ` Marc Eshel @ 2016-07-01 20:07 ` Bruce Fields 2016-07-01 20:24 ` Marc Eshel ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Bruce Fields @ 2016-07-01 20:07 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > It used to be that sending KILL signal to lockd would free locks and start > Grace period, and when setting nfsd threads to zero, nfsd_last_thread() > calls nfsd_shutdown that called lockd_down that I believe was causing both > freeing of locks and starting grace period or maybe it was setting it back > to a value > 0 that started the grace period. OK, apologies, I didn't know (or forgot) that. > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > grace period for NLM and NFSv4 changed things. > The question is how to do IP fail-over, so when a node fails and the IP is > moving to another node, we need to go into grace period on all the nodes > in the cluster so the locks of the failed node are not given to anyone > other than the client that is reclaiming his locks. Restarting NFS server > is to distractive. What's the difference? Just that clients don't have to reestablish tcp connections? --b. > For NFSv3 KILL signal to lockd still works but for > NFSv4 have no way to do it for v4. > Marc. > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org > Date: 07/01/2016 09:09 AM > Subject: Re: grace period > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > I see that setting the number of nfsd threads to 0 (echo 0 > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the server > > > in grace mode. > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > certainly drop locks. If that's not happening, there's a bug, but we'd > need to know more details (version numbers, etc.) to help. > > That alone has never been enough to start a grace period--you'd have to > start knfsd again to do that. > > > What is the best way to go into grace period, in new version of the > > kernel, without restarting the nfs server? > > Restarting the nfs server is the only way. That's true on older kernels > true, as far as I know. (OK, you can apparently make lockd do something > like this with a signal, I don't know if that's used much, and I doubt > it works outside an NFSv3-only environment.) > > So if you want locks dropped and a new grace period, then you should run > "systemctl restart nfs-server", or your distro's equivalent. > > But you're probably doing something more complicated than that. I'm not > sure I understand the question.... > > --b. > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 20:07 ` Bruce Fields @ 2016-07-01 20:24 ` Marc Eshel 2016-07-01 20:47 ` Bruce Fields 2016-07-01 20:46 ` Marc Eshel [not found] ` <OF5D486F02.62CECB7B-ON88257FE3.0071DBE5-88257FE3.00722318@LocalDomain> 2 siblings, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-07-01 20:24 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs, Tomer Perry linux-nfs-owner@vger.kernel.org wrote on 07/01/2016 01:07:42 PM: > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org > Date: 07/01/2016 01:07 PM > Subject: Re: grace period > Sent by: linux-nfs-owner@vger.kernel.org > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > It used to be that sending KILL signal to lockd would free locks and start > > Grace period, and when setting nfsd threads to zero, nfsd_last_thread() > > calls nfsd_shutdown that called lockd_down that I believe was causing both > > freeing of locks and starting grace period or maybe it was setting it back > > to a value > 0 that started the grace period. > > OK, apologies, I didn't know (or forgot) that. > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > > grace period for NLM and NFSv4 changed things. > > The question is how to do IP fail-over, so when a node fails and the IP is > > moving to another node, we need to go into grace period on all the nodes > > in the cluster so the locks of the failed node are not given to anyone > > other than the client that is reclaiming his locks. Restarting NFS server > > is to distractive. > > What's the difference? Just that clients don't have to reestablish tcp > connections? I am not sure what else systemctl will do but I need to control the order of the restart so the client will not see any errors. I don't think that echo 0 > /proc/fs/nfsd/threads is freeing the lock, at least not the v3 locks, I will try again with v4. The question is what is the most basic operation that can be done to start grace, will echo 8 > /proc/fs/nfsd/threads following echo 0 do it? or is there any other primitive that will do it? Marc. > > --b. > > > For NFSv3 KILL signal to lockd still works but for > > NFSv4 have no way to do it for v4. > > Marc. > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org > > Date: 07/01/2016 09:09 AM > > Subject: Re: grace period > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the server > > > > > in grace mode. > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > certainly drop locks. If that's not happening, there's a bug, but we'd > > need to know more details (version numbers, etc.) to help. > > > > That alone has never been enough to start a grace period--you'd have to > > start knfsd again to do that. > > > > > What is the best way to go into grace period, in new version of the > > > kernel, without restarting the nfs server? > > > > Restarting the nfs server is the only way. That's true on older kernels > > true, as far as I know. (OK, you can apparently make lockd do something > > like this with a signal, I don't know if that's used much, and I doubt > > it works outside an NFSv3-only environment.) > > > > So if you want locks dropped and a new grace period, then you should run > > "systemctl restart nfs-server", or your distro's equivalent. > > > > But you're probably doing something more complicated than that. I'm not > > sure I understand the question.... > > > > --b. > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 20:24 ` Marc Eshel @ 2016-07-01 20:47 ` Bruce Fields 0 siblings, 0 replies; 44+ messages in thread From: Bruce Fields @ 2016-07-01 20:47 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs, Tomer Perry On Fri, Jul 01, 2016 at 01:24:48PM -0700, Marc Eshel wrote: > linux-nfs-owner@vger.kernel.org wrote on 07/01/2016 01:07:42 PM: > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org > > Date: 07/01/2016 01:07 PM > > Subject: Re: grace period > > Sent by: linux-nfs-owner@vger.kernel.org > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > > It used to be that sending KILL signal to lockd would free locks and > start > > > Grace period, and when setting nfsd threads to zero, > nfsd_last_thread() > > > calls nfsd_shutdown that called lockd_down that I believe was causing > both > > > freeing of locks and starting grace period or maybe it was setting it > back > > > to a value > 0 that started the grace period. > > > > OK, apologies, I didn't know (or forgot) that. > > > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > > > grace period for NLM and NFSv4 changed things. > > > The question is how to do IP fail-over, so when a node fails and the > IP is > > > moving to another node, we need to go into grace period on all the > nodes > > > in the cluster so the locks of the failed node are not given to anyone > > > > other than the client that is reclaiming his locks. Restarting NFS > server > > > is to distractive. > > > > What's the difference? Just that clients don't have to reestablish tcp > > connections? > > I am not sure what else systemctl will do but I need to control the order > of the restart so the client will not see any errors. > I don't think that echo 0 > /proc/fs/nfsd/threads is freeing the lock, at > least not the v3 locks, I will try again with v4. > The question is what is the most basic operation that can be done to start > grace, will echo 8 > /proc/fs/nfsd/threads following echo 0 do it? > or is there any other primitive that will do it? That should do it, though really so should just "systemctl restart nfs-server"--if that causes errors then there's a bug somewhere. --b. > Marc. > > > > > --b. > > > > > For NFSv3 KILL signal to lockd still works but for > > > NFSv4 have no way to do it for v4. > > > Marc. > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > Cc: linux-nfs@vger.kernel.org > > > Date: 07/01/2016 09:09 AM > > > Subject: Re: grace period > > > > > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the > server > > > > > > > in grace mode. > > > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > > certainly drop locks. If that's not happening, there's a bug, but > we'd > > > need to know more details (version numbers, etc.) to help. > > > > > > That alone has never been enough to start a grace period--you'd have > to > > > start knfsd again to do that. > > > > > > > What is the best way to go into grace period, in new version of the > > > > kernel, without restarting the nfs server? > > > > > > Restarting the nfs server is the only way. That's true on older > kernels > > > true, as far as I know. (OK, you can apparently make lockd do > something > > > like this with a signal, I don't know if that's used much, and I doubt > > > it works outside an NFSv3-only environment.) > > > > > > So if you want locks dropped and a new grace period, then you should > run > > > "systemctl restart nfs-server", or your distro's equivalent. > > > > > > But you're probably doing something more complicated than that. I'm > not > > > sure I understand the question.... > > > > > > --b. > > > > > > > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 20:07 ` Bruce Fields 2016-07-01 20:24 ` Marc Eshel @ 2016-07-01 20:46 ` Marc Eshel 2016-07-01 21:01 ` Bruce Fields [not found] ` <OF5D486F02.62CECB7B-ON88257FE3.0071DBE5-88257FE3.00722318@LocalDomain> 2 siblings, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-07-01 20:46 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs, Tomer Perry This is my v3 test that show the lock still there after echo 0 > /proc/fs/nfsd/threads [root@sonascl21 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 (Maipo) [root@sonascl21 ~]# uname -a Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux [root@sonascl21 ~]# cat /proc/locks | grep 999 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads [root@sonascl21 ~]# cat /proc/fs/nfsd/threads 0 [root@sonascl21 ~]# cat /proc/locks | grep 999 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 From: Bruce Fields <bfields@fieldses.org> To: Marc Eshel/Almaden/IBM@IBMUS Cc: linux-nfs@vger.kernel.org Date: 07/01/2016 01:07 PM Subject: Re: grace period On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > It used to be that sending KILL signal to lockd would free locks and start > Grace period, and when setting nfsd threads to zero, nfsd_last_thread() > calls nfsd_shutdown that called lockd_down that I believe was causing both > freeing of locks and starting grace period or maybe it was setting it back > to a value > 0 that started the grace period. OK, apologies, I didn't know (or forgot) that. > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > grace period for NLM and NFSv4 changed things. > The question is how to do IP fail-over, so when a node fails and the IP is > moving to another node, we need to go into grace period on all the nodes > in the cluster so the locks of the failed node are not given to anyone > other than the client that is reclaiming his locks. Restarting NFS server > is to distractive. What's the difference? Just that clients don't have to reestablish tcp connections? --b. > For NFSv3 KILL signal to lockd still works but for > NFSv4 have no way to do it for v4. > Marc. > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org > Date: 07/01/2016 09:09 AM > Subject: Re: grace period > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > I see that setting the number of nfsd threads to 0 (echo 0 > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the server > > > in grace mode. > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > certainly drop locks. If that's not happening, there's a bug, but we'd > need to know more details (version numbers, etc.) to help. > > That alone has never been enough to start a grace period--you'd have to > start knfsd again to do that. > > > What is the best way to go into grace period, in new version of the > > kernel, without restarting the nfs server? > > Restarting the nfs server is the only way. That's true on older kernels > true, as far as I know. (OK, you can apparently make lockd do something > like this with a signal, I don't know if that's used much, and I doubt > it works outside an NFSv3-only environment.) > > So if you want locks dropped and a new grace period, then you should run > "systemctl restart nfs-server", or your distro's equivalent. > > But you're probably doing something more complicated than that. I'm not > sure I understand the question.... > > --b. > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 20:46 ` Marc Eshel @ 2016-07-01 21:01 ` Bruce Fields 2016-07-01 22:42 ` Marc Eshel 0 siblings, 1 reply; 44+ messages in thread From: Bruce Fields @ 2016-07-01 21:01 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs, Tomer Perry On Fri, Jul 01, 2016 at 01:46:42PM -0700, Marc Eshel wrote: > This is my v3 test that show the lock still there after echo 0 > > /proc/fs/nfsd/threads > > [root@sonascl21 ~]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > [root@sonascl21 ~]# uname -a > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > 0 > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 Huh, that's not what I see. Are you positive that's the lock on the backend filesystem and not the client-side lock (in case you're doing a loopback mount?) --b. > > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org > Date: 07/01/2016 01:07 PM > Subject: Re: grace period > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > It used to be that sending KILL signal to lockd would free locks and > start > > Grace period, and when setting nfsd threads to zero, nfsd_last_thread() > > calls nfsd_shutdown that called lockd_down that I believe was causing > both > > freeing of locks and starting grace period or maybe it was setting it > back > > to a value > 0 that started the grace period. > > OK, apologies, I didn't know (or forgot) that. > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > > grace period for NLM and NFSv4 changed things. > > The question is how to do IP fail-over, so when a node fails and the IP > is > > moving to another node, we need to go into grace period on all the nodes > > > in the cluster so the locks of the failed node are not given to anyone > > other than the client that is reclaiming his locks. Restarting NFS > server > > is to distractive. > > What's the difference? Just that clients don't have to reestablish tcp > connections? > > --b. > > > For NFSv3 KILL signal to lockd still works but for > > NFSv4 have no way to do it for v4. > > Marc. > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org > > Date: 07/01/2016 09:09 AM > > Subject: Re: grace period > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the > server > > > > > in grace mode. > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > certainly drop locks. If that's not happening, there's a bug, but we'd > > need to know more details (version numbers, etc.) to help. > > > > That alone has never been enough to start a grace period--you'd have to > > start knfsd again to do that. > > > > > What is the best way to go into grace period, in new version of the > > > kernel, without restarting the nfs server? > > > > Restarting the nfs server is the only way. That's true on older kernels > > true, as far as I know. (OK, you can apparently make lockd do something > > like this with a signal, I don't know if that's used much, and I doubt > > it works outside an NFSv3-only environment.) > > > > So if you want locks dropped and a new grace period, then you should run > > "systemctl restart nfs-server", or your distro's equivalent. > > > > But you're probably doing something more complicated than that. I'm not > > sure I understand the question.... > > > > --b. > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 21:01 ` Bruce Fields @ 2016-07-01 22:42 ` Marc Eshel 2016-07-02 0:58 ` Bruce Fields 0 siblings, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-07-01 22:42 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs, Tomer Perry Yes, the locks are requested from another node, what fs are you using, I don't think it should make any difference, but I can try it with the same fs. Make sure you are using v3, it does work for v4. Marc. From: Bruce Fields <bfields@fieldses.org> To: Marc Eshel/Almaden/IBM@IBMUS Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> Date: 07/01/2016 02:01 PM Subject: Re: grace period On Fri, Jul 01, 2016 at 01:46:42PM -0700, Marc Eshel wrote: > This is my v3 test that show the lock still there after echo 0 > > /proc/fs/nfsd/threads > > [root@sonascl21 ~]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > [root@sonascl21 ~]# uname -a > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > 0 > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 Huh, that's not what I see. Are you positive that's the lock on the backend filesystem and not the client-side lock (in case you're doing a loopback mount?) --b. > > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org > Date: 07/01/2016 01:07 PM > Subject: Re: grace period > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > It used to be that sending KILL signal to lockd would free locks and > start > > Grace period, and when setting nfsd threads to zero, nfsd_last_thread() > > calls nfsd_shutdown that called lockd_down that I believe was causing > both > > freeing of locks and starting grace period or maybe it was setting it > back > > to a value > 0 that started the grace period. > > OK, apologies, I didn't know (or forgot) that. > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > > grace period for NLM and NFSv4 changed things. > > The question is how to do IP fail-over, so when a node fails and the IP > is > > moving to another node, we need to go into grace period on all the nodes > > > in the cluster so the locks of the failed node are not given to anyone > > other than the client that is reclaiming his locks. Restarting NFS > server > > is to distractive. > > What's the difference? Just that clients don't have to reestablish tcp > connections? > > --b. > > > For NFSv3 KILL signal to lockd still works but for > > NFSv4 have no way to do it for v4. > > Marc. > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org > > Date: 07/01/2016 09:09 AM > > Subject: Re: grace period > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the > server > > > > > in grace mode. > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > certainly drop locks. If that's not happening, there's a bug, but we'd > > need to know more details (version numbers, etc.) to help. > > > > That alone has never been enough to start a grace period--you'd have to > > start knfsd again to do that. > > > > > What is the best way to go into grace period, in new version of the > > > kernel, without restarting the nfs server? > > > > Restarting the nfs server is the only way. That's true on older kernels > > true, as far as I know. (OK, you can apparently make lockd do something > > like this with a signal, I don't know if that's used much, and I doubt > > it works outside an NFSv3-only environment.) > > > > So if you want locks dropped and a new grace period, then you should run > > "systemctl restart nfs-server", or your distro's equivalent. > > > > But you're probably doing something more complicated than that. I'm not > > sure I understand the question.... > > > > --b. > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-01 22:42 ` Marc Eshel @ 2016-07-02 0:58 ` Bruce Fields 2016-07-03 5:30 ` Marc Eshel [not found] ` <OFC1237E53.3CFCA8E8-ON88257FE5.001D3182-88257FE5.001E3A5B@LocalDomain> 0 siblings, 2 replies; 44+ messages in thread From: Bruce Fields @ 2016-07-02 0:58 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs, Tomer Perry On Fri, Jul 01, 2016 at 03:42:43PM -0700, Marc Eshel wrote: > Yes, the locks are requested from another node, what fs are you using, I > don't think it should make any difference, but I can try it with the same > fs. > Make sure you are using v3, it does work for v4. I tested v3 on upstream.--b. > Marc. > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > Date: 07/01/2016 02:01 PM > Subject: Re: grace period > > > > On Fri, Jul 01, 2016 at 01:46:42PM -0700, Marc Eshel wrote: > > This is my v3 test that show the lock still there after echo 0 > > > /proc/fs/nfsd/threads > > > > [root@sonascl21 ~]# cat /etc/redhat-release > > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > > > [root@sonascl21 ~]# uname -a > > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu > > > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads > > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > > 0 > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > Huh, that's not what I see. Are you positive that's the lock on the > backend filesystem and not the client-side lock (in case you're doing a > loopback mount?) > > --b. > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org > > Date: 07/01/2016 01:07 PM > > Subject: Re: grace period > > > > > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > > It used to be that sending KILL signal to lockd would free locks and > > start > > > Grace period, and when setting nfsd threads to zero, > nfsd_last_thread() > > > calls nfsd_shutdown that called lockd_down that I believe was causing > > both > > > freeing of locks and starting grace period or maybe it was setting it > > back > > > to a value > 0 that started the grace period. > > > > OK, apologies, I didn't know (or forgot) that. > > > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > > > grace period for NLM and NFSv4 changed things. > > > The question is how to do IP fail-over, so when a node fails and the > IP > > is > > > moving to another node, we need to go into grace period on all the > nodes > > > > > in the cluster so the locks of the failed node are not given to anyone > > > > other than the client that is reclaiming his locks. Restarting NFS > > server > > > is to distractive. > > > > What's the difference? Just that clients don't have to reestablish tcp > > connections? > > > > --b. > > > > > For NFSv3 KILL signal to lockd still works but for > > > NFSv4 have no way to do it for v4. > > > Marc. > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > Cc: linux-nfs@vger.kernel.org > > > Date: 07/01/2016 09:09 AM > > > Subject: Re: grace period > > > > > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the > > server > > > > > > > in grace mode. > > > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > > certainly drop locks. If that's not happening, there's a bug, but > we'd > > > need to know more details (version numbers, etc.) to help. > > > > > > That alone has never been enough to start a grace period--you'd have > to > > > start knfsd again to do that. > > > > > > > What is the best way to go into grace period, in new version of the > > > > kernel, without restarting the nfs server? > > > > > > Restarting the nfs server is the only way. That's true on older > kernels > > > true, as far as I know. (OK, you can apparently make lockd do > something > > > like this with a signal, I don't know if that's used much, and I doubt > > > it works outside an NFSv3-only environment.) > > > > > > So if you want locks dropped and a new grace period, then you should > run > > > "systemctl restart nfs-server", or your distro's equivalent. > > > > > > But you're probably doing something more complicated than that. I'm > not > > > sure I understand the question.... > > > > > > --b. > > > > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-02 0:58 ` Bruce Fields @ 2016-07-03 5:30 ` Marc Eshel 2016-07-05 20:51 ` Bruce Fields [not found] ` <OFC1237E53.3CFCA8E8-ON88257FE5.001D3182-88257FE5.001E3A5B@LocalDomain> 1 sibling, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-07-03 5:30 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs, Tomer Perry I tried again NFSv3 locks with xfs export. "echo 0 > /proc/fs/nfsd/threads" releases locks on rhel7.0 but not on rhel7.2 What else can I show you to find the problem? Marc. works: [root@boar11 ~]# uname -a Linux boar11 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 11:16:57 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux [root@boar11 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.0 (Maipo) not working: [root@sonascl21 ~]# uname -a Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux [root@sonascl21 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 (Maipo) [root@sonascl21 ~]# cat /proc/fs/nfsd/threads 0 [root@sonascl21 ~]# cat /proc/locks 1: POSIX ADVISORY WRITE 2346 fd:00:1612092569 0 9999 From: Bruce Fields <bfields@fieldses.org> To: Marc Eshel/Almaden/IBM@IBMUS Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> Date: 07/01/2016 05:58 PM Subject: Re: grace period On Fri, Jul 01, 2016 at 03:42:43PM -0700, Marc Eshel wrote: > Yes, the locks are requested from another node, what fs are you using, I > don't think it should make any difference, but I can try it with the same > fs. > Make sure you are using v3, it does work for v4. I tested v3 on upstream.--b. > Marc. > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > Date: 07/01/2016 02:01 PM > Subject: Re: grace period > > > > On Fri, Jul 01, 2016 at 01:46:42PM -0700, Marc Eshel wrote: > > This is my v3 test that show the lock still there after echo 0 > > > /proc/fs/nfsd/threads > > > > [root@sonascl21 ~]# cat /etc/redhat-release > > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > > > [root@sonascl21 ~]# uname -a > > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu > > > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads > > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > > 0 > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > Huh, that's not what I see. Are you positive that's the lock on the > backend filesystem and not the client-side lock (in case you're doing a > loopback mount?) > > --b. > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org > > Date: 07/01/2016 01:07 PM > > Subject: Re: grace period > > > > > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > > It used to be that sending KILL signal to lockd would free locks and > > start > > > Grace period, and when setting nfsd threads to zero, > nfsd_last_thread() > > > calls nfsd_shutdown that called lockd_down that I believe was causing > > both > > > freeing of locks and starting grace period or maybe it was setting it > > back > > > to a value > 0 that started the grace period. > > > > OK, apologies, I didn't know (or forgot) that. > > > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > > > grace period for NLM and NFSv4 changed things. > > > The question is how to do IP fail-over, so when a node fails and the > IP > > is > > > moving to another node, we need to go into grace period on all the > nodes > > > > > in the cluster so the locks of the failed node are not given to anyone > > > > other than the client that is reclaiming his locks. Restarting NFS > > server > > > is to distractive. > > > > What's the difference? Just that clients don't have to reestablish tcp > > connections? > > > > --b. > > > > > For NFSv3 KILL signal to lockd still works but for > > > NFSv4 have no way to do it for v4. > > > Marc. > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > Cc: linux-nfs@vger.kernel.org > > > Date: 07/01/2016 09:09 AM > > > Subject: Re: grace period > > > > > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the > > server > > > > > > > in grace mode. > > > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > > certainly drop locks. If that's not happening, there's a bug, but > we'd > > > need to know more details (version numbers, etc.) to help. > > > > > > That alone has never been enough to start a grace period--you'd have > to > > > start knfsd again to do that. > > > > > > > What is the best way to go into grace period, in new version of the > > > > kernel, without restarting the nfs server? > > > > > > Restarting the nfs server is the only way. That's true on older > kernels > > > true, as far as I know. (OK, you can apparently make lockd do > something > > > like this with a signal, I don't know if that's used much, and I doubt > > > it works outside an NFSv3-only environment.) > > > > > > So if you want locks dropped and a new grace period, then you should > run > > > "systemctl restart nfs-server", or your distro's equivalent. > > > > > > But you're probably doing something more complicated than that. I'm > not > > > sure I understand the question.... > > > > > > --b. > > > > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-03 5:30 ` Marc Eshel @ 2016-07-05 20:51 ` Bruce Fields 2016-07-05 23:05 ` Marc Eshel 0 siblings, 1 reply; 44+ messages in thread From: Bruce Fields @ 2016-07-05 20:51 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs, Tomer Perry On Sat, Jul 02, 2016 at 10:30:11PM -0700, Marc Eshel wrote: > I tried again NFSv3 locks with xfs export. "echo 0 > > /proc/fs/nfsd/threads" releases locks on rhel7.0 but not on rhel7.2 > What else can I show you to find the problem? Sorry, I can't reproduce, though I've only tried a slightly later kernel than that. Could you submit a RHEL bug? --b. > Marc. > > works: > [root@boar11 ~]# uname -a > Linux boar11 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 11:16:57 EDT 2014 > x86_64 x86_64 x86_64 GNU/Linux > [root@boar11 ~]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.0 (Maipo) > > not working: > [root@sonascl21 ~]# uname -a > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > [root@sonascl21 ~]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > 0 > [root@sonascl21 ~]# cat /proc/locks > 1: POSIX ADVISORY WRITE 2346 fd:00:1612092569 0 9999 > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > Date: 07/01/2016 05:58 PM > Subject: Re: grace period > > > > On Fri, Jul 01, 2016 at 03:42:43PM -0700, Marc Eshel wrote: > > Yes, the locks are requested from another node, what fs are you using, I > > > don't think it should make any difference, but I can try it with the > same > > fs. > > Make sure you are using v3, it does work for v4. > > I tested v3 on upstream.--b. > > > Marc. > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > > Date: 07/01/2016 02:01 PM > > Subject: Re: grace period > > > > > > > > On Fri, Jul 01, 2016 at 01:46:42PM -0700, Marc Eshel wrote: > > > This is my v3 test that show the lock still there after echo 0 > > > > /proc/fs/nfsd/threads > > > > > > [root@sonascl21 ~]# cat /etc/redhat-release > > > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > > > > > [root@sonascl21 ~]# uname -a > > > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP > Thu > > > > > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > > > [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads > > > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > > > 0 > > > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > Huh, that's not what I see. Are you positive that's the lock on the > > backend filesystem and not the client-side lock (in case you're doing a > > loopback mount?) > > > > --b. > > > > > > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > Cc: linux-nfs@vger.kernel.org > > > Date: 07/01/2016 01:07 PM > > > Subject: Re: grace period > > > > > > > > > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > > > It used to be that sending KILL signal to lockd would free locks and > > > > start > > > > Grace period, and when setting nfsd threads to zero, > > nfsd_last_thread() > > > > calls nfsd_shutdown that called lockd_down that I believe was > causing > > > both > > > > freeing of locks and starting grace period or maybe it was setting > it > > > back > > > > to a value > 0 that started the grace period. > > > > > > OK, apologies, I didn't know (or forgot) that. > > > > > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to > common > > > > grace period for NLM and NFSv4 changed things. > > > > The question is how to do IP fail-over, so when a node fails and the > > > IP > > > is > > > > moving to another node, we need to go into grace period on all the > > nodes > > > > > > > in the cluster so the locks of the failed node are not given to > anyone > > > > > > other than the client that is reclaiming his locks. Restarting NFS > > > server > > > > is to distractive. > > > > > > What's the difference? Just that clients don't have to reestablish > tcp > > > connections? > > > > > > --b. > > > > > > > For NFSv3 KILL signal to lockd still works but for > > > > NFSv4 have no way to do it for v4. > > > > Marc. > > > > > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > > Cc: linux-nfs@vger.kernel.org > > > > Date: 07/01/2016 09:09 AM > > > > Subject: Re: grace period > > > > > > > > > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the > > > server > > > > > > > > > in grace mode. > > > > > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > > > certainly drop locks. If that's not happening, there's a bug, but > > we'd > > > > need to know more details (version numbers, etc.) to help. > > > > > > > > That alone has never been enough to start a grace period--you'd have > > > to > > > > start knfsd again to do that. > > > > > > > > > What is the best way to go into grace period, in new version of > the > > > > > kernel, without restarting the nfs server? > > > > > > > > Restarting the nfs server is the only way. That's true on older > > kernels > > > > true, as far as I know. (OK, you can apparently make lockd do > > something > > > > like this with a signal, I don't know if that's used much, and I > doubt > > > > it works outside an NFSv3-only environment.) > > > > > > > > So if you want locks dropped and a new grace period, then you should > > > run > > > > "systemctl restart nfs-server", or your distro's equivalent. > > > > > > > > But you're probably doing something more complicated than that. I'm > > > not > > > > sure I understand the question.... > > > > > > > > --b. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-05 20:51 ` Bruce Fields @ 2016-07-05 23:05 ` Marc Eshel 2016-07-06 0:38 ` Bruce Fields 0 siblings, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-07-05 23:05 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs, Tomer Perry Can you please point me to the kernel that you are using so I can check if it is an obvious problem before I open an RHEL bug? Thanks, Marc. From: Bruce Fields <bfields@fieldses.org> To: Marc Eshel/Almaden/IBM@IBMUS Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> Date: 07/05/2016 01:52 PM Subject: Re: grace period Sent by: linux-nfs-owner@vger.kernel.org On Sat, Jul 02, 2016 at 10:30:11PM -0700, Marc Eshel wrote: > I tried again NFSv3 locks with xfs export. "echo 0 > > /proc/fs/nfsd/threads" releases locks on rhel7.0 but not on rhel7.2 > What else can I show you to find the problem? Sorry, I can't reproduce, though I've only tried a slightly later kernel than that. Could you submit a RHEL bug? --b. > Marc. > > works: > [root@boar11 ~]# uname -a > Linux boar11 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 11:16:57 EDT 2014 > x86_64 x86_64 x86_64 GNU/Linux > [root@boar11 ~]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.0 (Maipo) > > not working: > [root@sonascl21 ~]# uname -a > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > [root@sonascl21 ~]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > 0 > [root@sonascl21 ~]# cat /proc/locks > 1: POSIX ADVISORY WRITE 2346 fd:00:1612092569 0 9999 > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > Date: 07/01/2016 05:58 PM > Subject: Re: grace period > > > > On Fri, Jul 01, 2016 at 03:42:43PM -0700, Marc Eshel wrote: > > Yes, the locks are requested from another node, what fs are you using, I > > > don't think it should make any difference, but I can try it with the > same > > fs. > > Make sure you are using v3, it does work for v4. > > I tested v3 on upstream.--b. > > > Marc. > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > > Date: 07/01/2016 02:01 PM > > Subject: Re: grace period > > > > > > > > On Fri, Jul 01, 2016 at 01:46:42PM -0700, Marc Eshel wrote: > > > This is my v3 test that show the lock still there after echo 0 > > > > /proc/fs/nfsd/threads > > > > > > [root@sonascl21 ~]# cat /etc/redhat-release > > > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > > > > > [root@sonascl21 ~]# uname -a > > > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP > Thu > > > > > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > > > [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads > > > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > > > 0 > > > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > Huh, that's not what I see. Are you positive that's the lock on the > > backend filesystem and not the client-side lock (in case you're doing a > > loopback mount?) > > > > --b. > > > > > > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > Cc: linux-nfs@vger.kernel.org > > > Date: 07/01/2016 01:07 PM > > > Subject: Re: grace period > > > > > > > > > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > > > It used to be that sending KILL signal to lockd would free locks and > > > > start > > > > Grace period, and when setting nfsd threads to zero, > > nfsd_last_thread() > > > > calls nfsd_shutdown that called lockd_down that I believe was > causing > > > both > > > > freeing of locks and starting grace period or maybe it was setting > it > > > back > > > > to a value > 0 that started the grace period. > > > > > > OK, apologies, I didn't know (or forgot) that. > > > > > > > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > > > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to > common > > > > grace period for NLM and NFSv4 changed things. > > > > The question is how to do IP fail-over, so when a node fails and the > > > IP > > > is > > > > moving to another node, we need to go into grace period on all the > > nodes > > > > > > > in the cluster so the locks of the failed node are not given to > anyone > > > > > > other than the client that is reclaiming his locks. Restarting NFS > > > server > > > > is to distractive. > > > > > > What's the difference? Just that clients don't have to reestablish > tcp > > > connections? > > > > > > --b. > > > > > > > For NFSv3 KILL signal to lockd still works but for > > > > NFSv4 have no way to do it for v4. > > > > Marc. > > > > > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > > Cc: linux-nfs@vger.kernel.org > > > > Date: 07/01/2016 09:09 AM > > > > Subject: Re: grace period > > > > > > > > > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the > > > server > > > > > > > > > in grace mode. > > > > > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > > > certainly drop locks. If that's not happening, there's a bug, but > > we'd > > > > need to know more details (version numbers, etc.) to help. > > > > > > > > That alone has never been enough to start a grace period--you'd have > > > to > > > > start knfsd again to do that. > > > > > > > > > What is the best way to go into grace period, in new version of > the > > > > > kernel, without restarting the nfs server? > > > > > > > > Restarting the nfs server is the only way. That's true on older > > kernels > > > > true, as far as I know. (OK, you can apparently make lockd do > > something > > > > like this with a signal, I don't know if that's used much, and I > doubt > > > > it works outside an NFSv3-only environment.) > > > > > > > > So if you want locks dropped and a new grace period, then you should > > > run > > > > "systemctl restart nfs-server", or your distro's equivalent. > > > > > > > > But you're probably doing something more complicated than that. I'm > > > not > > > > sure I understand the question.... > > > > > > > > --b. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: grace period 2016-07-05 23:05 ` Marc Eshel @ 2016-07-06 0:38 ` Bruce Fields 0 siblings, 0 replies; 44+ messages in thread From: Bruce Fields @ 2016-07-06 0:38 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs, Tomer Perry On Tue, Jul 05, 2016 at 04:05:56PM -0700, Marc Eshel wrote: > Can you please point me to the kernel that you are using so I can check if > it is an obvious problem before I open an RHEL bug? I've tried it on the latest upstream and on rhel 3.10.0-327.13.1.el7. --b. > Thanks, Marc. > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > Date: 07/05/2016 01:52 PM > Subject: Re: grace period > Sent by: linux-nfs-owner@vger.kernel.org > > > > On Sat, Jul 02, 2016 at 10:30:11PM -0700, Marc Eshel wrote: > > I tried again NFSv3 locks with xfs export. "echo 0 > > > /proc/fs/nfsd/threads" releases locks on rhel7.0 but not on rhel7.2 > > What else can I show you to find the problem? > > Sorry, I can't reproduce, though I've only tried a slightly later kernel > than that. Could you submit a RHEL bug? > > --b. > > > Marc. > > > > works: > > [root@boar11 ~]# uname -a > > Linux boar11 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 11:16:57 EDT 2014 > > x86_64 x86_64 x86_64 GNU/Linux > > [root@boar11 ~]# cat /etc/redhat-release > > Red Hat Enterprise Linux Server release 7.0 (Maipo) > > > > not working: > > [root@sonascl21 ~]# uname -a > > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu > > > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > [root@sonascl21 ~]# cat /etc/redhat-release > > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > > 0 > > [root@sonascl21 ~]# cat /proc/locks > > 1: POSIX ADVISORY WRITE 2346 fd:00:1612092569 0 9999 > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > To: Marc Eshel/Almaden/IBM@IBMUS > > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > > Date: 07/01/2016 05:58 PM > > Subject: Re: grace period > > > > > > > > On Fri, Jul 01, 2016 at 03:42:43PM -0700, Marc Eshel wrote: > > > Yes, the locks are requested from another node, what fs are you using, > I > > > > > don't think it should make any difference, but I can try it with the > > same > > > fs. > > > Make sure you are using v3, it does work for v4. > > > > I tested v3 on upstream.--b. > > > > > Marc. > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> > > > Date: 07/01/2016 02:01 PM > > > Subject: Re: grace period > > > > > > > > > > > > On Fri, Jul 01, 2016 at 01:46:42PM -0700, Marc Eshel wrote: > > > > This is my v3 test that show the lock still there after echo 0 > > > > > /proc/fs/nfsd/threads > > > > > > > > [root@sonascl21 ~]# cat /etc/redhat-release > > > > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > > > > > > > [root@sonascl21 ~]# uname -a > > > > Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP > > > Thu > > > > > > > Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > > > > > [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads > > > > [root@sonascl21 ~]# cat /proc/fs/nfsd/threads > > > > 0 > > > > > > > > [root@sonascl21 ~]# cat /proc/locks | grep 999 > > > > 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 > > > > > > Huh, that's not what I see. Are you positive that's the lock on the > > > backend filesystem and not the client-side lock (in case you're doing > a > > > loopback mount?) > > > > > > --b. > > > > > > > > > > > > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > > Cc: linux-nfs@vger.kernel.org > > > > Date: 07/01/2016 01:07 PM > > > > Subject: Re: grace period > > > > > > > > > > > > > > > > On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > > > > > It used to be that sending KILL signal to lockd would free locks > and > > > > > > start > > > > > Grace period, and when setting nfsd threads to zero, > > > nfsd_last_thread() > > > > > calls nfsd_shutdown that called lockd_down that I believe was > > causing > > > > both > > > > > freeing of locks and starting grace period or maybe it was setting > > > it > > > > back > > > > > to a value > 0 that started the grace period. > > > > > > > > OK, apologies, I didn't know (or forgot) that. > > > > > > > > > Any way starting with the kernels that are in RHEL7.1 and up echo > 0 > > > > > > > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to > > common > > > > > grace period for NLM and NFSv4 changed things. > > > > > The question is how to do IP fail-over, so when a node fails and > the > > > > > IP > > > > is > > > > > moving to another node, we need to go into grace period on all the > > > > nodes > > > > > > > > > in the cluster so the locks of the failed node are not given to > > anyone > > > > > > > > other than the client that is reclaiming his locks. Restarting NFS > > > > > server > > > > > is to distractive. > > > > > > > > What's the difference? Just that clients don't have to reestablish > > tcp > > > > connections? > > > > > > > > --b. > > > > > > > > > For NFSv3 KILL signal to lockd still works but for > > > > > NFSv4 have no way to do it for v4. > > > > > Marc. > > > > > > > > > > > > > > > > > > > > From: Bruce Fields <bfields@fieldses.org> > > > > > To: Marc Eshel/Almaden/IBM@IBMUS > > > > > Cc: linux-nfs@vger.kernel.org > > > > > Date: 07/01/2016 09:09 AM > > > > > Subject: Re: grace period > > > > > > > > > > > > > > > > > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > > > > > I see that setting the number of nfsd threads to 0 (echo 0 > > > > > > > /proc/fs/nfsd/threads) is not releasing the locks and putting > the > > > > server > > > > > > > > > > > in grace mode. > > > > > > > > > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > > > > > certainly drop locks. If that's not happening, there's a bug, but > > > > we'd > > > > > need to know more details (version numbers, etc.) to help. > > > > > > > > > > That alone has never been enough to start a grace period--you'd > have > > > > > to > > > > > start knfsd again to do that. > > > > > > > > > > > What is the best way to go into grace period, in new version of > > the > > > > > > kernel, without restarting the nfs server? > > > > > > > > > > Restarting the nfs server is the only way. That's true on older > > > kernels > > > > > true, as far as I know. (OK, you can apparently make lockd do > > > something > > > > > like this with a signal, I don't know if that's used much, and I > > doubt > > > > > it works outside an NFSv3-only environment.) > > > > > > > > > > So if you want locks dropped and a new grace period, then you > should > > > > > run > > > > > "systemctl restart nfs-server", or your distro's equivalent. > > > > > > > > > > But you're probably doing something more complicated than that. > I'm > > > > > not > > > > > sure I understand the question.... > > > > > > > > > > --b. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <OFC1237E53.3CFCA8E8-ON88257FE5.001D3182-88257FE5.001E3A5B@LocalDomain>]
* HA NFS [not found] ` <OFC1237E53.3CFCA8E8-ON88257FE5.001D3182-88257FE5.001E3A5B@LocalDomain> @ 2016-07-04 23:53 ` Marc Eshel 2016-07-05 15:08 ` Steve Dickson 0 siblings, 1 reply; 44+ messages in thread From: Marc Eshel @ 2016-07-04 23:53 UTC (permalink / raw) To: Steve Dickson; +Cc: linux-nfs, Tomer Perry Hi Steve, I did not pay attention for a while and now I see that since RHEL7.0 there a major changes in NFSv4 recovery for a signal machine and for cluster file system. Is there any write up on the changes like the use of /var/lib/nfs/nfsdcltrack/main.sqlite, I see it being used in 7.0 but not in 7.2. Any information would be appreciated. Thanks, Marc. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: HA NFS 2016-07-04 23:53 ` HA NFS Marc Eshel @ 2016-07-05 15:08 ` Steve Dickson 2016-07-05 20:56 ` Marc Eshel 0 siblings, 1 reply; 44+ messages in thread From: Steve Dickson @ 2016-07-05 15:08 UTC (permalink / raw) To: Marc Eshel; +Cc: linux-nfs, Tomer Perry On 07/04/2016 07:53 PM, Marc Eshel wrote: > Hi Steve, > I did not pay attention for a while and now I see that since RHEL7.0 there > a major changes in NFSv4 recovery for a signal machine and for cluster > file system. Is there any write up on the changes like the use of > /var/lib/nfs/nfsdcltrack/main.sqlite, I see it being used in 7.0 but not > in 7.2. Any information would be appreciated. That file is still being used... but there were some changes. In RHEL 7.2 this was added for bz 1234598 commit c41a3d0a17baa61a07d48d8536e99908d765de9b Author: Jeff Layton <jlayton@primarydata.com> Date: Fri Sep 19 11:07:31 2014 -0400 nfsdcltrack: fetch NFSDCLTRACK_GRACE_START out of environment In RHEL 7.3 there will be this for bz 1285097 commit d479ad3adb0671c48d6fbf3e36bd52a31159c413 Author: Jeff Layton <jlayton@primarydata.com> Date: Fri Sep 19 11:03:45 2014 -0400 nfsdcltrack: update schema to v2 I hope this helps... steved. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: HA NFS 2016-07-05 15:08 ` Steve Dickson @ 2016-07-05 20:56 ` Marc Eshel 0 siblings, 0 replies; 44+ messages in thread From: Marc Eshel @ 2016-07-05 20:56 UTC (permalink / raw) To: Steve Dickson; +Cc: linux-nfs, Tomer Perry, Jeff Layton Thanks for the pointer Steve, I am not sure how much of the changes considered RedHat and how much is Linux kernel changes so if another mailing list is more appropriate please let me know. I now see on RHEL7.0 that /var/lib/nfs/nfsdcltrack/main.sqlite is updated as I open a file from an NFS client, but going to RHEL7.2 that file is created but not updated with a new client open. Did something change in this area between 7.0 and 7.2 ? Marc. From: Steve Dickson <SteveD@redhat.com> To: Marc Eshel/Almaden/IBM@IBMUS Cc: linux-nfs@vger.kernel.org, Tomer Perry <TOMP@il.ibm.com> Date: 07/05/2016 08:08 AM Subject: Re: HA NFS On 07/04/2016 07:53 PM, Marc Eshel wrote: > Hi Steve, > I did not pay attention for a while and now I see that since RHEL7.0 there > a major changes in NFSv4 recovery for a signal machine and for cluster > file system. Is there any write up on the changes like the use of > /var/lib/nfs/nfsdcltrack/main.sqlite, I see it being used in 7.0 but not > in 7.2. Any information would be appreciated. That file is still being used... but there were some changes. In RHEL 7.2 this was added for bz 1234598 commit c41a3d0a17baa61a07d48d8536e99908d765de9b Author: Jeff Layton <jlayton@primarydata.com> Date: Fri Sep 19 11:07:31 2014 -0400 nfsdcltrack: fetch NFSDCLTRACK_GRACE_START out of environment In RHEL 7.3 there will be this for bz 1285097 commit d479ad3adb0671c48d6fbf3e36bd52a31159c413 Author: Jeff Layton <jlayton@primarydata.com> Date: Fri Sep 19 11:03:45 2014 -0400 nfsdcltrack: update schema to v2 I hope this helps... steved. ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <OF5D486F02.62CECB7B-ON88257FE3.0071DBE5-88257FE3.00722318@LocalDomain>]
* Re: grace period [not found] ` <OF5D486F02.62CECB7B-ON88257FE3.0071DBE5-88257FE3.00722318@LocalDomain> @ 2016-07-01 20:51 ` Marc Eshel 0 siblings, 0 replies; 44+ messages in thread From: Marc Eshel @ 2016-07-01 20:51 UTC (permalink / raw) To: Bruce Fields; +Cc: linux-nfs, Tomer Perry echo 0 > /proc/fs/nfsd/threads does delete the locks for v4 but not for v3 Marc. From: Marc Eshel/Almaden/IBM To: Bruce Fields <bfields@fieldses.org> Cc: linux-nfs@vger.kernel.org, Tomer Perry/Israel/IBM@IBMIL Date: 07/01/2016 01:46 PM Subject: Re: grace period This is my v3 test that show the lock still there after echo 0 > /proc/fs/nfsd/threads [root@sonascl21 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 (Maipo) [root@sonascl21 ~]# uname -a Linux sonascl21.sonasad.almaden.ibm.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux [root@sonascl21 ~]# cat /proc/locks | grep 999 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 [root@sonascl21 ~]# echo 0 > /proc/fs/nfsd/threads [root@sonascl21 ~]# cat /proc/fs/nfsd/threads 0 [root@sonascl21 ~]# cat /proc/locks | grep 999 3: POSIX ADVISORY WRITE 2349 00:2a:489486 0 999 From: Bruce Fields <bfields@fieldses.org> To: Marc Eshel/Almaden/IBM@IBMUS Cc: linux-nfs@vger.kernel.org Date: 07/01/2016 01:07 PM Subject: Re: grace period On Fri, Jul 01, 2016 at 10:31:55AM -0700, Marc Eshel wrote: > It used to be that sending KILL signal to lockd would free locks and start > Grace period, and when setting nfsd threads to zero, nfsd_last_thread() > calls nfsd_shutdown that called lockd_down that I believe was causing both > freeing of locks and starting grace period or maybe it was setting it back > to a value > 0 that started the grace period. OK, apologies, I didn't know (or forgot) that. > Any way starting with the kernels that are in RHEL7.1 and up echo 0 > > /proc/fs/nfsd/threads doesn't do it anymore, I assume going to common > grace period for NLM and NFSv4 changed things. > The question is how to do IP fail-over, so when a node fails and the IP is > moving to another node, we need to go into grace period on all the nodes > in the cluster so the locks of the failed node are not given to anyone > other than the client that is reclaiming his locks. Restarting NFS server > is to distractive. What's the difference? Just that clients don't have to reestablish tcp connections? --b. > For NFSv3 KILL signal to lockd still works but for > NFSv4 have no way to do it for v4. > Marc. > > > > From: Bruce Fields <bfields@fieldses.org> > To: Marc Eshel/Almaden/IBM@IBMUS > Cc: linux-nfs@vger.kernel.org > Date: 07/01/2016 09:09 AM > Subject: Re: grace period > > > > On Thu, Jun 30, 2016 at 02:46:19PM -0700, Marc Eshel wrote: > > I see that setting the number of nfsd threads to 0 (echo 0 > > > /proc/fs/nfsd/threads) is not releasing the locks and putting the server > > > in grace mode. > > Writing 0 to /proc/fs/nfsd/threads shuts down knfsd. So it should > certainly drop locks. If that's not happening, there's a bug, but we'd > need to know more details (version numbers, etc.) to help. > > That alone has never been enough to start a grace period--you'd have to > start knfsd again to do that. > > > What is the best way to go into grace period, in new version of the > > kernel, without restarting the nfs server? > > Restarting the nfs server is the only way. That's true on older kernels > true, as far as I know. (OK, you can apparently make lockd do something > like this with a signal, I don't know if that's used much, and I doubt > it works outside an NFSv3-only environment.) > > So if you want locks dropped and a new grace period, then you should run > "systemctl restart nfs-server", or your distro's equivalent. > > But you're probably doing something more complicated than that. I'm not > sure I understand the question.... > > --b. > > > > ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <4F7F230A.6080506@parallels.com>]
[parent not found: <20120406234039.GA20940@fieldses.org>]
* Re: Grace period [not found] ` <20120406234039.GA20940@fieldses.org> @ 2012-04-09 11:24 ` Stanislav Kinsbursky 2012-04-09 13:47 ` Jeff Layton 2012-04-09 23:26 ` bfields 0 siblings, 2 replies; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-09 11:24 UTC (permalink / raw) To: bfields, Trond.Myklebust; +Cc: linux-nfs, linux-kernel 07.04.2012 03:40, bfields@fieldses.org пишет: > On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: >> Hello, Bruce. >> Could you, please, clarify this reason why grace list is used? >> I.e. why list is used instead of some atomic variable, for example? > > Like just a reference count? Yeah, that would be OK. > > In theory it could provide some sort of debugging help. (E.g. we could > print out the list of "lock managers" currently keeping us in grace.) I > had some idea we'd make those lock manager objects more complicated, and > might have more for individual containerized services. Could you share this idea, please? Anyway, I have nothing against lists. Just was curious, why it was used. I added Trond and lists to this reply. Let me explain, what is the problem with grace period I'm facing right know, and what I'm thinking about it. So, one of the things to be containerized during "NFSd per net ns" work is the grace period, and these are the basic components of it: 1) Grace period start. 2) Grace period end. 3) Grace period check. 3) Grace period restart. So, the simplest straight-forward way is to make all internal stuff: "grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and "nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be per-net as well. In this case: 1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has to be moves there from "lockd()") and "nfs4_state_start()". 2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()". 3) Check - looks easy. There is either svc_rqst or net context can be passed to function. 4) Restart - this is a tricky place. It would be great to restart grace period only for the networks namespace of the sender of the kill signal. So, the idea is to check siginfo_t for the pid of sender, then try to locate the task, and if found, then get sender's networks namespace, and restart grace period only for this namespace (of course, if lockd was started for this namespace - see below). If task not found, of it's lockd wasn't started for it's namespace, then grace period can be either restarted for all namespaces, of just silently dropped. This is the place where I'm not sure, how to do. Because calling grace period for all namespaces will be overkill... There also another problem with the "task by pid" search, that found task can be actually not sender (which died already), but some other new task with the same pid number. In this case, I think, we can just neglect this probability and always assume, that we located sender (if, of course, lockd was started for sender's network namespace). Trond, Bruce, could you, please, comment this ideas? -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 11:24 ` Grace period Stanislav Kinsbursky @ 2012-04-09 13:47 ` Jeff Layton 2012-04-09 14:25 ` Stanislav Kinsbursky 2012-04-09 23:26 ` bfields 1 sibling, 1 reply; 44+ messages in thread From: Jeff Layton @ 2012-04-09 13:47 UTC (permalink / raw) To: Stanislav Kinsbursky; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel On Mon, 09 Apr 2012 15:24:19 +0400 Stanislav Kinsbursky <skinsbursky@parallels.com> wrote: > 07.04.2012 03:40, bfields@fieldses.org пишет: > > On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: > >> Hello, Bruce. > >> Could you, please, clarify this reason why grace list is used? > >> I.e. why list is used instead of some atomic variable, for example? > > > > Like just a reference count? Yeah, that would be OK. > > > > In theory it could provide some sort of debugging help. (E.g. we could > > print out the list of "lock managers" currently keeping us in grace.) I > > had some idea we'd make those lock manager objects more complicated, and > > might have more for individual containerized services. > > Could you share this idea, please? > > Anyway, I have nothing against lists. Just was curious, why it was used. > I added Trond and lists to this reply. > > Let me explain, what is the problem with grace period I'm facing right know, and > what I'm thinking about it. > So, one of the things to be containerized during "NFSd per net ns" work is the > grace period, and these are the basic components of it: > 1) Grace period start. > 2) Grace period end. > 3) Grace period check. > 3) Grace period restart. > > So, the simplest straight-forward way is to make all internal stuff: > "grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and > "nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be > per-net as well. > In this case: > 1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has > to be moves there from "lockd()") and "nfs4_state_start()". > 2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to > be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()". > 3) Check - looks easy. There is either svc_rqst or net context can be passed to > function. > 4) Restart - this is a tricky place. It would be great to restart grace period > only for the networks namespace of the sender of the kill signal. So, the idea > is to check siginfo_t for the pid of sender, then try to locate the task, and if > found, then get sender's networks namespace, and restart grace period only for > this namespace (of course, if lockd was started for this namespace - see below). > > If task not found, of it's lockd wasn't started for it's namespace, then grace > period can be either restarted for all namespaces, of just silently dropped. > This is the place where I'm not sure, how to do. Because calling grace period > for all namespaces will be overkill... > > There also another problem with the "task by pid" search, that found task can be > actually not sender (which died already), but some other new task with the same > pid number. In this case, I think, we can just neglect this probability and > always assume, that we located sender (if, of course, lockd was started for > sender's network namespace). > > Trond, Bruce, could you, please, comment this ideas? > I can comment and I'm not sure that will be sufficient. The grace period has a particular purpose. It keeps nfsd or lockd from handing out stateful objects (e.g. locks) before clients have an opportunity to reclaim them. Once the grace period expires, there is no more reclaim allowed and "normal" lock and open requests can proceed. Traditionally, there has been one nfsd or lockd "instance" per host. With that, we were able to get away with a relatively simple-minded approach of a global grace period that's gated on nfsd or lockd's startup and shutdown. Now, you're looking at making multiple nfsd or lockd "instances". Does it make sense to make this a per-net thing? Here's a particularly problematic case to illustrate what I mean: Suppose I have a filesystem that's mounted and exported in two different containers. You start up one container and then 60s later, start up the other. The grace period expires in the first container and that nfsd hands out locks that conflict with some that have not been reclaimed yet in the other. Now, we can just try to say "don't export the same fs from more than one container". But we all know that people will do it anyway, since there's nothing that really stops you from doing so. What probably makes more sense is making the grace period a per-sb property, and coming up with a set of rules for the fs going into and out of "grace" status. Perhaps a way for different net namespaces to "subscribe" to a particular fs, and don't take the fs out of grace until all of the grace period timers pop? If a fs attempts to subscribe after the fs comes out of grace, then its subscription would be denied and reclaim attempts would get NFS4ERR_NOGRACE or the NLM equivalent. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 13:47 ` Jeff Layton @ 2012-04-09 14:25 ` Stanislav Kinsbursky 2012-04-09 15:27 ` Jeff Layton 0 siblings, 1 reply; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-09 14:25 UTC (permalink / raw) To: Jeff Layton; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel 09.04.2012 17:47, Jeff Layton пишет: > On Mon, 09 Apr 2012 15:24:19 +0400 > Stanislav Kinsbursky<skinsbursky@parallels.com> wrote: > >> 07.04.2012 03:40, bfields@fieldses.org пишет: >>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: >>>> Hello, Bruce. >>>> Could you, please, clarify this reason why grace list is used? >>>> I.e. why list is used instead of some atomic variable, for example? >>> >>> Like just a reference count? Yeah, that would be OK. >>> >>> In theory it could provide some sort of debugging help. (E.g. we could >>> print out the list of "lock managers" currently keeping us in grace.) I >>> had some idea we'd make those lock manager objects more complicated, and >>> might have more for individual containerized services. >> >> Could you share this idea, please? >> >> Anyway, I have nothing against lists. Just was curious, why it was used. >> I added Trond and lists to this reply. >> >> Let me explain, what is the problem with grace period I'm facing right know, and >> what I'm thinking about it. >> So, one of the things to be containerized during "NFSd per net ns" work is the >> grace period, and these are the basic components of it: >> 1) Grace period start. >> 2) Grace period end. >> 3) Grace period check. >> 3) Grace period restart. >> >> So, the simplest straight-forward way is to make all internal stuff: >> "grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and >> "nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be >> per-net as well. >> In this case: >> 1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has >> to be moves there from "lockd()") and "nfs4_state_start()". >> 2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to >> be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()". >> 3) Check - looks easy. There is either svc_rqst or net context can be passed to >> function. >> 4) Restart - this is a tricky place. It would be great to restart grace period >> only for the networks namespace of the sender of the kill signal. So, the idea >> is to check siginfo_t for the pid of sender, then try to locate the task, and if >> found, then get sender's networks namespace, and restart grace period only for >> this namespace (of course, if lockd was started for this namespace - see below). >> >> If task not found, of it's lockd wasn't started for it's namespace, then grace >> period can be either restarted for all namespaces, of just silently dropped. >> This is the place where I'm not sure, how to do. Because calling grace period >> for all namespaces will be overkill... >> >> There also another problem with the "task by pid" search, that found task can be >> actually not sender (which died already), but some other new task with the same >> pid number. In this case, I think, we can just neglect this probability and >> always assume, that we located sender (if, of course, lockd was started for >> sender's network namespace). >> >> Trond, Bruce, could you, please, comment this ideas? >> > > I can comment and I'm not sure that will be sufficient. > Hi, Jeff. Thanks for the comment. > The grace period has a particular purpose. It keeps nfsd or lockd from > handing out stateful objects (e.g. locks) before clients have an > opportunity to reclaim them. Once the grace period expires, there is no > more reclaim allowed and "normal" lock and open requests can proceed. > > Traditionally, there has been one nfsd or lockd "instance" per host. > With that, we were able to get away with a relatively simple-minded > approach of a global grace period that's gated on nfsd or lockd's > startup and shutdown. > > Now, you're looking at making multiple nfsd or lockd "instances". Does > it make sense to make this a per-net thing? Here's a particularly > problematic case to illustrate what I mean: > > Suppose I have a filesystem that's mounted and exported in two > different containers. You start up one container and then 60s later, > start up the other. The grace period expires in the first container and > that nfsd hands out locks that conflict with some that have not been > reclaimed yet in the other. > > Now, we can just try to say "don't export the same fs from more than > one container". But we all know that people will do it anyway, since > there's nothing that really stops you from doing so. > Yes, I see. But situation you described is existent already. I.e. you can replace containers with the same file system by two nodes, sharing the same distributed file system (like Lustre and GPFS), and you'll experience the same problem in such case. > What probably makes more sense is making the grace period a per-sb > property, and coming up with a set of rules for the fs going into and > out of "grace" status. > > Perhaps a way for different net namespaces to "subscribe" to a > particular fs, and don't take the fs out of grace until all of the > grace period timers pop? If a fs attempts to subscribe after the fs > comes out of grace, then its subscription would be denied and reclaim > attempts would get NFS4ERR_NOGRACE or the NLM equivalent. > This raises another problem. Imagine, that grace period has elapsed for some container and then you start nfsd in another one. New grace period will affect all both of them. And that's even worse from my pow. -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 14:25 ` Stanislav Kinsbursky @ 2012-04-09 15:27 ` Jeff Layton 2012-04-09 16:08 ` Stanislav Kinsbursky 0 siblings, 1 reply; 44+ messages in thread From: Jeff Layton @ 2012-04-09 15:27 UTC (permalink / raw) To: Stanislav Kinsbursky; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel On Mon, 09 Apr 2012 18:25:48 +0400 Stanislav Kinsbursky <skinsbursky@parallels.com> wrote: > 09.04.2012 17:47, Jeff Layton пишет: > > On Mon, 09 Apr 2012 15:24:19 +0400 > > Stanislav Kinsbursky<skinsbursky@parallels.com> wrote: > > > >> 07.04.2012 03:40, bfields@fieldses.org пишет: > >>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: > >>>> Hello, Bruce. > >>>> Could you, please, clarify this reason why grace list is used? > >>>> I.e. why list is used instead of some atomic variable, for example? > >>> > >>> Like just a reference count? Yeah, that would be OK. > >>> > >>> In theory it could provide some sort of debugging help. (E.g. we could > >>> print out the list of "lock managers" currently keeping us in grace.) I > >>> had some idea we'd make those lock manager objects more complicated, and > >>> might have more for individual containerized services. > >> > >> Could you share this idea, please? > >> > >> Anyway, I have nothing against lists. Just was curious, why it was used. > >> I added Trond and lists to this reply. > >> > >> Let me explain, what is the problem with grace period I'm facing right know, and > >> what I'm thinking about it. > >> So, one of the things to be containerized during "NFSd per net ns" work is the > >> grace period, and these are the basic components of it: > >> 1) Grace period start. > >> 2) Grace period end. > >> 3) Grace period check. > >> 3) Grace period restart. > >> > >> So, the simplest straight-forward way is to make all internal stuff: > >> "grace_list", "grace_lock", "grace_period_end" work and both "lockd_manager" and > >> "nfsd4_manager" - per network namespace. Also, "laundromat_work" have to be > >> per-net as well. > >> In this case: > >> 1) Start - grace period can be started per net ns in "lockd_up_net()" (thus has > >> to be moves there from "lockd()") and "nfs4_state_start()". > >> 2) End - grace period can be ended per net ns in "lockd_down_net()" (thus has to > >> be moved there from "lockd()"), "nfsd4_end_grace()" and "fs4_state_shutdown()". > >> 3) Check - looks easy. There is either svc_rqst or net context can be passed to > >> function. > >> 4) Restart - this is a tricky place. It would be great to restart grace period > >> only for the networks namespace of the sender of the kill signal. So, the idea > >> is to check siginfo_t for the pid of sender, then try to locate the task, and if > >> found, then get sender's networks namespace, and restart grace period only for > >> this namespace (of course, if lockd was started for this namespace - see below). > >> > >> If task not found, of it's lockd wasn't started for it's namespace, then grace > >> period can be either restarted for all namespaces, of just silently dropped. > >> This is the place where I'm not sure, how to do. Because calling grace period > >> for all namespaces will be overkill... > >> > >> There also another problem with the "task by pid" search, that found task can be > >> actually not sender (which died already), but some other new task with the same > >> pid number. In this case, I think, we can just neglect this probability and > >> always assume, that we located sender (if, of course, lockd was started for > >> sender's network namespace). > >> > >> Trond, Bruce, could you, please, comment this ideas? > >> > > > > I can comment and I'm not sure that will be sufficient. > > > > Hi, Jeff. Thanks for the comment. > > > The grace period has a particular purpose. It keeps nfsd or lockd from > > handing out stateful objects (e.g. locks) before clients have an > > opportunity to reclaim them. Once the grace period expires, there is no > > more reclaim allowed and "normal" lock and open requests can proceed. > > > > Traditionally, there has been one nfsd or lockd "instance" per host. > > With that, we were able to get away with a relatively simple-minded > > approach of a global grace period that's gated on nfsd or lockd's > > startup and shutdown. > > > > Now, you're looking at making multiple nfsd or lockd "instances". Does > > it make sense to make this a per-net thing? Here's a particularly > > problematic case to illustrate what I mean: > > > > Suppose I have a filesystem that's mounted and exported in two > > different containers. You start up one container and then 60s later, > > start up the other. The grace period expires in the first container and > > that nfsd hands out locks that conflict with some that have not been > > reclaimed yet in the other. > > > > Now, we can just try to say "don't export the same fs from more than > > one container". But we all know that people will do it anyway, since > > there's nothing that really stops you from doing so. > > > > Yes, I see. But situation you described is existent already. > I.e. you can replace containers with the same file system by two nodes, sharing > the same distributed file system (like Lustre and GPFS), and you'll experience > the same problem in such case. > Yep, which is why we don't support active/active serving from clustered filesystems (yet). Containers are somewhat similar to a clustered configuration. The simple minded grace period handling we have now is really only suitable for very simple export configurations. The grace period exists to ensure that filesystem objects are not "oversubscribed" so it makes some sense to turn it into a per-sb property. > > What probably makes more sense is making the grace period a per-sb > > property, and coming up with a set of rules for the fs going into and > > out of "grace" status. > > > > Perhaps a way for different net namespaces to "subscribe" to a > > particular fs, and don't take the fs out of grace until all of the > > grace period timers pop? If a fs attempts to subscribe after the fs > > comes out of grace, then its subscription would be denied and reclaim > > attempts would get NFS4ERR_NOGRACE or the NLM equivalent. > > > > This raises another problem. Imagine, that grace period has elapsed for some > container and then you start nfsd in another one. New grace period will affect > all both of them. And that's even worse from my pow. > If you allow one container to hand out conflicting locks while another container is allowing reclaims, then you can end up with some very difficult to debug silent data corruption. That's the worst possible outcome, IMO. We really need to actively keep people from shooting themselves in the foot here. One possibility might be to only allow filesystems to be exported from a single container at a time (and allow that to be overridable somehow once we have a working active/active serving solution). With that, you may be able limp along with a per-container grace period handling scheme like you're proposing. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 15:27 ` Jeff Layton @ 2012-04-09 16:08 ` Stanislav Kinsbursky 2012-04-09 16:11 ` bfields 0 siblings, 1 reply; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-09 16:08 UTC (permalink / raw) To: Jeff Layton; +Cc: bfields, Trond.Myklebust, linux-nfs, linux-kernel 09.04.2012 19:27, Jeff Layton пишет: > > If you allow one container to hand out conflicting locks while another > container is allowing reclaims, then you can end up with some very > difficult to debug silent data corruption. That's the worst possible > outcome, IMO. We really need to actively keep people from shooting > themselves in the foot here. > > One possibility might be to only allow filesystems to be exported from > a single container at a time (and allow that to be overridable somehow > once we have a working active/active serving solution). With that, you > may be able limp along with a per-container grace period handling > scheme like you're proposing. > Ok then. Keeping people from shooting themselves here sounds reasonable. And I like the idea of exporting a filesystem only from once per network namespace. Looks like there should be a list of pairs "exported superblock - network namespace". And if superblock is exported already in other namespace, then export in new namespace have to be skipped (replaced?) with appropriate warning (error?) message shown in log. Or maybe we even should deny starting of NFS server if one of it's exports is shared already by other NFS server "instance"? But any of these ideas would be easy to implement in RAM, and thus it suits only for containers... -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 16:08 ` Stanislav Kinsbursky @ 2012-04-09 16:11 ` bfields 2012-04-09 16:17 ` Myklebust, Trond 0 siblings, 1 reply; 44+ messages in thread From: bfields @ 2012-04-09 16:11 UTC (permalink / raw) To: Stanislav Kinsbursky Cc: Jeff Layton, Trond.Myklebust, linux-nfs, linux-kernel On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: > 09.04.2012 19:27, Jeff Layton пишет: > > > >If you allow one container to hand out conflicting locks while another > >container is allowing reclaims, then you can end up with some very > >difficult to debug silent data corruption. That's the worst possible > >outcome, IMO. We really need to actively keep people from shooting > >themselves in the foot here. > > > >One possibility might be to only allow filesystems to be exported from > >a single container at a time (and allow that to be overridable somehow > >once we have a working active/active serving solution). With that, you > >may be able limp along with a per-container grace period handling > >scheme like you're proposing. > > > > Ok then. Keeping people from shooting themselves here sounds reasonable. > And I like the idea of exporting a filesystem only from once per > network namespace. Unfortunately that's not going to get us very far, especially not in the v4 case where we've got the common read-only pseudoroot that everyone has to share. --b. > Looks like there should be a list of pairs > "exported superblock - network namespace". And if superblock is > exported already in other namespace, then export in new namespace > have to be skipped (replaced?) with appropriate warning (error?) > message shown in log. > Or maybe we even should deny starting of NFS server if one of it's > exports is shared already by other NFS server "instance"? > But any of these ideas would be easy to implement in RAM, and thus > it suits only for containers... > > -- > Best regards, > Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 16:11 ` bfields @ 2012-04-09 16:17 ` Myklebust, Trond 0 siblings, 0 replies; 44+ messages in thread From: Myklebust, Trond @ 2012-04-09 16:17 UTC (permalink / raw) To: bfields; +Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1635 bytes --] On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote: > On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: > > 09.04.2012 19:27, Jeff Layton пиÑеÑ: > > > > > >If you allow one container to hand out conflicting locks while another > > >container is allowing reclaims, then you can end up with some very > > >difficult to debug silent data corruption. That's the worst possible > > >outcome, IMO. We really need to actively keep people from shooting > > >themselves in the foot here. > > > > > >One possibility might be to only allow filesystems to be exported from > > >a single container at a time (and allow that to be overridable somehow > > >once we have a working active/active serving solution). With that, you > > >may be able limp along with a per-container grace period handling > > >scheme like you're proposing. > > > > > > > Ok then. Keeping people from shooting themselves here sounds reasonable. > > And I like the idea of exporting a filesystem only from once per > > network namespace. > > Unfortunately that's not going to get us very far, especially not in the > v4 case where we've got the common read-only pseudoroot that everyone > has to share. I don't see how that can work in cases where each container has its own private mount namespace. You're going to have to tie that pseudoroot to the mount namespace somehow. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period @ 2012-04-09 16:17 ` Myklebust, Trond 0 siblings, 0 replies; 44+ messages in thread From: Myklebust, Trond @ 2012-04-09 16:17 UTC (permalink / raw) To: bfields; +Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel T24gTW9uLCAyMDEyLTA0LTA5IGF0IDEyOjExIC0wNDAwLCBiZmllbGRzQGZpZWxkc2VzLm9yZyB3 cm90ZToNCj4gT24gTW9uLCBBcHIgMDksIDIwMTIgYXQgMDg6MDg6NTdQTSArMDQwMCwgU3Rhbmlz bGF2IEtpbnNidXJza3kgd3JvdGU6DQo+ID4gMDkuMDQuMjAxMiAxOToyNywgSmVmZiBMYXl0b24g 0L/QuNGI0LXRgjoNCj4gPiA+DQo+ID4gPklmIHlvdSBhbGxvdyBvbmUgY29udGFpbmVyIHRvIGhh bmQgb3V0IGNvbmZsaWN0aW5nIGxvY2tzIHdoaWxlIGFub3RoZXINCj4gPiA+Y29udGFpbmVyIGlz IGFsbG93aW5nIHJlY2xhaW1zLCB0aGVuIHlvdSBjYW4gZW5kIHVwIHdpdGggc29tZSB2ZXJ5DQo+ ID4gPmRpZmZpY3VsdCB0byBkZWJ1ZyBzaWxlbnQgZGF0YSBjb3JydXB0aW9uLiBUaGF0J3MgdGhl IHdvcnN0IHBvc3NpYmxlDQo+ID4gPm91dGNvbWUsIElNTy4gV2UgcmVhbGx5IG5lZWQgdG8gYWN0 aXZlbHkga2VlcCBwZW9wbGUgZnJvbSBzaG9vdGluZw0KPiA+ID50aGVtc2VsdmVzIGluIHRoZSBm b290IGhlcmUuDQo+ID4gPg0KPiA+ID5PbmUgcG9zc2liaWxpdHkgbWlnaHQgYmUgdG8gb25seSBh bGxvdyBmaWxlc3lzdGVtcyB0byBiZSBleHBvcnRlZCBmcm9tDQo+ID4gPmEgc2luZ2xlIGNvbnRh aW5lciBhdCBhIHRpbWUgKGFuZCBhbGxvdyB0aGF0IHRvIGJlIG92ZXJyaWRhYmxlIHNvbWVob3cN Cj4gPiA+b25jZSB3ZSBoYXZlIGEgd29ya2luZyBhY3RpdmUvYWN0aXZlIHNlcnZpbmcgc29sdXRp b24pLiBXaXRoIHRoYXQsIHlvdQ0KPiA+ID5tYXkgYmUgYWJsZSBsaW1wIGFsb25nIHdpdGggYSBw ZXItY29udGFpbmVyIGdyYWNlIHBlcmlvZCBoYW5kbGluZw0KPiA+ID5zY2hlbWUgbGlrZSB5b3Un cmUgcHJvcG9zaW5nLg0KPiA+ID4NCj4gPiANCj4gPiBPayB0aGVuLiBLZWVwaW5nIHBlb3BsZSBm cm9tIHNob290aW5nIHRoZW1zZWx2ZXMgaGVyZSBzb3VuZHMgcmVhc29uYWJsZS4NCj4gPiBBbmQg SSBsaWtlIHRoZSBpZGVhIG9mIGV4cG9ydGluZyBhIGZpbGVzeXN0ZW0gb25seSBmcm9tIG9uY2Ug cGVyDQo+ID4gbmV0d29yayBuYW1lc3BhY2UuDQo+IA0KPiBVbmZvcnR1bmF0ZWx5IHRoYXQncyBu b3QgZ29pbmcgdG8gZ2V0IHVzIHZlcnkgZmFyLCBlc3BlY2lhbGx5IG5vdCBpbiB0aGUNCj4gdjQg Y2FzZSB3aGVyZSB3ZSd2ZSBnb3QgdGhlIGNvbW1vbiByZWFkLW9ubHkgcHNldWRvcm9vdCB0aGF0 IGV2ZXJ5b25lDQo+IGhhcyB0byBzaGFyZS4NCg0KSSBkb24ndCBzZWUgaG93IHRoYXQgY2FuIHdv cmsgaW4gY2FzZXMgd2hlcmUgZWFjaCBjb250YWluZXIgaGFzIGl0cyBvd24NCnByaXZhdGUgbW91 bnQgbmFtZXNwYWNlLiBZb3UncmUgZ29pbmcgdG8gaGF2ZSB0byB0aWUgdGhhdCBwc2V1ZG9yb290 IHRvDQp0aGUgbW91bnQgbmFtZXNwYWNlIHNvbWVob3cuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0 DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RA bmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 16:17 ` Myklebust, Trond (?) @ 2012-04-09 16:21 ` bfields 2012-04-09 16:33 ` Myklebust, Trond -1 siblings, 1 reply; 44+ messages in thread From: bfields @ 2012-04-09 16:21 UTC (permalink / raw) To: Myklebust, Trond Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote: > On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote: > > On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: > > > 09.04.2012 19:27, Jeff Layton пишет: > > > > > > > >If you allow one container to hand out conflicting locks while another > > > >container is allowing reclaims, then you can end up with some very > > > >difficult to debug silent data corruption. That's the worst possible > > > >outcome, IMO. We really need to actively keep people from shooting > > > >themselves in the foot here. > > > > > > > >One possibility might be to only allow filesystems to be exported from > > > >a single container at a time (and allow that to be overridable somehow > > > >once we have a working active/active serving solution). With that, you > > > >may be able limp along with a per-container grace period handling > > > >scheme like you're proposing. > > > > > > > > > > Ok then. Keeping people from shooting themselves here sounds reasonable. > > > And I like the idea of exporting a filesystem only from once per > > > network namespace. > > > > Unfortunately that's not going to get us very far, especially not in the > > v4 case where we've got the common read-only pseudoroot that everyone > > has to share. > > I don't see how that can work in cases where each container has its own > private mount namespace. You're going to have to tie that pseudoroot to > the mount namespace somehow. Sure, but in typical cases it'll still be shared; requiring that they not be sounds like a severe limitation. --b. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 16:21 ` bfields @ 2012-04-09 16:33 ` Myklebust, Trond 0 siblings, 0 replies; 44+ messages in thread From: Myklebust, Trond @ 2012-04-09 16:33 UTC (permalink / raw) To: bfields; +Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2234 bytes --] On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote: > On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote: > > On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote: > > > On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: > > > > 09.04.2012 19:27, Jeff Layton пиÑеÑ: > > > > > > > > > >If you allow one container to hand out conflicting locks while another > > > > >container is allowing reclaims, then you can end up with some very > > > > >difficult to debug silent data corruption. That's the worst possible > > > > >outcome, IMO. We really need to actively keep people from shooting > > > > >themselves in the foot here. > > > > > > > > > >One possibility might be to only allow filesystems to be exported from > > > > >a single container at a time (and allow that to be overridable somehow > > > > >once we have a working active/active serving solution). With that, you > > > > >may be able limp along with a per-container grace period handling > > > > >scheme like you're proposing. > > > > > > > > > > > > > Ok then. Keeping people from shooting themselves here sounds reasonable. > > > > And I like the idea of exporting a filesystem only from once per > > > > network namespace. > > > > > > Unfortunately that's not going to get us very far, especially not in the > > > v4 case where we've got the common read-only pseudoroot that everyone > > > has to share. > > > > I don't see how that can work in cases where each container has its own > > private mount namespace. You're going to have to tie that pseudoroot to > > the mount namespace somehow. > > Sure, but in typical cases it'll still be shared; requiring that they > not be sounds like a severe limitation. I'd expect the typical case to be the non-shared namespace: the whole point of containers is to provide for complete isolation of processes. Usually that implies that you don't want them to be able to communicate via a shared filesystem. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period @ 2012-04-09 16:33 ` Myklebust, Trond 0 siblings, 0 replies; 44+ messages in thread From: Myklebust, Trond @ 2012-04-09 16:33 UTC (permalink / raw) To: bfields; +Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel T24gTW9uLCAyMDEyLTA0LTA5IGF0IDEyOjIxIC0wNDAwLCBiZmllbGRzQGZpZWxkc2VzLm9yZyB3 cm90ZToNCj4gT24gTW9uLCBBcHIgMDksIDIwMTIgYXQgMDQ6MTc6MDZQTSArMDAwMCwgTXlrbGVi dXN0LCBUcm9uZCB3cm90ZToNCj4gPiBPbiBNb24sIDIwMTItMDQtMDkgYXQgMTI6MTEgLTA0MDAs IGJmaWVsZHNAZmllbGRzZXMub3JnIHdyb3RlOg0KPiA+ID4gT24gTW9uLCBBcHIgMDksIDIwMTIg YXQgMDg6MDg6NTdQTSArMDQwMCwgU3RhbmlzbGF2IEtpbnNidXJza3kgd3JvdGU6DQo+ID4gPiA+ IDA5LjA0LjIwMTIgMTk6MjcsIEplZmYgTGF5dG9uINC/0LjRiNC10YI6DQo+ID4gPiA+ID4NCj4g PiA+ID4gPklmIHlvdSBhbGxvdyBvbmUgY29udGFpbmVyIHRvIGhhbmQgb3V0IGNvbmZsaWN0aW5n IGxvY2tzIHdoaWxlIGFub3RoZXINCj4gPiA+ID4gPmNvbnRhaW5lciBpcyBhbGxvd2luZyByZWNs YWltcywgdGhlbiB5b3UgY2FuIGVuZCB1cCB3aXRoIHNvbWUgdmVyeQ0KPiA+ID4gPiA+ZGlmZmlj dWx0IHRvIGRlYnVnIHNpbGVudCBkYXRhIGNvcnJ1cHRpb24uIFRoYXQncyB0aGUgd29yc3QgcG9z c2libGUNCj4gPiA+ID4gPm91dGNvbWUsIElNTy4gV2UgcmVhbGx5IG5lZWQgdG8gYWN0aXZlbHkg a2VlcCBwZW9wbGUgZnJvbSBzaG9vdGluZw0KPiA+ID4gPiA+dGhlbXNlbHZlcyBpbiB0aGUgZm9v dCBoZXJlLg0KPiA+ID4gPiA+DQo+ID4gPiA+ID5PbmUgcG9zc2liaWxpdHkgbWlnaHQgYmUgdG8g b25seSBhbGxvdyBmaWxlc3lzdGVtcyB0byBiZSBleHBvcnRlZCBmcm9tDQo+ID4gPiA+ID5hIHNp bmdsZSBjb250YWluZXIgYXQgYSB0aW1lIChhbmQgYWxsb3cgdGhhdCB0byBiZSBvdmVycmlkYWJs ZSBzb21laG93DQo+ID4gPiA+ID5vbmNlIHdlIGhhdmUgYSB3b3JraW5nIGFjdGl2ZS9hY3RpdmUg c2VydmluZyBzb2x1dGlvbikuIFdpdGggdGhhdCwgeW91DQo+ID4gPiA+ID5tYXkgYmUgYWJsZSBs aW1wIGFsb25nIHdpdGggYSBwZXItY29udGFpbmVyIGdyYWNlIHBlcmlvZCBoYW5kbGluZw0KPiA+ ID4gPiA+c2NoZW1lIGxpa2UgeW91J3JlIHByb3Bvc2luZy4NCj4gPiA+ID4gPg0KPiA+ID4gPiAN Cj4gPiA+ID4gT2sgdGhlbi4gS2VlcGluZyBwZW9wbGUgZnJvbSBzaG9vdGluZyB0aGVtc2VsdmVz IGhlcmUgc291bmRzIHJlYXNvbmFibGUuDQo+ID4gPiA+IEFuZCBJIGxpa2UgdGhlIGlkZWEgb2Yg ZXhwb3J0aW5nIGEgZmlsZXN5c3RlbSBvbmx5IGZyb20gb25jZSBwZXINCj4gPiA+ID4gbmV0d29y ayBuYW1lc3BhY2UuDQo+ID4gPiANCj4gPiA+IFVuZm9ydHVuYXRlbHkgdGhhdCdzIG5vdCBnb2lu ZyB0byBnZXQgdXMgdmVyeSBmYXIsIGVzcGVjaWFsbHkgbm90IGluIHRoZQ0KPiA+ID4gdjQgY2Fz ZSB3aGVyZSB3ZSd2ZSBnb3QgdGhlIGNvbW1vbiByZWFkLW9ubHkgcHNldWRvcm9vdCB0aGF0IGV2 ZXJ5b25lDQo+ID4gPiBoYXMgdG8gc2hhcmUuDQo+ID4gDQo+ID4gSSBkb24ndCBzZWUgaG93IHRo YXQgY2FuIHdvcmsgaW4gY2FzZXMgd2hlcmUgZWFjaCBjb250YWluZXIgaGFzIGl0cyBvd24NCj4g PiBwcml2YXRlIG1vdW50IG5hbWVzcGFjZS4gWW91J3JlIGdvaW5nIHRvIGhhdmUgdG8gdGllIHRo YXQgcHNldWRvcm9vdCB0bw0KPiA+IHRoZSBtb3VudCBuYW1lc3BhY2Ugc29tZWhvdy4NCj4gDQo+ IFN1cmUsIGJ1dCBpbiB0eXBpY2FsIGNhc2VzIGl0J2xsIHN0aWxsIGJlIHNoYXJlZDsgcmVxdWly aW5nIHRoYXQgdGhleQ0KPiBub3QgYmUgc291bmRzIGxpa2UgYSBzZXZlcmUgbGltaXRhdGlvbi4N Cg0KSSdkIGV4cGVjdCB0aGUgdHlwaWNhbCBjYXNlIHRvIGJlIHRoZSBub24tc2hhcmVkIG5hbWVz cGFjZTogdGhlIHdob2xlDQpwb2ludCBvZiBjb250YWluZXJzIGlzIHRvIHByb3ZpZGUgZm9yIGNv bXBsZXRlIGlzb2xhdGlvbiBvZiBwcm9jZXNzZXMuDQpVc3VhbGx5IHRoYXQgaW1wbGllcyB0aGF0 IHlvdSBkb24ndCB3YW50IHRoZW0gdG8gYmUgYWJsZSB0byBjb21tdW5pY2F0ZQ0KdmlhIGEgc2hh cmVkIGZpbGVzeXN0ZW0uDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50 IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5l dGFwcC5jb20NCg0K ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 16:33 ` Myklebust, Trond (?) @ 2012-04-09 16:39 ` bfields -1 siblings, 0 replies; 44+ messages in thread From: bfields @ 2012-04-09 16:39 UTC (permalink / raw) To: Myklebust, Trond Cc: Stanislav Kinsbursky, Jeff Layton, linux-nfs, linux-kernel On Mon, Apr 09, 2012 at 04:33:36PM +0000, Myklebust, Trond wrote: > On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote: > > On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote: > > > On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote: > > > > On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: > > > > > 09.04.2012 19:27, Jeff Layton пишет: > > > > > > > > > > > >If you allow one container to hand out conflicting locks while another > > > > > >container is allowing reclaims, then you can end up with some very > > > > > >difficult to debug silent data corruption. That's the worst possible > > > > > >outcome, IMO. We really need to actively keep people from shooting > > > > > >themselves in the foot here. > > > > > > > > > > > >One possibility might be to only allow filesystems to be exported from > > > > > >a single container at a time (and allow that to be overridable somehow > > > > > >once we have a working active/active serving solution). With that, you > > > > > >may be able limp along with a per-container grace period handling > > > > > >scheme like you're proposing. > > > > > > > > > > > > > > > > Ok then. Keeping people from shooting themselves here sounds reasonable. > > > > > And I like the idea of exporting a filesystem only from once per > > > > > network namespace. > > > > > > > > Unfortunately that's not going to get us very far, especially not in the > > > > v4 case where we've got the common read-only pseudoroot that everyone > > > > has to share. > > > > > > I don't see how that can work in cases where each container has its own > > > private mount namespace. You're going to have to tie that pseudoroot to > > > the mount namespace somehow. > > > > Sure, but in typical cases it'll still be shared; requiring that they > > not be sounds like a severe limitation. > > I'd expect the typical case to be the non-shared namespace: the whole > point of containers is to provide for complete isolation of processes. > Usually that implies that you don't want them to be able to communicate > via a shared filesystem. If it's just a file server, then you may want to be able to bring up and down service on individual server ip's individually, and possibly advertise different exports; but requiring complete isolation to do that seems like overkill. --b. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 16:33 ` Myklebust, Trond (?) (?) @ 2012-04-09 16:56 ` Stanislav Kinsbursky 2012-04-09 18:11 ` bfields -1 siblings, 1 reply; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-09 16:56 UTC (permalink / raw) To: Myklebust, Trond; +Cc: bfields, Jeff Layton, linux-nfs, linux-kernel 09.04.2012 20:33, Myklebust, Trond пишет: > On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote: >> On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote: >>> On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote: >>>> On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: >>>>> 09.04.2012 19:27, Jeff Layton пишет: >>>>>> >>>>>> If you allow one container to hand out conflicting locks while another >>>>>> container is allowing reclaims, then you can end up with some very >>>>>> difficult to debug silent data corruption. That's the worst possible >>>>>> outcome, IMO. We really need to actively keep people from shooting >>>>>> themselves in the foot here. >>>>>> >>>>>> One possibility might be to only allow filesystems to be exported from >>>>>> a single container at a time (and allow that to be overridable somehow >>>>>> once we have a working active/active serving solution). With that, you >>>>>> may be able limp along with a per-container grace period handling >>>>>> scheme like you're proposing. >>>>>> >>>>> >>>>> Ok then. Keeping people from shooting themselves here sounds reasonable. >>>>> And I like the idea of exporting a filesystem only from once per >>>>> network namespace. >>>> >>>> Unfortunately that's not going to get us very far, especially not in the >>>> v4 case where we've got the common read-only pseudoroot that everyone >>>> has to share. >>> >>> I don't see how that can work in cases where each container has its own >>> private mount namespace. You're going to have to tie that pseudoroot to >>> the mount namespace somehow. >> >> Sure, but in typical cases it'll still be shared; requiring that they >> not be sounds like a severe limitation. > > I'd expect the typical case to be the non-shared namespace: the whole > point of containers is to provide for complete isolation of processes. > Usually that implies that you don't want them to be able to communicate > via a shared filesystem. > BTW, we DO use one mount namespace for all containers and host in OpenVZ. This allows us to have an access to containers mount points from initial environment. Isolation between containers is done via chroot and some simple tricks on /proc/mounts read operation. Moreover, with one mount namespace, we currently support bind-mounting on NFS from one container into another... Anyway, I'm sorry, but I'm not familiar with this pseudoroot idea. Why does it prevents implementing of check for "superblock-network namespace" pair on NFS server start and forbid (?) it in case of this pair is shared already in other namespace? I.e. maybe this pseudoroot can be an exclusion from this rule? Or I'm just missing the point at all? -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 16:56 ` Stanislav Kinsbursky @ 2012-04-09 18:11 ` bfields 2012-04-10 10:56 ` Stanislav Kinsbursky 0 siblings, 1 reply; 44+ messages in thread From: bfields @ 2012-04-09 18:11 UTC (permalink / raw) To: Stanislav Kinsbursky Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel On Mon, Apr 09, 2012 at 08:56:47PM +0400, Stanislav Kinsbursky wrote: > 09.04.2012 20:33, Myklebust, Trond пишет: > >On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote: > >>On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote: > >>>On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote: > >>>>On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: > >>>>>09.04.2012 19:27, Jeff Layton пишет: > >>>>>> > >>>>>>If you allow one container to hand out conflicting locks while another > >>>>>>container is allowing reclaims, then you can end up with some very > >>>>>>difficult to debug silent data corruption. That's the worst possible > >>>>>>outcome, IMO. We really need to actively keep people from shooting > >>>>>>themselves in the foot here. > >>>>>> > >>>>>>One possibility might be to only allow filesystems to be exported from > >>>>>>a single container at a time (and allow that to be overridable somehow > >>>>>>once we have a working active/active serving solution). With that, you > >>>>>>may be able limp along with a per-container grace period handling > >>>>>>scheme like you're proposing. > >>>>>> > >>>>> > >>>>>Ok then. Keeping people from shooting themselves here sounds reasonable. > >>>>>And I like the idea of exporting a filesystem only from once per > >>>>>network namespace. > >>>> > >>>>Unfortunately that's not going to get us very far, especially not in the > >>>>v4 case where we've got the common read-only pseudoroot that everyone > >>>>has to share. > >>> > >>>I don't see how that can work in cases where each container has its own > >>>private mount namespace. You're going to have to tie that pseudoroot to > >>>the mount namespace somehow. > >> > >>Sure, but in typical cases it'll still be shared; requiring that they > >>not be sounds like a severe limitation. > > > >I'd expect the typical case to be the non-shared namespace: the whole > >point of containers is to provide for complete isolation of processes. > >Usually that implies that you don't want them to be able to communicate > >via a shared filesystem. > > > > BTW, we DO use one mount namespace for all containers and host in > OpenVZ. This allows us to have an access to containers mount points > from initial environment. Isolation between containers is done via > chroot and some simple tricks on /proc/mounts read operation. > Moreover, with one mount namespace, we currently support > bind-mounting on NFS from one container into another... > > Anyway, I'm sorry, but I'm not familiar with this pseudoroot idea. Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be able to do readdir's and lookups to get to exported filesystems. We support this in the Linux server by exporting all the filesystems from "/" on down that must be traversed to reach a given filesystem. These exports are very restricted (e.g. only parents of exports are visible). > Why does it prevents implementing of check for "superblock-network > namespace" pair on NFS server start and forbid (?) it in case of > this pair is shared already in other namespace? I.e. maybe this > pseudoroot can be an exclusion from this rule? That might work. It's read-only and consists only of directories, so the grace period doesn't affect it. --b. > Or I'm just missing the point at all? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 18:11 ` bfields @ 2012-04-10 10:56 ` Stanislav Kinsbursky 2012-04-10 13:39 ` bfields 0 siblings, 1 reply; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-10 10:56 UTC (permalink / raw) To: bfields; +Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel 09.04.2012 22:11, bfields@fieldses.org пишет: > On Mon, Apr 09, 2012 at 08:56:47PM +0400, Stanislav Kinsbursky wrote: >> 09.04.2012 20:33, Myklebust, Trond пишет: >>> On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote: >>>> On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote: >>>>> On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote: >>>>>> On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote: >>>>>>> 09.04.2012 19:27, Jeff Layton пишет: >>>>>>>> >>>>>>>> If you allow one container to hand out conflicting locks while another >>>>>>>> container is allowing reclaims, then you can end up with some very >>>>>>>> difficult to debug silent data corruption. That's the worst possible >>>>>>>> outcome, IMO. We really need to actively keep people from shooting >>>>>>>> themselves in the foot here. >>>>>>>> >>>>>>>> One possibility might be to only allow filesystems to be exported from >>>>>>>> a single container at a time (and allow that to be overridable somehow >>>>>>>> once we have a working active/active serving solution). With that, you >>>>>>>> may be able limp along with a per-container grace period handling >>>>>>>> scheme like you're proposing. >>>>>>>> >>>>>>> >>>>>>> Ok then. Keeping people from shooting themselves here sounds reasonable. >>>>>>> And I like the idea of exporting a filesystem only from once per >>>>>>> network namespace. >>>>>> >>>>>> Unfortunately that's not going to get us very far, especially not in the >>>>>> v4 case where we've got the common read-only pseudoroot that everyone >>>>>> has to share. >>>>> >>>>> I don't see how that can work in cases where each container has its own >>>>> private mount namespace. You're going to have to tie that pseudoroot to >>>>> the mount namespace somehow. >>>> >>>> Sure, but in typical cases it'll still be shared; requiring that they >>>> not be sounds like a severe limitation. >>> >>> I'd expect the typical case to be the non-shared namespace: the whole >>> point of containers is to provide for complete isolation of processes. >>> Usually that implies that you don't want them to be able to communicate >>> via a shared filesystem. >>> >> >> BTW, we DO use one mount namespace for all containers and host in >> OpenVZ. This allows us to have an access to containers mount points >> from initial environment. Isolation between containers is done via >> chroot and some simple tricks on /proc/mounts read operation. >> Moreover, with one mount namespace, we currently support >> bind-mounting on NFS from one container into another... >> >> Anyway, I'm sorry, but I'm not familiar with this pseudoroot idea. > > Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be > able to do readdir's and lookups to get to exported filesystems. We > support this in the Linux server by exporting all the filesystems from > "/" on down that must be traversed to reach a given filesystem. These > exports are very restricted (e.g. only parents of exports are visible). > Ok, thanks for explanation. So, this pseudoroot looks like a part of NFS server internal implementation, but not a part of a standard. That's good. >> Why does it prevents implementing of check for "superblock-network >> namespace" pair on NFS server start and forbid (?) it in case of >> this pair is shared already in other namespace? I.e. maybe this >> pseudoroot can be an exclusion from this rule? > > That might work. It's read-only and consists only of directories, so > the grace period doesn't affect it. > I've just realized, that this per-sb grace period won't work. I.e., it's a valid situation, when two or more containers located on the same filesystem, but shares different parts of it. And there is not conflict here at all. I don't see any clear and simple way how to handle such races, because otherwise we have to tie network namespace and filesystem namespace. I.e. there will be required some way to define, was passed export directory shared already somewhere else or not. Realistic solution - since export check should be done in initial file system environment (most probably container will have it's own root), then we to pass this data to some kernel thread/userspace daemon in initial file system environment somehow (sockets doesn't suits here... Shared memory?). Improbable solution - patching VFS layer... -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 10:56 ` Stanislav Kinsbursky @ 2012-04-10 13:39 ` bfields 2012-04-10 15:36 ` Stanislav Kinsbursky 0 siblings, 1 reply; 44+ messages in thread From: bfields @ 2012-04-10 13:39 UTC (permalink / raw) To: Stanislav Kinsbursky Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote: > 09.04.2012 22:11, bfields@fieldses.org пишет: > >Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be > >able to do readdir's and lookups to get to exported filesystems. We > >support this in the Linux server by exporting all the filesystems from > >"/" on down that must be traversed to reach a given filesystem. These > >exports are very restricted (e.g. only parents of exports are visible). > > > > Ok, thanks for explanation. > So, this pseudoroot looks like a part of NFS server internal > implementation, but not a part of a standard. That's good. > > >>Why does it prevents implementing of check for "superblock-network > >>namespace" pair on NFS server start and forbid (?) it in case of > >>this pair is shared already in other namespace? I.e. maybe this > >>pseudoroot can be an exclusion from this rule? > > > >That might work. It's read-only and consists only of directories, so > >the grace period doesn't affect it. > > > > I've just realized, that this per-sb grace period won't work. > I.e., it's a valid situation, when two or more containers located on > the same filesystem, but shares different parts of it. And there is > not conflict here at all. Well, there may be some conflict in that a file could be hardlinked into both subtrees, and that file could be locked from users of either export. --b. > I don't see any clear and simple way how to handle such races, > because otherwise we have to tie network namespace and filesystem > namespace. > I.e. there will be required some way to define, was passed export > directory shared already somewhere else or not. > > Realistic solution - since export check should be done in initial > file system environment (most probably container will have it's own > root), then we to pass this data to some kernel thread/userspace > daemon in initial file system environment somehow (sockets doesn't > suits here... Shared memory?). > > Improbable solution - patching VFS layer... > > -- > Best regards, > Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 13:39 ` bfields @ 2012-04-10 15:36 ` Stanislav Kinsbursky 2012-04-10 18:28 ` Jeff Layton 0 siblings, 1 reply; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-10 15:36 UTC (permalink / raw) To: bfields; +Cc: Myklebust, Trond, Jeff Layton, linux-nfs, linux-kernel 10.04.2012 17:39, bfields@fieldses.org пишет: > On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote: >> 09.04.2012 22:11, bfields@fieldses.org пишет: >>> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be >>> able to do readdir's and lookups to get to exported filesystems. We >>> support this in the Linux server by exporting all the filesystems from >>> "/" on down that must be traversed to reach a given filesystem. These >>> exports are very restricted (e.g. only parents of exports are visible). >>> >> >> Ok, thanks for explanation. >> So, this pseudoroot looks like a part of NFS server internal >> implementation, but not a part of a standard. That's good. >> >>>> Why does it prevents implementing of check for "superblock-network >>>> namespace" pair on NFS server start and forbid (?) it in case of >>>> this pair is shared already in other namespace? I.e. maybe this >>>> pseudoroot can be an exclusion from this rule? >>> >>> That might work. It's read-only and consists only of directories, so >>> the grace period doesn't affect it. >>> >> >> I've just realized, that this per-sb grace period won't work. >> I.e., it's a valid situation, when two or more containers located on >> the same filesystem, but shares different parts of it. And there is >> not conflict here at all. > > Well, there may be some conflict in that a file could be hardlinked into > both subtrees, and that file could be locked from users of either > export. > Is this case handled if both links or visible in the same export? But anyway, this is not that bad. I.e it doesn't make things unpredictable. Probably, there are some more issues like this one (bind-mounting, for example). But I think, that it's root responsibility to handle such problems. -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 15:36 ` Stanislav Kinsbursky @ 2012-04-10 18:28 ` Jeff Layton 2012-04-10 20:46 ` bfields 2012-04-11 10:08 ` Stanislav Kinsbursky 0 siblings, 2 replies; 44+ messages in thread From: Jeff Layton @ 2012-04-10 18:28 UTC (permalink / raw) To: Stanislav Kinsbursky; +Cc: bfields, Myklebust, Trond, linux-nfs, linux-kernel On Tue, 10 Apr 2012 19:36:26 +0400 Stanislav Kinsbursky <skinsbursky@parallels.com> wrote: > 10.04.2012 17:39, bfields@fieldses.org пишет: > > On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote: > >> 09.04.2012 22:11, bfields@fieldses.org пишет: > >>> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be > >>> able to do readdir's and lookups to get to exported filesystems. We > >>> support this in the Linux server by exporting all the filesystems from > >>> "/" on down that must be traversed to reach a given filesystem. These > >>> exports are very restricted (e.g. only parents of exports are visible). > >>> > >> > >> Ok, thanks for explanation. > >> So, this pseudoroot looks like a part of NFS server internal > >> implementation, but not a part of a standard. That's good. > >> > >>>> Why does it prevents implementing of check for "superblock-network > >>>> namespace" pair on NFS server start and forbid (?) it in case of > >>>> this pair is shared already in other namespace? I.e. maybe this > >>>> pseudoroot can be an exclusion from this rule? > >>> > >>> That might work. It's read-only and consists only of directories, so > >>> the grace period doesn't affect it. > >>> > >> > >> I've just realized, that this per-sb grace period won't work. > >> I.e., it's a valid situation, when two or more containers located on > >> the same filesystem, but shares different parts of it. And there is > >> not conflict here at all. > > > > Well, there may be some conflict in that a file could be hardlinked into > > both subtrees, and that file could be locked from users of either > > export. > > > > Is this case handled if both links or visible in the same export? > But anyway, this is not that bad. I.e it doesn't make things unpredictable. > Probably, there are some more issues like this one (bind-mounting, for example). > But I think, that it's root responsibility to handle such problems. > Well, it's a problem and one that you'll probably have to address to some degree. In truth, the fact that you're exporting different subtrees in different containers is immaterial since they're both on the same fs and filehandles don't carry any info about the path in and of themselves... Suppose for instance that we have a hardlinked file that's available from two different exports in two different containers. The grace period ends in container #1, so that nfsd starts servicing normal lock requests. An application takes a lock on that hardlinked file. In the meantime, a client of container #2 attempts to reclaim the lock that he previously held on that same inode and gets denied. That's just one example. The scarier case is that the client of container #1 takes the lock, alters the file and then drops it again with the client of container #2 none the wiser. Now the file got altered while client #2 thought he held a lock on it. That won't be fun to track down... This sort of thing is one of the reasons I've been saying that the grace period is really a property of the underlying filesystem and not of nfsd itself. Of course, we do have to come up with a way to handle the grace period that doesn't involve altering every exportable fs. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 18:28 ` Jeff Layton @ 2012-04-10 20:46 ` bfields 2012-04-11 10:08 ` Stanislav Kinsbursky 1 sibling, 0 replies; 44+ messages in thread From: bfields @ 2012-04-10 20:46 UTC (permalink / raw) To: Jeff Layton Cc: Stanislav Kinsbursky, Myklebust, Trond, linux-nfs, linux-kernel On Tue, Apr 10, 2012 at 02:28:53PM -0400, Jeff Layton wrote: > This sort of thing is one of the reasons I've been saying that the > grace period is really a property of the underlying filesystem and not > of nfsd itself. Of course, we do have to come up with a way to handle > the grace period that doesn't involve altering every exportable fs. By the way, the case of multiple containers exporting a single filesystem does look a lot like an active/active cluster filesystem export. It might be an opportunity to prototype the interfaces for handling that case without having to deal with modifying the DLM. --b. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 18:28 ` Jeff Layton 2012-04-10 20:46 ` bfields @ 2012-04-11 10:08 ` Stanislav Kinsbursky 1 sibling, 0 replies; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-11 10:08 UTC (permalink / raw) To: Jeff Layton; +Cc: bfields, Myklebust, Trond, linux-nfs, linux-kernel 10.04.2012 22:28, Jeff Layton пишет: > On Tue, 10 Apr 2012 19:36:26 +0400 > Stanislav Kinsbursky<skinsbursky@parallels.com> wrote: > >> 10.04.2012 17:39, bfields@fieldses.org пишет: >>> On Tue, Apr 10, 2012 at 02:56:12PM +0400, Stanislav Kinsbursky wrote: >>>> 09.04.2012 22:11, bfields@fieldses.org пишет: >>>>> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be >>>>> able to do readdir's and lookups to get to exported filesystems. We >>>>> support this in the Linux server by exporting all the filesystems from >>>>> "/" on down that must be traversed to reach a given filesystem. These >>>>> exports are very restricted (e.g. only parents of exports are visible). >>>>> >>>> >>>> Ok, thanks for explanation. >>>> So, this pseudoroot looks like a part of NFS server internal >>>> implementation, but not a part of a standard. That's good. >>>> >>>>>> Why does it prevents implementing of check for "superblock-network >>>>>> namespace" pair on NFS server start and forbid (?) it in case of >>>>>> this pair is shared already in other namespace? I.e. maybe this >>>>>> pseudoroot can be an exclusion from this rule? >>>>> >>>>> That might work. It's read-only and consists only of directories, so >>>>> the grace period doesn't affect it. >>>>> >>>> >>>> I've just realized, that this per-sb grace period won't work. >>>> I.e., it's a valid situation, when two or more containers located on >>>> the same filesystem, but shares different parts of it. And there is >>>> not conflict here at all. >>> >>> Well, there may be some conflict in that a file could be hardlinked into >>> both subtrees, and that file could be locked from users of either >>> export. >>> >> >> Is this case handled if both links or visible in the same export? >> But anyway, this is not that bad. I.e it doesn't make things unpredictable. >> Probably, there are some more issues like this one (bind-mounting, for example). >> But I think, that it's root responsibility to handle such problems. >> > > Well, it's a problem and one that you'll probably have to address to > some degree. In truth, the fact that you're exporting different > subtrees in different containers is immaterial since they're both on > the same fs and filehandles don't carry any info about the path in and > of themselves... > > Suppose for instance that we have a hardlinked file that's available > from two different exports in two different containers. The grace > period ends in container #1, so that nfsd starts servicing normal lock > requests. An application takes a lock on that hardlinked file. In the > meantime, a client of container #2 attempts to reclaim the lock that he > previously held on that same inode and gets denied. > > That's just one example. The scarier case is that the client of > container #1 takes the lock, alters the file and then drops it again > with the client of container #2 none the wiser. Now the file got > altered while client #2 thought he held a lock on it. That won't be fun > to track down... > > This sort of thing is one of the reasons I've been saying that the > grace period is really a property of the underlying filesystem and not > of nfsd itself. Of course, we do have to come up with a way to handle > the grace period that doesn't involve altering every exportable fs. > I see. But, frankly speaking, looks like the problem you are talking about is another task (comparing to containerization). I.e. making NFSd work per network namespace is somewhat different comparing to these "shared file system" issues (which are actually a part of mount namespace). -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 11:24 ` Grace period Stanislav Kinsbursky 2012-04-09 13:47 ` Jeff Layton @ 2012-04-09 23:26 ` bfields 2012-04-10 11:29 ` Stanislav Kinsbursky 1 sibling, 1 reply; 44+ messages in thread From: bfields @ 2012-04-09 23:26 UTC (permalink / raw) To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, linux-nfs, linux-kernel On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote: > 07.04.2012 03:40, bfields@fieldses.org пишет: > >On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: > >>Hello, Bruce. > >>Could you, please, clarify this reason why grace list is used? > >>I.e. why list is used instead of some atomic variable, for example? > > > >Like just a reference count? Yeah, that would be OK. > > > >In theory it could provide some sort of debugging help. (E.g. we could > >print out the list of "lock managers" currently keeping us in grace.) I > >had some idea we'd make those lock manager objects more complicated, and > >might have more for individual containerized services. > > Could you share this idea, please? > > Anyway, I have nothing against lists. Just was curious, why it was used. > I added Trond and lists to this reply. > > Let me explain, what is the problem with grace period I'm facing > right know, and what I'm thinking about it. > So, one of the things to be containerized during "NFSd per net ns" > work is the grace period, and these are the basic components of it: > 1) Grace period start. > 2) Grace period end. > 3) Grace period check. > 3) Grace period restart. For restart, you're thinking of the fs/lockd/svc.c:restart_grace() that's called on aisngal in lockd()? I wonder if there's any way to figure out if that's actually used by anyone? (E.g. by any distro init scripts). It strikes me as possibly impossible to use correctly. Perhaps we could deprecate it.... > So, the simplest straight-forward way is to make all internal stuff: > "grace_list", "grace_lock", "grace_period_end" work and both > "lockd_manager" and "nfsd4_manager" - per network namespace. Also, > "laundromat_work" have to be per-net as well. > In this case: > 1) Start - grace period can be started per net ns in > "lockd_up_net()" (thus has to be moves there from "lockd()") and > "nfs4_state_start()". > 2) End - grace period can be ended per net ns in "lockd_down_net()" > (thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and > "fs4_state_shutdown()". > 3) Check - looks easy. There is either svc_rqst or net context can > be passed to function. > 4) Restart - this is a tricky place. It would be great to restart > grace period only for the networks namespace of the sender of the > kill signal. So, the idea is to check siginfo_t for the pid of > sender, then try to locate the task, and if found, then get sender's > networks namespace, and restart grace period only for this namespace > (of course, if lockd was started for this namespace - see below). If it's really the signalling that's the problem--perhaps we can get away from the signal-based interface. At least in the case of lockd I suspect we could. Or perhaps the decision to share a single lockd thread (or set of nsfd threads) among multiple network namespaces was a poor one. But I realize multithreading lockd doesn't look easy. --b. > If task not found, of it's lockd wasn't started for it's namespace, > then grace period can be either restarted for all namespaces, of > just silently dropped. This is the place where I'm not sure, how to > do. Because calling grace period for all namespaces will be > overkill... > > There also another problem with the "task by pid" search, that found > task can be actually not sender (which died already), but some other > new task with the same pid number. In this case, I think, we can > just neglect this probability and always assume, that we located > sender (if, of course, lockd was started for sender's network > namespace). > > Trond, Bruce, could you, please, comment this ideas? > > -- > Best regards, > Stanislav Kinsbursky > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-09 23:26 ` bfields @ 2012-04-10 11:29 ` Stanislav Kinsbursky 2012-04-10 13:37 ` bfields 0 siblings, 1 reply; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-10 11:29 UTC (permalink / raw) To: bfields; +Cc: Trond.Myklebust, linux-nfs, linux-kernel 10.04.2012 03:26, bfields@fieldses.org пишет: > On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote: >> 07.04.2012 03:40, bfields@fieldses.org пишет: >>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: >>>> Hello, Bruce. >>>> Could you, please, clarify this reason why grace list is used? >>>> I.e. why list is used instead of some atomic variable, for example? >>> >>> Like just a reference count? Yeah, that would be OK. >>> >>> In theory it could provide some sort of debugging help. (E.g. we could >>> print out the list of "lock managers" currently keeping us in grace.) I >>> had some idea we'd make those lock manager objects more complicated, and >>> might have more for individual containerized services. >> >> Could you share this idea, please? >> >> Anyway, I have nothing against lists. Just was curious, why it was used. >> I added Trond and lists to this reply. >> >> Let me explain, what is the problem with grace period I'm facing >> right know, and what I'm thinking about it. >> So, one of the things to be containerized during "NFSd per net ns" >> work is the grace period, and these are the basic components of it: >> 1) Grace period start. >> 2) Grace period end. >> 3) Grace period check. >> 3) Grace period restart. > > For restart, you're thinking of the fs/lockd/svc.c:restart_grace() > that's called on aisngal in lockd()? > > I wonder if there's any way to figure out if that's actually used by > anyone? (E.g. by any distro init scripts). It strikes me as possibly > impossible to use correctly. Perhaps we could deprecate it.... > Or (since lockd kthread is visible only from initial pid namespace) we can just hardcode "init_net" in this case. But it means, that this "kill" logic will be broken if two containers shares one pid namespace, but have separated networks namespaces. Anyway, both (this one or Bruce's) solutions suits me. >> So, the simplest straight-forward way is to make all internal stuff: >> "grace_list", "grace_lock", "grace_period_end" work and both >> "lockd_manager" and "nfsd4_manager" - per network namespace. Also, >> "laundromat_work" have to be per-net as well. >> In this case: >> 1) Start - grace period can be started per net ns in >> "lockd_up_net()" (thus has to be moves there from "lockd()") and >> "nfs4_state_start()". >> 2) End - grace period can be ended per net ns in "lockd_down_net()" >> (thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and >> "fs4_state_shutdown()". >> 3) Check - looks easy. There is either svc_rqst or net context can >> be passed to function. >> 4) Restart - this is a tricky place. It would be great to restart >> grace period only for the networks namespace of the sender of the >> kill signal. So, the idea is to check siginfo_t for the pid of >> sender, then try to locate the task, and if found, then get sender's >> networks namespace, and restart grace period only for this namespace >> (of course, if lockd was started for this namespace - see below). > > If it's really the signalling that's the problem--perhaps we can get > away from the signal-based interface. > > At least in the case of lockd I suspect we could. > I'm ok with that. So, if no objections will follow, I'll drop it and send the patch. Or you want to do it? BTW, I tried this "pid from siginfo" approach yesterday. And it doesn't work, because sender usually dead already, when lookup for task by pid is performed. > Or perhaps the decision to share a single lockd thread (or set of nsfd > threads) among multiple network namespaces was a poor one. But I > realize multithreading lockd doesn't look easy. > This decision was the best one in current circumstances. Having Lockd thread (or NFSd threads) per container looks easy to implement on first sight. But kernel threads currently supported only in initial pid namespace. I.e. it means that per-container kernel thread won't be visible in container, if it has it's own pid namespace. And there is no way to put a kernel thread into container. In OpenVZ we have per-container kernel threads. But integrating this feature to mainline looks hopeless (or very difficult) to me. At least for now. So this problem with signals remains unsolved. So, as it looks to me, this "one service per all" is the only one suitable for now. But there are some corner cases which have to be solved. Anyway, Jeff's question is still open. Do we need to prevent people from exporting nested directories from different network namespaces? And if yes, how to do this? -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 11:29 ` Stanislav Kinsbursky @ 2012-04-10 13:37 ` bfields 2012-04-10 14:10 ` Stanislav Kinsbursky 0 siblings, 1 reply; 44+ messages in thread From: bfields @ 2012-04-10 13:37 UTC (permalink / raw) To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, linux-nfs, linux-kernel On Tue, Apr 10, 2012 at 03:29:11PM +0400, Stanislav Kinsbursky wrote: > 10.04.2012 03:26, bfields@fieldses.org пишет: > >On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote: > >>07.04.2012 03:40, bfields@fieldses.org пишет: > >>>On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: > >>>>Hello, Bruce. > >>>>Could you, please, clarify this reason why grace list is used? > >>>>I.e. why list is used instead of some atomic variable, for example? > >>> > >>>Like just a reference count? Yeah, that would be OK. > >>> > >>>In theory it could provide some sort of debugging help. (E.g. we could > >>>print out the list of "lock managers" currently keeping us in grace.) I > >>>had some idea we'd make those lock manager objects more complicated, and > >>>might have more for individual containerized services. > >> > >>Could you share this idea, please? > >> > >>Anyway, I have nothing against lists. Just was curious, why it was used. > >>I added Trond and lists to this reply. > >> > >>Let me explain, what is the problem with grace period I'm facing > >>right know, and what I'm thinking about it. > >>So, one of the things to be containerized during "NFSd per net ns" > >>work is the grace period, and these are the basic components of it: > >>1) Grace period start. > >>2) Grace period end. > >>3) Grace period check. > >>3) Grace period restart. > > > >For restart, you're thinking of the fs/lockd/svc.c:restart_grace() > >that's called on aisngal in lockd()? > > > >I wonder if there's any way to figure out if that's actually used by > >anyone? (E.g. by any distro init scripts). It strikes me as possibly > >impossible to use correctly. Perhaps we could deprecate it.... > > > > Or (since lockd kthread is visible only from initial pid namespace) > we can just hardcode "init_net" in this case. But it means, that > this "kill" logic will be broken if two containers shares one pid > namespace, but have separated networks namespaces. > Anyway, both (this one or Bruce's) solutions suits me. > > >>So, the simplest straight-forward way is to make all internal stuff: > >>"grace_list", "grace_lock", "grace_period_end" work and both > >>"lockd_manager" and "nfsd4_manager" - per network namespace. Also, > >>"laundromat_work" have to be per-net as well. > >>In this case: > >>1) Start - grace period can be started per net ns in > >>"lockd_up_net()" (thus has to be moves there from "lockd()") and > >>"nfs4_state_start()". > >>2) End - grace period can be ended per net ns in "lockd_down_net()" > >>(thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and > >>"fs4_state_shutdown()". > >>3) Check - looks easy. There is either svc_rqst or net context can > >>be passed to function. > >>4) Restart - this is a tricky place. It would be great to restart > >>grace period only for the networks namespace of the sender of the > >>kill signal. So, the idea is to check siginfo_t for the pid of > >>sender, then try to locate the task, and if found, then get sender's > >>networks namespace, and restart grace period only for this namespace > >>(of course, if lockd was started for this namespace - see below). > > > >If it's really the signalling that's the problem--perhaps we can get > >away from the signal-based interface. > > > >At least in the case of lockd I suspect we could. > > > > I'm ok with that. So, if no objections will follow, I'll drop it and > send the patch. Or you want to do it? Please do go ahead. The safest approach might be: - leave lockd's signal handling there (just accept that it may behave incorrectly in container case), assuming that's safe. - add a printk ("signalling lockd to restart is deprecated", or something) if it's used. Then eventually we'll remove it entirely. (But if that doesn't work, it'd likely also be OK just to remove it completely now.) --b. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 13:37 ` bfields @ 2012-04-10 14:10 ` Stanislav Kinsbursky 2012-04-10 14:18 ` bfields 0 siblings, 1 reply; 44+ messages in thread From: Stanislav Kinsbursky @ 2012-04-10 14:10 UTC (permalink / raw) To: bfields; +Cc: Trond.Myklebust, linux-nfs, linux-kernel 10.04.2012 17:37, bfields@fieldses.org пишет: > On Tue, Apr 10, 2012 at 03:29:11PM +0400, Stanislav Kinsbursky wrote: >> 10.04.2012 03:26, bfields@fieldses.org пишет: >>> On Mon, Apr 09, 2012 at 03:24:19PM +0400, Stanislav Kinsbursky wrote: >>>> 07.04.2012 03:40, bfields@fieldses.org пишет: >>>>> On Fri, Apr 06, 2012 at 09:08:26PM +0400, Stanislav Kinsbursky wrote: >>>>>> Hello, Bruce. >>>>>> Could you, please, clarify this reason why grace list is used? >>>>>> I.e. why list is used instead of some atomic variable, for example? >>>>> >>>>> Like just a reference count? Yeah, that would be OK. >>>>> >>>>> In theory it could provide some sort of debugging help. (E.g. we could >>>>> print out the list of "lock managers" currently keeping us in grace.) I >>>>> had some idea we'd make those lock manager objects more complicated, and >>>>> might have more for individual containerized services. >>>> >>>> Could you share this idea, please? >>>> >>>> Anyway, I have nothing against lists. Just was curious, why it was used. >>>> I added Trond and lists to this reply. >>>> >>>> Let me explain, what is the problem with grace period I'm facing >>>> right know, and what I'm thinking about it. >>>> So, one of the things to be containerized during "NFSd per net ns" >>>> work is the grace period, and these are the basic components of it: >>>> 1) Grace period start. >>>> 2) Grace period end. >>>> 3) Grace period check. >>>> 3) Grace period restart. >>> >>> For restart, you're thinking of the fs/lockd/svc.c:restart_grace() >>> that's called on aisngal in lockd()? >>> >>> I wonder if there's any way to figure out if that's actually used by >>> anyone? (E.g. by any distro init scripts). It strikes me as possibly >>> impossible to use correctly. Perhaps we could deprecate it.... >>> >> >> Or (since lockd kthread is visible only from initial pid namespace) >> we can just hardcode "init_net" in this case. But it means, that >> this "kill" logic will be broken if two containers shares one pid >> namespace, but have separated networks namespaces. >> Anyway, both (this one or Bruce's) solutions suits me. >> >>>> So, the simplest straight-forward way is to make all internal stuff: >>>> "grace_list", "grace_lock", "grace_period_end" work and both >>>> "lockd_manager" and "nfsd4_manager" - per network namespace. Also, >>>> "laundromat_work" have to be per-net as well. >>>> In this case: >>>> 1) Start - grace period can be started per net ns in >>>> "lockd_up_net()" (thus has to be moves there from "lockd()") and >>>> "nfs4_state_start()". >>>> 2) End - grace period can be ended per net ns in "lockd_down_net()" >>>> (thus has to be moved there from "lockd()"), "nfsd4_end_grace()" and >>>> "fs4_state_shutdown()". >>>> 3) Check - looks easy. There is either svc_rqst or net context can >>>> be passed to function. >>>> 4) Restart - this is a tricky place. It would be great to restart >>>> grace period only for the networks namespace of the sender of the >>>> kill signal. So, the idea is to check siginfo_t for the pid of >>>> sender, then try to locate the task, and if found, then get sender's >>>> networks namespace, and restart grace period only for this namespace >>>> (of course, if lockd was started for this namespace - see below). >>> >>> If it's really the signalling that's the problem--perhaps we can get >>> away from the signal-based interface. >>> >>> At least in the case of lockd I suspect we could. >>> >> >> I'm ok with that. So, if no objections will follow, I'll drop it and >> send the patch. Or you want to do it? > > Please do go ahead. > > The safest approach might be: > - leave lockd's signal handling there (just accept that it may > behave incorrectly in container case), assuming that's safe. > - add a printk ("signalling lockd to restart is deprecated", > or something) if it's used. > > Then eventually we'll remove it entirely. > > (But if that doesn't work, it'd likely also be OK just to remove it > completely now.) > Well, I can do this to restart grace only for "init_net" and a printk with your message and information, that it affect only init_net. Looks good to you? -- Best regards, Stanislav Kinsbursky ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Grace period 2012-04-10 14:10 ` Stanislav Kinsbursky @ 2012-04-10 14:18 ` bfields 0 siblings, 0 replies; 44+ messages in thread From: bfields @ 2012-04-10 14:18 UTC (permalink / raw) To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, linux-nfs, linux-kernel On Tue, Apr 10, 2012 at 06:10:27PM +0400, Stanislav Kinsbursky wrote: > Well, I can do this to restart grace only for "init_net" and a > printk with your message and information, that it affect only > init_net. > Looks good to you? Yep, thanks! --b. ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2016-07-06 0:38 UTC | newest] Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-14 21:25 [PATCH] NFS: Don't let readdirplus revalidate an inode that was marked as stale Trond Myklebust 2016-06-30 21:46 ` grace period Marc Eshel 2016-07-01 16:08 ` Bruce Fields 2016-07-01 17:31 ` Marc Eshel 2016-07-01 20:07 ` Bruce Fields 2016-07-01 20:24 ` Marc Eshel 2016-07-01 20:47 ` Bruce Fields 2016-07-01 20:46 ` Marc Eshel 2016-07-01 21:01 ` Bruce Fields 2016-07-01 22:42 ` Marc Eshel 2016-07-02 0:58 ` Bruce Fields 2016-07-03 5:30 ` Marc Eshel 2016-07-05 20:51 ` Bruce Fields 2016-07-05 23:05 ` Marc Eshel 2016-07-06 0:38 ` Bruce Fields [not found] ` <OFC1237E53.3CFCA8E8-ON88257FE5.001D3182-88257FE5.001E3A5B@LocalDomain> 2016-07-04 23:53 ` HA NFS Marc Eshel 2016-07-05 15:08 ` Steve Dickson 2016-07-05 20:56 ` Marc Eshel [not found] ` <OF5D486F02.62CECB7B-ON88257FE3.0071DBE5-88257FE3.00722318@LocalDomain> 2016-07-01 20:51 ` grace period Marc Eshel [not found] <4F7F230A.6080506@parallels.com> [not found] ` <20120406234039.GA20940@fieldses.org> 2012-04-09 11:24 ` Grace period Stanislav Kinsbursky 2012-04-09 13:47 ` Jeff Layton 2012-04-09 14:25 ` Stanislav Kinsbursky 2012-04-09 15:27 ` Jeff Layton 2012-04-09 16:08 ` Stanislav Kinsbursky 2012-04-09 16:11 ` bfields 2012-04-09 16:17 ` Myklebust, Trond 2012-04-09 16:17 ` Myklebust, Trond 2012-04-09 16:21 ` bfields 2012-04-09 16:33 ` Myklebust, Trond 2012-04-09 16:33 ` Myklebust, Trond 2012-04-09 16:39 ` bfields 2012-04-09 16:56 ` Stanislav Kinsbursky 2012-04-09 18:11 ` bfields 2012-04-10 10:56 ` Stanislav Kinsbursky 2012-04-10 13:39 ` bfields 2012-04-10 15:36 ` Stanislav Kinsbursky 2012-04-10 18:28 ` Jeff Layton 2012-04-10 20:46 ` bfields 2012-04-11 10:08 ` Stanislav Kinsbursky 2012-04-09 23:26 ` bfields 2012-04-10 11:29 ` Stanislav Kinsbursky 2012-04-10 13:37 ` bfields 2012-04-10 14:10 ` Stanislav Kinsbursky 2012-04-10 14:18 ` bfields
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.