* [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. @ 2017-01-15 23:38 Oleg Drokin 2017-01-16 17:17 ` J. Bruce Fields 2017-01-16 17:32 ` [Lsf-pc] " James Bottomley 0 siblings, 2 replies; 19+ messages in thread From: Oleg Drokin @ 2017-01-15 23:38 UTC (permalink / raw) To: lsf-pc; +Cc: linux-fsdevel Hello! I would like to attend filesystem track in the LSF/MM this year. Other than the obvious Lustre related stuff (ie hearing from Christoph how bad Lustre is and what other parts of it we need to remove), I can share hopefully useful testing methods we came up with in our group that more people can benefit from apparently, as evidenced by some interest from NFS people due to a bunch of problems I was able to uncover. I suspect other networking filesystems would benefit here. I also see there's potentially going to be a caching discussion that sounds pretty relevant to Lustre too. This probably would go hand-in-hand with a somewhat recent discusison with Al Viro about potentially redoing "unmount the subtrees on dentry invalidation" that appears to be overly aggressive now. A container support from filesystems is also very relevant to us since Lustre is used more and more in such settings. Bye, Oleg ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-15 23:38 [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems Oleg Drokin @ 2017-01-16 17:17 ` J. Bruce Fields 2017-01-16 17:23 ` Jeffrey Altman 2017-01-16 17:32 ` [Lsf-pc] " James Bottomley 1 sibling, 1 reply; 19+ messages in thread From: J. Bruce Fields @ 2017-01-16 17:17 UTC (permalink / raw) To: Oleg Drokin; +Cc: lsf-pc, linux-fsdevel On Sun, Jan 15, 2017 at 06:38:43PM -0500, Oleg Drokin wrote: > Hello! > > I would like to attend filesystem track in the LSF/MM this year. > > Other than the obvious Lustre related stuff (ie hearing from Christoph > how bad Lustre is and what other parts of it we need to remove), > I can share hopefully useful testing methods we came up with in our group > that more people can benefit from apparently, as evidenced by some interest > from NFS people due to a bunch of problems I was able to uncover. Yes, I remember at least this found some races after the server's NFSv4 state locking was rewritten. --b. > I suspect other networking filesystems would benefit here. > > I also see there's potentially going to be a caching discussion that sounds > pretty relevant to Lustre too. > This probably would go hand-in-hand with a somewhat recent discusison with Al Viro > about potentially redoing "unmount the subtrees on dentry invalidation" that > appears to be overly aggressive now. > > A container support from filesystems is also very relevant to us since Lustre > is used more and more in such settings. > > Bye, > Oleg-- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 17:17 ` J. Bruce Fields @ 2017-01-16 17:23 ` Jeffrey Altman 2017-01-16 17:42 ` Chuck Lever 0 siblings, 1 reply; 19+ messages in thread From: Jeffrey Altman @ 2017-01-16 17:23 UTC (permalink / raw) To: Oleg Drokin; +Cc: lsf-pc, linux-fsdevel [-- Attachment #1.1: Type: text/plain, Size: 550 bytes --] > On Sun, Jan 15, 2017 at 06:38:43PM -0500, Oleg Drokin wrote: >> A container support from filesystems is also very relevant to us since Lustre >> is used more and more in such settings. I too would be interested in participating in a discussion of filesystem support for containers. In particular, how to manage container identity for network filesystems so that network filesystem modules such as kafs can be used to provide persistent location-independent storage to containerized processes. Jeffrey Altman AuriStor, Inc. [-- Attachment #1.2: jaltman.vcf --] [-- Type: text/x-vcard, Size: 410 bytes --] begin:vcard fn:Jeffrey Altman n:Altman;Jeffrey org:AuriStor, Inc. adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United States email;internet:jaltman@auristor.com title:Founder and CEO tel;work:+1-212-769-9018 note;quoted-printable:LinkedIn: https://www.linkedin.com/in/jeffreyaltman=0D=0A= Skype: jeffrey.e.altman=0D=0A= url:https://www.auristor.com/ version:2.1 end:vcard [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4057 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 17:23 ` Jeffrey Altman @ 2017-01-16 17:42 ` Chuck Lever 2017-01-16 17:46 ` James Bottomley 0 siblings, 1 reply; 19+ messages in thread From: Chuck Lever @ 2017-01-16 17:42 UTC (permalink / raw) To: Jeffrey Altman; +Cc: Oleg Drokin, lsf-pc, linux-fsdevel > On Jan 16, 2017, at 12:23 PM, Jeffrey Altman <jaltman@auristor.com> wrote: > >> On Sun, Jan 15, 2017 at 06:38:43PM -0500, Oleg Drokin wrote: >>> A container support from filesystems is also very relevant to us since Lustre >>> is used more and more in such settings. > > I too would be interested in participating in a discussion of filesystem > support for containers. In particular, how to manage container identity > for network filesystems so that network filesystem modules such as kafs > can be used to provide persistent location-independent storage to > containerized processes. I'm also interested in that discussion. > Jeffrey Altman > AuriStor, Inc. > > > <jaltman.vcf> -- Chuck Lever ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 17:42 ` Chuck Lever @ 2017-01-16 17:46 ` James Bottomley 2017-01-16 20:39 ` Authentication Contexts for network file systems and Containers was " Jeffrey Altman 0 siblings, 1 reply; 19+ messages in thread From: James Bottomley @ 2017-01-16 17:46 UTC (permalink / raw) To: Chuck Lever, Jeffrey Altman; +Cc: Oleg Drokin, lsf-pc, linux-fsdevel On Mon, 2017-01-16 at 12:42 -0500, Chuck Lever wrote: > > On Jan 16, 2017, at 12:23 PM, Jeffrey Altman <jaltman@auristor.com> > > wrote: > > > > > On Sun, Jan 15, 2017 at 06:38:43PM -0500, Oleg Drokin wrote: > > > > A container support from filesystems is also very relevant to > > > > us since Lustre > > > > is used more and more in such settings. > > > > I too would be interested in participating in a discussion of > > filesystem support for containers. In particular, how to manage > > container identity for network filesystems so that network > > filesystem modules such as kafs can be used to provide persistent > > location-independent storage to containerized processes. > > I'm also interested in that discussion. For identity, doesn't the UTS namespace do this? If not, what is missing? James ^ permalink raw reply [flat|nested] 19+ messages in thread
* Authentication Contexts for network file systems and Containers was Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 17:46 ` James Bottomley @ 2017-01-16 20:39 ` Jeffrey Altman 2017-01-16 21:03 ` [Lsf-pc] " James Bottomley 0 siblings, 1 reply; 19+ messages in thread From: Jeffrey Altman @ 2017-01-16 20:39 UTC (permalink / raw) To: James Bottomley; +Cc: containers, lsf-pc, linux-fsdevel [-- Attachment #1.1: Type: text/plain, Size: 5949 bytes --] On 1/16/2017 12:46 PM, James Bottomley wrote: > > For identity, doesn't the UTS namespace do this? If not, what is > missing? > > James James, Thanks for posing the question. Unless I'm missing something, the UTS namespace permits an alternate 'hostname' and NIS 'domainname' to be specified for local visibility to the processes running in the container. For an /afs network file system client (kafs, OpenAFS or AuriStorFS) the kernel module must be able to associate each process with an authentication context. The AFS family of file systems have implemented this binding as part of its Process Authentication Group (PAG) concept. A PAG is a set of processes that share an authentication context. The authentication context includes: * network credentials necessary to establish new server connections to requisite network-based services. These include not only the backing store for files and directories but any distributed database services managing location independence, replication, failover, etc. * established server connections to individual servers. These connections are re-used for all requests from a process that shares the authentication context. The network credentials might be a Kerberos ticket, a public key, the result of a GSS-API exchange, or something else. It depends on the requirements of the security class. The security properties of a PAG are: * a new PAG may be created by any process. When a new PAG is created its membership is only the process that created it. * a process may remove itself from a PAG that it is a member of. * when a child process is created, it inherits a single PAG membership from the parent process. * it should not be possible to join a process to a PAG after process creation. Although due to implementation limitations on some platforms you will find references to a child process being able to set the PAG of its parent process. In the traditional PAG implementations used by AFS unix clients, there has been a restriction of one PAG membership per process. The Windows client implements an extended model which is better suited to multi-threaded processes. * a process can be a member of more than one PAG at a time * a process can select one of its PAGs as the default PAG * a thread can select one of the process' PAGs as its active PAG and if there is no active PAG, the process default PAG is used This extended Authentication Group model works well for processes such as web servers that need to execute requests in the authentication context of the delegated identity and be able to rapidly switch contexts for each request. It is important to note that the network credentials stored in an authentication context do not necessarily have any relationship to the local machine. It is also important to remember that network credentials often have a relatively short lifetime and must be renewed or replaced on a regular basis. For containers I envision PAGs being used in the following manner: * A process running in the context of the host OS or one that has access to keys stored in a TPM or other secure keystore creates a new PAG for each container it is going to launch. * This process will then obtain the initial network credentials required by the container processes and store them into the PAG. * The initial Container process will then be created as a child process and inherits the PAG membership. Each subsequent child process in the Container will in turn inherit the same PAG. * Periodically the host OS process will renew the network credentials for the PAG. This avoids the need for the processes in the container to have any access to or knowledge of the network identity under which it is executing. * A process in the container could decide to resign from the inherited PAG and create its own PAG using credentials available to that process. For example, a web server running in a container. The end result is a PAG which spans both the host OS and the Container processes. The Container processes might not even know what credentials they are running with. Keyrings were created as a storage facility for the network credentials, https://www.infradead.org/~dhowells/kafs/#keyrings, but keyrings are not an authentication context. While a file system can internally create an association between an authentication content with a file descriptor once it is created and with pages for write-back, I believe there would be benefit from a more generic method of tracking authentication contexts in file descriptors and pages. In particular would be better defined behavior when a file has been opened for "write" from processes associated with more than one authentication context. PAG creation and PAG token set manipulation in the AFS family of file systems traditionally took place via the use of path-based ioctls. Providing equivalent functionality to user-land is an open topic that David Howells's submitted as a topic for LSF/MM. See afs(setpag), VIOC_GETPAG, VIOCUNPAG, VIC*TOK* and VIOCUNLOG: https://www.infradead.org/~dhowells/kafs/user_interface.html While the PAG model has worked well for many decades it does periodically run into problems with system design that assumes that local system identities have the same meaning to network resources. For example, the problems that AFS is currently experiencing with systemd. A good description of problem by Jonathan Billings can be found at https://docs.google.com/document/d/1P27fP1uj-C8QdxDKMKtI-Qh00c5_9zJa4YHjnpB6ODM/pub I hope this letter is helpful in describing the issues that the AFS community has experienced and how we believe that authentication context management can be used to enhance the usability of containers. [-- Attachment #1.2: jaltman.vcf --] [-- Type: text/x-vcard, Size: 410 bytes --] begin:vcard fn:Jeffrey Altman n:Altman;Jeffrey org:AuriStor, Inc. adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United States email;internet:jaltman@auristor.com title:Founder and CEO tel;work:+1-212-769-9018 note;quoted-printable:LinkedIn: https://www.linkedin.com/in/jeffreyaltman=0D=0A= Skype: jeffrey.e.altman=0D=0A= url:https://www.auristor.com/ version:2.1 end:vcard [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4057 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] Authentication Contexts for network file systems and Containers was Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 20:39 ` Authentication Contexts for network file systems and Containers was " Jeffrey Altman @ 2017-01-16 21:03 ` James Bottomley 2017-01-17 16:29 ` Jeffrey Altman 0 siblings, 1 reply; 19+ messages in thread From: James Bottomley @ 2017-01-16 21:03 UTC (permalink / raw) To: Jeffrey Altman; +Cc: linux-fsdevel, containers, lsf-pc On Mon, 2017-01-16 at 15:39 -0500, Jeffrey Altman wrote: > Error verifying signature: parse error > --------------ms000508080908050405010401 > Content-Type: multipart/mixed; > boundary="------------049F6401F78BABEBFB8F74AC" > > This is a multi-part message in MIME format. > --------------049F6401F78BABEBFB8F74AC > Content-Type: text/plain; charset=utf-8 > Content-Transfer-Encoding: quoted-printable > > On 1/16/2017 12:46 PM, James Bottomley wrote: > > > > For identity, doesn't the UTS namespace do this? If not, what is > > missing? > > =20 > > James > > James, > > Thanks for posing the question. > > Unless I'm missing something, the UTS namespace permits an alternate > 'hostname' and NIS 'domainname' to be specified for local visibility > to the processes running in the container. > > For an /afs network file system client (kafs, OpenAFS or AuriStorFS) > the kernel module must be able to associate each process with an > authentication context. The AFS family of file systems have > implemented this binding as part of its Process Authentication Group > (PAG) concept. A PAG is a set of processes that share an > authentication context. The authentication context includes: [...] OK, so snipping all the details: it's a per process property and inherited, I don't even see that it needs anything container specific. The pid namespace should be sufficient to keep any potential security leaks contained and the inheritance model should just work with containers. > While a file system can internally create an association between an > authentication content with a file descriptor once it is created and > with pages for write-back, I believe there would be benefit from a > more generic method of tracking authentication contexts in file > descriptors and pages. In particular would be better defined > behavior when a file has been opened for "write" from processes > associated with more than one authentication context. As long as an "authentication" becomes a property of a file descriptor (like a token), then I don't see any container problems: fds are namespace blind, so they can be passed between containers and your authorizations would go with them. If you need to go back to a process as part of the authorization, then there would be problems because processes are namespaced. > For example, the problems that AFS is currently experiencing with > systemd. A good description of problem by Jonathan Billings can be > found at > > > https://docs.google.com/document/d/1P27fP1uj-C8QdxDKMKtI-Qh00c5_9zJa4 > YHjn=pB6ODM/pub This is giving me "Sorry, the file you have requested does not exist." James ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] Authentication Contexts for network file systems and Containers was Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 21:03 ` [Lsf-pc] " James Bottomley @ 2017-01-17 16:29 ` Jeffrey Altman 2017-01-17 16:34 ` Trond Myklebust 0 siblings, 1 reply; 19+ messages in thread From: Jeffrey Altman @ 2017-01-17 16:29 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, containers, lsf-pc [-- Attachment #1.1: Type: text/plain, Size: 1727 bytes --] On 1/16/2017 4:03 PM, James Bottomley wrote: > [...] > > OK, so snipping all the details: it's a per process property and > inherited, I don't even see that it needs anything container specific. > The pid namespace should be sufficient to keep any potential security > leaks contained and the inheritance model should just work with > containers. Agreed. >> While a file system can internally create an association between an >> authentication content with a file descriptor once it is created and >> with pages for write-back, I believe there would be benefit from a >> more generic method of tracking authentication contexts in file >> descriptors and pages. In particular would be better defined >> behavior when a file has been opened for "write" from processes >> associated with more than one authentication context. > > As long as an "authentication" becomes a property of a file descriptor > (like a token), then I don't see any container problems: fds are > namespace blind, so they can be passed between containers and your > authorizations would go with them. If you need to go back to a process > as part of the authorization, then there would be problems because > processes are namespaced. > >> For example, the problems that AFS is currently experiencing with >> systemd. A good description of problem by Jonathan Billings can be >> found at >> >> >> https://docs.google.com/document/d/1P27fP1uj-C8QdxDKMKtI-Qh00c5_9zJa4 >> YHjn=pB6ODM/pub > > This is giving me "Sorry, the file you have requested does not exist." Not sure how an extra '=' got in there. https://docs.google.com/document/d/1P27fP1uj-C8QdxDKMKtI-Qh00c5_9zJa4YHjnpB6ODM/pub Jeffrey Altman [-- Attachment #1.2: jaltman.vcf --] [-- Type: text/x-vcard, Size: 410 bytes --] begin:vcard fn:Jeffrey Altman n:Altman;Jeffrey org:AuriStor, Inc. adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United States email;internet:jaltman@auristor.com title:Founder and CEO tel;work:+1-212-769-9018 note;quoted-printable:LinkedIn: https://www.linkedin.com/in/jeffreyaltman=0D=0A= Skype: jeffrey.e.altman=0D=0A= url:https://www.auristor.com/ version:2.1 end:vcard [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4057 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] Authentication Contexts for network file systems and Containers was Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-17 16:29 ` Jeffrey Altman @ 2017-01-17 16:34 ` Trond Myklebust 2017-01-17 17:10 ` Jeffrey Altman 0 siblings, 1 reply; 19+ messages in thread From: Trond Myklebust @ 2017-01-17 16:34 UTC (permalink / raw) To: jaltman, James.Bottomley; +Cc: containers, lsf-pc, linux-fsdevel On Tue, 2017-01-17 at 11:29 -0500, Jeffrey Altman wrote: > Error verifying signature: parse error > On 1/16/2017 4:03 PM, James Bottomley wrote: > > [...] > > > > OK, so snipping all the details: it's a per process property and > > inherited, I don't even see that it needs anything container > > specific. > > The pid namespace should be sufficient to keep any potential > > security > > leaks contained and the inheritance model should just work with > > containers. > > Agreed. > > > > While a file system can internally create an association between > > > an > > > authentication content with a file descriptor once it is created > > > and > > > with pages for write-back, I believe there would be benefit from > > > a > > > more generic method of tracking authentication contexts in file > > > descriptors and pages. In particular would be better defined > > > behavior when a file has been opened for "write" from processes > > > associated with more than one authentication context. > > > > As long as an "authentication" becomes a property of a file > > descriptor > > (like a token), then I don't see any container problems: fds are > > namespace blind, so they can be passed between containers and your > > authorizations would go with them. If you need to go back to a > > process > > as part of the authorization, then there would be problems because > > processes are namespaced. > > > > > For example, the problems that AFS is currently experiencing with > > > systemd. A good description of problem by Jonathan Billings can > > > be > > > found at > > > > > > > > > https://docs.google.com/document/d/1P27fP1uj-C8QdxDKMKtI-Qh00c5_9 > > > zJa4 > > > YHjn=pB6ODM/pub > > > > This is giving me "Sorry, the file you have requested does not > > exist." > > Not sure how an extra '=' got in there. > > https://docs.google.com/document/d/1P27fP1uj-C8QdxDKMKtI-Qh00c5_9zJa4 > YHjnpB6ODM/pub > > Jeffrey Altman > There is the usual problem when you have to do an upcall in order to set up the authentication context for session based protocols, such as RPCSEC_GSS. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] Authentication Contexts for network file systems and Containers was Re: [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-17 16:34 ` Trond Myklebust @ 2017-01-17 17:10 ` Jeffrey Altman 0 siblings, 0 replies; 19+ messages in thread From: Jeffrey Altman @ 2017-01-17 17:10 UTC (permalink / raw) To: Trond Myklebust, James.Bottomley; +Cc: containers, lsf-pc, linux-fsdevel [-- Attachment #1.1: Type: text/plain, Size: 1025 bytes --] On 1/17/2017 11:34 AM, Trond Myklebust wrote: >> >> https://docs.google.com/document/d/1P27fP1uj-C8QdxDKMKtI-Qh00c5_9zJa4 >> YHjnpB6ODM/pub >> >> Jeffrey Altman >> > > > There is the usual problem when you have to do an upcall in order to > set up the authentication context for session based protocols, such as > RPCSEC_GSS. > Trond, Thanks for the thought but that is not the issue here. systemd --user launches processes as the user but those processes do not share the same keyring as the processes started from the pam stack at logon. Since the keyring doesn't match, the processes started by systemd --user are in a different authentication context. Setting the effective 'uid' is insufficient to gain access to the proper authentication context. I agree that upcalls are often a problem which is why the AFS family of protocols does not use them. Typically a process will be created in userland for each PAG to push refreshed credentials to the kernel module. Jeffrey Altman [-- Attachment #1.2: jaltman.vcf --] [-- Type: text/x-vcard, Size: 410 bytes --] begin:vcard fn:Jeffrey Altman n:Altman;Jeffrey org:AuriStor, Inc. adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United States email;internet:jaltman@auristor.com title:Founder and CEO tel;work:+1-212-769-9018 note;quoted-printable:LinkedIn: https://www.linkedin.com/in/jeffreyaltman=0D=0A= Skype: jeffrey.e.altman=0D=0A= url:https://www.auristor.com/ version:2.1 end:vcard [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4057 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-15 23:38 [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems Oleg Drokin 2017-01-16 17:17 ` J. Bruce Fields @ 2017-01-16 17:32 ` James Bottomley 2017-01-16 18:02 ` Oleg Drokin 1 sibling, 1 reply; 19+ messages in thread From: James Bottomley @ 2017-01-16 17:32 UTC (permalink / raw) To: Oleg Drokin, lsf-pc; +Cc: linux-fsdevel, containers On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: > A container support from filesystems is also very relevant to us > since Lustre is used more and more in such settings. I've added the containers ML to the cc just in case. Can you add more colour to this, please? What container support for filesystems do you think we need beyond the user namespace in the superblock? James ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 17:32 ` [Lsf-pc] " James Bottomley @ 2017-01-16 18:02 ` Oleg Drokin 2017-01-16 18:21 ` James Bottomley 0 siblings, 1 reply; 19+ messages in thread From: Oleg Drokin @ 2017-01-16 18:02 UTC (permalink / raw) To: James Bottomley; +Cc: lsf-pc, linux-fsdevel, containers On Jan 16, 2017, at 12:32 PM, James Bottomley wrote: > On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: >> A container support from filesystems is also very relevant to us >> since Lustre is used more and more in such settings. > > I've added the containers ML to the cc just in case. Can you add more > colour to this, please? What container support for filesystems do you > think we need beyond the user namespace in the superblock? Namespace access is necessary, we might need it before the superblock is there too (say during mount we might need kerberos credentials fetched to properly authenticate this mount instance to the server). Separately, I know that e.g. NFS tries to match underlying mounts to share them "under the hood", so there might be a single superblock used with several namespaces potentially, I imagine. In Lustre it might be beneficial to do something like this too in order to conserve resources and potentially have better fs cache sharing. I fact the whole caching thing is somewhat complicated with memory groups too, and if we allow shared caching between several containers, would become even more complicated. I am sure there's a bunch of pitfalls there too that we are not realizing yet that other people have already encountered and it would be useful to find about them. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 18:02 ` Oleg Drokin @ 2017-01-16 18:21 ` James Bottomley 2017-01-16 18:39 ` Oleg Drokin 0 siblings, 1 reply; 19+ messages in thread From: James Bottomley @ 2017-01-16 18:21 UTC (permalink / raw) To: Oleg Drokin; +Cc: linux-fsdevel, containers, lsf-pc On Mon, 2017-01-16 at 13:02 -0500, Oleg Drokin wrote: > On Jan 16, 2017, at 12:32 PM, James Bottomley wrote: > > > On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: > > > A container support from filesystems is also very relevant to > > > us > > > since Lustre is used more and more in such settings. > > > > I've added the containers ML to the cc just in case. Can you add > > more > > colour to this, please? What container support for filesystems do > > you > > think we need beyond the user namespace in the superblock? > > Namespace access is necessary, we might need it before the superblock > is there too (say during mount we might need kerberos credentials > fetched to properly authenticate this mount instance to the server). The superblock namespace is mostly for uid/gid changes across the kernel <-> filesystem boundary. The actual container namespaces will already be set up by the time the mount is done (assuming mount within a container), so you have them all present. Usually you get the information for credentials from a combination of the UTS namespace (host/domain name) and the mount namespace (credentials provisioned to container filesystem). Perhaps if you described the actual problem you're seeing rather than try to relate it to what I said about superblock namespace (which is probably irrelevant), we could figure out what the issue is. James ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 18:21 ` James Bottomley @ 2017-01-16 18:39 ` Oleg Drokin 2017-01-16 20:58 ` James Bottomley 0 siblings, 1 reply; 19+ messages in thread From: Oleg Drokin @ 2017-01-16 18:39 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, containers, lsf-pc On Jan 16, 2017, at 1:21 PM, James Bottomley wrote: > On Mon, 2017-01-16 at 13:02 -0500, Oleg Drokin wrote: >> On Jan 16, 2017, at 12:32 PM, James Bottomley wrote: >> >>> On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: >>>> A container support from filesystems is also very relevant to >>>> us >>>> since Lustre is used more and more in such settings. >>> >>> I've added the containers ML to the cc just in case. Can you add >>> more >>> colour to this, please? What container support for filesystems do >>> you >>> think we need beyond the user namespace in the superblock? >> >> Namespace access is necessary, we might need it before the superblock >> is there too (say during mount we might need kerberos credentials >> fetched to properly authenticate this mount instance to the server). > > The superblock namespace is mostly for uid/gid changes across the > kernel <-> filesystem boundary. That's on the kernel<->filesystem, but inside of the FS there might be other considerations that you might want to attach there. Say when you are encrypting the traffic to the server you want to use the right keys. It's all relatively easy when you have a separate mount there, so you can store the credentials in the superblock, but we lose on the cache sharing, for example (I don't know how important that is). > The actual container namespaces will already be set up by the time the > mount is done (assuming mount within a container), so you have them all > present. Usually you get the information for credentials from a > combination of the UTS namespace (host/domain name) and the mount > namespace (credentials provisioned to container filesystem). Yes, when mounting from a container it's possible to fetch this info and pass it around, is mounting from outside of the container important too? > Perhaps if you described the actual problem you're seeing rather than > try to relate it to what I said about superblock namespace (which is > probably irrelevant), we could figure out what the issue is. Right now the deployments are simple and we do not have any major issues (other than certain caching overzealousness that throws cgroup memory accounting off), but learning what other problems are there in this space and what we should be looking for. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 18:39 ` Oleg Drokin @ 2017-01-16 20:58 ` James Bottomley 2017-01-17 7:00 ` Oleg Drokin 0 siblings, 1 reply; 19+ messages in thread From: James Bottomley @ 2017-01-16 20:58 UTC (permalink / raw) To: Oleg Drokin; +Cc: linux-fsdevel, containers, lsf-pc On Mon, 2017-01-16 at 13:39 -0500, Oleg Drokin wrote: > On Jan 16, 2017, at 1:21 PM, James Bottomley wrote: > > > On Mon, 2017-01-16 at 13:02 -0500, Oleg Drokin wrote: > > > On Jan 16, 2017, at 12:32 PM, James Bottomley wrote: > > > > > > > On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: > > > > > A container support from filesystems is also very relevant > > > > > to us since Lustre is used more and more in such settings. > > > > > > > > I've added the containers ML to the cc just in case. Can you > > > > add more colour to this, please? What container support for > > > > filesystems do you think we need beyond the user namespace in > > > > the superblock? > > > > > > Namespace access is necessary, we might need it before the > > > superblock is there too (say during mount we might need kerberos > > > credentials fetched to properly authenticate this mount instance > > > to the server). > > > > The superblock namespace is mostly for uid/gid changes across the > > kernel <-> filesystem boundary. > > That's on the kernel<->filesystem, but inside of the FS there might > be other considerations that you might want to attach there. > Say when you are encrypting the traffic to the server you want > to use the right keys. So this is the keyring namespace? It was mentioned at KS, but, as far as I can tell, not discussed in the Containers MC that followed, so I've no idea what the status is. > It's all relatively easy when you have a separate mount there, so > you can store the credentials in the superblock, but we lose on the > cache sharing, for example (I don't know how important that is). It depends what you mean by "cache sharing". If you're thinking of the page cache, then it all just works, provided the underlying inode doesn't change. If you're in the situation where the container orchestration system knows that two files are the same but there's been a change of underlying device (fuse passthrough, say) so the inode is different (the docker double caching problem) and you need some way of forcibly combining them in the page cache, that was discussed a couple of years ago, and Virtuozzo people have patches, but I haven't seen much upstream agreement. > > The actual container namespaces will already be set up by the time > > the mount is done (assuming mount within a container), so you have > > them all present. Usually you get the information for credentials > > from a combination of the UTS namespace (host/domain name) and the > > mount namespace (credentials provisioned to container filesystem). > > Yes, when mounting from a container it's possible to fetch this info > and pass it around, is mounting from outside of the container > important too? mounting from outside the container usually involved entering the container and performing the mount. However the way you enter the container can pull stuff in from outside (like file descriptors). > > Perhaps if you described the actual problem you're seeing rather > > than try to relate it to what I said about superblock namespace > > (which is probably irrelevant), we could figure out what the issue > > is. > > Right now the deployments are simple and we do not have any major > issues (other than certain caching overzealousness that throws cgroup > memory accounting off), but learning what other problems are there in > this space and what we should be looking for. You might need to canvas the other users to see if there is anything viable to discuss. James ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-16 20:58 ` James Bottomley @ 2017-01-17 7:00 ` Oleg Drokin 2017-01-17 14:26 ` James Bottomley 2017-01-17 14:56 ` James Bottomley 0 siblings, 2 replies; 19+ messages in thread From: Oleg Drokin @ 2017-01-17 7:00 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, containers, lsf-pc On Jan 16, 2017, at 3:58 PM, James Bottomley wrote: > On Mon, 2017-01-16 at 13:39 -0500, Oleg Drokin wrote: >> On Jan 16, 2017, at 1:21 PM, James Bottomley wrote: >> >>> On Mon, 2017-01-16 at 13:02 -0500, Oleg Drokin wrote: >>>> On Jan 16, 2017, at 12:32 PM, James Bottomley wrote: >>>> >>>>> On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: >>>>>> A container support from filesystems is also very relevant >>>>>> to us since Lustre is used more and more in such settings. >>>>> >>>>> I've added the containers ML to the cc just in case. Can you >>>>> add more colour to this, please? What container support for >>>>> filesystems do you think we need beyond the user namespace in >>>>> the superblock? >>>> >>>> Namespace access is necessary, we might need it before the >>>> superblock is there too (say during mount we might need kerberos >>>> credentials fetched to properly authenticate this mount instance >>>> to the server). >>> >>> The superblock namespace is mostly for uid/gid changes across the >>> kernel <-> filesystem boundary. >> >> That's on the kernel<->filesystem, but inside of the FS there might >> be other considerations that you might want to attach there. >> Say when you are encrypting the traffic to the server you want >> to use the right keys. > > So this is the keyring namespace? It was mentioned at KS, but, as far > as I can tell, not discussed in the Containers MC that followed, so > I've no idea what the status is. Could be keyring or other mechanisms. >> It's all relatively easy when you have a separate mount there, so >> you can store the credentials in the superblock, but we lose on the >> cache sharing, for example (I don't know how important that is). > > It depends what you mean by "cache sharing". If you're thinking of the > page cache, then it all just works, provided the underlying inode > doesn't change. If you're in the situation where the container It only "just works" if the superblock is the same, if there's a separate mount per container with separate superblock, then there's no sharing at all. Accounting of said "shared" cache might be interesting too, which of the containers would you account against? All of them? >>> Perhaps if you described the actual problem you're seeing rather >>> than try to relate it to what I said about superblock namespace >>> (which is probably irrelevant), we could figure out what the issue >>> is. >> >> Right now the deployments are simple and we do not have any major >> issues (other than certain caching overzealousness that throws cgroup >> memory accounting off), but learning what other problems are there in >> this space and what we should be looking for. > > You might need to canvas the other users to see if there is anything > viable to discuss. This is what I am trying to do with this email in part, I guess. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-17 7:00 ` Oleg Drokin @ 2017-01-17 14:26 ` James Bottomley 2017-01-17 17:41 ` Oleg Drokin 2017-01-17 14:56 ` James Bottomley 1 sibling, 1 reply; 19+ messages in thread From: James Bottomley @ 2017-01-17 14:26 UTC (permalink / raw) To: Oleg Drokin; +Cc: linux-fsdevel, containers, lsf-pc On Tue, 2017-01-17 at 02:00 -0500, Oleg Drokin wrote: > On Jan 16, 2017, at 3:58 PM, James Bottomley wrote: > > > On Mon, 2017-01-16 at 13:39 -0500, Oleg Drokin wrote: > > > On Jan 16, 2017, at 1:21 PM, James Bottomley wrote: > > > > > > > On Mon, 2017-01-16 at 13:02 -0500, Oleg Drokin wrote: > > > > > On Jan 16, 2017, at 12:32 PM, James Bottomley wrote: > > > > > > > > > > > On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: > > > > > > > A container support from filesystems is also very > > > > > > > relevant to us since Lustre is used more and more in > > > > > > > such settings. > > > > > > > > > > > > I've added the containers ML to the cc just in case. Can > > > > > > you add more colour to this, please? What container > > > > > > support for filesystems do you think we need beyond the > > > > > > user namespace in the superblock? > > > > > > > > > > Namespace access is necessary, we might need it before the > > > > > superblock is there too (say during mount we might need > > > > > kerberos credentials fetched to properly authenticate this > > > > > mount instance to the server). > > > > > > > > The superblock namespace is mostly for uid/gid changes across > > > > the kernel <-> filesystem boundary. > > > > > > That's on the kernel<->filesystem, but inside of the FS there > > > might be other considerations that you might want to attach > > > there. Say when you are encrypting the traffic to the server you > > > want to use the right keys. > > > > So this is the keyring namespace? It was mentioned at KS, but, as > > far as I can tell, not discussed in the Containers MC that > > followed, so I've no idea what the status is. > > Could be keyring or other mechanisms. OK, you need to agree on the mechanism first, then we can discuss if it needs OS virtualization. A large number of mechanisms in the kernel actually don't (because the current OS protections are strong enough) like file descriptors. After you understand the mechanism there are usually four main ways to do OS virtualization: 1. Do nothing becuase the object doesn't need it (fd) 2. Label Namespace because it needs isolation (network) 3. add to user namespace because you need privileged access (setns call) 4. add to cgroup because the resource needs to be accounted (mem) But before we get into that we need to know the properties of the mechanism. James ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-17 14:26 ` James Bottomley @ 2017-01-17 17:41 ` Oleg Drokin 0 siblings, 0 replies; 19+ messages in thread From: Oleg Drokin @ 2017-01-17 17:41 UTC (permalink / raw) To: James Bottomley; +Cc: linux-fsdevel, containers, lsf-pc On Jan 17, 2017, at 9:26 AM, James Bottomley wrote: > On Tue, 2017-01-17 at 02:00 -0500, Oleg Drokin wrote: >> On Jan 16, 2017, at 3:58 PM, James Bottomley wrote: >> >>> On Mon, 2017-01-16 at 13:39 -0500, Oleg Drokin wrote: >>>> On Jan 16, 2017, at 1:21 PM, James Bottomley wrote: >>>> >>>>> On Mon, 2017-01-16 at 13:02 -0500, Oleg Drokin wrote: >>>>>> On Jan 16, 2017, at 12:32 PM, James Bottomley wrote: >>>>>> >>>>>>> On Sun, 2017-01-15 at 18:38 -0500, Oleg Drokin wrote: >>>>>>>> A container support from filesystems is also very >>>>>>>> relevant to us since Lustre is used more and more in >>>>>>>> such settings. >>>>>>> >>>>>>> I've added the containers ML to the cc just in case. Can >>>>>>> you add more colour to this, please? What container >>>>>>> support for filesystems do you think we need beyond the >>>>>>> user namespace in the superblock? >>>>>> >>>>>> Namespace access is necessary, we might need it before the >>>>>> superblock is there too (say during mount we might need >>>>>> kerberos credentials fetched to properly authenticate this >>>>>> mount instance to the server). >>>>> >>>>> The superblock namespace is mostly for uid/gid changes across >>>>> the kernel <-> filesystem boundary. >>>> >>>> That's on the kernel<->filesystem, but inside of the FS there >>>> might be other considerations that you might want to attach >>>> there. Say when you are encrypting the traffic to the server you >>>> want to use the right keys. >>> >>> So this is the keyring namespace? It was mentioned at KS, but, as >>> far as I can tell, not discussed in the Containers MC that >>> followed, so I've no idea what the status is. >> >> Could be keyring or other mechanisms. > > OK, you need to agree on the mechanism first, then we can discuss if it > needs OS virtualization. A large number of mechanisms in the kernel > actually don't (because the current OS protections are strong enough) > like file descriptors. After you understand the mechanism there are > usually four main ways to do OS virtualization: > > 1. Do nothing becuase the object doesn't need it (fd) > 2. Label Namespace because it needs isolation (network) > 3. add to user namespace because you need privileged access (setns > call) > 4. add to cgroup because the resource needs to be accounted (mem) > > But before we get into that we need to know the properties of the > mechanism. Right, I just checked and we actually are using keyring that is a per namespace even for kerberos, so that's enough for us there so far as long as we can attach to it (and we can when we know where from did the request originate from). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems. 2017-01-17 7:00 ` Oleg Drokin 2017-01-17 14:26 ` James Bottomley @ 2017-01-17 14:56 ` James Bottomley 1 sibling, 0 replies; 19+ messages in thread From: James Bottomley @ 2017-01-17 14:56 UTC (permalink / raw) To: Oleg Drokin; +Cc: linux-fsdevel, containers, lsf-pc On Tue, 2017-01-17 at 02:00 -0500, Oleg Drokin wrote: > On Jan 16, 2017, at 3:58 PM, James Bottomley wrote: > > > On Mon, 2017-01-16 at 13:39 -0500, Oleg Drokin wrote: > > > It's all relatively easy when you have a separate mount there, so > > > you can store the credentials in the superblock, but we lose on > > > the cache sharing, for example (I don't know how important that > > > is). > > > > It depends what you mean by "cache sharing". If you're thinking of > > the page cache, then it all just works, provided the underlying > > inode doesn't change. If you're in the situation where the > > container > > It only "just works" if the superblock is the same, if there's a > separate mount per container with separate superblock, then there's > no sharing at all. Accounting of said "shared" cache might be > interesting too, which of the containers would you account against? > All of them? Well, caching is done per address_space, which is can be per inode and as you found, inodes are usually per superblock. There are (dirty) tr icks you can do to force sharing at the address space level if you know it's the same file. There was also mention of a ksm like mechanism to force the sharing. Like I said, it was the VZ people who had patches. James ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2017-01-17 17:42 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-15 23:38 [LSF/MM ATTEND] FS jitter testing, network caching, Lustre, cluster filesystems Oleg Drokin 2017-01-16 17:17 ` J. Bruce Fields 2017-01-16 17:23 ` Jeffrey Altman 2017-01-16 17:42 ` Chuck Lever 2017-01-16 17:46 ` James Bottomley 2017-01-16 20:39 ` Authentication Contexts for network file systems and Containers was " Jeffrey Altman 2017-01-16 21:03 ` [Lsf-pc] " James Bottomley 2017-01-17 16:29 ` Jeffrey Altman 2017-01-17 16:34 ` Trond Myklebust 2017-01-17 17:10 ` Jeffrey Altman 2017-01-16 17:32 ` [Lsf-pc] " James Bottomley 2017-01-16 18:02 ` Oleg Drokin 2017-01-16 18:21 ` James Bottomley 2017-01-16 18:39 ` Oleg Drokin 2017-01-16 20:58 ` James Bottomley 2017-01-17 7:00 ` Oleg Drokin 2017-01-17 14:26 ` James Bottomley 2017-01-17 17:41 ` Oleg Drokin 2017-01-17 14:56 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).