From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Thu, 3 Oct 2002 17:40:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Thu, 3 Oct 2002 17:40:40 -0400 Received: from dell-paw-3.cambridge.redhat.com ([195.224.55.237]:17401 "EHLO executor.cambridge.redhat.com") by vger.kernel.org with ESMTP id ; Thu, 3 Oct 2002 17:40:35 -0400 To: David Howells , Linus Torvalds , linux-kernel@vger.kernel.org Subject: Re: [PATCH] AFS filesystem for Linux (2/2) In-Reply-To: Message from Jan Harkes of "Thu, 03 Oct 2002 12:53:04 EDT." <20021003165304.GA25718@ravel.coda.cs.cmu.edu> User-Agent: EMH/1.14.1 SEMI/1.14.3 (Ushinoya) FLIM/1.14.3 (=?ISO-8859-4?Q?Unebigory=F2mae?=) APEL/10.3 Emacs/21.2 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII Date: Thu, 03 Oct 2002 22:46:05 +0100 Message-ID: <15361.1033681565@warthog.cambridge.redhat.com> From: David Howells Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi Jan, Do I take it you were (partially) responsible for Coda development? I have to admit I don't know much about Coda. > So you want to eventually link kerberos into the kernel to get the > security right? That's unnecessary judging by OpenAFS. AFAICT, only the ticket needs to be cached in the kernel (this is obtained by means of a userspace program), and then the ticket is passed through the security challenge/response mechanism provided by RxRPC. Otherwise, the entire network side of OpenAFS would have to be in userspace too, I suspect. It may be possible to offload the security aspects to userspace. I'll have to think about that. Besides, I get the impression that NFSv4 may require some level of kerberos support in the kernel. > Coda 'solves' the page-aliasing issues by passing the kernel the same file > descriptor as it is using itself to put the data into the container (cache) > file. You could do the same and tell the kernel what the 'expected size' is, > it can then block or trigger further fetches when that part of the file > isn't available yet. I presume Coda uses a 1:1 mapping between Coda files and cache files stored under a local filesystem (such as EXT3). If so, how do you detect holes in the file, given that the underlying fs doesn't permit you to differentiate between a hole and a block of zeros. > We don't need to do it at such a granualarity because of the disconnected > operation. It is more reliable as we can return a stale copy when we lose > the network halfway during the fetch. OTOH, if you have a copy that you know is now out of date, then one could argue that you shouldn't let the user/application see it, as they/it are then basing anything they do on known "bad" data. Should I also take it that Coda keeps the old file around until it has fetched a revised copy? If so, then surely you can't update a file unless your cache can find room for the entire revised copy. Surely another consequence of this is that the practical maximum file size you can deal with is half the size of your cache. > Hmm, a version of AFS that doesn't adhere to AFS semantics, interesting. > Are you going to emulate the same broken behaviour as transarc AFS on > O_RDWR? Basically when you open a file O_RDWR and write some data, and > anyone else 'commits' an update to the file before you close the > filehandle. Your client writes back the previously committed data, which it > has proactively fetched, but with the local metadata (i.e. i_size). So you > end up with something that closely resembles neither of the actual versions > that were written. What I'm intending to do is have the write VFS method attempt to write the new data direct to the server and to the cache simultaneously where possible. If the volume is not available for some reason, I have a number of choices: (1) Make the write block until the volume becomes available again. (2) Immediately(-ish) fail with an error. (3) Store the write in the cache and try and sync up with the volume when it becomes available again. However, with shared writable mappings, this isn't necessarily possible as we can only really get the data when the VM prods our writepage(s) method. In this case, we have another choice: (4) "Diff" the page in the pagecache against a copy stored in the cache and try to send the changes to the server. Using disconnected operation doesn't actually make this any easier. The problem of how and when write conflicts are resolved still arises. There is a fifth option, and that is to try to lock the target file against other accessors whilst we are trying to write to it (prepare/commit write maybe). > Different underlying filesystems will lay out their data differently, who > says that ext3 with the dirindex hashes or reiserfs, or foofs will not > suddenly break your solution and still work reliable (and faster) from > userspace. Because (and I may not have made this clear) you nominate a block device as the cache, not an already existing filesystem, and mount it as afscache filesystem type. _This_ specifies the layout of the cache, and so whatever other filesystems do is irrelevant. > Can you say hack. No need to. I can go direct to the block device through the BIO system, and so can throw a heap of requests at the blockdev and deal with them as they complete, in the order they are read off of the disc when scanning catalogues. > When you can a file from userspace the kernel will give you readahead, and > with a well working elevator any 'improvements' you obtain really should end > up in the noise. Since I can fire off several requests simultaneously, I effectively obtain a readahead type of effect, and since I don't have to follow any ordering constraints (my catalogues are unordered), I can deal with the blocks in whatever order the elevator delivers them to me. > Intermezzo does the same thing, they even proposed a 'punch hole' syscall to > allow a userspace daemon to 'invalidate' parts of a file so that the kernel > will send the upcall to refetch the data from the server. I don't need a hole punching syscall or ioctl. Apart from the fact that the filesystem is already in the kernel and doesn't require a syscall, the cache filesystem has to discard an entire file as a whole when it notices or is told of a change. > VM/VFS will handle appropriate readahead for you, you might just want to > join the separate requests into one bigger request. Agreed. That would be a reasonable way of doing it. The reason I thought of doing it the way I suggested is that I could make the block size bigger in the cache, and thus reduce indexing walking latency for adjacent pages. > And one definite advantage, you actually provide AFS session semantics. According to the AFS-3 Architectural Overview, "AFS does _not_ provide for completely disconnected operation of file system clients" [their emphasis]. Furthermore, the overview also talks about "Chunked Access", in which it allows files to be pulled over to the client and pushed back to the server in chunks of 64Kb, thus allowing "AFS files of any size to be accessed from a client". Note that 64Kb is also a "default" that can be configured. It also mentions that the read-entire-file notion was dropped, giving some of the reasons I've mentioned. > And my current development version of Coda has {cell,volume,vnode,unique} > (128 bits), which is the same size as a UUID which was designed to have a > very high probability of uniqueness. So if I ever consider adding another > 'ident', I'll just switch to identifying each object with a UUID. Does this mean that every Coda cell is issued with a 4-byte ID number? Or does there need to be an additional index in the cache? > How about IPv6? These were just examples I know fairly well to illustrate the problems. > Or you could use a hash or a userspace daemon that can map a fs-specific > handle to a local cache file. You still have to store a hash somewhere, and if it's stored in a userspace daemon's VM, then it'll probably end up being swapped out to disc, and it may have to be regenerated from indices every time the daemon is restarted (or else your cache has to be started afresh. Thanks for your insights though. Cheers, David