* Correctly understanding Linux's close-to-open consistency @ 2018-09-13 1:24 Chris Siebenmann 2018-09-15 16:20 ` Jeff Layton 0 siblings, 1 reply; 7+ messages in thread From: Chris Siebenmann @ 2018-09-13 1:24 UTC (permalink / raw) To: linux-nfs; +Cc: cks I'm trying to get my head around the officially proper way of writing to NFS files (not just what works today, and what I think is supposed to work, since I was misunderstanding things about that recently). Is it correct to say that when writing data to NFS files, the only sequence of operations that Linux NFS clients officially support is the following: - all processes on all client machines close() the file - one machine (a client or the fileserver) opens() the file, writes to it, and close()s again - processes on client machines can now open() the file again for reading Other sequences of operations may work in some particular kernel version or under some circumstances, but are not guaranteed to work over kernel version changes or in general. In an official 'we guarantee that if you do this, things will work' sense, how does taking NFS locks interact with this required sequence? Do NFS locks make some part of it unnecessary, or does it remain necessary and NFS locks are just there to let you coordinate who has a magic 'you can write' token and you still officially need to close and open and so on? Thanks in advance. - cks ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Correctly understanding Linux's close-to-open consistency 2018-09-13 1:24 Correctly understanding Linux's close-to-open consistency Chris Siebenmann @ 2018-09-15 16:20 ` Jeff Layton 2018-09-15 19:11 ` Chris Siebenmann 0 siblings, 1 reply; 7+ messages in thread From: Jeff Layton @ 2018-09-15 16:20 UTC (permalink / raw) To: Chris Siebenmann, linux-nfs On Wed, 2018-09-12 at 21:24 -0400, Chris Siebenmann wrote: > I'm trying to get my head around the officially proper way of > writing to NFS files (not just what works today, and what I think > is supposed to work, since I was misunderstanding things about that > recently). > > Is it correct to say that when writing data to NFS files, the only > sequence of operations that Linux NFS clients officially support is > the following: > > - all processes on all client machines close() the file > - one machine (a client or the fileserver) opens() the file, writes > to it, and close()s again > - processes on client machines can now open() the file again for > reading No. One can always call fsync() to force data to be flushed to avoid the close of the write fd in this situation. That's really a more portable solution anyway. A local filesystem may not flush data to disk, on close (for instance) so calling fsync will ensure you rely less on filesystem implementation details. The separate open by the reader just helps ensure that the file's attributes are revalidated (so you can tell whether cached data you hold is still valid). > Other sequences of operations may work in some particular kernel version > or under some circumstances, but are not guaranteed to work over kernel > version changes or in general. > The NFS client (and the Linux kernel in general) will try to preserve as much cached data as it can, but eventually it will end up being freed, depending on the kernel's memory requirements. This is not behavior you want to depend on, as an application developer. > In an official 'we guarantee that if you do this, things will work' sense, > how does taking NFS locks interact with this required sequence? Do NFS > locks make some part of it unnecessary, or does it remain necessary and > NFS locks are just there to let you coordinate who has a magic 'you can > write' token and you still officially need to close and open and so on? > If you use file locking (flock() or POSIX locks), then we treat those as cache coherency points as well. The client will write back cached data to the server prior to releasing a lock, and revalidate attributes (and thus the local cache) after acquiring one. If you have an application that does concurrent access via NFS over multiple machines, then you probably want to be using file locking to serialize things across machines. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Correctly understanding Linux's close-to-open consistency 2018-09-15 16:20 ` Jeff Layton @ 2018-09-15 19:11 ` Chris Siebenmann 2018-09-16 11:01 ` Jeff Layton 0 siblings, 1 reply; 7+ messages in thread From: Chris Siebenmann @ 2018-09-15 19:11 UTC (permalink / raw) To: Jeff Layton; +Cc: Chris Siebenmann, linux-nfs > On Wed, 2018-09-12 at 21:24 -0400, Chris Siebenmann wrote: > > Is it correct to say that when writing data to NFS files, the only > > sequence of operations that Linux NFS clients officially support is > > the following: > > > > - all processes on all client machines close() the file > > - one machine (a client or the fileserver) opens() the file, writes > > to it, and close()s again > > - processes on client machines can now open() the file again for > > reading > > No. > > One can always call fsync() to force data to be flushed to avoid the > close of the write fd in this situation. That's really a more portable > solution anyway. A local filesystem may not flush data to disk, on close > (for instance) so calling fsync will ensure you rely less on filesystem > implementation details. > > The separate open by the reader just helps ensure that the file's > attributes are revalidated (so you can tell whether cached data you > hold is still valid). This bit about the separate open doesn't seem to be the case currently, and people here have asserted that it's not true in general. Specifically, under some conditions *not involving you writing*, if you do not close() the file before another machine writes to it and then open() it afterward, the kernel may retain cached data that it is in a position to know (for sure) is invalid because it didn't exist in the previous version of the file (as it was past the end of file position). Since failing to close() before another machine open()s puts you outside this outline of close-to-open, this kernel behavior is not a bug as such (or so it's been explained to me here). If you go outside c-t-o, the kernel is free to do whatever it finds most convenient, and what it found most convenient was to not bother invalidating some cached page data even though it saw a GETATTR change. It may be that I'm not fully understanding how you mean 'revalidated' here. Is it that the kernel does not necessarily bother (re)checking some internal things (such as cached pages) even when it has new GETATTR results, until you do certain operations? As far as the writer using fsync() instead of close(): under this model, the writer must close() if there are ever going to be writers on another machine and readers on its machine (including itself), because otherwise it (and they) will be in the 'reader' position here, and in violation of the outline, and so their client kernel is free to do odd things. (This is a basic model that ignores how NFS locks might interact with things.) > If you use file locking (flock() or POSIX locks), then we treat > those as cache coherency points as well. The client will write back > cached data to the server prior to releasing a lock, and revalidate > attributes (and thus the local cache) after acquiring one. The client currently appears to do more than re-check attributes, at least in one sense of 'revalidate'. In some cases, flock() will cause the client to flush cached data that it would otherwise return and apparently considered valid, even though GETATTR results from the server didn't change. I'm curious if this is guaranteed behavior, or simply 'it works today'. (If by 'revalidate attributes' you mean that the kernel internally revalidates some cached data that it didn't bother revalidating before, then that would match observed behavior. As an outside user of NFS, I find this confusing terminology, though, as the kernel clearly has new GETATTR results.) Specifically, consider the sequence: client A fileserver open file read-write read through end of file 1 go idle, but don't close file 2 open file, append data, close, sync 3 remain idle until fstat() shows st_size has grown 4 optional: close and re-open file 5 optional: flock() 6 read from old EOF to new EOF Today, if you leave out #5, at #6 client A will read some zero bytes instead of actual file content (whether or not you did #4). If you include #5, it will not (again whether or not you did #4). Under my outline in my original email, client A is behaving outside of close to open consistency because it has not closed the file before the fileserver wrote to it and opened it afterward. At point #3, in some sense the client clearly knows that file attributes have changed, because fstat() results have changed (showing a new, larger file size among other things), but because we went outside the guaranteed behavior the kernel doesn't have to care completely; it retains a cached partial page at the old end of file and returns this data to us at step #6 (if we skip #5). The file attributes obtained from the NFS server don't change between #3, #4, and #5, but if we do #5, today the kernel does something with the cached partial page that causes it to return real data at #6. This doesn't happen with just #4, but under my outlined rules that's acceptable because we violated c-t-o by closing the file only after it had been changed elsewhere and so the kernel isn't obliged to do the magic that it does for #5. (In fact it is possible to read zero bytes before #5 and read good data afterward, including in a different program.) - cks ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Correctly understanding Linux's close-to-open consistency 2018-09-15 19:11 ` Chris Siebenmann @ 2018-09-16 11:01 ` Jeff Layton 2018-09-16 16:12 ` Trond Myklebust 0 siblings, 1 reply; 7+ messages in thread From: Jeff Layton @ 2018-09-16 11:01 UTC (permalink / raw) To: Chris Siebenmann; +Cc: linux-nfs On Sat, 2018-09-15 at 15:11 -0400, Chris Siebenmann wrote: > > On Wed, 2018-09-12 at 21:24 -0400, Chris Siebenmann wrote: > > > Is it correct to say that when writing data to NFS files, the only > > > sequence of operations that Linux NFS clients officially support is > > > the following: > > > > > > - all processes on all client machines close() the file > > > - one machine (a client or the fileserver) opens() the file, writes > > > to it, and close()s again > > > - processes on client machines can now open() the file again for > > > reading > > > > No. > > > > One can always call fsync() to force data to be flushed to avoid the > > close of the write fd in this situation. That's really a more portable > > solution anyway. A local filesystem may not flush data to disk, on close > > (for instance) so calling fsync will ensure you rely less on filesystem > > implementation details. > > > > The separate open by the reader just helps ensure that the file's > > attributes are revalidated (so you can tell whether cached data you > > hold is still valid). > > This bit about the separate open doesn't seem to be the case > currently, and people here have asserted that it's not true in > general. Specifically, under some conditions *not involving you > writing*, if you do not close() the file before another machine writes > to it and then open() it afterward, the kernel may retain cached data > that it is in a position to know (for sure) is invalid because it didn't > exist in the previous version of the file (as it was past the end of > file position). > > Since failing to close() before another machine open()s puts you > outside this outline of close-to-open, this kernel behavior is not a > bug as such (or so it's been explained to me here). If you go outside > c-t-o, the kernel is free to do whatever it finds most convenient, and > what it found most convenient was to not bother invalidating some cached > page data even though it saw a GETATTR change. > That would be a bug. If we have reason to believe the file has changed, then we must invalidate the cache on the file prior to allowing a read to proceed. > It may be that I'm not fully understanding how you mean 'revalidated' > here. Is it that the kernel does not necessarily bother (re)checking > some internal things (such as cached pages) even when it has new GETATTR > results, until you do certain operations? > Well, it'll generally mark the cache as being invalid (e.g. NFS_INO_INVALID_DATA flag). Whether it purges the cache at that point is a different matter. If we have writes cached, then we can't just drop pages that have dirty data. They must be written back to the server first. Basically, if you don't take steps to serialize your I/O between hosts, then your results may not be what you expect. > As far as the writer using fsync() instead of close(): under this > model, the writer must close() if there are ever going to be writers > on another machine and readers on its machine (including itself), > because otherwise it (and they) will be in the 'reader' position here, > and in violation of the outline, and so their client kernel is free to > do odd things. (This is a basic model that ignores how NFS locks might > interact with things.) > A close() on NFS is basically doing fsync() and then close(), unless you hold a write delegation, in which case it may not do the fsync since it's not required. > > If you use file locking (flock() or POSIX locks), then we treat > > those as cache coherency points as well. The client will write back > > cached data to the server prior to releasing a lock, and revalidate > > attributes (and thus the local cache) after acquiring one. > > The client currently appears to do more than re-check attributes, > at least in one sense of 'revalidate'. In some cases, flock() will > cause the client to flush cached data that it would otherwise return and > apparently considered valid, even though GETATTR results from the server > didn't change. I'm curious if this is guaranteed behavior, or simply > 'it works today'. > You need to distinguish between two different cases in the cache here. Pages can be dirty or clean. When I say flush here, I mean that it's writing back dirty data. The client can decide to drop clean pages at any time. It doesn't need a reason -- being low on memory is good enough. > (If by 'revalidate attributes' you mean that the kernel internally > revalidates some cached data that it didn't bother revalidating before, > then that would match observed behavior. As an outside user of NFS, > I find this confusing terminology, though, as the kernel clearly has > new GETATTR results.) > > Specifically, consider the sequence: > > client A fileserver > open file read-write > read through end of file > 1 go idle, but don't close file > 2 open file, append data, close, sync > > 3 remain idle until fstat() shows st_size has grown > > 4 optional: close and re-open file > 5 optional: flock() > > 6 read from old EOF to new EOF > > Today, if you leave out #5, at #6 client A will read some zero bytes > instead of actual file content (whether or not you did #4). If you > include #5, it will not (again whether or not you did #4). > > Under my outline in my original email, client A is behaving outside > of close to open consistency because it has not closed the file before > the fileserver wrote to it and opened it afterward. At point #3, in some > sense the client clearly knows that file attributes have changed, because > fstat() results have changed (showing a new, larger file size among other > things), but because we went outside the guaranteed behavior the kernel > doesn't have to care completely; it retains a cached partial page at the > old end of file and returns this data to us at step #6 (if we skip #5). > > The file attributes obtained from the NFS server don't change between > #3, #4, and #5, but if we do #5, today the kernel does something with > the cached partial page that causes it to return real data at #6. This > doesn't happen with just #4, but under my outlined rules that's acceptable > because we violated c-t-o by closing the file only after it had been > changed elsewhere and so the kernel isn't obliged to do the magic that > it does for #5. > > (In fact it is possible to read zero bytes before #5 and read good data > afterward, including in a different program.) > > Sure. As I said before, locking acts as cache coherency points. On flock, we would revalidate the attributes so it would see the new size and do reads like you'd expect. As complicated as CTO sounds, it's actually relatively simple. When we close a file, we flush any cached write data back to the server (basically doing an fsync). When we open a file, we revalidate the attributes to ensure that we know whether the cache is valid. We do similar things with locking (releasing a lock flushes cached data, and acquiring one revalidates attributes). The client however is free to flush data at any time and fetch attributes at any time. YMMV if changes happened to the file after you locked or opened it, or if someone performs reads prior to your unlock or close. If you want consistent reads and writes then you _must_ ensure that the accesses are serialized. Usually that's done with locking but it doesn't have to be if you can serialize open/close/fsync via other mechanisms. Basically, your assertion was that you _must_ open and close files in order to get proper cache coherency between clients doing reads and writes. That's simply not true if you use file locking. If you've found cases where file locks are not protecting things as they should then please do raise a bug report. It's also not required to close the file that was open for write if you do an fsync prior to the reader reopening the file. The close is completely extraneous at that point since you know that writeback is complete. The reopen for read in that case is only required in order to ensure that the attrs are re-fetched prior to trusting the reader's cache. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Correctly understanding Linux's close-to-open consistency 2018-09-16 11:01 ` Jeff Layton @ 2018-09-16 16:12 ` Trond Myklebust 2018-09-17 0:18 ` Chris Siebenmann 0 siblings, 1 reply; 7+ messages in thread From: Trond Myklebust @ 2018-09-16 16:12 UTC (permalink / raw) To: cks, jlayton; +Cc: linux-nfs T24gU3VuLCAyMDE4LTA5LTE2IGF0IDA3OjAxIC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g T24gU2F0LCAyMDE4LTA5LTE1IGF0IDE1OjExIC0wNDAwLCBDaHJpcyBTaWViZW5tYW5uIHdyb3Rl Og0KPiA+ID4gT24gV2VkLCAyMDE4LTA5LTEyIGF0IDIxOjI0IC0wNDAwLCBDaHJpcyBTaWViZW5t YW5uIHdyb3RlOg0KPiA+ID4gPiAgSXMgaXQgY29ycmVjdCB0byBzYXkgdGhhdCB3aGVuIHdyaXRp bmcgZGF0YSB0byBORlMgZmlsZXMsIHRoZQ0KPiA+ID4gPiBvbmx5DQo+ID4gPiA+IHNlcXVlbmNl IG9mIG9wZXJhdGlvbnMgdGhhdCBMaW51eCBORlMgY2xpZW50cyBvZmZpY2lhbGx5DQo+ID4gPiA+ IHN1cHBvcnQgaXMNCj4gPiA+ID4gdGhlIGZvbGxvd2luZzoNCj4gPiA+ID4gDQo+ID4gPiA+IC0g YWxsIHByb2Nlc3NlcyBvbiBhbGwgY2xpZW50IG1hY2hpbmVzIGNsb3NlKCkgdGhlIGZpbGUNCj4g PiA+ID4gLSBvbmUgbWFjaGluZSAoYSBjbGllbnQgb3IgdGhlIGZpbGVzZXJ2ZXIpIG9wZW5zKCkg dGhlIGZpbGUsDQo+ID4gPiA+IHdyaXRlcw0KPiA+ID4gPiAgIHRvIGl0LCBhbmQgY2xvc2UoKXMg YWdhaW4NCj4gPiA+ID4gLSBwcm9jZXNzZXMgb24gY2xpZW50IG1hY2hpbmVzIGNhbiBub3cgb3Bl bigpIHRoZSBmaWxlIGFnYWluDQo+ID4gPiA+IGZvcg0KPiA+ID4gPiAgIHJlYWRpbmcNCj4gPiA+ IA0KPiA+ID4gTm8uDQo+ID4gPiANCj4gPiA+IE9uZSBjYW4gYWx3YXlzIGNhbGwgZnN5bmMoKSB0 byBmb3JjZSBkYXRhIHRvIGJlIGZsdXNoZWQgdG8gYXZvaWQNCj4gPiA+IHRoZQ0KPiA+ID4gY2xv c2Ugb2YgdGhlIHdyaXRlIGZkIGluIHRoaXMgc2l0dWF0aW9uLiBUaGF0J3MgcmVhbGx5IGEgbW9y ZQ0KPiA+ID4gcG9ydGFibGUNCj4gPiA+IHNvbHV0aW9uIGFueXdheS4gQSBsb2NhbCBmaWxlc3lz dGVtIG1heSBub3QgZmx1c2ggZGF0YSB0byBkaXNrLA0KPiA+ID4gb24gY2xvc2UNCj4gPiA+IChm b3IgaW5zdGFuY2UpIHNvIGNhbGxpbmcgZnN5bmMgd2lsbCBlbnN1cmUgeW91IHJlbHkgbGVzcyBv bg0KPiA+ID4gZmlsZXN5c3RlbQ0KPiA+ID4gaW1wbGVtZW50YXRpb24gZGV0YWlscy4NCj4gPiA+ IA0KPiA+ID4gVGhlIHNlcGFyYXRlIG9wZW4gYnkgdGhlIHJlYWRlciBqdXN0IGhlbHBzIGVuc3Vy ZSB0aGF0IHRoZSBmaWxlJ3MNCj4gPiA+IGF0dHJpYnV0ZXMgYXJlIHJldmFsaWRhdGVkIChzbyB5 b3UgY2FuIHRlbGwgd2hldGhlciBjYWNoZWQgZGF0YQ0KPiA+ID4geW91DQo+ID4gPiBob2xkIGlz IHN0aWxsIHZhbGlkKS4NCj4gPiANCj4gPiAgVGhpcyBiaXQgYWJvdXQgdGhlIHNlcGFyYXRlIG9w ZW4gZG9lc24ndCBzZWVtIHRvIGJlIHRoZSBjYXNlDQo+ID4gY3VycmVudGx5LCBhbmQgcGVvcGxl IGhlcmUgaGF2ZSBhc3NlcnRlZCB0aGF0IGl0J3Mgbm90IHRydWUgaW4NCj4gPiBnZW5lcmFsLiBT cGVjaWZpY2FsbHksIHVuZGVyIHNvbWUgY29uZGl0aW9ucyAqbm90IGludm9sdmluZyB5b3UNCj4g PiB3cml0aW5nKiwgaWYgeW91IGRvIG5vdCBjbG9zZSgpIHRoZSBmaWxlIGJlZm9yZSBhbm90aGVy IG1hY2hpbmUNCj4gPiB3cml0ZXMNCj4gPiB0byBpdCBhbmQgdGhlbiBvcGVuKCkgaXQgYWZ0ZXJ3 YXJkLCB0aGUga2VybmVsIG1heSByZXRhaW4gY2FjaGVkDQo+ID4gZGF0YQ0KPiA+IHRoYXQgaXQg aXMgaW4gYSBwb3NpdGlvbiB0byBrbm93IChmb3Igc3VyZSkgaXMgaW52YWxpZCBiZWNhdXNlIGl0 DQo+ID4gZGlkbid0DQo+ID4gZXhpc3QgaW4gdGhlIHByZXZpb3VzIHZlcnNpb24gb2YgdGhlIGZp bGUgKGFzIGl0IHdhcyBwYXN0IHRoZSBlbmQNCj4gPiBvZg0KPiA+IGZpbGUgcG9zaXRpb24pLg0K PiA+IA0KPiA+ICBTaW5jZSBmYWlsaW5nIHRvIGNsb3NlKCkgYmVmb3JlIGFub3RoZXIgbWFjaGlu ZSBvcGVuKClzIHB1dHMgeW91DQo+ID4gb3V0c2lkZSB0aGlzIG91dGxpbmUgb2YgY2xvc2UtdG8t b3BlbiwgdGhpcyBrZXJuZWwgYmVoYXZpb3IgaXMgbm90DQo+ID4gYQ0KPiA+IGJ1ZyBhcyBzdWNo IChvciBzbyBpdCdzIGJlZW4gZXhwbGFpbmVkIHRvIG1lIGhlcmUpLiAgSWYgeW91IGdvDQo+ID4g b3V0c2lkZQ0KPiA+IGMtdC1vLCB0aGUga2VybmVsIGlzIGZyZWUgdG8gZG8gd2hhdGV2ZXIgaXQg ZmluZHMgbW9zdCBjb252ZW5pZW50LA0KPiA+IGFuZA0KPiA+IHdoYXQgaXQgZm91bmQgbW9zdCBj b252ZW5pZW50IHdhcyB0byBub3QgYm90aGVyIGludmFsaWRhdGluZyBzb21lDQo+ID4gY2FjaGVk DQo+ID4gcGFnZSBkYXRhIGV2ZW4gdGhvdWdoIGl0IHNhdyBhIEdFVEFUVFIgY2hhbmdlLg0KPiA+ IA0KPiANCj4gVGhhdCB3b3VsZCBiZSBhIGJ1Zy4gSWYgd2UgaGF2ZSByZWFzb24gdG8gYmVsaWV2 ZSB0aGUgZmlsZSBoYXMNCj4gY2hhbmdlZCwNCj4gdGhlbiB3ZSBtdXN0IGludmFsaWRhdGUgdGhl IGNhY2hlIG9uIHRoZSBmaWxlIHByaW9yIHRvIGFsbG93aW5nIGENCj4gcmVhZA0KPiB0byBwcm9j ZWVkLg0KDQpUaGUgcG9pbnQgaGVyZSBpcyB0aGF0IHdoZW4gdGhlIGZpbGUgaXMgb3BlbiBmb3Ig d3JpdGluZyAob3IgZm9yDQpyZWFkK3dyaXRlKSwgYW5kIHlvdXIgYXBwbGljYXRpb25zIGFyZSBu b3QgdXNpbmcgbG9ja2luZywgdGhlbiB3ZSBoYXZlDQpubyByZWFzb24gdG8gYmVsaWV2ZSB0aGUg ZmlsZSBpcyBiZWluZyBjaGFuZ2VkIG9uIHRoZSBzZXJ2ZXIsIGFuZCB3ZQ0KZGVsaWJlcmF0ZWx5 IG9wdGltaXNlIGZvciB0aGUgY2FzZSB3aGVyZSB0aGUgY2FjaGUgY29uc2lzdGVuY3kgcnVsZXMN CmFyZSBiZWluZyBvYnNlcnZlZC4NCg0KSWYgdGhlIGZpbGUgaXMgb3BlbiBmb3IgcmVhZGluZyBv bmx5LCB0aGVuIHdlIG1heSBkZXRlY3QgY2hhbmdlcyBvbiB0aGUNCnNlcnZlci4gSG93ZXZlciB3 ZSBjZXJ0YWlubHkgY2Fubm90IGd1YXJhbnRlZSB0aGF0IHRoZSBkYXRhIGlzDQpjb25zaXN0ZW50 IGR1ZSB0byB0aGUgcG90ZW50aWFsIGZvciB3cml0ZSByZW9yZGVyaW5nIGFzIGRpc2N1c3NlZA0K ZWFybGllciBpbiB0aGlzIHRocmVhZCwgYW5kIGR1ZSB0byB0aGUgZmFjdCB0aGF0IGF0dHJpYnV0 ZSByZXZhbGlkYXRpb24NCmlzIG5vdCBhdG9taWMgd2l0aCByZWFkcy4NCg0KQWdhaW4sIHRoZXNl IGFyZSB0aGUgY2FzZXMgd2hlcmUgeW91IGFyZSBfbm90XyB1c2luZyBsb2NraW5nIHRvDQptZWRp YXRlLiBJZiB5b3UgYXJlIHVzaW5nIGxvY2tpbmcsIHRoZW4gSSBhZ3JlZSB0aGF0IGNoYW5nZXMg bmVlZCB0byBiZQ0Kc2VlbiBieSB0aGUgY2xpZW50Lg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxp bnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lciwgSGFtbWVyc3BhY2UNCnRyb25kLm15a2xlYnVzdEBo YW1tZXJzcGFjZS5jb20NCg0KDQo= ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Correctly understanding Linux's close-to-open consistency 2018-09-16 16:12 ` Trond Myklebust @ 2018-09-17 0:18 ` Chris Siebenmann 2018-09-17 2:19 ` Trond Myklebust 0 siblings, 1 reply; 7+ messages in thread From: Chris Siebenmann @ 2018-09-17 0:18 UTC (permalink / raw) To: Trond Myklebust; +Cc: cks, jlayton, linux-nfs > > > Since failing to close() before another machine open()s puts you > > > outside this outline of close-to-open, this kernel behavior is > > > not a bug as such (or so it's been explained to me here). If you > > > go outside c-t-o, the kernel is free to do whatever it finds most > > > convenient, and what it found most convenient was to not bother > > > invalidating some cached page data even though it saw a GETATTR > > > change. > > > > That would be a bug. If we have reason to believe the file has > > changed, then we must invalidate the cache on the file prior to > > allowing a read to proceed. > > The point here is that when the file is open for writing (or for > read+write), and your applications are not using locking, then we have > no reason to believe the file is being changed on the server, and we > deliberately optimise for the case where the cache consistency rules > are being observed. In this case the user level can be completely sure that the client kernel has issued a GETATTR and received a different answer from the NFS server, because the fstat() results it sees have changed from the values it has seen before (and remembered). This may not count as the NFS client kernel code '[having] reason to believe' that the file has changed on the server from its perspective, but if so it's not because the information is not available and a GETATTR would have to be explicitly issued to find it out. The client code has made the GETATTR and received different results, which it has passed to user level; it has just not used those results to do things to its cached data. Today, if you do a flock(), the NFS client code in the kernel will do things that invalidate the cached data, despite the GETATTR result from the fileserver not changing. From my outside perspective, as someone writing code or dealing with programs that must work over NFS, this is a little bit magical, and as a result I would like to understand if it is guaranteed that the magic works or if this is not officially supported magic, merely 'it happens to work' magic in the way that having the file open read-write without the flock() used to work in kernel 4.4.x but doesn't now (and this is simply considered to be the kernel using CTO more strongly, not a bug). (Looking at a tcpdump trace, the flock() call appears to cause the kernel to issue another GETATTR to the fileserver. The results are the same as the GETATTR results that were passed to the client program.) > Again, these are the cases where you are _not_ using locking to > mediate. If you are using locking, then I agree that changes need to > be seen by the client. The original code (Alpine) *is* using locking in the broad sense, but it is not flock() locking; instead it is locking (in this case) through .lock files. The current kernel behavior and what I've been told about it implies that it is not sufficient for your application to perfectly coordinate locking, writes, fsync(), and fstat() visibility of the resulting changes through its own mechanism; you must do your locking through the officially approved kernel channels (and it is not clear what they are) or see potentially incorrect results. Consider a system where reads and writes to a shared file are coordinated by a central process that everyone communicates with through TCP connections. The central process pauses readers before it allows a writer to start, the writer always fsync()s before it releases its write permissions, and then no reader is permitted to proceed until the entire cluster sees the same updated fstat() result. This is perfectly coordinated but currently could see incorrect read() results, and I've been told that this is allowed under Linux's CTO rules because all of the processes hold the file open read-write through this entire process (and no one flock()s). - cks ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Correctly understanding Linux's close-to-open consistency 2018-09-17 0:18 ` Chris Siebenmann @ 2018-09-17 2:19 ` Trond Myklebust 0 siblings, 0 replies; 7+ messages in thread From: Trond Myklebust @ 2018-09-17 2:19 UTC (permalink / raw) To: cks; +Cc: jlayton, linux-nfs T24gU3VuLCAyMDE4LTA5LTE2IGF0IDIwOjE4IC0wNDAwLCBDaHJpcyBTaWViZW5tYW5uIHdyb3Rl Og0KPiA+ID4gPiAgU2luY2UgZmFpbGluZyB0byBjbG9zZSgpIGJlZm9yZSBhbm90aGVyIG1hY2hp bmUgb3BlbigpcyBwdXRzDQo+ID4gPiA+IHlvdQ0KPiA+ID4gPiBvdXRzaWRlIHRoaXMgb3V0bGlu ZSBvZiBjbG9zZS10by1vcGVuLCB0aGlzIGtlcm5lbCBiZWhhdmlvciBpcw0KPiA+ID4gPiBub3Qg YSBidWcgYXMgc3VjaCAob3Igc28gaXQncyBiZWVuIGV4cGxhaW5lZCB0byBtZSBoZXJlKS4gIElm DQo+ID4gPiA+IHlvdQ0KPiA+ID4gPiBnbyBvdXRzaWRlIGMtdC1vLCB0aGUga2VybmVsIGlzIGZy ZWUgdG8gZG8gd2hhdGV2ZXIgaXQgZmluZHMNCj4gPiA+ID4gbW9zdA0KPiA+ID4gPiBjb252ZW5p ZW50LCBhbmQgd2hhdCBpdCBmb3VuZCBtb3N0IGNvbnZlbmllbnQgd2FzIHRvIG5vdCBib3RoZXIN Cj4gPiA+ID4gaW52YWxpZGF0aW5nIHNvbWUgY2FjaGVkIHBhZ2UgZGF0YSBldmVuIHRob3VnaCBp dCBzYXcgYSBHRVRBVFRSDQo+ID4gPiA+IGNoYW5nZS4NCj4gPiA+IA0KPiA+ID4gVGhhdCB3b3Vs ZCBiZSBhIGJ1Zy4gSWYgd2UgaGF2ZSByZWFzb24gdG8gYmVsaWV2ZSB0aGUgZmlsZSBoYXMNCj4g PiA+IGNoYW5nZWQsIHRoZW4gd2UgbXVzdCBpbnZhbGlkYXRlIHRoZSBjYWNoZSBvbiB0aGUgZmls ZSBwcmlvciB0bw0KPiA+ID4gYWxsb3dpbmcgYSByZWFkIHRvIHByb2NlZWQuDQo+ID4gDQo+ID4g VGhlIHBvaW50IGhlcmUgaXMgdGhhdCB3aGVuIHRoZSBmaWxlIGlzIG9wZW4gZm9yIHdyaXRpbmcg KG9yIGZvcg0KPiA+IHJlYWQrd3JpdGUpLCBhbmQgeW91ciBhcHBsaWNhdGlvbnMgYXJlIG5vdCB1 c2luZyBsb2NraW5nLCB0aGVuIHdlDQo+ID4gaGF2ZQ0KPiA+IG5vIHJlYXNvbiB0byBiZWxpZXZl IHRoZSBmaWxlIGlzIGJlaW5nIGNoYW5nZWQgb24gdGhlIHNlcnZlciwgYW5kDQo+ID4gd2UNCj4g PiBkZWxpYmVyYXRlbHkgb3B0aW1pc2UgZm9yIHRoZSBjYXNlIHdoZXJlIHRoZSBjYWNoZSBjb25z aXN0ZW5jeQ0KPiA+IHJ1bGVzDQo+ID4gYXJlIGJlaW5nIG9ic2VydmVkLg0KPiANCj4gIEluIHRo aXMgY2FzZSB0aGUgdXNlciBsZXZlbCBjYW4gYmUgY29tcGxldGVseSBzdXJlIHRoYXQgdGhlIGNs aWVudA0KPiBrZXJuZWwgaGFzIGlzc3VlZCBhIEdFVEFUVFIgYW5kIHJlY2VpdmVkIGEgZGlmZmVy ZW50IGFuc3dlciBmcm9tIHRoZQ0KPiBORlMgc2VydmVyLCBiZWNhdXNlIHRoZSBmc3RhdCgpIHJl c3VsdHMgaXQgc2VlcyBoYXZlIGNoYW5nZWQgZnJvbSB0aGUNCj4gdmFsdWVzIGl0IGhhcyBzZWVu IGJlZm9yZSAoYW5kIHJlbWVtYmVyZWQpLiBUaGlzIG1heSBub3QgY291bnQgYXMgdGhlDQo+IE5G UyBjbGllbnQga2VybmVsIGNvZGUgJ1toYXZpbmddIHJlYXNvbiB0byBiZWxpZXZlJyB0aGF0IHRo ZSBmaWxlIGhhcw0KPiBjaGFuZ2VkIG9uIHRoZSBzZXJ2ZXIgZnJvbSBpdHMgcGVyc3BlY3RpdmUs IGJ1dCBpZiBzbyBpdCdzIG5vdA0KPiBiZWNhdXNlDQo+IHRoZSBpbmZvcm1hdGlvbiBpcyBub3Qg YXZhaWxhYmxlIGFuZCBhIEdFVEFUVFIgd291bGQgaGF2ZSB0byBiZQ0KPiBleHBsaWNpdGx5DQo+ IGlzc3VlZCB0byBmaW5kIGl0IG91dC4gVGhlIGNsaWVudCBjb2RlIGhhcyBtYWRlIHRoZSBHRVRB VFRSIGFuZA0KPiByZWNlaXZlZA0KPiBkaWZmZXJlbnQgcmVzdWx0cywgd2hpY2ggaXQgaGFzIHBh c3NlZCB0byB1c2VyIGxldmVsOyBpdCBoYXMganVzdCBub3QNCj4gdXNlZCB0aG9zZSByZXN1bHRz IHRvIGRvIHRoaW5ncyB0byBpdHMgY2FjaGVkIGRhdGEuDQo+IA0KPiAgVG9kYXksIGlmIHlvdSBk byBhIGZsb2NrKCksIHRoZSBORlMgY2xpZW50IGNvZGUgaW4gdGhlIGtlcm5lbCB3aWxsDQo+IGRv IHRoaW5ncyB0aGF0IGludmFsaWRhdGUgdGhlIGNhY2hlZCBkYXRhLCBkZXNwaXRlIHRoZSBHRVRB VFRSIHJlc3VsdA0KPiBmcm9tIHRoZSBmaWxlc2VydmVyIG5vdCBjaGFuZ2luZy4gRnJvbSBteSBv dXRzaWRlIHBlcnNwZWN0aXZlLCBhcw0KPiBzb21lb25lDQo+IHdyaXRpbmcgY29kZSBvciBkZWFs aW5nIHdpdGggcHJvZ3JhbXMgdGhhdCBtdXN0IHdvcmsgb3ZlciBORlMsIHRoaXMNCj4gaXMgYQ0K PiBsaXR0bGUgYml0IG1hZ2ljYWwsIGFuZCBhcyBhIHJlc3VsdCBJIHdvdWxkIGxpa2UgdG8gdW5k ZXJzdGFuZCBpZiBpdA0KPiBpcw0KPiBndWFyYW50ZWVkIHRoYXQgdGhlIG1hZ2ljIHdvcmtzIG9y IGlmIHRoaXMgaXMgbm90IG9mZmljaWFsbHkNCj4gc3VwcG9ydGVkDQo+IG1hZ2ljLCBtZXJlbHkg J2l0IGhhcHBlbnMgdG8gd29yaycgbWFnaWMgaW4gdGhlIHdheSB0aGF0IGhhdmluZyB0aGUNCj4g ZmlsZSBvcGVuIHJlYWQtd3JpdGUgd2l0aG91dCB0aGUgZmxvY2soKSB1c2VkIHRvIHdvcmsgaW4g a2VybmVsIDQuNC54DQo+IGJ1dCBkb2Vzbid0IG5vdyAoYW5kIHRoaXMgaXMgc2ltcGx5IGNvbnNp ZGVyZWQgdG8gYmUgdGhlIGtlcm5lbCB1c2luZw0KPiBDVE8gbW9yZSBzdHJvbmdseSwgbm90IGEg YnVnKS4NCj4gDQo+IChMb29raW5nIGF0IGEgdGNwZHVtcCB0cmFjZSwgdGhlIGZsb2NrKCkgY2Fs bCBhcHBlYXJzIHRvIGNhdXNlIHRoZQ0KPiBrZXJuZWwNCj4gdG8gaXNzdWUgYW5vdGhlciBHRVRB VFRSIHRvIHRoZSBmaWxlc2VydmVyLiBUaGUgcmVzdWx0cyBhcmUgdGhlIHNhbWUNCj4gYXMNCj4g dGhlIEdFVEFUVFIgcmVzdWx0cyB0aGF0IHdlcmUgcGFzc2VkIHRvIHRoZSBjbGllbnQgcHJvZ3Jh bS4pDQoNCg0KVGhpcyBpcyBhbHNvIGRvY3VtZW50ZWQgaW4gdGhlIE5GUyBGQVEgdG8gd2hpY2gg SSBwb2ludGVkIHlvdSBlYXJsaWVyLg0KDQo+ID4gQWdhaW4sIHRoZXNlIGFyZSB0aGUgY2FzZXMg d2hlcmUgeW91IGFyZSBfbm90XyB1c2luZyBsb2NraW5nIHRvDQo+ID4gbWVkaWF0ZS4gSWYgeW91 IGFyZSB1c2luZyBsb2NraW5nLCB0aGVuIEkgYWdyZWUgdGhhdCBjaGFuZ2VzIG5lZWQNCj4gPiB0 bw0KPiA+IGJlIHNlZW4gYnkgdGhlIGNsaWVudC4NCj4gDQo+ICBUaGUgb3JpZ2luYWwgY29kZSAo QWxwaW5lKSAqaXMqIHVzaW5nIGxvY2tpbmcgaW4gdGhlIGJyb2FkIHNlbnNlLA0KPiBidXQgaXQg aXMgbm90IGZsb2NrKCkgbG9ja2luZzsgaW5zdGVhZCBpdCBpcyBsb2NraW5nIChpbiB0aGlzIGNh c2UpDQo+IHRocm91Z2ggLmxvY2sgZmlsZXMuIFRoZSBjdXJyZW50IGtlcm5lbCBiZWhhdmlvciBh bmQgd2hhdCBJJ3ZlIGJlZW4NCj4gdG9sZCBhYm91dCBpdCBpbXBsaWVzIHRoYXQgaXQgaXMgbm90 IHN1ZmZpY2llbnQgZm9yIHlvdXIgYXBwbGljYXRpb24NCj4gdG8NCj4gcGVyZmVjdGx5IGNvb3Jk aW5hdGUgbG9ja2luZywgd3JpdGVzLCBmc3luYygpLCBhbmQgZnN0YXQoKSB2aXNpYmlsaXR5DQo+ IG9mIHRoZSByZXN1bHRpbmcgY2hhbmdlcyB0aHJvdWdoIGl0cyBvd24gbWVjaGFuaXNtOyB5b3Ug bXVzdCBkbyB5b3VyDQo+IGxvY2tpbmcgdGhyb3VnaCB0aGUgb2ZmaWNpYWxseSBhcHByb3ZlZCBr ZXJuZWwgY2hhbm5lbHMgKGFuZCBpdCBpcw0KPiBub3QNCj4gY2xlYXIgd2hhdCB0aGV5IGFyZSkg b3Igc2VlIHBvdGVudGlhbGx5IGluY29ycmVjdCByZXN1bHRzLg0KPiANCj4gIENvbnNpZGVyIGEg c3lzdGVtIHdoZXJlIHJlYWRzIGFuZCB3cml0ZXMgdG8gYSBzaGFyZWQgZmlsZSBhcmUNCj4gY29v cmRpbmF0ZWQgYnkgYSBjZW50cmFsIHByb2Nlc3MgdGhhdCBldmVyeW9uZSBjb21tdW5pY2F0ZXMg d2l0aA0KPiB0aHJvdWdoDQo+IFRDUCBjb25uZWN0aW9ucy4gVGhlIGNlbnRyYWwgcHJvY2VzcyBw YXVzZXMgcmVhZGVycyBiZWZvcmUgaXQgYWxsb3dzDQo+IGEgd3JpdGVyIHRvIHN0YXJ0LCB0aGUg d3JpdGVyIGFsd2F5cyBmc3luYygpcyBiZWZvcmUgaXQgcmVsZWFzZXMgaXRzDQo+IHdyaXRlIHBl cm1pc3Npb25zLCBhbmQgdGhlbiBubyByZWFkZXIgaXMgcGVybWl0dGVkIHRvIHByb2NlZWQgdW50 aWwNCj4gdGhlDQo+IGVudGlyZSBjbHVzdGVyIHNlZXMgdGhlIHNhbWUgdXBkYXRlZCBmc3RhdCgp IHJlc3VsdC4gVGhpcyBpcw0KPiBwZXJmZWN0bHkNCj4gY29vcmRpbmF0ZWQgYnV0IGN1cnJlbnRs eSBjb3VsZCBzZWUgaW5jb3JyZWN0IHJlYWQoKSByZXN1bHRzLCBhbmQNCj4gSSd2ZQ0KPiBiZWVu IHRvbGQgdGhhdCB0aGlzIGlzIGFsbG93ZWQgdW5kZXIgTGludXgncyBDVE8gcnVsZXMgYmVjYXVz ZSBhbGwgb2YNCj4gdGhlIHByb2Nlc3NlcyBob2xkIHRoZSBmaWxlIG9wZW4gcmVhZC13cml0ZSB0 aHJvdWdoIHRoaXMgZW50aXJlDQo+IHByb2Nlc3MNCj4gKGFuZCBubyBvbmUgZmxvY2soKXMpLg0K PiANCg0KV2h5IHdvdWxkIHN1Y2ggYSBzeXN0ZW0gbmVlZCB0byB1c2UgYnVmZmVyZWQgSS9PIGlu c3RlYWQgb2YgdW5jYWNoZWQNCkkvTyAoaS5lLiBPX0RJUkVDVCk/IFdoYXQgd291bGQgYmUgdGhl IHBvaW50IG9mIG9wdGltaXNpbmcgdGhlIGJ1ZmZlcmVkDQpJL08gY2xpZW50IGZvciB0aGlzIHVz ZSBjYXNlIHJhdGhlciB0aGFuIHRoZSBjbG9zZSB0byBvcGVuIGNhY2hlDQpjb25zaXN0ZW50IGNh c2U/DQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIs IEhhbW1lcnNwYWNlDQp0cm9uZC5teWtsZWJ1c3RAaGFtbWVyc3BhY2UuY29tDQoNCg0K ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-09-17 7:44 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-13 1:24 Correctly understanding Linux's close-to-open consistency Chris Siebenmann 2018-09-15 16:20 ` Jeff Layton 2018-09-15 19:11 ` Chris Siebenmann 2018-09-16 11:01 ` Jeff Layton 2018-09-16 16:12 ` Trond Myklebust 2018-09-17 0:18 ` Chris Siebenmann 2018-09-17 2:19 ` Trond Myklebust
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.