* Provision for filesystem specific open flags @ 2017-11-10 16:49 Fu, Rodney 2017-11-10 17:23 ` hch 0 siblings, 1 reply; 20+ messages in thread From: Fu, Rodney @ 2017-11-10 16:49 UTC (permalink / raw) To: hch, viro; +Cc: linux-fsdevel Hello, With, commit 59724793983177d1b51c8cdd0134326977a1cabc Author: Christoph Hellwig <hch@lst.de> Date: Thu Apr 27 09:42:25 2017 +0200 fs: completely ignore unknown open flags [ Upstream commit 629e014bb8349fcf7c1e4df19a842652ece1c945 ] Currently we just stash anything we got into file->f_flags, and the report it in fcntl(F_GETFD). This patch just clears out all unknown flags so that we don't pass them to the fs or report them. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> and, commit 666d1fc2023eee3bf723c764eeeabb21d71a11f2 Author: Christoph Hellwig <hch@lst.de> Date: Thu Apr 27 09:42:24 2017 +0200 fs: add a VALID_OPEN_FLAGS [ Upstream commit 80f18379a7c350c011d30332658aa15fe49a8fa5 ] Add a central define for all valid open flags, and use it in the uniqueness check. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> The kernel prevents unknown open flags from being passed through to the underlying filesystem. I am wondering if people would be for or against the idea of provisioning some number of bits in the open flags that are opaque to the VFS layer but get passed down to the underlying filesystem? The motivation would be to allow filesystem specific semantics to be controllable via open, much like the more generic and pre-existing open flags. Thanks, Rodney Fu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-11-10 16:49 Provision for filesystem specific open flags Fu, Rodney @ 2017-11-10 17:23 ` hch 2017-11-10 17:39 ` Fu, Rodney 0 siblings, 1 reply; 20+ messages in thread From: hch @ 2017-11-10 17:23 UTC (permalink / raw) To: Fu, Rodney; +Cc: hch, viro, linux-fsdevel On Fri, Nov 10, 2017 at 04:49:33PM +0000, Fu, Rodney wrote: > The kernel prevents unknown open flags from being passed through to the > underlying filesystem. I am wondering if people would be for or against the > idea of provisioning some number of bits in the open flags that are opaque to > the VFS layer but get passed down to the underlying filesystem? The motivation > would be to allow filesystem specific semantics to be controllable via open, > much like the more generic and pre-existing open flags. Absolutely against. Open flags need to be defined in common code or you are in a massive world of trouble. ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Provision for filesystem specific open flags 2017-11-10 17:23 ` hch @ 2017-11-10 17:39 ` Fu, Rodney 2017-11-10 19:29 ` Matthew Wilcox 0 siblings, 1 reply; 20+ messages in thread From: Fu, Rodney @ 2017-11-10 17:39 UTC (permalink / raw) To: hch; +Cc: viro, linux-fsdevel > Absolutely against. Open flags need to be defined in common code or you are in a massive world of trouble. I'm suggesting this can be done with definitions in common code via generically named open flags. Say for example if there was defined an O_FS1, O_FS2 (someone pick a better name) exposed for an application to use that has no meaning to the VFS layer, but could be interpreted by the filesystem. Could that work? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-11-10 17:39 ` Fu, Rodney @ 2017-11-10 19:29 ` Matthew Wilcox 2017-11-10 21:04 ` Fu, Rodney 0 siblings, 1 reply; 20+ messages in thread From: Matthew Wilcox @ 2017-11-10 19:29 UTC (permalink / raw) To: Fu, Rodney; +Cc: hch, viro, linux-fsdevel On Fri, Nov 10, 2017 at 05:39:21PM +0000, Fu, Rodney wrote: > > Absolutely against. Open flags need to be defined in common code or you are in a massive world of trouble. > > I'm suggesting this can be done with definitions in common code via generically named open flags. Say for example if there was defined an O_FS1, O_FS2 (someone pick a better name) exposed for an application to use that has no meaning to the VFS layer, but could be interpreted by the filesystem. Could that work? No. If you want new flags bits, make a public proposal. Maybe some other filesystem would also benefit from them. ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Provision for filesystem specific open flags 2017-11-10 19:29 ` Matthew Wilcox @ 2017-11-10 21:04 ` Fu, Rodney 2017-11-11 0:37 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Fu, Rodney @ 2017-11-10 21:04 UTC (permalink / raw) To: Matthew Wilcox; +Cc: hch, viro, linux-fsdevel > No. If you want new flags bits, make a public proposal. Maybe some other > filesystem would also benefit from them. Ah, I see what you mean now, thanks. I would like to propose O_CONCURRENT_WRITE as a new open flag. It is currently used in the Panasas filesystem (panfs) and defined with value: #define O_CONCURRENT_WRITE 020000000000 This flag has been provided by panfs to HPC users via the mpich package for well over a decade. See: https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344 O_CONCURRENT_WRITE indicates to the filesystem that the application doing the open is participating in a coordinated distributed manner with other such applications, possibly running on different hosts. This allows the panfs filesystem to delegate some of the cache coherency responsibilities to the application, improving performance. The reason this flag is used on open as opposed to having a post-open ioctl or fcntl SETFL is to allow panfs to catch and reject opens by applications that attempt to access files that have already been opened by applications that have set O_CONCURRENT_WRITE. Recent changes to reject non-VALID_OPEN_FLAGS now prevent this facility from working, so getting our flag bit back would help tremendously. Thank you! Regards, Rodney Fu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-11-10 21:04 ` Fu, Rodney @ 2017-11-11 0:37 ` Matthew Wilcox 2017-11-13 15:16 ` Fu, Rodney 2017-11-13 0:48 ` Dave Chinner 2017-11-13 17:45 ` Bernd Schubert 2 siblings, 1 reply; 20+ messages in thread From: Matthew Wilcox @ 2017-11-11 0:37 UTC (permalink / raw) To: Fu, Rodney; +Cc: hch, viro, linux-fsdevel On Fri, Nov 10, 2017 at 09:04:31PM +0000, Fu, Rodney wrote: > > No. If you want new flags bits, make a public proposal. Maybe some other > > filesystem would also benefit from them. > > Ah, I see what you mean now, thanks. > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > currently used in the Panasas filesystem (panfs) and defined with value: > > #define O_CONCURRENT_WRITE 020000000000 > > This flag has been provided by panfs to HPC users via the mpich package for > well over a decade. See: > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344 > > O_CONCURRENT_WRITE indicates to the filesystem that the application doing the > open is participating in a coordinated distributed manner with other such > applications, possibly running on different hosts. This allows the panfs > filesystem to delegate some of the cache coherency responsibilities to the > application, improving performance. > > The reason this flag is used on open as opposed to having a post-open ioctl or > fcntl SETFL is to allow panfs to catch and reject opens by applications that > attempt to access files that have already been opened by applications that have > set O_CONCURRENT_WRITE. OK, let me just check I understand. Once any application has opened the inode with O_CONCURRENT_WRITE, all subsequent attempts to open the same inode without O_CONCURRENT_WRITE will fail. Presumably also if somebody already has the inode open without O_CONCURRENT_WRITE set, the first open with O_CONCURRENT_WRITE will fail? Are opens with O_RDONLY also blocked? This feels a lot like leases ... maybe there's an opportunity to give better semantics here -- rather than rejecting opens without O_CONCURRENT_WRITE, all existing users could be forced to use the stricter coherency model? ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Provision for filesystem specific open flags 2017-11-11 0:37 ` Matthew Wilcox @ 2017-11-13 15:16 ` Fu, Rodney 2017-11-20 13:38 ` Jeff Layton 0 siblings, 1 reply; 20+ messages in thread From: Fu, Rodney @ 2017-11-13 15:16 UTC (permalink / raw) To: Matthew Wilcox; +Cc: hch, viro, linux-fsdevel > > > No. If you want new flags bits, make a public proposal. Maybe some > > > other filesystem would also benefit from them. > > > > Ah, I see what you mean now, thanks. > > > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > > currently used in the Panasas filesystem (panfs) and defined with value: > > > > #define O_CONCURRENT_WRITE 020000000000 > > > > This flag has been provided by panfs to HPC users via the mpich > > package for well over a decade. See: > > > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan > > fs/ad_panfs_open6.c#L344 > > > > O_CONCURRENT_WRITE indicates to the filesystem that the application > > doing the open is participating in a coordinated distributed manner > > with other such applications, possibly running on different hosts. > > This allows the panfs filesystem to delegate some of the cache > > coherency responsibilities to the application, improving performance. > > > > The reason this flag is used on open as opposed to having a post-open > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by > > applications that attempt to access files that have already been > > opened by applications that have set O_CONCURRENT_WRITE. > OK, let me just check I understand. Once any application has opened the inode > with O_CONCURRENT_WRITE, all subsequent attempts to open the same inode without > O_CONCURRENT_WRITE will fail. Presumably also if somebody already has the inode > open without O_CONCURRENT_WRITE set, the first open with O_CONCURRENT_WRITE will > fail? Yes on both counts. Opening with O_CONCURRENT_WRITE, followed by an open without will fail. Opening without O_CONCURRENT_WRITE followed by one with it will also fail. > Are opens with O_RDONLY also blocked? No they are not. The decision to grant access is based solely on the O_CONCURRENT_WRITE flag. > This feels a lot like leases ... maybe there's an opportunity to give better > semantics here -- rather than rejecting opens without O_CONCURRENT_WRITE, all > existing users could be forced to use the stricter coherency model? I don't think that will work, at least not from the perspective of trying to maintain good performance. A user that does not open with O_CONCURRENT_WRITE does not know how to adhere to the proper access patterns that maintain coherency. To continue to allow all users access after that point, the filesystem will have to force all users into a non-cacheable mode. Instead, we reject stray opens to allow any existing CONCURRENT_WRITE application to complete in a higher performance mode. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-11-13 15:16 ` Fu, Rodney @ 2017-11-20 13:38 ` Jeff Layton 0 siblings, 0 replies; 20+ messages in thread From: Jeff Layton @ 2017-11-20 13:38 UTC (permalink / raw) To: Fu, Rodney, Matthew Wilcox; +Cc: hch, viro, linux-fsdevel, linux-api On Mon, 2017-11-13 at 15:16 +0000, Fu, Rodney wrote: > > > > No. If you want new flags bits, make a public proposal. Maybe some > > > > other filesystem would also benefit from them. > > > > > > Ah, I see what you mean now, thanks. > > > > > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > > > currently used in the Panasas filesystem (panfs) and defined with value: > > > > > > #define O_CONCURRENT_WRITE 020000000000 > > > > > > This flag has been provided by panfs to HPC users via the mpich > > > package for well over a decade. See: > > > > > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan > > > fs/ad_panfs_open6.c#L344 > > > > > > O_CONCURRENT_WRITE indicates to the filesystem that the application > > > doing the open is participating in a coordinated distributed manner > > > with other such applications, possibly running on different hosts. > > > This allows the panfs filesystem to delegate some of the cache > > > coherency responsibilities to the application, improving performance. > > > > > > The reason this flag is used on open as opposed to having a post-open > > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by > > > applications that attempt to access files that have already been > > > opened by applications that have set O_CONCURRENT_WRITE. > > OK, let me just check I understand. Once any application has opened the inode > > with O_CONCURRENT_WRITE, all subsequent attempts to open the same inode without > > O_CONCURRENT_WRITE will fail. Presumably also if somebody already has the inode > > open without O_CONCURRENT_WRITE set, the first open with O_CONCURRENT_WRITE will > > fail? > > Yes on both counts. Opening with O_CONCURRENT_WRITE, followed by an open > without will fail. Opening without O_CONCURRENT_WRITE followed by one with it > will also fail. > > > Are opens with O_RDONLY also blocked? > > No they are not. The decision to grant access is based solely on the > O_CONCURRENT_WRITE flag. > > > This feels a lot like leases ... maybe there's an opportunity to give better > > semantics here -- rather than rejecting opens without O_CONCURRENT_WRITE, all > > existing users could be forced to use the stricter coherency model? > > I don't think that will work, at least not from the perspective of trying to > maintain good performance. A user that does not open with O_CONCURRENT_WRITE > does not know how to adhere to the proper access patterns that maintain > coherency. To continue to allow all users access after that point, the > filesystem will have to force all users into a non-cacheable mode. Instead, we > reject stray opens to allow any existing CONCURRENT_WRITE application to > complete in a higher performance mode. > (added linux-api@vger.kernel.org to the cc list...) Actually, it feels more like O_EXLOCK / O_SHLOCK to me: https://www.gnu.org/software/libc/manual/html_node/Open_002dtime-Flags.html Those are not quite the same semantics as what you're describing for O_CONCURRENT_WRITE, but the handling of conflicts would be similar. Maybe it's possible to dovetail your new flag on top of a credible O_EXLOCK/O_SHLOCK implementation? It'd be nice to have those to implement VFS-level share/deny locking. Most NFS and SMB servers could make good use of it. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-11-10 21:04 ` Fu, Rodney 2017-11-11 0:37 ` Matthew Wilcox @ 2017-11-13 0:48 ` Dave Chinner 2017-11-13 17:02 ` Fu, Rodney 2017-11-13 17:45 ` Bernd Schubert 2 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2017-11-13 0:48 UTC (permalink / raw) To: Fu, Rodney; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel On Fri, Nov 10, 2017 at 09:04:31PM +0000, Fu, Rodney wrote: > > No. If you want new flags bits, make a public proposal. Maybe some other > > filesystem would also benefit from them. > > Ah, I see what you mean now, thanks. > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > currently used in the Panasas filesystem (panfs) and defined with value: > > #define O_CONCURRENT_WRITE 020000000000 > > This flag has been provided by panfs to HPC users via the mpich package for > well over a decade. See: > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344 > > O_CONCURRENT_WRITE indicates to the filesystem that the application doing the > open is participating in a coordinated distributed manner with other such > applications, possibly running on different hosts. This allows the panfs > filesystem to delegate some of the cache coherency responsibilities to the > application, improving performance. O_DIRECT already delegates responsibility for cache coherency to userspace applications and it allows for concurrent writes to a single file. Why do we need a new flag for this? > The reason this flag is used on open as opposed to having a post-open ioctl or > fcntl SETFL is to allow panfs to catch and reject opens by applications that > attempt to access files that have already been opened by applications that have > set O_CONCURRENT_WRITE. Sounds kinda like how we already use O_EXCL on block devices. Perhaps something like: #define O_CONCURRENT_WRITE (O_DIRECT | O_EXCL) To tell open to reject mixed mode access to the file on open? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Provision for filesystem specific open flags 2017-11-13 0:48 ` Dave Chinner @ 2017-11-13 17:02 ` Fu, Rodney 2017-11-13 21:58 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Fu, Rodney @ 2017-11-13 17:02 UTC (permalink / raw) To: Dave Chinner; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel > > > No. If you want new flags bits, make a public proposal. Maybe some > > > other filesystem would also benefit from them. > > > > Ah, I see what you mean now, thanks. > > > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > > currently used in the Panasas filesystem (panfs) and defined with value: > > > > #define O_CONCURRENT_WRITE 020000000000 > > > > This flag has been provided by panfs to HPC users via the mpich > > package for well over a decade. See: > > > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan > > fs/ad_panfs_open6.c#L344 > > > > O_CONCURRENT_WRITE indicates to the filesystem that the application > > doing the open is participating in a coordinated distributed manner > > with other such applications, possibly running on different hosts. > > This allows the panfs filesystem to delegate some of the cache > > coherency responsibilities to the application, improving performance. > O_DIRECT already delegates responsibility for cache coherency to userspace > applications and it allows for concurrent writes to a single file. Why do we > need a new flag for this? > > The reason this flag is used on open as opposed to having a post-open > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by > > applications that attempt to access files that have already been > > opened by applications that have set O_CONCURRENT_WRITE. > Sounds kinda like how we already use O_EXCL on block devices. > Perhaps something like: > #define O_CONCURRENT_WRITE (O_DIRECT | O_EXCL) > To tell open to reject mixed mode access to the file on open? > -Dave. > -- > Dave Chinner > david@fromorbit.com Thanks for this suggestion, but O_DIRECT has a significantly different meaning to O_CONCURRENT_WRITE. O_DIRECT forces the filesystem to not cache read or write data, while O_CONCURRENT_WRITE allows caching and concurrent distributed access. I was not clear in my initial description of CONCURRENT_WRITE, so let me add more details here. When O_CONCURRENT_WRITE is used, portions of read and write data are still cachable in the filesystem. The filesystem continues to be responsible for maintaining distributed coherency. The user application is expected to provide an access pattern that will allow the filesystem to cache data, thereby improving performance. If the application misbehaves, the filesystem will still guarantee coherency but at a performance cost, as portions of the file will have to be treated as non-cacheable. In panfs, a well behaved CONCURRENT_WRITE application will consider the file's layout on storage. Access from different machines will not overlap within the same RAID stripe so as not to cause distributed stripe lock contention. Writes to the file that are page aligned can be cached and the filesystem can aggregate multiple such writes before writing out to storage. Conversely, a CONCURRENT_WRITE application that ends up colliding on the same stripe will see worse performance. Non page aligned writes are treated by panfs as write-through and non-cachable, as the filesystem will have to assume that the region of the page that is untouched by this machine might in fact be written to on another machine. Caching such a page and writing it out later might lead to data corruption. The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the application does not have to implement any caching to see good performance. The intricacies of maintaining distributed coherency are left to the filesystem instead of to the application developer. Caching at the filesystem layer allows multiple CONCURRENT_WRITE processes on the same machine to enjoy the performance benefits of the page cache. Think of this as a hybrid between exclusive access to a file, where the filesystem can cache everything and a simplistic shared mode where the filesystem caches nothing. So we really do need a separate flag defined. Thanks! ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-11-13 17:02 ` Fu, Rodney @ 2017-11-13 21:58 ` Dave Chinner 2017-11-14 17:35 ` Fu, Rodney 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2017-11-13 21:58 UTC (permalink / raw) To: Fu, Rodney; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel On Mon, Nov 13, 2017 at 05:02:20PM +0000, Fu, Rodney wrote: > > > > No. If you want new flags bits, make a public proposal. Maybe some > > > > other filesystem would also benefit from them. > > > > > > Ah, I see what you mean now, thanks. > > > > > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > > > currently used in the Panasas filesystem (panfs) and defined with value: > > > > > > #define O_CONCURRENT_WRITE 020000000000 > > > > > > This flag has been provided by panfs to HPC users via the mpich > > > package for well over a decade. See: > > > > > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan > > > fs/ad_panfs_open6.c#L344 > > > > > > O_CONCURRENT_WRITE indicates to the filesystem that the application > > > doing the open is participating in a coordinated distributed manner > > > with other such applications, possibly running on different hosts. > > > This allows the panfs filesystem to delegate some of the cache > > > coherency responsibilities to the application, improving performance. > > > O_DIRECT already delegates responsibility for cache coherency to userspace > > applications and it allows for concurrent writes to a single file. Why do we > > need a new flag for this? > > > > The reason this flag is used on open as opposed to having a post-open > > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by > > > applications that attempt to access files that have already been > > > opened by applications that have set O_CONCURRENT_WRITE. > > > Sounds kinda like how we already use O_EXCL on block devices. > > Perhaps something like: > > > #define O_CONCURRENT_WRITE (O_DIRECT | O_EXCL) > > > To tell open to reject mixed mode access to the file on open? > > > -Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > Thanks for this suggestion, but O_DIRECT has a significantly different meaning > to O_CONCURRENT_WRITE. O_DIRECT forces the filesystem to not cache read or > write data, while O_CONCURRENT_WRITE allows caching and concurrent distributed > access. I was not clear in my initial description of CONCURRENT_WRITE, so let > me add more details here. > > When O_CONCURRENT_WRITE is used, portions of read and write data are still > cachable in the filesystem. The filesystem can still choose to do that for O_DIRECT if it wants - look at all the filesystems that have a "fall back to buffered IO because this is too hard to implement in the direct Io path". > The filesystem continues to be responsible for > maintaining distributed coherency. Just like gfs2 and ocfs2 maintain distributed coherency when doing direct IO... > The user application is expected to provide > an access pattern that will allow the filesystem to cache data, thereby > improving performance. If the application misbehaves, the filesystem will still > guarantee coherency but at a performance cost, as portions of the file will have > to be treated as non-cacheable. IOWs, you've got another set of custom userspace APIs that are needed to make proper use of this open flag? > In panfs, a well behaved CONCURRENT_WRITE application will consider the file's > layout on storage. Access from different machines will not overlap within the > same RAID stripe so as not to cause distributed stripe lock contention. Writes > to the file that are page aligned can be cached and the filesystem can aggregate > multiple such writes before writing out to storage. Conversely, a > CONCURRENT_WRITE application that ends up colliding on the same stripe will see > worse performance. Non page aligned writes are treated by panfs as > write-through and non-cachable, as the filesystem will have to assume that the > region of the page that is untouched by this machine might in fact be written to > on another machine. Caching such a page and writing it out later might lead to > data corruption. That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if the app doesn't do correctly aligned and sized IO then performance is going to suck, and if the apps doesn't serialise access to the file correctly it can and will corrupt data in the file.... > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the application does > not have to implement any caching to see good performance. Sure, but it has to be aware of layout and where/how it can write, which is exactly the same constraints that local filesystems place on O_DIRECT access. > The intricacies of > maintaining distributed coherency are left to the filesystem instead of to > the application developer. Caching at the filesystem layer allows multiple > CONCURRENT_WRITE processes on the same machine to enjoy the performance benefits > of the page cache. > > Think of this as a hybrid between exclusive access to a file, where the > filesystem can cache everything and a simplistic shared mode where the > filesystem caches nothing. > > So we really do need a separate flag defined. Thanks! Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics and behaviour, IMO. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Provision for filesystem specific open flags 2017-11-13 21:58 ` Dave Chinner @ 2017-11-14 17:35 ` Fu, Rodney 2017-11-20 13:53 ` Jeff Layton 2017-12-04 5:29 ` NeilBrown 0 siblings, 2 replies; 20+ messages in thread From: Fu, Rodney @ 2017-11-14 17:35 UTC (permalink / raw) To: Dave Chinner; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel > The filesystem can still choose to do that for O_DIRECT if it wants - look at > all the filesystems that have a "fall back to buffered IO because this is too > hard to implement in the direct Io path". Yes, I agree that the filesystem can still decide to buffer IO even with O_DIRECT, but the application's intent is that the effects of caching are minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching. > IOWs, you've got another set of custom userspace APIs that are needed to make > proper use of this open flag? Yes and no. Applications can make ioctls to the filesystem to query or set layout details but don't have to. Directory level default layout attributes can be set up by an admin to meet the requirements of the application. > > In panfs, a well behaved CONCURRENT_WRITE application will consider > > the file's layout on storage. Access from different machines will not > > overlap within the same RAID stripe so as not to cause distributed > > stripe lock contention. Writes to the file that are page aligned can > > be cached and the filesystem can aggregate multiple such writes before > > writing out to storage. Conversely, a CONCURRENT_WRITE application > > that ends up colliding on the same stripe will see worse performance. > > Non page aligned writes are treated by panfs as write-through and > > non-cachable, as the filesystem will have to assume that the region of > > the page that is untouched by this machine might in fact be written to > > on another machine. Caching such a page and writing it out later might lead to data corruption. > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if > the app doesn't do correctly aligned and sized IO then performance is going to > suck, and if the apps doesn't serialize access to the file correctly it can and > will corrupt data in the file.... I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have opposite intents with respect to caching. Our filesystem handles them differently, so we need to distinguish between the two. > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the > > application does not have to implement any caching to see good performance. > Sure, but it has to be aware of layout and where/how it can write, which is > exactly the same constraints that local filesystems place on O_DIRECT access. > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics > and behaviour, IMO. I'd like to make a slight adjustment to my proposal. The HPC community had talked about extensions to POSIX to include O_LAZY as a way for filesystems to relax data coherency requirements. There is code in the ceph filesystem that uses that flag if defined. Can we get O_LAZY defined? HEC POSIX extension: http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf Ceph usage of O_LAZY: https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78 Regards, Rodney ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags @ 2017-11-20 13:53 ` Jeff Layton 0 siblings, 0 replies; 20+ messages in thread From: Jeff Layton @ 2017-11-20 13:53 UTC (permalink / raw) To: Fu, Rodney, Dave Chinner Cc: Matthew Wilcox, hch, viro, linux-fsdevel, linux-api On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote: > > The filesystem can still choose to do that for O_DIRECT if it wants - look at > > all the filesystems that have a "fall back to buffered IO because this is too > > hard to implement in the direct Io path". > > Yes, I agree that the filesystem can still decide to buffer IO even with > O_DIRECT, but the application's intent is that the effects of caching are > minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching. > > > IOWs, you've got another set of custom userspace APIs that are needed to make > > proper use of this open flag? > > Yes and no. Applications can make ioctls to the filesystem to query or set > layout details but don't have to. Directory level default layout attributes can > be set up by an admin to meet the requirements of the application. > > > > In panfs, a well behaved CONCURRENT_WRITE application will consider > > > the file's layout on storage. Access from different machines will not > > > overlap within the same RAID stripe so as not to cause distributed > > > stripe lock contention. Writes to the file that are page aligned can > > > be cached and the filesystem can aggregate multiple such writes before > > > writing out to storage. Conversely, a CONCURRENT_WRITE application > > > that ends up colliding on the same stripe will see worse performance. > > > Non page aligned writes are treated by panfs as write-through and > > > non-cachable, as the filesystem will have to assume that the region of > > > the page that is untouched by this machine might in fact be written to > > > on another machine. Caching such a page and writing it out later might lead to data corruption. > > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if > > the app doesn't do correctly aligned and sized IO then performance is going to > > suck, and if the apps doesn't serialize access to the file correctly it can and > > will corrupt data in the file.... > > I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have > opposite intents with respect to caching. Our filesystem handles them > differently, so we need to distinguish between the two. > > > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the > > > application does not have to implement any caching to see good performance. > > Sure, but it has to be aware of layout and where/how it can write, which is > > exactly the same constraints that local filesystems place on O_DIRECT access. > > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics > > and behaviour, IMO. > > I'd like to make a slight adjustment to my proposal. The HPC community had > talked about extensions to POSIX to include O_LAZY as a way for filesystems to > relax data coherency requirements. There is code in the ceph filesystem that > uses that flag if defined. Can we get O_LAZY defined? > > HEC POSIX extension: > http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf > > Ceph usage of O_LAZY: > https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78 O_LAZY support was removed from cephfs userland client in 2013: commit 94afedf02d07ad4678222aa66289a74b87768810 Author: Sage Weil <sage@inktank.com> Date: Mon Jul 8 11:24:48 2013 -0700 client: remove O_LAZY ...part of the problem (and this may just be my lack of understanding) is that it's not clear what O_LAZY semantics actually are. The ceph sources have a textfile with this in it: "-- lazy i/o integrity FIXME: currently missing call to flag an Fd/file has lazy. used to be O_LAZY on open, but no more. * relax data coherency * writes may not be visible until lazyio_propagate, fsync, close lazyio_propagate(int fd, off_t offset, size_t count); * my writes are safe lazyio_synchronize(int fd, off_t offset, size_t count); * i will see everyone else's propagated writes lazyio_propagate / lazyio_synchronize. Those seem like they could be implemented as ioctls if you don't care about other filesystems. It is possible to add new open flags (we're running low, but that's a problem we'll hit sooner or later anyway), but before we can do anything here, O_LAZY needs to be defined in a way that makes sense for application developers across filesystems. How does this change behavior on ext4, xfs or btrfs, for instance? What about nfs or cifs? I suggest that before you even dive into writing patches for any of this, that you draft a small manpage update for open(2). What would an O_LAZY entry look like? -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags @ 2017-11-20 13:53 ` Jeff Layton 0 siblings, 0 replies; 20+ messages in thread From: Jeff Layton @ 2017-11-20 13:53 UTC (permalink / raw) To: Fu, Rodney, Dave Chinner Cc: Matthew Wilcox, hch-jcswGhMUV9g, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote: > > The filesystem can still choose to do that for O_DIRECT if it wants - look at > > all the filesystems that have a "fall back to buffered IO because this is too > > hard to implement in the direct Io path". > > Yes, I agree that the filesystem can still decide to buffer IO even with > O_DIRECT, but the application's intent is that the effects of caching are > minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching. > > > IOWs, you've got another set of custom userspace APIs that are needed to make > > proper use of this open flag? > > Yes and no. Applications can make ioctls to the filesystem to query or set > layout details but don't have to. Directory level default layout attributes can > be set up by an admin to meet the requirements of the application. > > > > In panfs, a well behaved CONCURRENT_WRITE application will consider > > > the file's layout on storage. Access from different machines will not > > > overlap within the same RAID stripe so as not to cause distributed > > > stripe lock contention. Writes to the file that are page aligned can > > > be cached and the filesystem can aggregate multiple such writes before > > > writing out to storage. Conversely, a CONCURRENT_WRITE application > > > that ends up colliding on the same stripe will see worse performance. > > > Non page aligned writes are treated by panfs as write-through and > > > non-cachable, as the filesystem will have to assume that the region of > > > the page that is untouched by this machine might in fact be written to > > > on another machine. Caching such a page and writing it out later might lead to data corruption. > > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if > > the app doesn't do correctly aligned and sized IO then performance is going to > > suck, and if the apps doesn't serialize access to the file correctly it can and > > will corrupt data in the file.... > > I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have > opposite intents with respect to caching. Our filesystem handles them > differently, so we need to distinguish between the two. > > > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the > > > application does not have to implement any caching to see good performance. > > Sure, but it has to be aware of layout and where/how it can write, which is > > exactly the same constraints that local filesystems place on O_DIRECT access. > > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics > > and behaviour, IMO. > > I'd like to make a slight adjustment to my proposal. The HPC community had > talked about extensions to POSIX to include O_LAZY as a way for filesystems to > relax data coherency requirements. There is code in the ceph filesystem that > uses that flag if defined. Can we get O_LAZY defined? > > HEC POSIX extension: > http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf > > Ceph usage of O_LAZY: > https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78 O_LAZY support was removed from cephfs userland client in 2013: commit 94afedf02d07ad4678222aa66289a74b87768810 Author: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> Date: Mon Jul 8 11:24:48 2013 -0700 client: remove O_LAZY ...part of the problem (and this may just be my lack of understanding) is that it's not clear what O_LAZY semantics actually are. The ceph sources have a textfile with this in it: "-- lazy i/o integrity FIXME: currently missing call to flag an Fd/file has lazy. used to be O_LAZY on open, but no more. * relax data coherency * writes may not be visible until lazyio_propagate, fsync, close lazyio_propagate(int fd, off_t offset, size_t count); * my writes are safe lazyio_synchronize(int fd, off_t offset, size_t count); * i will see everyone else's propagated writes lazyio_propagate / lazyio_synchronize. Those seem like they could be implemented as ioctls if you don't care about other filesystems. It is possible to add new open flags (we're running low, but that's a problem we'll hit sooner or later anyway), but before we can do anything here, O_LAZY needs to be defined in a way that makes sense for application developers across filesystems. How does this change behavior on ext4, xfs or btrfs, for instance? What about nfs or cifs? I suggest that before you even dive into writing patches for any of this, that you draft a small manpage update for open(2). What would an O_LAZY entry look like? -- Jeff Layton <jlayton-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Provision for filesystem specific open flags 2017-11-14 17:35 ` Fu, Rodney 2017-11-20 13:53 ` Jeff Layton @ 2017-12-04 5:29 ` NeilBrown 2017-12-05 21:36 ` Andreas Dilger 1 sibling, 1 reply; 20+ messages in thread From: NeilBrown @ 2017-12-04 5:29 UTC (permalink / raw) To: Fu, Rodney, Dave Chinner; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 3990 bytes --] On Tue, Nov 14 2017, Fu, Rodney wrote: >> The filesystem can still choose to do that for O_DIRECT if it wants - look at >> all the filesystems that have a "fall back to buffered IO because this is too >> hard to implement in the direct Io path". > > Yes, I agree that the filesystem can still decide to buffer IO even with > O_DIRECT, but the application's intent is that the effects of caching are > minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching. > >> IOWs, you've got another set of custom userspace APIs that are needed to make >> proper use of this open flag? > > Yes and no. Applications can make ioctls to the filesystem to query or set > layout details but don't have to. Directory level default layout attributes can > be set up by an admin to meet the requirements of the application. > >> > In panfs, a well behaved CONCURRENT_WRITE application will consider >> > the file's layout on storage. Access from different machines will not >> > overlap within the same RAID stripe so as not to cause distributed >> > stripe lock contention. Writes to the file that are page aligned can >> > be cached and the filesystem can aggregate multiple such writes before >> > writing out to storage. Conversely, a CONCURRENT_WRITE application >> > that ends up colliding on the same stripe will see worse performance. >> > Non page aligned writes are treated by panfs as write-through and >> > non-cachable, as the filesystem will have to assume that the region of >> > the page that is untouched by this machine might in fact be written to >> > on another machine. Caching such a page and writing it out later might lead to data corruption. > >> That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if >> the app doesn't do correctly aligned and sized IO then performance is going to >> suck, and if the apps doesn't serialize access to the file correctly it can and >> will corrupt data in the file.... > > I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have > opposite intents with respect to caching. Our filesystem handles them > differently, so we need to distinguish between the two. > >> > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the >> > application does not have to implement any caching to see good performance. > >> Sure, but it has to be aware of layout and where/how it can write, which is >> exactly the same constraints that local filesystems place on O_DIRECT access. > >> Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics >> and behaviour, IMO. > > I'd like to make a slight adjustment to my proposal. The HPC community had > talked about extensions to POSIX to include O_LAZY as a way for filesystems to > relax data coherency requirements. There is code in the ceph filesystem that > uses that flag if defined. Can we get O_LAZY defined? This O_LAZY sounds exactly like what NFS has always done. If different clients do page aligned writes and have their own protocol to keep track of who owns which page, then everything is fine and write-back caching does good things. If different clients use byte-range locks, then write-back caching is curtailed a bit, but clients don't need to be so careful. If clients do non-aligned writes without locking, then corruption can result. So: #define O_LAZY 0 and NFS already has it implemented :-) For NFS, with have O_SYNC which tries to provide cache-coherency as strong as other filesystems provide without it. Do we really want O_LAZY? Or are other filesystems trying too hard to provide coherency when apps don't use locks? NeilBrown > > HEC POSIX extension: > http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf > > Ceph usage of O_LAZY: > https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78 > > Regards, > Rodney [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-12-04 5:29 ` NeilBrown @ 2017-12-05 21:36 ` Andreas Dilger 0 siblings, 0 replies; 20+ messages in thread From: Andreas Dilger @ 2017-12-05 21:36 UTC (permalink / raw) To: NeilBrown Cc: Fu, Rodney, Dave Chinner, Matthew Wilcox, hch, viro, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 4741 bytes --] On Dec 3, 2017, at 10:29 PM, NeilBrown <neilb@suse.com> wrote: > > On Tue, Nov 14 2017, Fu, Rodney wrote: > >>> The filesystem can still choose to do that for O_DIRECT if it wants - look at >>> all the filesystems that have a "fall back to buffered IO because this is too >>> hard to implement in the direct Io path". >> >> Yes, I agree that the filesystem can still decide to buffer IO even with >> O_DIRECT, but the application's intent is that the effects of caching are >> minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching. >> >>> IOWs, you've got another set of custom userspace APIs that are needed to make >>> proper use of this open flag? >> >> Yes and no. Applications can make ioctls to the filesystem to query or set >> layout details but don't have to. Directory level default layout attributes can >> be set up by an admin to meet the requirements of the application. >> >>>> In panfs, a well behaved CONCURRENT_WRITE application will consider >>>> the file's layout on storage. Access from different machines will not >>>> overlap within the same RAID stripe so as not to cause distributed >>>> stripe lock contention. Writes to the file that are page aligned can >>>> be cached and the filesystem can aggregate multiple such writes before >>>> writing out to storage. Conversely, a CONCURRENT_WRITE application >>>> that ends up colliding on the same stripe will see worse performance. >>>> Non page aligned writes are treated by panfs as write-through and >>>> non-cachable, as the filesystem will have to assume that the region of >>>> the page that is untouched by this machine might in fact be written to >>>> on another machine. Caching such a page and writing it out later might lead to data corruption. >> >>> That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if >>> the app doesn't do correctly aligned and sized IO then performance is going to >>> suck, and if the apps doesn't serialize access to the file correctly it can and >>> will corrupt data in the file.... >> >> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have >> opposite intents with respect to caching. Our filesystem handles them >> differently, so we need to distinguish between the two. >> >>>> The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the >>>> application does not have to implement any caching to see good performance. >> >>> Sure, but it has to be aware of layout and where/how it can write, which is >>> exactly the same constraints that local filesystems place on O_DIRECT access. >> >>> Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics >>> and behaviour, IMO. >> >> I'd like to make a slight adjustment to my proposal. The HPC community had >> talked about extensions to POSIX to include O_LAZY as a way for filesystems to >> relax data coherency requirements. There is code in the ceph filesystem that >> uses that flag if defined. Can we get O_LAZY defined? > > This O_LAZY sounds exactly like what NFS has always done. > If different clients do page aligned writes and have their own protocol > to keep track of who owns which page, then everything is fine and > write-back caching does good things. > If different clients use byte-range locks, then write-back caching > is curtailed a bit, but clients don't need to be so careful. > If clients do non-aligned writes without locking, then corruption can > result. > So: > #define O_LAZY 0 > and NFS already has it implemented :-) > > For NFS, with have O_SYNC which tries to provide cache-coherency as strong > as other filesystems provide without it. > > Do we really want O_LAZY? Or are other filesystems trying too hard to > provide coherency when apps don't use locks? Well, POSIX requires the correct read-after-write behaviour regardless of whether applications are being careful or not. As you wrote above, "If clients do non-aligned writes without locking, then corruption can result," and there definitely are apps that expect the filesystem to work correctly even at very large scales. I think O_LAZY would be reasonable to add, as long as that is what applications are asking for, but we can't just break long-standing data correctness behind their backs because it would go faster, and there is no way for the filesystem to know without a flag like O_LAZY if they are doing their own locking or not. There is also a simple fallback to "#define O_LAZY 0" if it is not defined on older systems, and then POSIX-compliant filesystems (not NFS) will still work correctly, without the speedup that O_LAZY provides. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags 2017-11-10 21:04 ` Fu, Rodney 2017-11-11 0:37 ` Matthew Wilcox 2017-11-13 0:48 ` Dave Chinner @ 2017-11-13 17:45 ` Bernd Schubert 2017-11-13 20:19 ` Fu, Rodney 2 siblings, 1 reply; 20+ messages in thread From: Bernd Schubert @ 2017-11-13 17:45 UTC (permalink / raw) To: Fu, Rodney, Matthew Wilcox; +Cc: hch, viro, linux-fsdevel On 11/10/2017 10:04 PM, Fu, Rodney wrote: >> No. If you want new flags bits, make a public proposal. Maybe some other >> filesystem would also benefit from them. > > Ah, I see what you mean now, thanks. > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > currently used in the Panasas filesystem (panfs) and defined with value: > > #define O_CONCURRENT_WRITE 020000000000 > > This flag has been provided by panfs to HPC users via the mpich package for > well over a decade. See: > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344 > > O_CONCURRENT_WRITE indicates to the filesystem that the application doing the > open is participating in a coordinated distributed manner with other such > applications, possibly running on different hosts. This allows the panfs > filesystem to delegate some of the cache coherency responsibilities to the > application, improving performance. > > The reason this flag is used on open as opposed to having a post-open ioctl or > fcntl SETFL is to allow panfs to catch and reject opens by applications that > attempt to access files that have already been opened by applications that have > set O_CONCURRENT_WRITE. Hmm, while I see why adding this flag is convenient, it still should be possible to have an ioctl to open the file and to set the flag? If a wrong panfs-inode flag is set, failing either the normal- or the ioctl-open would also work. Cheers, Bernd ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: Provision for filesystem specific open flags 2017-11-13 17:45 ` Bernd Schubert @ 2017-11-13 20:19 ` Fu, Rodney 2017-11-20 14:03 ` Florian Weimer 0 siblings, 1 reply; 20+ messages in thread From: Fu, Rodney @ 2017-11-13 20:19 UTC (permalink / raw) To: Bernd Schubert, Matthew Wilcox; +Cc: hch, viro, linux-fsdevel > Hmm, while I see why adding this flag is convenient, it still should be possible > to have an ioctl to open the file and to set the flag? If a wrong panfs-inode > flag is set, failing either the normal- or the ioctl-open would also work. > Cheers, > Bernd Yes, an ioctl open is possible but not ideal. The interface would require an additional open to perform the ioctl against. The open system call is really a great place to pass control information to the filesystem and any other solution seems less elegant. There is also the issue of backward compatibility with existing MPI applications that have been built using the existing O_CONCURRENT_WRITE flag. A user wanting to ensure compatibility would have to consider four pieces of software: the kernel, the filesystem version, the MPI package and finally the application. Having a single flag bit can make a big difference in this regard. Regards, Rodney ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags @ 2017-11-20 14:03 ` Florian Weimer 0 siblings, 0 replies; 20+ messages in thread From: Florian Weimer @ 2017-11-20 14:03 UTC (permalink / raw) To: Fu, Rodney, Bernd Schubert, Matthew Wilcox Cc: hch, viro, linux-fsdevel, Linux API On 11/13/2017 09:19 PM, Fu, Rodney wrote: > Yes, an ioctl open is possible but not ideal. The interface would require an > additional open to perform the ioctl against. The open system call is really a > great place to pass control information to the filesystem and any other solution > seems less elegant. But with the FS-specific open flag, you would have to do an open call with O_PATH, check that the file system is what you expect, and then openat the O_PATH descriptor to get a full descriptor. If you don't follow this protocol, you might end up using a custom open flag with a different file system which has completely different semantics for the flag. So ioctl actually is much simpler here and needs fewer system calls. (Due to per-file bind mounts, there is no way to figure out the file system on which a file is located without actually opening the file.) Thanks, Florian ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Provision for filesystem specific open flags @ 2017-11-20 14:03 ` Florian Weimer 0 siblings, 0 replies; 20+ messages in thread From: Florian Weimer @ 2017-11-20 14:03 UTC (permalink / raw) To: Fu, Rodney, Bernd Schubert, Matthew Wilcox Cc: hch-jcswGhMUV9g, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API On 11/13/2017 09:19 PM, Fu, Rodney wrote: > Yes, an ioctl open is possible but not ideal. The interface would require an > additional open to perform the ioctl against. The open system call is really a > great place to pass control information to the filesystem and any other solution > seems less elegant. But with the FS-specific open flag, you would have to do an open call with O_PATH, check that the file system is what you expect, and then openat the O_PATH descriptor to get a full descriptor. If you don't follow this protocol, you might end up using a custom open flag with a different file system which has completely different semantics for the flag. So ioctl actually is much simpler here and needs fewer system calls. (Due to per-file bind mounts, there is no way to figure out the file system on which a file is located without actually opening the file.) Thanks, Florian ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2017-12-05 21:36 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-10 16:49 Provision for filesystem specific open flags Fu, Rodney 2017-11-10 17:23 ` hch 2017-11-10 17:39 ` Fu, Rodney 2017-11-10 19:29 ` Matthew Wilcox 2017-11-10 21:04 ` Fu, Rodney 2017-11-11 0:37 ` Matthew Wilcox 2017-11-13 15:16 ` Fu, Rodney 2017-11-20 13:38 ` Jeff Layton 2017-11-13 0:48 ` Dave Chinner 2017-11-13 17:02 ` Fu, Rodney 2017-11-13 21:58 ` Dave Chinner 2017-11-14 17:35 ` Fu, Rodney 2017-11-20 13:53 ` Jeff Layton 2017-11-20 13:53 ` Jeff Layton 2017-12-04 5:29 ` NeilBrown 2017-12-05 21:36 ` Andreas Dilger 2017-11-13 17:45 ` Bernd Schubert 2017-11-13 20:19 ` Fu, Rodney 2017-11-20 14:03 ` Florian Weimer 2017-11-20 14:03 ` Florian Weimer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.