All of lore.kernel.org
 help / color / mirror / Atom feed
* Provision for filesystem specific open flags
@ 2017-11-10 16:49 Fu, Rodney
  2017-11-10 17:23 ` hch
  0 siblings, 1 reply; 20+ messages in thread
From: Fu, Rodney @ 2017-11-10 16:49 UTC (permalink / raw)
  To: hch, viro; +Cc: linux-fsdevel

Hello,

With,

commit 59724793983177d1b51c8cdd0134326977a1cabc
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Apr 27 09:42:25 2017 +0200

    fs: completely ignore unknown open flags
    
    [ Upstream commit 629e014bb8349fcf7c1e4df19a842652ece1c945 ]
    
    Currently we just stash anything we got into file->f_flags, and the
    report it in fcntl(F_GETFD).  This patch just clears out all unknown
    flags so that we don't pass them to the fs or report them.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Sasha Levin <alexander.levin@verizon.com>

and,

commit 666d1fc2023eee3bf723c764eeeabb21d71a11f2
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Apr 27 09:42:24 2017 +0200

    fs: add a VALID_OPEN_FLAGS
    
    [ Upstream commit 80f18379a7c350c011d30332658aa15fe49a8fa5 ]
    
    Add a central define for all valid open flags, and use it in the uniqueness
    check.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Sasha Levin <alexander.levin@verizon.com>

The kernel prevents unknown open flags from being passed through to the 
underlying filesystem.  I am wondering if people would be for or against the 
idea of provisioning some number of bits in the open flags that are opaque to 
the VFS layer but get passed down to the underlying filesystem?  The motivation 
would be to allow filesystem specific semantics to be controllable via open, 
much like the more generic and pre-existing open flags.

Thanks,
Rodney Fu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-11-10 16:49 Provision for filesystem specific open flags Fu, Rodney
@ 2017-11-10 17:23 ` hch
  2017-11-10 17:39   ` Fu, Rodney
  0 siblings, 1 reply; 20+ messages in thread
From: hch @ 2017-11-10 17:23 UTC (permalink / raw)
  To: Fu, Rodney; +Cc: hch, viro, linux-fsdevel

On Fri, Nov 10, 2017 at 04:49:33PM +0000, Fu, Rodney wrote:
> The kernel prevents unknown open flags from being passed through to the 
> underlying filesystem.  I am wondering if people would be for or against the 
> idea of provisioning some number of bits in the open flags that are opaque to 
> the VFS layer but get passed down to the underlying filesystem?  The motivation 
> would be to allow filesystem specific semantics to be controllable via open, 
> much like the more generic and pre-existing open flags.

Absolutely against.  Open flags need to be defined in common code or
you are in a massive world of trouble.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Provision for filesystem specific open flags
  2017-11-10 17:23 ` hch
@ 2017-11-10 17:39   ` Fu, Rodney
  2017-11-10 19:29     ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Fu, Rodney @ 2017-11-10 17:39 UTC (permalink / raw)
  To: hch; +Cc: viro, linux-fsdevel

> Absolutely against.  Open flags need to be defined in common code or you are in a massive world of trouble.

I'm suggesting this can be done with definitions in common code via generically named open flags.  Say for example if there was defined an O_FS1, O_FS2 (someone pick a better name) exposed for an application to use that has no meaning to the VFS layer, but could be interpreted by the filesystem.  Could that work?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-11-10 17:39   ` Fu, Rodney
@ 2017-11-10 19:29     ` Matthew Wilcox
  2017-11-10 21:04       ` Fu, Rodney
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2017-11-10 19:29 UTC (permalink / raw)
  To: Fu, Rodney; +Cc: hch, viro, linux-fsdevel

On Fri, Nov 10, 2017 at 05:39:21PM +0000, Fu, Rodney wrote:
> > Absolutely against.  Open flags need to be defined in common code or you are in a massive world of trouble.
> 
> I'm suggesting this can be done with definitions in common code via generically named open flags.  Say for example if there was defined an O_FS1, O_FS2 (someone pick a better name) exposed for an application to use that has no meaning to the VFS layer, but could be interpreted by the filesystem.  Could that work?

No.  If you want new flags bits, make a public proposal.  Maybe some
other filesystem would also benefit from them.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Provision for filesystem specific open flags
  2017-11-10 19:29     ` Matthew Wilcox
@ 2017-11-10 21:04       ` Fu, Rodney
  2017-11-11  0:37         ` Matthew Wilcox
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Fu, Rodney @ 2017-11-10 21:04 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: hch, viro, linux-fsdevel

> No.  If you want new flags bits, make a public proposal.  Maybe some other
> filesystem would also benefit from them.

Ah, I see what you mean now, thanks.

I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is
currently used in the Panasas filesystem (panfs) and defined with value:

#define O_CONCURRENT_WRITE 020000000000

This flag has been provided by panfs to HPC users via the mpich package for
well over a decade.  See:

https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344

O_CONCURRENT_WRITE indicates to the filesystem that the application doing the
open is participating in a coordinated distributed manner with other such
applications, possibly running on different hosts.  This allows the panfs
filesystem to delegate some of the cache coherency responsibilities to the
application, improving performance.

The reason this flag is used on open as opposed to having a post-open ioctl or
fcntl SETFL is to allow panfs to catch and reject opens by applications that
attempt to access files that have already been opened by applications that have
set O_CONCURRENT_WRITE.

Recent changes to reject non-VALID_OPEN_FLAGS now prevent this facility from
working, so getting our flag bit back would help tremendously.  Thank you!

Regards,
Rodney Fu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-11-10 21:04       ` Fu, Rodney
@ 2017-11-11  0:37         ` Matthew Wilcox
  2017-11-13 15:16           ` Fu, Rodney
  2017-11-13  0:48         ` Dave Chinner
  2017-11-13 17:45         ` Bernd Schubert
  2 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2017-11-11  0:37 UTC (permalink / raw)
  To: Fu, Rodney; +Cc: hch, viro, linux-fsdevel

On Fri, Nov 10, 2017 at 09:04:31PM +0000, Fu, Rodney wrote:
> > No.  If you want new flags bits, make a public proposal.  Maybe some other
> > filesystem would also benefit from them.
> 
> Ah, I see what you mean now, thanks.
> 
> I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is
> currently used in the Panasas filesystem (panfs) and defined with value:
> 
> #define O_CONCURRENT_WRITE 020000000000
> 
> This flag has been provided by panfs to HPC users via the mpich package for
> well over a decade.  See:
> 
> https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344
> 
> O_CONCURRENT_WRITE indicates to the filesystem that the application doing the
> open is participating in a coordinated distributed manner with other such
> applications, possibly running on different hosts.  This allows the panfs
> filesystem to delegate some of the cache coherency responsibilities to the
> application, improving performance.
> 
> The reason this flag is used on open as opposed to having a post-open ioctl or
> fcntl SETFL is to allow panfs to catch and reject opens by applications that
> attempt to access files that have already been opened by applications that have
> set O_CONCURRENT_WRITE.

OK, let me just check I understand.  Once any application has opened
the inode with O_CONCURRENT_WRITE, all subsequent attempts to open the
same inode without O_CONCURRENT_WRITE will fail.  Presumably also if
somebody already has the inode open without O_CONCURRENT_WRITE set,
the first open with O_CONCURRENT_WRITE will fail?

Are opens with O_RDONLY also blocked?

This feels a lot like leases ... maybe there's an opportunity to
give better semantics here -- rather than rejecting opens without
O_CONCURRENT_WRITE, all existing users could be forced to use the stricter
coherency model?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-11-10 21:04       ` Fu, Rodney
  2017-11-11  0:37         ` Matthew Wilcox
@ 2017-11-13  0:48         ` Dave Chinner
  2017-11-13 17:02           ` Fu, Rodney
  2017-11-13 17:45         ` Bernd Schubert
  2 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2017-11-13  0:48 UTC (permalink / raw)
  To: Fu, Rodney; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel

On Fri, Nov 10, 2017 at 09:04:31PM +0000, Fu, Rodney wrote:
> > No.  If you want new flags bits, make a public proposal.  Maybe some other
> > filesystem would also benefit from them.
> 
> Ah, I see what you mean now, thanks.
> 
> I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is
> currently used in the Panasas filesystem (panfs) and defined with value:
> 
> #define O_CONCURRENT_WRITE 020000000000
> 
> This flag has been provided by panfs to HPC users via the mpich package for
> well over a decade.  See:
> 
> https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344
> 
> O_CONCURRENT_WRITE indicates to the filesystem that the application doing the
> open is participating in a coordinated distributed manner with other such
> applications, possibly running on different hosts.  This allows the panfs
> filesystem to delegate some of the cache coherency responsibilities to the
> application, improving performance.

O_DIRECT already delegates responsibility for cache coherency to
userspace applications and it allows for concurrent writes to a
single file. Why do we need a new flag for this?

> The reason this flag is used on open as opposed to having a post-open ioctl or
> fcntl SETFL is to allow panfs to catch and reject opens by applications that
> attempt to access files that have already been opened by applications that have
> set O_CONCURRENT_WRITE.

Sounds kinda like how we already use O_EXCL on block devices.
Perhaps something like:

#define O_CONCURRENT_WRITE  (O_DIRECT | O_EXCL)

To tell open to reject mixed mode access to the file on open?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Provision for filesystem specific open flags
  2017-11-11  0:37         ` Matthew Wilcox
@ 2017-11-13 15:16           ` Fu, Rodney
  2017-11-20 13:38             ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Fu, Rodney @ 2017-11-13 15:16 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: hch, viro, linux-fsdevel

> > > No.  If you want new flags bits, make a public proposal.  Maybe some 
> > > other filesystem would also benefit from them.
> > 
> > Ah, I see what you mean now, thanks.
> > 
> > I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is 
> > currently used in the Panasas filesystem (panfs) and defined with value:
> > 
> > #define O_CONCURRENT_WRITE 020000000000
> > 
> > This flag has been provided by panfs to HPC users via the mpich 
> > package for well over a decade.  See:
> > 
> > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan
> > fs/ad_panfs_open6.c#L344
> > 
> > O_CONCURRENT_WRITE indicates to the filesystem that the application 
> > doing the open is participating in a coordinated distributed manner 
> > with other such applications, possibly running on different hosts.  
> > This allows the panfs filesystem to delegate some of the cache 
> > coherency responsibilities to the application, improving performance.
> > 
> > The reason this flag is used on open as opposed to having a post-open 
> > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by 
> > applications that attempt to access files that have already been 
> > opened by applications that have set O_CONCURRENT_WRITE.

> OK, let me just check I understand.  Once any application has opened the inode
> with O_CONCURRENT_WRITE, all subsequent attempts to open the same inode without
> O_CONCURRENT_WRITE will fail.  Presumably also if somebody already has the inode
> open without O_CONCURRENT_WRITE set, the first open with O_CONCURRENT_WRITE will
> fail?

Yes on both counts.  Opening with O_CONCURRENT_WRITE, followed by an open
without will fail.  Opening without O_CONCURRENT_WRITE followed by one with it
will also fail.

> Are opens with O_RDONLY also blocked?

No they are not.  The decision to grant access is based solely on the
O_CONCURRENT_WRITE flag.

> This feels a lot like leases ... maybe there's an opportunity to give better
> semantics here -- rather than rejecting opens without O_CONCURRENT_WRITE, all
> existing users could be forced to use the stricter coherency model?

I don't think that will work, at least not from the perspective of trying to
maintain good performance.  A user that does not open with O_CONCURRENT_WRITE
does not know how to adhere to the proper access patterns that maintain
coherency.  To continue to allow all users access after that point, the
filesystem will have to force all users into a non-cacheable mode.  Instead, we
reject stray opens to allow any existing CONCURRENT_WRITE application to
complete in a higher performance mode.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Provision for filesystem specific open flags
  2017-11-13  0:48         ` Dave Chinner
@ 2017-11-13 17:02           ` Fu, Rodney
  2017-11-13 21:58             ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Fu, Rodney @ 2017-11-13 17:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel

> > > No.  If you want new flags bits, make a public proposal.  Maybe some 
> > > other filesystem would also benefit from them.
> > 
> > Ah, I see what you mean now, thanks.
> > 
> > I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is 
> > currently used in the Panasas filesystem (panfs) and defined with value:
> > 
> > #define O_CONCURRENT_WRITE 020000000000
> > 
> > This flag has been provided by panfs to HPC users via the mpich 
> > package for well over a decade.  See:
> > 
> > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan
> > fs/ad_panfs_open6.c#L344
> > 
> > O_CONCURRENT_WRITE indicates to the filesystem that the application 
> > doing the open is participating in a coordinated distributed manner 
> > with other such applications, possibly running on different hosts.  
> > This allows the panfs filesystem to delegate some of the cache 
> > coherency responsibilities to the application, improving performance.

> O_DIRECT already delegates responsibility for cache coherency to userspace
> applications and it allows for concurrent writes to a single file. Why do we
> need a new flag for this?

> > The reason this flag is used on open as opposed to having a post-open 
> > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by 
> > applications that attempt to access files that have already been 
> > opened by applications that have set O_CONCURRENT_WRITE.

> Sounds kinda like how we already use O_EXCL on block devices.
> Perhaps something like:

> #define O_CONCURRENT_WRITE  (O_DIRECT | O_EXCL)

> To tell open to reject mixed mode access to the file on open?

> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Thanks for this suggestion, but O_DIRECT has a significantly different meaning
to O_CONCURRENT_WRITE.  O_DIRECT forces the filesystem to not cache read or
write data, while O_CONCURRENT_WRITE allows caching and concurrent distributed
access.  I was not clear in my initial description of CONCURRENT_WRITE, so let
me add more details here.

When O_CONCURRENT_WRITE is used, portions of read and write data are still
cachable in the filesystem.  The filesystem continues to be responsible for
maintaining distributed coherency.  The user application is expected to provide
an access pattern that will allow the filesystem to cache data, thereby
improving performance.  If the application misbehaves, the filesystem will still
guarantee coherency but at a performance cost, as portions of the file will have
to be treated as non-cacheable.

In panfs, a well behaved CONCURRENT_WRITE application will consider the file's
layout on storage.  Access from different machines will not overlap within the
same RAID stripe so as not to cause distributed stripe lock contention.  Writes
to the file that are page aligned can be cached and the filesystem can aggregate
multiple such writes before writing out to storage.  Conversely, a
CONCURRENT_WRITE application that ends up colliding on the same stripe will see
worse performance.  Non page aligned writes are treated by panfs as
write-through and non-cachable, as the filesystem will have to assume that the
region of the page that is untouched by this machine might in fact be written to
on another machine.  Caching such a page and writing it out later might lead to
data corruption.

The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the application does
not have to implement any caching to see good performance.  The intricacies of
maintaining distributed coherency are left to the filesystem instead of to
the application developer.  Caching at the filesystem layer allows multiple
CONCURRENT_WRITE processes on the same machine to enjoy the performance benefits
of the page cache.

Think of this as a hybrid between exclusive access to a file, where the
filesystem can cache everything and a simplistic shared mode where the
filesystem caches nothing.

So we really do need a separate flag defined.  Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-11-10 21:04       ` Fu, Rodney
  2017-11-11  0:37         ` Matthew Wilcox
  2017-11-13  0:48         ` Dave Chinner
@ 2017-11-13 17:45         ` Bernd Schubert
  2017-11-13 20:19           ` Fu, Rodney
  2 siblings, 1 reply; 20+ messages in thread
From: Bernd Schubert @ 2017-11-13 17:45 UTC (permalink / raw)
  To: Fu, Rodney, Matthew Wilcox; +Cc: hch, viro, linux-fsdevel



On 11/10/2017 10:04 PM, Fu, Rodney wrote:
>> No.  If you want new flags bits, make a public proposal.  Maybe some other
>> filesystem would also benefit from them.
> 
> Ah, I see what you mean now, thanks.
> 
> I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is
> currently used in the Panasas filesystem (panfs) and defined with value:
> 
> #define O_CONCURRENT_WRITE 020000000000
> 
> This flag has been provided by panfs to HPC users via the mpich package for
> well over a decade.  See:
> 
> https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_panfs/ad_panfs_open6.c#L344
> 
> O_CONCURRENT_WRITE indicates to the filesystem that the application doing the
> open is participating in a coordinated distributed manner with other such
> applications, possibly running on different hosts.  This allows the panfs
> filesystem to delegate some of the cache coherency responsibilities to the
> application, improving performance.
> 
> The reason this flag is used on open as opposed to having a post-open ioctl or
> fcntl SETFL is to allow panfs to catch and reject opens by applications that
> attempt to access files that have already been opened by applications that have
> set O_CONCURRENT_WRITE.

Hmm, while I see why adding this flag is convenient, it still should be
possible to have an ioctl to open the file and to set the flag? If a
wrong panfs-inode flag is set, failing either the normal- or the
ioctl-open would also work.


Cheers,
Bernd

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Provision for filesystem specific open flags
  2017-11-13 17:45         ` Bernd Schubert
@ 2017-11-13 20:19           ` Fu, Rodney
  2017-11-20 14:03               ` Florian Weimer
  0 siblings, 1 reply; 20+ messages in thread
From: Fu, Rodney @ 2017-11-13 20:19 UTC (permalink / raw)
  To: Bernd Schubert, Matthew Wilcox; +Cc: hch, viro, linux-fsdevel

> Hmm, while I see why adding this flag is convenient, it still should be possible
> to have an ioctl to open the file and to set the flag? If a wrong panfs-inode
> flag is set, failing either the normal- or the ioctl-open would also work.


> Cheers,
> Bernd

Yes, an ioctl open is possible but not ideal.  The interface would require an
additional open to perform the ioctl against.  The open system call is really a
great place to pass control information to the filesystem and any other solution
seems less elegant.

There is also the issue of backward compatibility with existing MPI applications
that have been built using the existing O_CONCURRENT_WRITE flag.  A user wanting
to ensure compatibility would have to consider four pieces of software: the
kernel, the filesystem version, the MPI package and finally the application.
Having a single flag bit can make a big difference in this regard.

Regards,
Rodney

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-11-13 17:02           ` Fu, Rodney
@ 2017-11-13 21:58             ` Dave Chinner
  2017-11-14 17:35               ` Fu, Rodney
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2017-11-13 21:58 UTC (permalink / raw)
  To: Fu, Rodney; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel

On Mon, Nov 13, 2017 at 05:02:20PM +0000, Fu, Rodney wrote:
> > > > No.  If you want new flags bits, make a public proposal.  Maybe some 
> > > > other filesystem would also benefit from them.
> > > 
> > > Ah, I see what you mean now, thanks.
> > > 
> > > I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is 
> > > currently used in the Panasas filesystem (panfs) and defined with value:
> > > 
> > > #define O_CONCURRENT_WRITE 020000000000
> > > 
> > > This flag has been provided by panfs to HPC users via the mpich 
> > > package for well over a decade.  See:
> > > 
> > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan
> > > fs/ad_panfs_open6.c#L344
> > > 
> > > O_CONCURRENT_WRITE indicates to the filesystem that the application 
> > > doing the open is participating in a coordinated distributed manner 
> > > with other such applications, possibly running on different hosts.  
> > > This allows the panfs filesystem to delegate some of the cache 
> > > coherency responsibilities to the application, improving performance.
> 
> > O_DIRECT already delegates responsibility for cache coherency to userspace
> > applications and it allows for concurrent writes to a single file. Why do we
> > need a new flag for this?
> 
> > > The reason this flag is used on open as opposed to having a post-open 
> > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by 
> > > applications that attempt to access files that have already been 
> > > opened by applications that have set O_CONCURRENT_WRITE.
> 
> > Sounds kinda like how we already use O_EXCL on block devices.
> > Perhaps something like:
> 
> > #define O_CONCURRENT_WRITE  (O_DIRECT | O_EXCL)
> 
> > To tell open to reject mixed mode access to the file on open?
> 
> > -Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> 
> Thanks for this suggestion, but O_DIRECT has a significantly different meaning
> to O_CONCURRENT_WRITE.  O_DIRECT forces the filesystem to not cache read or
> write data, while O_CONCURRENT_WRITE allows caching and concurrent distributed
> access.  I was not clear in my initial description of CONCURRENT_WRITE, so let
> me add more details here.
> 
> When O_CONCURRENT_WRITE is used, portions of read and write data are still
> cachable in the filesystem.

The filesystem can still choose to do that for O_DIRECT if it wants
- look at all the filesystems that have a "fall back to buffered IO
because this is too hard to implement in the direct Io path".

> The filesystem continues to be responsible for
> maintaining distributed coherency.

Just like gfs2 and ocfs2 maintain distributed coherency when doing
direct IO...

> The user application is expected to provide
> an access pattern that will allow the filesystem to cache data, thereby
> improving performance.  If the application misbehaves, the filesystem will still
> guarantee coherency but at a performance cost, as portions of the file will have
> to be treated as non-cacheable.

IOWs, you've got another set of custom userspace APIs that are
needed to make proper use of this open flag?

> In panfs, a well behaved CONCURRENT_WRITE application will consider the file's
> layout on storage.  Access from different machines will not overlap within the
> same RAID stripe so as not to cause distributed stripe lock contention.  Writes
> to the file that are page aligned can be cached and the filesystem can aggregate
> multiple such writes before writing out to storage.  Conversely, a
> CONCURRENT_WRITE application that ends up colliding on the same stripe will see
> worse performance.  Non page aligned writes are treated by panfs as
> write-through and non-cachable, as the filesystem will have to assume that the
> region of the page that is untouched by this machine might in fact be written to
> on another machine.  Caching such a page and writing it out later might lead to
> data corruption.

That seems to fit the expected behaviour of O_DIRECT pretty damn
closely - if the app doesn't do correctly aligned and sized IO then
performance is going to suck, and if the apps doesn't serialise
access to the file correctly it can and will corrupt data in the
file....

> The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the application does
> not have to implement any caching to see good performance.

Sure, but it has to be aware of layout and where/how it can write,
which is exactly the same constraints that local filesystems place
on O_DIRECT access.

> The intricacies of
> maintaining distributed coherency are left to the filesystem instead of to
> the application developer.  Caching at the filesystem layer allows multiple
> CONCURRENT_WRITE processes on the same machine to enjoy the performance benefits
> of the page cache.
> 
> Think of this as a hybrid between exclusive access to a file, where the
> filesystem can cache everything and a simplistic shared mode where the
> filesystem caches nothing.
> 
> So we really do need a separate flag defined.  Thanks!

Not convinced. The use case fits pretty neatly into expected
O_DIRECT semantics and behaviour, IMO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Provision for filesystem specific open flags
  2017-11-13 21:58             ` Dave Chinner
@ 2017-11-14 17:35               ` Fu, Rodney
  2017-11-20 13:53                   ` Jeff Layton
  2017-12-04  5:29                 ` NeilBrown
  0 siblings, 2 replies; 20+ messages in thread
From: Fu, Rodney @ 2017-11-14 17:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel

> The filesystem can still choose to do that for O_DIRECT if it wants - look at
> all the filesystems that have a "fall back to buffered IO because this is too
> hard to implement in the direct Io path".

Yes, I agree that the filesystem can still decide to buffer IO even with
O_DIRECT, but the application's intent is that the effects of caching are
minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.

> IOWs, you've got another set of custom userspace APIs that are needed to make
> proper use of this open flag?

Yes and no.  Applications can make ioctls to the filesystem to query or set
layout details but don't have to.  Directory level default layout attributes can
be set up by an admin to meet the requirements of the application.

> > In panfs, a well behaved CONCURRENT_WRITE application will consider 
> > the file's layout on storage.  Access from different machines will not 
> > overlap within the same RAID stripe so as not to cause distributed 
> > stripe lock contention.  Writes to the file that are page aligned can 
> > be cached and the filesystem can aggregate multiple such writes before 
> > writing out to storage.  Conversely, a CONCURRENT_WRITE application 
> > that ends up colliding on the same stripe will see worse performance.  
> > Non page aligned writes are treated by panfs as write-through and 
> > non-cachable, as the filesystem will have to assume that the region of 
> > the page that is untouched by this machine might in fact be written to 
> > on another machine.  Caching such a page and writing it out later might lead to data corruption.

> That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
> the app doesn't do correctly aligned and sized IO then performance is going to
> suck, and if the apps doesn't serialize access to the file correctly it can and
> will corrupt data in the file....

I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
opposite intents with respect to caching.  Our filesystem handles them
differently, so we need to distinguish between the two.

> > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the 
> > application does not have to implement any caching to see good performance.

> Sure, but it has to be aware of layout and where/how it can write, which is
> exactly the same constraints that local filesystems place on O_DIRECT access.

> Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
> and behaviour, IMO.

I'd like to make a slight adjustment to my proposal.  The HPC community had
talked about extensions to POSIX to include O_LAZY as a way for filesystems to
relax data coherency requirements.  There is code in the ceph filesystem that
uses that flag if defined.  Can we get O_LAZY defined?

HEC POSIX extension:
http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf

Ceph usage of O_LAZY:
https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78

Regards,
Rodney

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-11-13 15:16           ` Fu, Rodney
@ 2017-11-20 13:38             ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2017-11-20 13:38 UTC (permalink / raw)
  To: Fu, Rodney, Matthew Wilcox; +Cc: hch, viro, linux-fsdevel, linux-api

On Mon, 2017-11-13 at 15:16 +0000, Fu, Rodney wrote:
> > > > No.  If you want new flags bits, make a public proposal.  Maybe some 
> > > > other filesystem would also benefit from them.
> > > 
> > > Ah, I see what you mean now, thanks.
> > > 
> > > I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is 
> > > currently used in the Panasas filesystem (panfs) and defined with value:
> > > 
> > > #define O_CONCURRENT_WRITE 020000000000
> > > 
> > > This flag has been provided by panfs to HPC users via the mpich 
> > > package for well over a decade.  See:
> > > 
> > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan
> > > fs/ad_panfs_open6.c#L344
> > > 
> > > O_CONCURRENT_WRITE indicates to the filesystem that the application 
> > > doing the open is participating in a coordinated distributed manner 
> > > with other such applications, possibly running on different hosts.  
> > > This allows the panfs filesystem to delegate some of the cache 
> > > coherency responsibilities to the application, improving performance.
> > > 
> > > The reason this flag is used on open as opposed to having a post-open 
> > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by 
> > > applications that attempt to access files that have already been 
> > > opened by applications that have set O_CONCURRENT_WRITE.
> > OK, let me just check I understand.  Once any application has opened the inode
> > with O_CONCURRENT_WRITE, all subsequent attempts to open the same inode without
> > O_CONCURRENT_WRITE will fail.  Presumably also if somebody already has the inode
> > open without O_CONCURRENT_WRITE set, the first open with O_CONCURRENT_WRITE will
> > fail?
> 
> Yes on both counts.  Opening with O_CONCURRENT_WRITE, followed by an open
> without will fail.  Opening without O_CONCURRENT_WRITE followed by one with it
> will also fail.
> 
> > Are opens with O_RDONLY also blocked?
> 
> No they are not.  The decision to grant access is based solely on the
> O_CONCURRENT_WRITE flag.
> 
> > This feels a lot like leases ... maybe there's an opportunity to give better
> > semantics here -- rather than rejecting opens without O_CONCURRENT_WRITE, all
> > existing users could be forced to use the stricter coherency model?
> 
> I don't think that will work, at least not from the perspective of trying to
> maintain good performance.  A user that does not open with O_CONCURRENT_WRITE
> does not know how to adhere to the proper access patterns that maintain
> coherency.  To continue to allow all users access after that point, the
> filesystem will have to force all users into a non-cacheable mode.  Instead, we
> reject stray opens to allow any existing CONCURRENT_WRITE application to
> complete in a higher performance mode.
> 

(added linux-api@vger.kernel.org to the cc list...)

Actually, it feels more like O_EXLOCK / O_SHLOCK to me:

    https://www.gnu.org/software/libc/manual/html_node/Open_002dtime-Flags.html

Those are not quite the same semantics as what you're describing for
O_CONCURRENT_WRITE, but the handling of conflicts would be similar. 

Maybe it's possible to dovetail your new flag on top of a credible
O_EXLOCK/O_SHLOCK implementation? It'd be nice to have those to
implement VFS-level share/deny locking. Most NFS and SMB servers could
make good use of it.


-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
@ 2017-11-20 13:53                   ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2017-11-20 13:53 UTC (permalink / raw)
  To: Fu, Rodney, Dave Chinner
  Cc: Matthew Wilcox, hch, viro, linux-fsdevel, linux-api

On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote:
> > The filesystem can still choose to do that for O_DIRECT if it wants - look at
> > all the filesystems that have a "fall back to buffered IO because this is too
> > hard to implement in the direct Io path".
> 
> Yes, I agree that the filesystem can still decide to buffer IO even with
> O_DIRECT, but the application's intent is that the effects of caching are
> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
> 
> > IOWs, you've got another set of custom userspace APIs that are needed to make
> > proper use of this open flag?
> 
> Yes and no.  Applications can make ioctls to the filesystem to query or set
> layout details but don't have to.  Directory level default layout attributes can
> be set up by an admin to meet the requirements of the application.
> 
> > > In panfs, a well behaved CONCURRENT_WRITE application will consider 
> > > the file's layout on storage.  Access from different machines will not 
> > > overlap within the same RAID stripe so as not to cause distributed 
> > > stripe lock contention.  Writes to the file that are page aligned can 
> > > be cached and the filesystem can aggregate multiple such writes before 
> > > writing out to storage.  Conversely, a CONCURRENT_WRITE application 
> > > that ends up colliding on the same stripe will see worse performance.  
> > > Non page aligned writes are treated by panfs as write-through and 
> > > non-cachable, as the filesystem will have to assume that the region of 
> > > the page that is untouched by this machine might in fact be written to 
> > > on another machine.  Caching such a page and writing it out later might lead to data corruption.
> > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
> > the app doesn't do correctly aligned and sized IO then performance is going to
> > suck, and if the apps doesn't serialize access to the file correctly it can and
> > will corrupt data in the file....
> 
> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
> opposite intents with respect to caching.  Our filesystem handles them
> differently, so we need to distinguish between the two.
> 
> > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the 
> > > application does not have to implement any caching to see good performance.
> > Sure, but it has to be aware of layout and where/how it can write, which is
> > exactly the same constraints that local filesystems place on O_DIRECT access.
> > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
> > and behaviour, IMO.
> 
> I'd like to make a slight adjustment to my proposal.  The HPC community had
> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
> relax data coherency requirements.  There is code in the ceph filesystem that
> uses that flag if defined.  Can we get O_LAZY defined?
> 
> HEC POSIX extension:
> http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf
> 
> Ceph usage of O_LAZY:
> https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78


O_LAZY support was removed from cephfs userland client in 2013:

    commit 94afedf02d07ad4678222aa66289a74b87768810
    Author: Sage Weil <sage@inktank.com>
    Date:   Mon Jul 8 11:24:48 2013 -0700

        client: remove O_LAZY

...part of the problem (and this may just be my lack of understanding)
is that it's not clear what O_LAZY semantics actually are. The ceph
sources have a textfile with this in it:

"-- lazy i/o integrity

  FIXME: currently missing call to flag an Fd/file has lazy.  used to be
O_LAZY on open, but no more.

  * relax data coherency
  * writes may not be visible until lazyio_propagate, fsync, close

  lazyio_propagate(int fd, off_t offset, size_t count);
   * my writes are safe

  lazyio_synchronize(int fd, off_t offset, size_t count);
   * i will see everyone else's propagated writes


lazyio_propagate / lazyio_synchronize. Those seem like they could be
implemented as ioctls if you don't care about other filesystems.

It is possible to add new open flags (we're running low, but that's a
problem we'll hit sooner or later anyway), but before we can do anything
here, O_LAZY needs to be defined in a way that makes sense for
application developers across filesystems.

How does this change behavior on ext4, xfs or btrfs, for instance? What
about nfs or cifs?

I suggest that before you even dive into writing patches for any of
this, that you draft a small manpage update for open(2). What would an
O_LAZY entry look like?

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
@ 2017-11-20 13:53                   ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2017-11-20 13:53 UTC (permalink / raw)
  To: Fu, Rodney, Dave Chinner
  Cc: Matthew Wilcox, hch-jcswGhMUV9g,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-api

On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote:
> > The filesystem can still choose to do that for O_DIRECT if it wants - look at
> > all the filesystems that have a "fall back to buffered IO because this is too
> > hard to implement in the direct Io path".
> 
> Yes, I agree that the filesystem can still decide to buffer IO even with
> O_DIRECT, but the application's intent is that the effects of caching are
> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
> 
> > IOWs, you've got another set of custom userspace APIs that are needed to make
> > proper use of this open flag?
> 
> Yes and no.  Applications can make ioctls to the filesystem to query or set
> layout details but don't have to.  Directory level default layout attributes can
> be set up by an admin to meet the requirements of the application.
> 
> > > In panfs, a well behaved CONCURRENT_WRITE application will consider 
> > > the file's layout on storage.  Access from different machines will not 
> > > overlap within the same RAID stripe so as not to cause distributed 
> > > stripe lock contention.  Writes to the file that are page aligned can 
> > > be cached and the filesystem can aggregate multiple such writes before 
> > > writing out to storage.  Conversely, a CONCURRENT_WRITE application 
> > > that ends up colliding on the same stripe will see worse performance.  
> > > Non page aligned writes are treated by panfs as write-through and 
> > > non-cachable, as the filesystem will have to assume that the region of 
> > > the page that is untouched by this machine might in fact be written to 
> > > on another machine.  Caching such a page and writing it out later might lead to data corruption.
> > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
> > the app doesn't do correctly aligned and sized IO then performance is going to
> > suck, and if the apps doesn't serialize access to the file correctly it can and
> > will corrupt data in the file....
> 
> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
> opposite intents with respect to caching.  Our filesystem handles them
> differently, so we need to distinguish between the two.
> 
> > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the 
> > > application does not have to implement any caching to see good performance.
> > Sure, but it has to be aware of layout and where/how it can write, which is
> > exactly the same constraints that local filesystems place on O_DIRECT access.
> > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
> > and behaviour, IMO.
> 
> I'd like to make a slight adjustment to my proposal.  The HPC community had
> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
> relax data coherency requirements.  There is code in the ceph filesystem that
> uses that flag if defined.  Can we get O_LAZY defined?
> 
> HEC POSIX extension:
> http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf
> 
> Ceph usage of O_LAZY:
> https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78


O_LAZY support was removed from cephfs userland client in 2013:

    commit 94afedf02d07ad4678222aa66289a74b87768810
    Author: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
    Date:   Mon Jul 8 11:24:48 2013 -0700

        client: remove O_LAZY

...part of the problem (and this may just be my lack of understanding)
is that it's not clear what O_LAZY semantics actually are. The ceph
sources have a textfile with this in it:

"-- lazy i/o integrity

  FIXME: currently missing call to flag an Fd/file has lazy.  used to be
O_LAZY on open, but no more.

  * relax data coherency
  * writes may not be visible until lazyio_propagate, fsync, close

  lazyio_propagate(int fd, off_t offset, size_t count);
   * my writes are safe

  lazyio_synchronize(int fd, off_t offset, size_t count);
   * i will see everyone else's propagated writes


lazyio_propagate / lazyio_synchronize. Those seem like they could be
implemented as ioctls if you don't care about other filesystems.

It is possible to add new open flags (we're running low, but that's a
problem we'll hit sooner or later anyway), but before we can do anything
here, O_LAZY needs to be defined in a way that makes sense for
application developers across filesystems.

How does this change behavior on ext4, xfs or btrfs, for instance? What
about nfs or cifs?

I suggest that before you even dive into writing patches for any of
this, that you draft a small manpage update for open(2). What would an
O_LAZY entry look like?

-- 
Jeff Layton <jlayton-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
@ 2017-11-20 14:03               ` Florian Weimer
  0 siblings, 0 replies; 20+ messages in thread
From: Florian Weimer @ 2017-11-20 14:03 UTC (permalink / raw)
  To: Fu, Rodney, Bernd Schubert, Matthew Wilcox
  Cc: hch, viro, linux-fsdevel, Linux API

On 11/13/2017 09:19 PM, Fu, Rodney wrote:
> Yes, an ioctl open is possible but not ideal.  The interface would require an
> additional open to perform the ioctl against.  The open system call is really a
> great place to pass control information to the filesystem and any other solution
> seems less elegant.

But with the FS-specific open flag, you would have to do an open call 
with O_PATH, check that the file system is what you expect, and then 
openat the O_PATH descriptor to get a full descriptor.  If you don't 
follow this protocol, you might end up using a custom open flag with a 
different file system which has completely different semantics for the flag.

So ioctl actually is much simpler here and needs fewer system calls.

(Due to per-file bind mounts, there is no way to figure out the file 
system on which a file is located without actually opening the file.)

Thanks,
Florian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
@ 2017-11-20 14:03               ` Florian Weimer
  0 siblings, 0 replies; 20+ messages in thread
From: Florian Weimer @ 2017-11-20 14:03 UTC (permalink / raw)
  To: Fu, Rodney, Bernd Schubert, Matthew Wilcox
  Cc: hch-jcswGhMUV9g, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API

On 11/13/2017 09:19 PM, Fu, Rodney wrote:
> Yes, an ioctl open is possible but not ideal.  The interface would require an
> additional open to perform the ioctl against.  The open system call is really a
> great place to pass control information to the filesystem and any other solution
> seems less elegant.

But with the FS-specific open flag, you would have to do an open call 
with O_PATH, check that the file system is what you expect, and then 
openat the O_PATH descriptor to get a full descriptor.  If you don't 
follow this protocol, you might end up using a custom open flag with a 
different file system which has completely different semantics for the flag.

So ioctl actually is much simpler here and needs fewer system calls.

(Due to per-file bind mounts, there is no way to figure out the file 
system on which a file is located without actually opening the file.)

Thanks,
Florian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Provision for filesystem specific open flags
  2017-11-14 17:35               ` Fu, Rodney
  2017-11-20 13:53                   ` Jeff Layton
@ 2017-12-04  5:29                 ` NeilBrown
  2017-12-05 21:36                   ` Andreas Dilger
  1 sibling, 1 reply; 20+ messages in thread
From: NeilBrown @ 2017-12-04  5:29 UTC (permalink / raw)
  To: Fu, Rodney, Dave Chinner; +Cc: Matthew Wilcox, hch, viro, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3990 bytes --]

On Tue, Nov 14 2017, Fu, Rodney wrote:

>> The filesystem can still choose to do that for O_DIRECT if it wants - look at
>> all the filesystems that have a "fall back to buffered IO because this is too
>> hard to implement in the direct Io path".
>
> Yes, I agree that the filesystem can still decide to buffer IO even with
> O_DIRECT, but the application's intent is that the effects of caching are
> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
>
>> IOWs, you've got another set of custom userspace APIs that are needed to make
>> proper use of this open flag?
>
> Yes and no.  Applications can make ioctls to the filesystem to query or set
> layout details but don't have to.  Directory level default layout attributes can
> be set up by an admin to meet the requirements of the application.
>
>> > In panfs, a well behaved CONCURRENT_WRITE application will consider 
>> > the file's layout on storage.  Access from different machines will not 
>> > overlap within the same RAID stripe so as not to cause distributed 
>> > stripe lock contention.  Writes to the file that are page aligned can 
>> > be cached and the filesystem can aggregate multiple such writes before 
>> > writing out to storage.  Conversely, a CONCURRENT_WRITE application 
>> > that ends up colliding on the same stripe will see worse performance.  
>> > Non page aligned writes are treated by panfs as write-through and 
>> > non-cachable, as the filesystem will have to assume that the region of 
>> > the page that is untouched by this machine might in fact be written to 
>> > on another machine.  Caching such a page and writing it out later might lead to data corruption.
>
>> That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
>> the app doesn't do correctly aligned and sized IO then performance is going to
>> suck, and if the apps doesn't serialize access to the file correctly it can and
>> will corrupt data in the file....
>
> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
> opposite intents with respect to caching.  Our filesystem handles them
> differently, so we need to distinguish between the two.
>
>> > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the 
>> > application does not have to implement any caching to see good performance.
>
>> Sure, but it has to be aware of layout and where/how it can write, which is
>> exactly the same constraints that local filesystems place on O_DIRECT access.
>
>> Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
>> and behaviour, IMO.
>
> I'd like to make a slight adjustment to my proposal.  The HPC community had
> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
> relax data coherency requirements.  There is code in the ceph filesystem that
> uses that flag if defined.  Can we get O_LAZY defined?

This O_LAZY sounds exactly like what NFS has always done.
If different clients do page aligned writes and have their own protocol
to keep track of who owns which page, then everything is fine and
write-back caching does good things.
If different clients use byte-range locks, then write-back caching
is curtailed a bit, but clients don't need to be so careful.
If clients do non-aligned writes without locking, then corruption can
result.
So:
  #define O_LAZY 0
and NFS already has it implemented :-)

For NFS, with have O_SYNC which tries to provide cache-coherency as strong
as other filesystems provide without it.

Do we really want O_LAZY?  Or are other filesystems trying too hard to
provide coherency when apps don't use locks?

NeilBrown


>
> HEC POSIX extension:
> http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf
>
> Ceph usage of O_LAZY:
> https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78
>
> Regards,
> Rodney

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Provision for filesystem specific open flags
  2017-12-04  5:29                 ` NeilBrown
@ 2017-12-05 21:36                   ` Andreas Dilger
  0 siblings, 0 replies; 20+ messages in thread
From: Andreas Dilger @ 2017-12-05 21:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Fu, Rodney, Dave Chinner, Matthew Wilcox, hch, viro, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4741 bytes --]

On Dec 3, 2017, at 10:29 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Tue, Nov 14 2017, Fu, Rodney wrote:
> 
>>> The filesystem can still choose to do that for O_DIRECT if it wants - look at
>>> all the filesystems that have a "fall back to buffered IO because this is too
>>> hard to implement in the direct Io path".
>> 
>> Yes, I agree that the filesystem can still decide to buffer IO even with
>> O_DIRECT, but the application's intent is that the effects of caching are
>> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
>> 
>>> IOWs, you've got another set of custom userspace APIs that are needed to make
>>> proper use of this open flag?
>> 
>> Yes and no.  Applications can make ioctls to the filesystem to query or set
>> layout details but don't have to.  Directory level default layout attributes can
>> be set up by an admin to meet the requirements of the application.
>> 
>>>> In panfs, a well behaved CONCURRENT_WRITE application will consider
>>>> the file's layout on storage.  Access from different machines will not
>>>> overlap within the same RAID stripe so as not to cause distributed
>>>> stripe lock contention.  Writes to the file that are page aligned can
>>>> be cached and the filesystem can aggregate multiple such writes before
>>>> writing out to storage.  Conversely, a CONCURRENT_WRITE application
>>>> that ends up colliding on the same stripe will see worse performance.
>>>> Non page aligned writes are treated by panfs as write-through and
>>>> non-cachable, as the filesystem will have to assume that the region of
>>>> the page that is untouched by this machine might in fact be written to
>>>> on another machine.  Caching such a page and writing it out later might lead to data corruption.
>> 
>>> That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
>>> the app doesn't do correctly aligned and sized IO then performance is going to
>>> suck, and if the apps doesn't serialize access to the file correctly it can and
>>> will corrupt data in the file....
>> 
>> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
>> opposite intents with respect to caching.  Our filesystem handles them
>> differently, so we need to distinguish between the two.
>> 
>>>> The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the
>>>> application does not have to implement any caching to see good performance.
>> 
>>> Sure, but it has to be aware of layout and where/how it can write, which is
>>> exactly the same constraints that local filesystems place on O_DIRECT access.
>> 
>>> Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
>>> and behaviour, IMO.
>> 
>> I'd like to make a slight adjustment to my proposal.  The HPC community had
>> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
>> relax data coherency requirements.  There is code in the ceph filesystem that
>> uses that flag if defined.  Can we get O_LAZY defined?
> 
> This O_LAZY sounds exactly like what NFS has always done.
> If different clients do page aligned writes and have their own protocol
> to keep track of who owns which page, then everything is fine and
> write-back caching does good things.
> If different clients use byte-range locks, then write-back caching
> is curtailed a bit, but clients don't need to be so careful.
> If clients do non-aligned writes without locking, then corruption can
> result.
> So:
>  #define O_LAZY 0
> and NFS already has it implemented :-)
> 
> For NFS, with have O_SYNC which tries to provide cache-coherency as strong
> as other filesystems provide without it.
> 
> Do we really want O_LAZY?  Or are other filesystems trying too hard to
> provide coherency when apps don't use locks?

Well, POSIX requires the correct read-after-write behaviour regardless
of whether applications are being careful or not. As you wrote above,
"If clients do non-aligned writes without locking, then corruption can
result," and there definitely are apps that expect the filesystem to
work correctly even at very large scales.

I think O_LAZY would be reasonable to add, as long as that is what
applications are asking for, but we can't just break long-standing
data correctness behind their backs because it would go faster, and
there is no way for the filesystem to know without a flag like O_LAZY
if they are doing their own locking or not.

There is also a simple fallback to "#define O_LAZY 0" if it is not
defined on older systems, and then POSIX-compliant filesystems (not NFS)
will still work correctly, without the speedup that O_LAZY provides.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-12-05 21:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-10 16:49 Provision for filesystem specific open flags Fu, Rodney
2017-11-10 17:23 ` hch
2017-11-10 17:39   ` Fu, Rodney
2017-11-10 19:29     ` Matthew Wilcox
2017-11-10 21:04       ` Fu, Rodney
2017-11-11  0:37         ` Matthew Wilcox
2017-11-13 15:16           ` Fu, Rodney
2017-11-20 13:38             ` Jeff Layton
2017-11-13  0:48         ` Dave Chinner
2017-11-13 17:02           ` Fu, Rodney
2017-11-13 21:58             ` Dave Chinner
2017-11-14 17:35               ` Fu, Rodney
2017-11-20 13:53                 ` Jeff Layton
2017-11-20 13:53                   ` Jeff Layton
2017-12-04  5:29                 ` NeilBrown
2017-12-05 21:36                   ` Andreas Dilger
2017-11-13 17:45         ` Bernd Schubert
2017-11-13 20:19           ` Fu, Rodney
2017-11-20 14:03             ` Florian Weimer
2017-11-20 14:03               ` Florian Weimer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.