Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* Lease semantic proposal
@ 2019-09-23 19:08 Ira Weiny
  2019-09-23 20:17 ` Jeff Layton
  2019-09-23 22:26 ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Ira Weiny @ 2019-09-23 19:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm
  Cc: Jeff Layton, Dave Chinner, Jan Kara, Theodore Ts'o,
	John Hubbard, Dan Williams, Jason Gunthorpe


Since the last RFC patch set[1] much of the discussion of supporting RDMA with
FS DAX has been around the semantics of the lease mechanism.[2]  Within that
thread it was suggested I try and write some documentation and/or tests for the
new mechanism being proposed.  I have created a foundation to test lease
functionality within xfstests.[3] This should be close to being accepted.
Before writing additional lease tests, or changing lots of kernel code, this
email presents documentation for the new proposed "layout lease" semantic.

At Linux Plumbers[4] just over a week ago, I presented the current state of the
patch set and the outstanding issues.  Based on the discussion there, well as
follow up emails, I propose the following addition to the fcntl() man page.

Thank you,
Ira

[1] https://lkml.org/lkml/2019/8/9/1043
[2] https://lkml.org/lkml/2019/8/9/1062
[3] https://www.spinics.net/lists/fstests/msg12620.html
[4] https://linuxplumbersconf.org/event/4/contributions/368/


<fcntl man page addition>
Layout Leases
-------------

Layout (F_LAYOUT) leases are special leases which can be used to control and/or
be informed about the manipulation of the underlying layout of a file.

A layout is defined as the logical file block -> physical file block mapping
including the file size and sharing of physical blocks among files.  Note that
the unwritten state of a block is not considered part of file layout.

**Read layout lease F_RDLCK | F_LAYOUT**

Read layout leases can be used to be informed of layout changes by the
system or other users.  This lease is similar to the standard read (F_RDLCK)
lease in that any attempt to change the _layout_ of the file will be reported to
the process through the lease break process.  But this lease is different
because the file can be opened for write and data can be read and/or written to
the file as long as the underlying layout of the file does not change.
Therefore, the lease is not broken if the file is simply open for write, but
_may_ be broken if an operation such as, truncate(), fallocate() or write()
results in changing the underlying layout.

**Write layout lease (F_WRLCK | F_LAYOUT)**

Write Layout leases can be used to break read layout leases to indicate that
the process intends to change the underlying layout lease of the file.

A process which has taken a write layout lease has exclusive ownership of the
file layout and can modify that layout as long as the lease is held.
Operations which change the layout are allowed by that process.  But operations
from other file descriptors which attempt to change the layout will break the
lease through the standard lease break process.  The F_LAYOUT flag is used to
indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
the F_LAYOUT case opens for write do not break the lease.  But some operations,
if they change the underlying layout, may.

The distinction between read layout leases and write layout leases is that
write layout leases can change the layout without breaking the lease within the
owning process.  This is useful to guarantee a layout prior to specifying the
unbreakable flag described below.


**Unbreakable Layout Leases (F_UNBREAK)**

In order to support pinning of file pages by direct user space users an
unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
lease.  When specified, F_UNBREAK indicates that any user attempting to break
the lease will fail with ETXTBUSY rather than follow the normal breaking
procedure.

Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
specified.  The difference between an unbreakable read layout lease and an
unbreakable write layout lease are that an unbreakable read layout lease is
_not_ exclusive.  This means that once a layout is established on a file,
multiple unbreakable read layout leases can be taken by multiple processes and
used to pin the underlying pages of that file.

Care must therefore be taken to ensure that the layout of the file is as the
user wants prior to using the unbreakable read layout lease.  A safe mechanism
to do this would be to take a write layout lease and use fallocate() to set the
layout of the file.  The layout lease can then be "downgraded" to unbreakable
read layout as long as no other user broke the write layout lease.

</fcntl man page addition>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-23 19:08 Lease semantic proposal Ira Weiny
@ 2019-09-23 20:17 ` Jeff Layton
  2019-10-01 18:17   ` Ira Weiny
  2019-09-23 22:26 ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2019-09-23 20:17 UTC (permalink / raw)
  To: Ira Weiny, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	linux-kernel, linux-nvdimm, linux-mm
  Cc: Dave Chinner, Jan Kara, Theodore Ts'o, John Hubbard,
	Dan Williams, Jason Gunthorpe

On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> thread it was suggested I try and write some documentation and/or tests for the
> new mechanism being proposed.  I have created a foundation to test lease
> functionality within xfstests.[3] This should be close to being accepted.
> Before writing additional lease tests, or changing lots of kernel code, this
> email presents documentation for the new proposed "layout lease" semantic.
> 
> At Linux Plumbers[4] just over a week ago, I presented the current state of the
> patch set and the outstanding issues.  Based on the discussion there, well as
> follow up emails, I propose the following addition to the fcntl() man page.
> 
> Thank you,
> Ira
> 
> [1] https://lkml.org/lkml/2019/8/9/1043
> [2] https://lkml.org/lkml/2019/8/9/1062
> [3] https://www.spinics.net/lists/fstests/msg12620.html
> [4] https://linuxplumbersconf.org/event/4/contributions/368/
> 
> 

Thank you so much for doing this, Ira. This allows us to debate the
user-visible behavior semantics without getting bogged down in the
implementation details. More comments below:

> <fcntl man page addition>
> Layout Leases
> -------------
> 
> Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> be informed about the manipulation of the underlying layout of a file.
> 
> A layout is defined as the logical file block -> physical file block mapping
> including the file size and sharing of physical blocks among files.  Note that
> the unwritten state of a block is not considered part of file layout.
> 
> **Read layout lease F_RDLCK | F_LAYOUT**
> 
> Read layout leases can be used to be informed of layout changes by the
> system or other users.  This lease is similar to the standard read (F_RDLCK)
> lease in that any attempt to change the _layout_ of the file will be reported to
> the process through the lease break process.  But this lease is different
> because the file can be opened for write and data can be read and/or written to
> the file as long as the underlying layout of the file does not change.
> Therefore, the lease is not broken if the file is simply open for write, but
> _may_ be broken if an operation such as, truncate(), fallocate() or write()
> results in changing the underlying layout.
> 
> **Write layout lease (F_WRLCK | F_LAYOUT)**
> 
> Write Layout leases can be used to break read layout leases to indicate that
> the process intends to change the underlying layout lease of the file.
> 
> A process which has taken a write layout lease has exclusive ownership of the
> file layout and can modify that layout as long as the lease is held.
> Operations which change the layout are allowed by that process.  But operations
> from other file descriptors which attempt to change the layout will break the
> lease through the standard lease break process.  The F_LAYOUT flag is used to
> indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> the F_LAYOUT case opens for write do not break the lease.  But some operations,
> if they change the underlying layout, may.
> 
> The distinction between read layout leases and write layout leases is that
> write layout leases can change the layout without breaking the lease within the
> owning process.  This is useful to guarantee a layout prior to specifying the
> unbreakable flag described below.
> 
> 

The above sounds totally reasonable. You're essentially exposing the
behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
leases work the same way as "normal" leases, wrt signals and timeouts?

I do wonder if we're better off not trying to "or" in flags for this,
and instead have a separate set of commands (maybe F_RDLAYOUT,
F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
feel terribly strongly about it.

Also, at least in NFSv4, layouts are handed out for a particular byte
range in a file. Should we consider doing this with an API that allows
for that in the future? Is this something that would be desirable for
your RDMA+DAX use-cases?

We could add a new F_SETLEASE variant that takes a struct with a byte
range (something like struct flock).

> **Unbreakable Layout Leases (F_UNBREAK)**
> 
> In order to support pinning of file pages by direct user space users an
> unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> lease.  When specified, F_UNBREAK indicates that any user attempting to break
> the lease will fail with ETXTBUSY rather than follow the normal breaking
> procedure.
> 
> Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> specified.  The difference between an unbreakable read layout lease and an
> unbreakable write layout lease are that an unbreakable read layout lease is
> _not_ exclusive.  This means that once a layout is established on a file,
> multiple unbreakable read layout leases can be taken by multiple processes and
> used to pin the underlying pages of that file.
> 
> Care must therefore be taken to ensure that the layout of the file is as the
> user wants prior to using the unbreakable read layout lease.  A safe mechanism
> to do this would be to take a write layout lease and use fallocate() to set the
> layout of the file.  The layout lease can then be "downgraded" to unbreakable
> read layout as long as no other user broke the write layout lease.
> 

Will userland require any special privileges in order to set an
F_UNBREAK lease? This seems like something that could be used for DoS. I
assume that these will never time out.

How will we deal with the case where something is is squatting on an
F_UNBREAK lease and isn't letting it go?

Leases are technically "owned" by the file description -- we can't
necessarily trace it back to a single task in a threaded program. The
kernel task that set the lease may have exited by the time we go
looking.

Will we be content trying to determine this using /proc/locks+lsof, etc,
or will we need something better?

> </fcntl man page addition>

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-23 19:08 Lease semantic proposal Ira Weiny
  2019-09-23 20:17 ` Jeff Layton
@ 2019-09-23 22:26 ` Dave Chinner
  2019-09-25 23:46   ` Ira Weiny
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2019-09-23 22:26 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Jeff Layton, Jan Kara, Theodore Ts'o,
	John Hubbard, Dan Williams, Jason Gunthorpe

On Mon, Sep 23, 2019 at 12:08:53PM -0700, Ira Weiny wrote:
> 
> Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> thread it was suggested I try and write some documentation and/or tests for the
> new mechanism being proposed.  I have created a foundation to test lease
> functionality within xfstests.[3] This should be close to being accepted.
> Before writing additional lease tests, or changing lots of kernel code, this
> email presents documentation for the new proposed "layout lease" semantic.
> 
> At Linux Plumbers[4] just over a week ago, I presented the current state of the
> patch set and the outstanding issues.  Based on the discussion there, well as
> follow up emails, I propose the following addition to the fcntl() man page.
> 
> Thank you,
> Ira
> 
> [1] https://lkml.org/lkml/2019/8/9/1043
> [2] https://lkml.org/lkml/2019/8/9/1062
> [3] https://www.spinics.net/lists/fstests/msg12620.html
> [4] https://linuxplumbersconf.org/event/4/contributions/368/
> 
> 
> <fcntl man page addition>
> Layout Leases
> -------------
> 
> Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> be informed about the manipulation of the underlying layout of a file.
> 
> A layout is defined as the logical file block -> physical file block mapping
> including the file size and sharing of physical blocks among files.  Note that
> the unwritten state of a block is not considered part of file layout.

Why even mention "unwritten" state if it's not considered something
that the layout lease treats differently?

i.e. Unwritten extents are a filesystem implementation detail that
is not exposed to userspace by anything other than FIEMAP. If they
have no impact on layout lease behaviour, then why raise it as
something the user needs to know about?

> **Read layout lease F_RDLCK | F_LAYOUT**
> 
> Read layout leases can be used to be informed of layout changes by the
> system or other users.  This lease is similar to the standard read (F_RDLCK)
> lease in that any attempt to change the _layout_ of the file will be reported to
> the process through the lease break process. 

Similar in what way? The standard F_RDLCK lease triggers on open or
truncate - a layout lease does nothing of the sort.

> But this lease is different
> because the file can be opened for write and data can be read and/or written to
> the file as long as the underlying layout of the file does not change.

So a F_RDLCK|F_LAYOUT can be taken on a O_WRONLY fd, unlike a
F_RDLCK which can only be taken on O_RDONLY fd.

I think these semantics are sufficiently different to F_RDLCK they
need to be explicitly documented, because I see problems here.

> Therefore, the lease is not broken if the file is simply open for write, but
> _may_ be broken if an operation such as, truncate(), fallocate() or write()
> results in changing the underlying layout.

As will mmap(), any number of XFS and ext4 ioctls, etc. 

So this really needs to say "_will_ be broken if *any* modification to
the file _might_ need to change the underlying physical layout".

Now, the big question: what happens to a process with a
F_RDLCK|F_LAYOUT lease held does a write that triggers a layout
change? What happens then?

Also, have you noticed that XFS will unconditionally break layouts on
write() because it /might/ need to change the layout? i.e. the
BREAK_WRITE case in xfs_file_aio_write_checks()? This is needed for
correctly supporting pNFS layout coherency against local IO. i.e.
local write() breaks layouts held by NFS server to get the
delegation recalled.

So by the above definition of F_RDLCK|F_LAYOUT behaviour, a holder
of such a lease doing a write() to that file would trigger a lease
break of their own lease as the filesystem has notified the lease
layer that there is a layout change about to happen. What's expected
to happen here?

Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
doesn't appear to be compatible with the semantics required by
existing users of layout leases.

> **Write layout lease (F_WRLCK | F_LAYOUT)**
> 
> Write Layout leases can be used to break read layout leases to indicate that
> the process intends to change the underlying layout lease of the file.

Any write() can change the layout of the file, and userspace cannot
tell in advance whether that will occur (neither can the
filesystem), so it seems to me that any application that needs to
write data is going to have to use F_WRLCK|F_LAYOUT.

> A process which has taken a write layout lease has exclusive ownership of the
> file layout and can modify that layout as long as the lease is held.

Which further implies single writer semantics and leases are
associated with a single open fd. Single writers are something we
are always trying to avoid in XFS.

> Operations which change the layout are allowed by that process.  But operations
> from other file descriptors which attempt to change the layout will break the
> lease through the standard lease break process.

If the F_WRLCK|F_LAYOUT lease is exclusive, who is actually able to
modify the layout?  Are you talking about processes that don't
actually hold leases modifying the layout? i.e. what are the
constraints on "exclusive access" here - is F_WRLCK|F_LAYOUT is
only exclusive when every modification is co-operating and taking
the appropriate layout lease for every access to the file that is
made?

If that's the case, what happens when someone fails to get a read
lock and decides "I can break write locks just by using ftruncate()
to the same size without a layout lease". Or fallocate() to
preallocate space that is already allocated. Or many other things I
can think of.

IOWs, this seems to me like a very fragile sort of construct that is
open to abuse and that will lead to everyone using F_UNBREAK, which
is highly unfriendly to everyone else...

> The F_LAYOUT flag is used to
> indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> the F_LAYOUT case opens for write do not break the lease.  But some operations,
> if they change the underlying layout, may.
> 
> The distinction between read layout leases and write layout leases is that
> write layout leases can change the layout without breaking the lease within the
> owning process.  This is useful to guarantee a layout prior to specifying the
> unbreakable flag described below.

Ok, so now you really are saying that F_RDLCK leases can only be
used on O_RDONLY file descriptors because any modification under a
F_RDLCK|LAYOUT will trigger a layout break.

> **Unbreakable Layout Leases (F_UNBREAK)**
> 
> In order to support pinning of file pages by direct user space users an
> unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> lease.  When specified, F_UNBREAK indicates that any user attempting to break
> the lease will fail with ETXTBUSY rather than follow the normal breaking
> procedure.
> 
> Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> specified.  The difference between an unbreakable read layout lease and an
> unbreakable write layout lease are that an unbreakable read layout lease is
> _not_ exclusive. 

Oh, this doesn't work at all. Now we get write()s to F_RDLCK leases
that can't break the leases and so all writes, even to processes
that own RDLCK|UNBREAK, will fail with ETXTBSY.

> This means that once a layout is established on a file,
> multiple unbreakable read layout leases can be taken by multiple processes and
> used to pin the underlying pages of that file.

Ok, so what happens when someone now takes a
F_WRLOCK|F_LAYOUT|F_UNBREAK? Is that supposed to break
F_RDLCK|F_LAYOUT|F_UNBREAK, as the wording about F_WRLCK behaviour
implies it should?

> Care must therefore be taken to ensure that the layout of the file is as the
> user wants prior to using the unbreakable read layout lease.  A safe mechanism
> to do this would be to take a write layout lease and use fallocate() to set the
> layout of the file.  The layout lease can then be "downgraded" to unbreakable
> read layout as long as no other user broke the write layout lease.

What are the semantics of this "downgrade" behaviour you speak of? :)

My thoughts are:
	- RDLCK can only be used for O_RDONLY because write()
	  requires breaking of leases
	- WRLCK is open to abuse simply by not using a layout lease
	  to do a "no change" layout modification
	- RDLCK|F_UNBREAK is entirely unusable
	- WRLCK|F_UNBREAK will be what every application uses
	  because everything else either doesn't work or is too easy
	  to abuse.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-23 22:26 ` Dave Chinner
@ 2019-09-25 23:46   ` Ira Weiny
  2019-09-26 11:29     ` Jeff Layton
  2019-09-30  8:42     ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Ira Weiny @ 2019-09-25 23:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Jeff Layton, Jan Kara, Theodore Ts'o,
	John Hubbard, Dan Williams, Jason Gunthorpe

On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> On Mon, Sep 23, 2019 at 12:08:53PM -0700, Ira Weiny wrote:
> > 
> > Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > thread it was suggested I try and write some documentation and/or tests for the
> > new mechanism being proposed.  I have created a foundation to test lease
> > functionality within xfstests.[3] This should be close to being accepted.
> > Before writing additional lease tests, or changing lots of kernel code, this
> > email presents documentation for the new proposed "layout lease" semantic.
> > 
> > At Linux Plumbers[4] just over a week ago, I presented the current state of the
> > patch set and the outstanding issues.  Based on the discussion there, well as
> > follow up emails, I propose the following addition to the fcntl() man page.
> > 
> > Thank you,
> > Ira
> > 
> > [1] https://lkml.org/lkml/2019/8/9/1043
> > [2] https://lkml.org/lkml/2019/8/9/1062
> > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > 
> > 
> > <fcntl man page addition>
> > Layout Leases
> > -------------
> > 
> > Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> > be informed about the manipulation of the underlying layout of a file.
> > 
> > A layout is defined as the logical file block -> physical file block mapping
> > including the file size and sharing of physical blocks among files.  Note that
> > the unwritten state of a block is not considered part of file layout.
> 
> Why even mention "unwritten" state if it's not considered something
> that the layout lease treats differently?
> 
> i.e. Unwritten extents are a filesystem implementation detail that
> is not exposed to userspace by anything other than FIEMAP. If they
> have no impact on layout lease behaviour, then why raise it as
> something the user needs to know about?

This paragraph was intended to define a layout.  So I guess one could say our
internal discussion on what defines a "layout" has leaked into the external
documentation.  Do you think we should just remove the second sentence or the
whole paragraph?

> 
> > **Read layout lease F_RDLCK | F_LAYOUT**
> > 
> > Read layout leases can be used to be informed of layout changes by the
> > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > lease in that any attempt to change the _layout_ of the file will be reported to
> > the process through the lease break process. 
> 
> Similar in what way? The standard F_RDLCK lease triggers on open or
> truncate - a layout lease does nothing of the sort.

Similar in that attempts to "write" the layout will result in breaking the
lease just like attempts to write the file would break the standard F_RDLCK
lease.  I'm not stuck on the verbiage though; similar may be the wrong word.

> 
> > But this lease is different
> > because the file can be opened for write and data can be read and/or written to
> > the file as long as the underlying layout of the file does not change.
> 
> So a F_RDLCK|F_LAYOUT can be taken on a O_WRONLY fd, unlike a
> F_RDLCK which can only be taken on O_RDONLY fd.

That was the idea, yes.

> 
> I think these semantics are sufficiently different to F_RDLCK they
> need to be explicitly documented, because I see problems here.

I agree, and I intended this document to indicate how they are different.

Anther option may be to define F_RDLAYOUT and not have F_LAYOUT such that it is
clear that this lease is not associated with F_RDLCK at all.  It is different.

> 
> > Therefore, the lease is not broken if the file is simply open for write, but
> > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > results in changing the underlying layout.
> 
> As will mmap(), any number of XFS and ext4 ioctls, etc. 
> 
> So this really needs to say "_will_ be broken if *any* modification to
> the file _might_ need to change the underlying physical layout".

Agreed.  I used the word "may" because a simple write() does not necessarily
change the layout of the file.  But I like your verbiage better.  I did wonder
if listing operations was a bad idea.  So I'm ok simply leaving that detail
out.

> 
> Now, the big question: what happens to a process with a
> F_RDLCK|F_LAYOUT lease held does a write that triggers a layout
> change? What happens then?

Because F_UNBREAK is not specified, the write() operation is held for lease
break time and then the lease is broken if not voluntarily released.  This
would be the same pattern as a process holding a F_RDLCK and opening the file
O_RDWR.

> 
> Also, have you noticed that XFS will unconditionally break layouts on
> write() because it /might/ need to change the layout? i.e. the
> BREAK_WRITE case in xfs_file_aio_write_checks()? This is needed for
> correctly supporting pNFS layout coherency against local IO. i.e.
> local write() breaks layouts held by NFS server to get the
> delegation recalled.
> 
> So by the above definition of F_RDLCK|F_LAYOUT behaviour, a holder
> of such a lease doing a write() to that file would trigger a lease
> break of their own lease as the filesystem has notified the lease
> layer that there is a layout change about to happen. What's expected
> to happen here?

That is not ideal but the proposed semantics say a write() may fail in this
situation.  So depending on the implementation requirements of the underlying
FS the semantics still hold for our current use case.  It would be nice to be
able to enhance the implementation in the future such that a write() could work
but maybe they can't.  For RDMA the application is probably going to have the
region mmap'ed anyway and will not need, nor in fact want to use a write()
call.

Also, I think I missed a need to specify that a F_RDLCK|F_LAYOUT needs to have
write permission on (or be the owner of) the file for the user to be able to
specify F_UNBREAK on it.

> 
> Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> doesn't appear to be compatible with the semantics required by
> existing users of layout leases.

I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
with what is currently implemented.  Also, by exporting all this to user space
we can now write tests for it independent of the RDMA pinning.

> 
> > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > 
> > Write Layout leases can be used to break read layout leases to indicate that
> > the process intends to change the underlying layout lease of the file.
> 
> Any write() can change the layout of the file, and userspace cannot
> tell in advance whether that will occur (neither can the
> filesystem), so it seems to me that any application that needs to
> write data is going to have to use F_WRLCK|F_LAYOUT.

Sure, but the use case of F_WRLCK|F_LAYOUT is that the user is creating the
layout.  So using write() to create the file would be ok.

On the surface it seems like using a standard F_WRLCK lease could be used
instead of F_WRLCK|F_LAYOUT.  But it actually can't because that does not
protect against the internals of the file system changing the lease.  This is
where the semantics are exactly exported to user space.

> 
> > A process which has taken a write layout lease has exclusive ownership of the
> > file layout and can modify that layout as long as the lease is held.
> 
> Which further implies single writer semantics and leases are
> associated with a single open fd. Single writers are something we
> are always trying to avoid in XFS.

The discussion at LPC revealed that we need a way for the user to ensure the
file layout is realized prior to any unbreakable lease being taken.  So yes, for
some period we will need a single writer.

> 
> > Operations which change the layout are allowed by that process.  But operations
> > from other file descriptors which attempt to change the layout will break the
> > lease through the standard lease break process.
> 
> If the F_WRLCK|F_LAYOUT lease is exclusive, who is actually able to
> modify the layout?  Are you talking about processes that don't
> actually hold leases modifying the layout?

That was the idea, yes.

> i.e. what are the
> constraints on "exclusive access" here - is F_WRLCK|F_LAYOUT is
> only exclusive when every modification is co-operating and taking
> the appropriate layout lease for every access to the file that is
> made?

I'm not following but IIUC...  no...  The idea is that if you hold the
F_WRLCK|F_LAYOUT lease then any attempt by _other_ processes to change the
layout (intentional or otherwise) would result in you getting a SIGIO signal
which indicates someone _else_ changed the file.

Then you can atomically downgrade the lock to F_RDLCK|F_LAYOUT|F_UNBREAK or
atomically upgrade to F_WRLCK|F_LAYOUT|F_UNBREAK.  Either way you know you have
the layout you want and can rely on the pin working.

> 
> If that's the case, what happens when someone fails to get a read
> lock and decides "I can break write locks just by using ftruncate()
> to the same size without a layout lease". Or fallocate() to
> preallocate space that is already allocated. Or many other things I
> can think of.

The intended use case for F_WRLCK|F_LAYOUT is that a single process is
attempting to set the layout prior to setting F_UNBREAK.  While
F_WRLCK|F_LAYOUT can be broken, breaking the lease will not happen without that
process knowing about it.

I don't see this being different from the current lease semantics which
requires some external coordination amongst process/file users to resolve any
races or coordination.

> 
> IOWs, this seems to me like a very fragile sort of construct that is
> open to abuse and that will lead to everyone using F_UNBREAK, which
> is highly unfriendly to everyone else...

FWIW, my use case does require F_UNBREAK.  All of the semantics presented have
a real use case except for F_RDLCK|F_LAYOUT.  However, I think F_RDLCK|F_LAYOUT
does have a use case in testing.

Also, I do think that we need to have some check on file ownership for
F_UNBREAK.  That needs to be added.

> 
> > The F_LAYOUT flag is used to
> > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> > the F_LAYOUT case opens for write do not break the lease.  But some operations,
> > if they change the underlying layout, may.
> > 
> > The distinction between read layout leases and write layout leases is that
> > write layout leases can change the layout without breaking the lease within the
> > owning process.  This is useful to guarantee a layout prior to specifying the
> > unbreakable flag described below.
> 
> Ok, so now you really are saying that F_RDLCK leases can only be
> used on O_RDONLY file descriptors because any modification under a
> F_RDLCK|LAYOUT will trigger a layout break.

I don't necessarily agree.  We also have the mmap() case.  What I was really
trying to do is define a relaxed lease semantic which allows some shared
reading/writing of the file as long as the underlying layout does not change.
I am _not_ a file system expert but it seems like that should be possible.

Perhaps we need something more fine grained between BREAK_UNMAP and
BREAK_WRITE?

> 
> > **Unbreakable Layout Leases (F_UNBREAK)**
> > 
> > In order to support pinning of file pages by direct user space users an
> > unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> > lease.  When specified, F_UNBREAK indicates that any user attempting to break
> > the lease will fail with ETXTBUSY rather than follow the normal breaking
> > procedure.
> > 
> > Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> > specified.  The difference between an unbreakable read layout lease and an
> > unbreakable write layout lease are that an unbreakable read layout lease is
> > _not_ exclusive. 
> 
> Oh, this doesn't work at all. Now we get write()s to F_RDLCK leases
> that can't break the leases and so all writes, even to processes
> that own RDLCK|UNBREAK, will fail with ETXTBSY.

Yes I agree writes()'s to F_RDLCK|F_LAYOUT|F_UNBREAK _may_ fail.  I don't see
how this is broken if the file owner is opting into it.  RDMA's can still occur
to that file.  mmap'ed areas of the file can still be used (especially in the
no-page cache case of FS DAX).

> 
> > This means that once a layout is established on a file,
> > multiple unbreakable read layout leases can be taken by multiple processes and
> > used to pin the underlying pages of that file.
> 
> Ok, so what happens when someone now takes a
> F_WRLOCK|F_LAYOUT|F_UNBREAK? Is that supposed to break
> F_RDLCK|F_LAYOUT|F_UNBREAK, as the wording about F_WRLCK behaviour
> implies it should?

Ah no.  F_RDLCK|F_LAYOUT|F_UNBREAK could not be broken.  I'll have to update
the text for this.  The idea here is that no one can be changing the layout but
multiple readers could be using that layout.  So I'll update the text that a
F_WRLCK|F_LAYOUT|F_UNBREAK would fail in this case.

> 
> > Care must therefore be taken to ensure that the layout of the file is as the
> > user wants prior to using the unbreakable read layout lease.  A safe mechanism
> > to do this would be to take a write layout lease and use fallocate() to set the
> > layout of the file.  The layout lease can then be "downgraded" to unbreakable
> > read layout as long as no other user broke the write layout lease.
> 
> What are the semantics of this "downgrade" behaviour you speak of? :)

As I said above it may be a downgrade or an upgrade but the idea is to
atomically convert the lease to F_UNBREAK.

> 
> My thoughts are:
> 	- RDLCK can only be used for O_RDONLY because write()
> 	  requires breaking of leases

Does the file system require write() break the layout lease?  Or is that just
the way the file system is currently implemented?  What about mmap()?  I need
to have the file open WR to mmap() the file for RDMA.

To be clear I'm intending F_RDLCK|F_LAYOUT to be something new.  As I said
above we could use something like F_RDLAYOUT instead?

> 	- WRLCK is open to abuse simply by not using a layout lease
> 	  to do a "no change" layout modification

I'm sorry, I don't understand this comment.

> 	- RDLCK|F_UNBREAK is entirely unusable

Well even if write() fails with ETXTBSY this should give us the ability to do
RDMA and XDP to these areas from multiple processes.  Furthermore, for FS DAX
which bypasses the page cache mmap'ed areas can be written without write() with
CPU stores.  Which is how many RDMA applications are likely to write this data.

> 	- WRLCK|F_UNBREAK will be what every application uses
> 	  because everything else either doesn't work or is too easy
> 	  to abuse.

Maybe.  IMO that is still a step in the right direction as at least 1 process
can use this now.  And these semantics allow for a shared unbreakable lease
(F_RDLCK|F_LAYOUT|F_UNBREAK) which can be used with some configurations (FS DAX
in particular).

Also, I do think we will need something to ensure file ownership for F_UNBREAK.

It sounds like the difficulty here is in potential implementation of allowing
write() to not break layouts.  And dealing with writes to mmap'ed page cache
pages.  The file system is free to do better later.

Thanks for the review,
Ira


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-25 23:46   ` Ira Weiny
@ 2019-09-26 11:29     ` Jeff Layton
  2019-09-30  8:42     ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2019-09-26 11:29 UTC (permalink / raw)
  To: Ira Weiny, Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Jan Kara, Theodore Ts'o,
	John Hubbard, Dan Williams, Jason Gunthorpe

On Wed, 2019-09-25 at 16:46 -0700, Ira Weiny wrote:
> On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > On Mon, Sep 23, 2019 at 12:08:53PM -0700, Ira Weiny wrote:
> > > Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> > > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > > thread it was suggested I try and write some documentation and/or tests for the
> > > new mechanism being proposed.  I have created a foundation to test lease
> > > functionality within xfstests.[3] This should be close to being accepted.
> > > Before writing additional lease tests, or changing lots of kernel code, this
> > > email presents documentation for the new proposed "layout lease" semantic.
> > > 
> > > At Linux Plumbers[4] just over a week ago, I presented the current state of the
> > > patch set and the outstanding issues.  Based on the discussion there, well as
> > > follow up emails, I propose the following addition to the fcntl() man page.
> > > 
> > > Thank you,
> > > Ira
> > > 
> > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > 
> > > 
> > > <fcntl man page addition>
> > > Layout Leases
> > > -------------
> > > 
> > > Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> > > be informed about the manipulation of the underlying layout of a file.
> > > 
> > > A layout is defined as the logical file block -> physical file block mapping
> > > including the file size and sharing of physical blocks among files.  Note that
> > > the unwritten state of a block is not considered part of file layout.
> > 
> > Why even mention "unwritten" state if it's not considered something
> > that the layout lease treats differently?
> > 
> > i.e. Unwritten extents are a filesystem implementation detail that
> > is not exposed to userspace by anything other than FIEMAP. If they
> > have no impact on layout lease behaviour, then why raise it as
> > something the user needs to know about?
> 
> This paragraph was intended to define a layout.  So I guess one could say our
> internal discussion on what defines a "layout" has leaked into the external
> documentation.  Do you think we should just remove the second sentence or the
> whole paragraph?
> 
> > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > 
> > > Read layout leases can be used to be informed of layout changes by the
> > > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > > lease in that any attempt to change the _layout_ of the file will be reported to
> > > the process through the lease break process. 
> > 
> > Similar in what way? The standard F_RDLCK lease triggers on open or
> > truncate - a layout lease does nothing of the sort.
> 
> Similar in that attempts to "write" the layout will result in breaking the
> lease just like attempts to write the file would break the standard F_RDLCK
> lease.  I'm not stuck on the verbiage though; similar may be the wrong word.
> 
> > > But this lease is different
> > > because the file can be opened for write and data can be read and/or written to
> > > the file as long as the underlying layout of the file does not change.
> > 
> > So a F_RDLCK|F_LAYOUT can be taken on a O_WRONLY fd, unlike a
> > F_RDLCK which can only be taken on O_RDONLY fd.
> 
> That was the idea, yes.
> 
> > I think these semantics are sufficiently different to F_RDLCK they
> > need to be explicitly documented, because I see problems here.
> 
> I agree, and I intended this document to indicate how they are different.
> 
> Anther option may be to define F_RDLAYOUT and not have F_LAYOUT such that it is
> clear that this lease is not associated with F_RDLCK at all.  It is different.
> 
> > > Therefore, the lease is not broken if the file is simply open for write, but
> > > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > > results in changing the underlying layout.
> > 
> > As will mmap(), any number of XFS and ext4 ioctls, etc. 
> > 
> > So this really needs to say "_will_ be broken if *any* modification to
> > the file _might_ need to change the underlying physical layout".
> 
> Agreed.  I used the word "may" because a simple write() does not necessarily
> change the layout of the file.  But I like your verbiage better.  I did wonder
> if listing operations was a bad idea.  So I'm ok simply leaving that detail
> out.
> 
> > Now, the big question: what happens to a process with a
> > F_RDLCK|F_LAYOUT lease held does a write that triggers a layout
> > change? What happens then?
> 
> Because F_UNBREAK is not specified, the write() operation is held for lease
> break time and then the lease is broken if not voluntarily released.  This
> would be the same pattern as a process holding a F_RDLCK and opening the file
> O_RDWR.
> 
> > Also, have you noticed that XFS will unconditionally break layouts on
> > write() because it /might/ need to change the layout? i.e. the
> > BREAK_WRITE case in xfs_file_aio_write_checks()? This is needed for
> > correctly supporting pNFS layout coherency against local IO. i.e.
> > local write() breaks layouts held by NFS server to get the
> > delegation recalled.
> > 
> > So by the above definition of F_RDLCK|F_LAYOUT behaviour, a holder
> > of such a lease doing a write() to that file would trigger a lease
> > break of their own lease as the filesystem has notified the lease
> > layer that there is a layout change about to happen. What's expected
> > to happen here?
> 
> That is not ideal but the proposed semantics say a write() may fail in this
> situation.  So depending on the implementation requirements of the underlying
> FS the semantics still hold for our current use case.  It would be nice to be
> able to enhance the implementation in the future such that a write() could work
> but maybe they can't.  For RDMA the application is probably going to have the
> region mmap'ed anyway and will not need, nor in fact want to use a write()
> call.
> 
> Also, I think I missed a need to specify that a F_RDLCK|F_LAYOUT needs to have
> write permission on (or be the owner of) the file for the user to be able to
> specify F_UNBREAK on it.
> 
> > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > doesn't appear to be compatible with the semantics required by
> > existing users of layout leases.
> 
> I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> with what is currently implemented.  Also, by exporting all this to user space
> we can now write tests for it independent of the RDMA pinning.
> 
> > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > 
> > > Write Layout leases can be used to break read layout leases to indicate that
> > > the process intends to change the underlying layout lease of the file.
> > 
> > Any write() can change the layout of the file, and userspace cannot
> > tell in advance whether that will occur (neither can the
> > filesystem), so it seems to me that any application that needs to
> > write data is going to have to use F_WRLCK|F_LAYOUT.
> 
> Sure, but the use case of F_WRLCK|F_LAYOUT is that the user is creating the
> layout.  So using write() to create the file would be ok.
> 
> On the surface it seems like using a standard F_WRLCK lease could be used
> instead of F_WRLCK|F_LAYOUT.  But it actually can't because that does not
> protect against the internals of the file system changing the lease.  This is
> where the semantics are exactly exported to user space.
> 
> > > A process which has taken a write layout lease has exclusive ownership of the
> > > file layout and can modify that layout as long as the lease is held.
> > 
> > Which further implies single writer semantics and leases are
> > associated with a single open fd. Single writers are something we
> > are always trying to avoid in XFS.
> 
> The discussion at LPC revealed that we need a way for the user to ensure the
> file layout is realized prior to any unbreakable lease being taken.  So yes, for
> some period we will need a single writer.
> 
> > > Operations which change the layout are allowed by that process.  But operations
> > > from other file descriptors which attempt to change the layout will break the
> > > lease through the standard lease break process.
> > 
> > If the F_WRLCK|F_LAYOUT lease is exclusive, who is actually able to
> > modify the layout?  Are you talking about processes that don't
> > actually hold leases modifying the layout?
> 
> That was the idea, yes.
> 
> > i.e. what are the
> > constraints on "exclusive access" here - is F_WRLCK|F_LAYOUT is
> > only exclusive when every modification is co-operating and taking
> > the appropriate layout lease for every access to the file that is
> > made?
> 
> I'm not following but IIUC...  no...  The idea is that if you hold the
> F_WRLCK|F_LAYOUT lease then any attempt by _other_ processes to change the
> layout (intentional or otherwise) would result in you getting a SIGIO signal
> which indicates someone _else_ changed the file.
> 
> Then you can atomically downgrade the lock to F_RDLCK|F_LAYOUT|F_UNBREAK or
> atomically upgrade to F_WRLCK|F_LAYOUT|F_UNBREAK.  Either way you know you have
> the layout you want and can rely on the pin working.
> 
> > If that's the case, what happens when someone fails to get a read
> > lock and decides "I can break write locks just by using ftruncate()
> > to the same size without a layout lease". Or fallocate() to
> > preallocate space that is already allocated. Or many other things I
> > can think of.
> 
> The intended use case for F_WRLCK|F_LAYOUT is that a single process is
> attempting to set the layout prior to setting F_UNBREAK.  While
> F_WRLCK|F_LAYOUT can be broken, breaking the lease will not happen without that
> process knowing about it.
> 
> I don't see this being different from the current lease semantics which
> requires some external coordination amongst process/file users to resolve any
> races or coordination.
> 
> > IOWs, this seems to me like a very fragile sort of construct that is
> > open to abuse and that will lead to everyone using F_UNBREAK, which
> > is highly unfriendly to everyone else...
> 
> FWIW, my use case does require F_UNBREAK.  All of the semantics presented have
> a real use case except for F_RDLCK|F_LAYOUT.  However, I think F_RDLCK|F_LAYOUT
> does have a use case in testing.
> 
> Also, I do think that we need to have some check on file ownership for
> F_UNBREAK.  That needs to be added.
> 
> > > The F_LAYOUT flag is used to
> > > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> > > the F_LAYOUT case opens for write do not break the lease.  But some operations,
> > > if they change the underlying layout, may.
> > > 
> > > The distinction between read layout leases and write layout leases is that
> > > write layout leases can change the layout without breaking the lease within the
> > > owning process.  This is useful to guarantee a layout prior to specifying the
> > > unbreakable flag described below.
> > 
> > Ok, so now you really are saying that F_RDLCK leases can only be
> > used on O_RDONLY file descriptors because any modification under a
> > F_RDLCK|LAYOUT will trigger a layout break.
> 
> I don't necessarily agree.  We also have the mmap() case.  What I was really
> trying to do is define a relaxed lease semantic which allows some shared
> reading/writing of the file as long as the underlying layout does not change.
> I am _not_ a file system expert but it seems like that should be possible.
> 
> Perhaps we need something more fine grained between BREAK_UNMAP and
> BREAK_WRITE?
> 
> > > **Unbreakable Layout Leases (F_UNBREAK)**
> > > 
> > > In order to support pinning of file pages by direct user space users an
> > > unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> > > lease.  When specified, F_UNBREAK indicates that any user attempting to break
> > > the lease will fail with ETXTBUSY rather than follow the normal breaking
> > > procedure.
> > > 
> > > Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> > > specified.  The difference between an unbreakable read layout lease and an
> > > unbreakable write layout lease are that an unbreakable read layout lease is
> > > _not_ exclusive. 
> > 
> > Oh, this doesn't work at all. Now we get write()s to F_RDLCK leases
> > that can't break the leases and so all writes, even to processes
> > that own RDLCK|UNBREAK, will fail with ETXTBSY.
> 
> Yes I agree writes()'s to F_RDLCK|F_LAYOUT|F_UNBREAK _may_ fail.  I don't see
> how this is broken if the file owner is opting into it.  RDMA's can still occur
> to that file.  mmap'ed areas of the file can still be used (especially in the
> no-page cache case of FS DAX).
> 
> > > This means that once a layout is established on a file,
> > > multiple unbreakable read layout leases can be taken by multiple processes and
> > > used to pin the underlying pages of that file.
> > 
> > Ok, so what happens when someone now takes a
> > F_WRLOCK|F_LAYOUT|F_UNBREAK? Is that supposed to break
> > F_RDLCK|F_LAYOUT|F_UNBREAK, as the wording about F_WRLCK behaviour
> > implies it should?
> 
> Ah no.  F_RDLCK|F_LAYOUT|F_UNBREAK could not be broken.  I'll have to update
> the text for this.  The idea here is that no one can be changing the layout but
> multiple readers could be using that layout.  So I'll update the text that a
> F_WRLCK|F_LAYOUT|F_UNBREAK would fail in this case.
> 
> > > Care must therefore be taken to ensure that the layout of the file is as the
> > > user wants prior to using the unbreakable read layout lease.  A safe mechanism
> > > to do this would be to take a write layout lease and use fallocate() to set the
> > > layout of the file.  The layout lease can then be "downgraded" to unbreakable
> > > read layout as long as no other user broke the write layout lease.
> > 
> > What are the semantics of this "downgrade" behaviour you speak of? :)
> 
> As I said above it may be a downgrade or an upgrade but the idea is to
> atomically convert the lease to F_UNBREAK.
> 
> > My thoughts are:
> > 	- RDLCK can only be used for O_RDONLY because write()
> > 	  requires breaking of leases
> 
> Does the file system require write() break the layout lease?  Or is that just
> the way the file system is currently implemented?  What about mmap()?  I need
> to have the file open WR to mmap() the file for RDMA.
> 
> To be clear I'm intending F_RDLCK|F_LAYOUT to be something new.  As I said
> above we could use something like F_RDLAYOUT instead?
> 
> > 	- WRLCK is open to abuse simply by not using a layout lease
> > 	  to do a "no change" layout modification
> 
> I'm sorry, I don't understand this comment.
> 
> > 	- RDLCK|F_UNBREAK is entirely unusable
> 
> Well even if write() fails with ETXTBSY this should give us the ability to do
> RDMA and XDP to these areas from multiple processes.  Furthermore, for FS DAX
> which bypasses the page cache mmap'ed areas can be written without write() with
> CPU stores.  Which is how many RDMA applications are likely to write this data.
> 
> > 	- WRLCK|F_UNBREAK will be what every application uses
> > 	  because everything else either doesn't work or is too easy
> > 	  to abuse.
> 
> Maybe.  IMO that is still a step in the right direction as at least 1 process
> can use this now.  And these semantics allow for a shared unbreakable lease
> (F_RDLCK|F_LAYOUT|F_UNBREAK) which can be used with some configurations (FS DAX
> in particular).
> 
> Also, I do think we will need something to ensure file ownership for F_UNBREAK.
> 
> It sounds like the difficulty here is in potential implementation of allowing
> write() to not break layouts.  And dealing with writes to mmap'ed page cache
> pages.  The file system is free to do better later.
> 

Whatever the semantics, from an API standpoint, I think we should not
try to multiplex layout behavior on top of F_SETLEASE. Layouts should
get new fcntl cmd values (maybe F_SETLAYOUT/F_GETLAYOUT) and then you
wouldn't need to expose the F_LAYOUT flag to userland.

The behavior of layouts is different enough from "traditional" leases
that trying to mash them together like this is only going to cause
confusion.

Note that I'm mainly concerned with the user-facing API with this. The
kernel internals can still use F_LAYOUT flag, and you can do the
F_SETLAYOUT -> F_LAYOUT translation at a high level.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-25 23:46   ` Ira Weiny
  2019-09-26 11:29     ` Jeff Layton
@ 2019-09-30  8:42     ` Dave Chinner
  2019-10-01 21:01       ` Ira Weiny
  2019-10-04  7:51       ` Jan Kara
  1 sibling, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2019-09-30  8:42 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Jeff Layton, Jan Kara, Theodore Ts'o,
	John Hubbard, Dan Williams, Jason Gunthorpe

On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > doesn't appear to be compatible with the semantics required by
> > existing users of layout leases.
> 
> I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> with what is currently implemented.  Also, by exporting all this to user space
> we can now write tests for it independent of the RDMA pinning.

The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
layout changes to occur to the file while the layout lease is held.
IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
to change the is in direct contradition to existing users.

I've said this several times over the past few months now: shared
layout leases must allow layout modifications to be made. Only
allowing an exclusive layout lease to modify the layout rules out
many potential use cases for direct data placement and p2p DMA
applications, not to mention conflicts with the existing pNFS usage.
Layout leases need to support more than just RDMA, and tailoring the
API to exactly the immediate needs of RDMA is just going to make it
useless for anything else.

I'm getting frustrated now because we still seem to be going around
in circles and getting nowhere.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-23 20:17 ` Jeff Layton
@ 2019-10-01 18:17   ` Ira Weiny
  2019-10-02 12:28     ` Jeff Layton
  2019-10-03  9:01     ` Jan Kara
  0 siblings, 2 replies; 18+ messages in thread
From: Ira Weiny @ 2019-10-01 18:17 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Dave Chinner, Jan Kara,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > thread it was suggested I try and write some documentation and/or tests for the
> > new mechanism being proposed.  I have created a foundation to test lease
> > functionality within xfstests.[3] This should be close to being accepted.
> > Before writing additional lease tests, or changing lots of kernel code, this
> > email presents documentation for the new proposed "layout lease" semantic.
> > 
> > At Linux Plumbers[4] just over a week ago, I presented the current state of the
> > patch set and the outstanding issues.  Based on the discussion there, well as
> > follow up emails, I propose the following addition to the fcntl() man page.
> > 
> > Thank you,
> > Ira
> > 
> > [1] https://lkml.org/lkml/2019/8/9/1043
> > [2] https://lkml.org/lkml/2019/8/9/1062
> > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > 
> > 
> 
> Thank you so much for doing this, Ira. This allows us to debate the
> user-visible behavior semantics without getting bogged down in the
> implementation details. More comments below:

Thanks.  Sorry for the delay in response.  Turns out this email was in my
spam...  :-/  I'll need to work out why.

> 
> > <fcntl man page addition>
> > Layout Leases
> > -------------
> > 
> > Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> > be informed about the manipulation of the underlying layout of a file.
> > 
> > A layout is defined as the logical file block -> physical file block mapping
> > including the file size and sharing of physical blocks among files.  Note that
> > the unwritten state of a block is not considered part of file layout.
> > 
> > **Read layout lease F_RDLCK | F_LAYOUT**
> > 
> > Read layout leases can be used to be informed of layout changes by the
> > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > lease in that any attempt to change the _layout_ of the file will be reported to
> > the process through the lease break process.  But this lease is different
> > because the file can be opened for write and data can be read and/or written to
> > the file as long as the underlying layout of the file does not change.
> > Therefore, the lease is not broken if the file is simply open for write, but
> > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > results in changing the underlying layout.
> > 
> > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > 
> > Write Layout leases can be used to break read layout leases to indicate that
> > the process intends to change the underlying layout lease of the file.
> > 
> > A process which has taken a write layout lease has exclusive ownership of the
> > file layout and can modify that layout as long as the lease is held.
> > Operations which change the layout are allowed by that process.  But operations
> > from other file descriptors which attempt to change the layout will break the
> > lease through the standard lease break process.  The F_LAYOUT flag is used to
> > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> > the F_LAYOUT case opens for write do not break the lease.  But some operations,
> > if they change the underlying layout, may.
> > 
> > The distinction between read layout leases and write layout leases is that
> > write layout leases can change the layout without breaking the lease within the
> > owning process.  This is useful to guarantee a layout prior to specifying the
> > unbreakable flag described below.
> > 
> > 
> 
> The above sounds totally reasonable. You're essentially exposing the
> behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> leases work the same way as "normal" leases, wrt signals and timeouts?

That was my intention, yes.

> 
> I do wonder if we're better off not trying to "or" in flags for this,
> and instead have a separate set of commands (maybe F_RDLAYOUT,
> F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> feel terribly strongly about it.

I'm leaning that was as well.  To make these even more distinct from
F_SETLEASE.

> 
> Also, at least in NFSv4, layouts are handed out for a particular byte
> range in a file. Should we consider doing this with an API that allows
> for that in the future? Is this something that would be desirable for
> your RDMA+DAX use-cases?

I don't see this.  I've thought it would be a nice thing to have but I don't
know of any hard use case.  But first I'd like to understand how this works for
NFS.

> 
> We could add a new F_SETLEASE variant that takes a struct with a byte
> range (something like struct flock).

I think this would be another reason to introduce F_[RD|WR|UN]LAYOUT as a
command.  Perhaps supporting smaller byte ranges could be added later?

> 
> > **Unbreakable Layout Leases (F_UNBREAK)**
> > 
> > In order to support pinning of file pages by direct user space users an
> > unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> > lease.  When specified, F_UNBREAK indicates that any user attempting to break
> > the lease will fail with ETXTBUSY rather than follow the normal breaking
> > procedure.
> > 
> > Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> > specified.  The difference between an unbreakable read layout lease and an
> > unbreakable write layout lease are that an unbreakable read layout lease is
> > _not_ exclusive.  This means that once a layout is established on a file,
> > multiple unbreakable read layout leases can be taken by multiple processes and
> > used to pin the underlying pages of that file.
> > 
> > Care must therefore be taken to ensure that the layout of the file is as the
> > user wants prior to using the unbreakable read layout lease.  A safe mechanism
> > to do this would be to take a write layout lease and use fallocate() to set the
> > layout of the file.  The layout lease can then be "downgraded" to unbreakable
> > read layout as long as no other user broke the write layout lease.
> > 
> 
> Will userland require any special privileges in order to set an
> F_UNBREAK lease? This seems like something that could be used for DoS. I
> assume that these will never time out.

Dan and I discussed this some more and yes I think the uid of the process needs
to be the owner of the file.  I think that is a reasonable mechanism.

> 
> How will we deal with the case where something is is squatting on an
> F_UNBREAK lease and isn't letting it go?

That is a good question.  I had not considered someone taking the UNBREAK
without pinning the file.

> 
> Leases are technically "owned" by the file description -- we can't
> necessarily trace it back to a single task in a threaded program. The
> kernel task that set the lease may have exited by the time we go
> looking.
> 
> Will we be content trying to determine this using /proc/locks+lsof, etc,
> or will we need something better?

I think using /proc/locks is our best bet.  Similar to my intention to report
files being pinned.[1]

In fact should we consider files with F_UNBREAK leases "pinned" and just report
them there?

Ira

[1] https://lkml.org/lkml/2019/8/9/1043

> 
> > </fcntl man page addition>
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-30  8:42     ` Dave Chinner
@ 2019-10-01 21:01       ` Ira Weiny
  2019-10-02 13:07         ` Dan Williams
  2019-10-10 10:39         ` Dave Chinner
  2019-10-04  7:51       ` Jan Kara
  1 sibling, 2 replies; 18+ messages in thread
From: Ira Weiny @ 2019-10-01 21:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Jeff Layton, Jan Kara, Theodore Ts'o,
	John Hubbard, Dan Williams, Jason Gunthorpe

On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > doesn't appear to be compatible with the semantics required by
> > > existing users of layout leases.
> > 
> > I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> > with what is currently implemented.  Also, by exporting all this to user space
> > we can now write tests for it independent of the RDMA pinning.
> 
> The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> layout changes to occur to the file while the layout lease is held.

This was not my understanding.

> IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> to change the is in direct contradition to existing users.
> 
> I've said this several times over the past few months now: shared
> layout leases must allow layout modifications to be made.

I don't understand what the point of having a layout lease is then?

>
> Only
> allowing an exclusive layout lease to modify the layout rules out
> many potential use cases for direct data placement and p2p DMA
> applications,

How?  I think that having a typical design pattern of multiple readers
and only a single writer would actually make all these use cases easier.

> not to mention conflicts with the existing pNFS usage.

I apologize for not understanding this.  My reading of the code is that layout
changes require the read layout to be broken prior to proceeding.

The break layout code does this by creating a F_WRLCK of type FL_LAYOUT which
conflicts with the F_RDLCK of type FL_LAYOUT...

int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
{
...
        struct file_lock *new_fl, *fl, *tmp;
...

        new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK, 0);
        if (IS_ERR(new_fl))
                return PTR_ERR(new_fl);
        new_fl->fl_flags = type;
...
        list_for_each_entry_safe(fl, tmp, &ctx->flc_lease, fl_list) {
                if (!leases_conflict(fl, new_fl))
                        continue;
...
}

type == FL_LAYOUT from the call here.

static inline int break_layout(struct inode *inode, bool wait)
{
        smp_mb();
        if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
                return __break_lease(inode,
                                wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
                                FL_LAYOUT);
        return 0;
}       

Also, I don't see any code which limits the number of read layout holders which
can be present and all of them will be revoked by the above code.

What am I missing?

Ira


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-01 18:17   ` Ira Weiny
@ 2019-10-02 12:28     ` Jeff Layton
  2019-10-02 19:27       ` bfields
  2019-10-03  9:01     ` Jan Kara
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2019-10-02 12:28 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Dave Chinner, Jan Kara,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> > > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > > thread it was suggested I try and write some documentation and/or tests for the
> > > new mechanism being proposed.  I have created a foundation to test lease
> > > functionality within xfstests.[3] This should be close to being accepted.
> > > Before writing additional lease tests, or changing lots of kernel code, this
> > > email presents documentation for the new proposed "layout lease" semantic.
> > > 
> > > At Linux Plumbers[4] just over a week ago, I presented the current state of the
> > > patch set and the outstanding issues.  Based on the discussion there, well as
> > > follow up emails, I propose the following addition to the fcntl() man page.
> > > 
> > > Thank you,
> > > Ira
> > > 
> > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > 
> > > 
> > 
> > Thank you so much for doing this, Ira. This allows us to debate the
> > user-visible behavior semantics without getting bogged down in the
> > implementation details. More comments below:
> 
> Thanks.  Sorry for the delay in response.  Turns out this email was in my
> spam...  :-/  I'll need to work out why.
> 
> > > <fcntl man page addition>
> > > Layout Leases
> > > -------------
> > > 
> > > Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> > > be informed about the manipulation of the underlying layout of a file.
> > > 
> > > A layout is defined as the logical file block -> physical file block mapping
> > > including the file size and sharing of physical blocks among files.  Note that
> > > the unwritten state of a block is not considered part of file layout.
> > > 
> > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > 
> > > Read layout leases can be used to be informed of layout changes by the
> > > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > > lease in that any attempt to change the _layout_ of the file will be reported to
> > > the process through the lease break process.  But this lease is different
> > > because the file can be opened for write and data can be read and/or written to
> > > the file as long as the underlying layout of the file does not change.
> > > Therefore, the lease is not broken if the file is simply open for write, but
> > > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > > results in changing the underlying layout.
> > > 
> > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > 
> > > Write Layout leases can be used to break read layout leases to indicate that
> > > the process intends to change the underlying layout lease of the file.
> > > 
> > > A process which has taken a write layout lease has exclusive ownership of the
> > > file layout and can modify that layout as long as the lease is held.
> > > Operations which change the layout are allowed by that process.  But operations
> > > from other file descriptors which attempt to change the layout will break the
> > > lease through the standard lease break process.  The F_LAYOUT flag is used to
> > > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> > > the F_LAYOUT case opens for write do not break the lease.  But some operations,
> > > if they change the underlying layout, may.
> > > 
> > > The distinction between read layout leases and write layout leases is that
> > > write layout leases can change the layout without breaking the lease within the
> > > owning process.  This is useful to guarantee a layout prior to specifying the
> > > unbreakable flag described below.
> > > 
> > > 
> > 
> > The above sounds totally reasonable. You're essentially exposing the
> > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > leases work the same way as "normal" leases, wrt signals and timeouts?
> 
> That was my intention, yes.
>
> > I do wonder if we're better off not trying to "or" in flags for this,
> > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > feel terribly strongly about it.
> 
> I'm leaning that was as well.  To make these even more distinct from
> F_SETLEASE.
> 
> > Also, at least in NFSv4, layouts are handed out for a particular byte
> > range in a file. Should we consider doing this with an API that allows
> > for that in the future? Is this something that would be desirable for
> > your RDMA+DAX use-cases?
> 
> I don't see this.  I've thought it would be a nice thing to have but I don't
> know of any hard use case.  But first I'd like to understand how this works for
> NFS.
> 

The NFSv4.1 spec allows the client to request the layouts for a
particular range in the file:

https://tools.ietf.org/html/rfc5661#page-538

The knfsd only hands out whole-file layouts at present. Eventually we
may want to make better use of segmented layouts, at which point we'd
need something like a byte-range lease.

> > We could add a new F_SETLEASE variant that takes a struct with a byte
> > range (something like struct flock).
> 
> I think this would be another reason to introduce F_[RD|WR|UN]LAYOUT as a
> command.  Perhaps supporting smaller byte ranges could be added later?
> 

I'd definitely not multiplex this over F_SETLEASE. An F_SETLAYOUT cmd
would probably be sufficient, and maybe just reuse
F_RDLCK/F_WRLCK/F_UNLCK for the iomode?

For the byte ranges, the catch there is that extending the userland
interface for that later will be difficult. What I'd probably suggest
(and what would jive with the way pNFS works) would be to go ahead and
add an offset and length to the arguments and result (maybe also
whence?).

The current kernel implementation could be free to deliver a larger
range than requested and only hand out full-file layouts for now.
Eventually we could rework the internals to allow for byte-range layout
leases.

I think this means that you'll probably require an argument struct for
layouts, analogous to struct flock for F_SETLK.

> > > **Unbreakable Layout Leases (F_UNBREAK)**
> > > 
> > > In order to support pinning of file pages by direct user space users an
> > > unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> > > lease.  When specified, F_UNBREAK indicates that any user attempting to break
> > > the lease will fail with ETXTBUSY rather than follow the normal breaking
> > > procedure.
> > > 
> > > Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> > > specified.  The difference between an unbreakable read layout lease and an
> > > unbreakable write layout lease are that an unbreakable read layout lease is
> > > _not_ exclusive.  This means that once a layout is established on a file,
> > > multiple unbreakable read layout leases can be taken by multiple processes and
> > > used to pin the underlying pages of that file.
> > > 
> > > Care must therefore be taken to ensure that the layout of the file is as the
> > > user wants prior to using the unbreakable read layout lease.  A safe mechanism
> > > to do this would be to take a write layout lease and use fallocate() to set the
> > > layout of the file.  The layout lease can then be "downgraded" to unbreakable
> > > read layout as long as no other user broke the write layout lease.
> > > 
> > 
> > Will userland require any special privileges in order to set an
> > F_UNBREAK lease? This seems like something that could be used for DoS. I
> > assume that these will never time out.
> 
> Dan and I discussed this some more and yes I think the uid of the process needs
> to be the owner of the file.  I think that is a reasonable mechanism.
> 

If that's the model we want to use, then I think the owner of the file
will need some mechanism to forcibly seize the layout in this event.

What happens when the file is chowned in that case. Is that also denied?
If I set an F_UNBREAK layout (maybe as root) and then setuid(), do I get
to keep the layout?

> > How will we deal with the case where something is is squatting on an
> > F_UNBREAK lease and isn't letting it go?
> 
> That is a good question.  I had not considered someone taking the UNBREAK
> without pinning the file.
> 

Even if the file is "pinned", I think this is still something that could
be abused. We need to try to consider how we will address those
situations up front.

In that same vein, I know you mentioned that conflicting activity will
just be denied when there is an outstanding F_UNBREAK lease. Will the
process holding one be notified in some fashion when another task
attempts to do some conflicting activity?

> > Leases are technically "owned" by the file description -- we can't
> > necessarily trace it back to a single task in a threaded program. The
> > kernel task that set the lease may have exited by the time we go
> > looking.
> > 
> > Will we be content trying to determine this using /proc/locks+lsof, etc,
> > or will we need something better?
> 
> I think using /proc/locks is our best bet.  Similar to my intention to report
> files being pinned.[1]
> 
> In fact should we consider files with F_UNBREAK leases "pinned" and just report
> them there?
> 
> [1] https://lkml.org/lkml/2019/8/9/1043

Sure, but eventually you'll want to track that back to a process that is
holding the thing. /proc/locks just shows you dev+ino for a particular
lock. You'll need to use something like lsof to figure out who is
holding the file open.

Since layouts aren't necessarily broken on open, there may be a bunch of
tasks that have the file open. How will you identify which one holds the
F_UNBREAK layout?
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-01 21:01       ` Ira Weiny
@ 2019-10-02 13:07         ` Dan Williams
  2019-10-10 10:39         ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dan Williams @ 2019-10-02 13:07 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	Linux Kernel Mailing List, linux-nvdimm, Linux MM, Jeff Layton,
	Jan Kara, Theodore Ts'o, John Hubbard, Jason Gunthorpe,
	Christoph Hellwig

On Tue, Oct 1, 2019 at 2:02 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> > On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > > doesn't appear to be compatible with the semantics required by
> > > > existing users of layout leases.
> > >
> > > I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> > > with what is currently implemented.  Also, by exporting all this to user space
> > > we can now write tests for it independent of the RDMA pinning.
> >
> > The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> > layout changes to occur to the file while the layout lease is held.
>
> This was not my understanding.

I think you guys are talking past each other. F_RDLCK | F_LAYOUT can
be broken to allow writes to the file / layout. The new unbreakable
case would require explicit SIGKILL as "revocation method of last
resort", but that's the new incremental extension being proposed. No
changes to the current behavior of F_RDLCK | F_LAYOUT.

Dave, the question at hand is whether this new layout lease mode being
proposed is going to respond to BREAK_WRITE, or just BREAK_UNMAP. It
seems longterm page pinning conflicts really only care about
BREAK_UNMAP where pages that were part of the file are being removed
from the file. The unbreakable case can tolerate layout changes that
keep pinned pages mapped / allocated to the file.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-02 12:28     ` Jeff Layton
@ 2019-10-02 19:27       ` bfields
  2019-10-02 20:35         ` Jeff Layton
  0 siblings, 1 reply; 18+ messages in thread
From: bfields @ 2019-10-02 19:27 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Ira Weiny, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	linux-kernel, linux-nvdimm, linux-mm, Dave Chinner, Jan Kara,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> > > > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > > > thread it was suggested I try and write some documentation and/or tests for the
> > > > new mechanism being proposed.  I have created a foundation to test lease
> > > > functionality within xfstests.[3] This should be close to being accepted.
> > > > Before writing additional lease tests, or changing lots of kernel code, this
> > > > email presents documentation for the new proposed "layout lease" semantic.
> > > > 
> > > > At Linux Plumbers[4] just over a week ago, I presented the current state of the
> > > > patch set and the outstanding issues.  Based on the discussion there, well as
> > > > follow up emails, I propose the following addition to the fcntl() man page.
> > > > 
> > > > Thank you,
> > > > Ira
> > > > 
> > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > 
> > > > 
> > > 
> > > Thank you so much for doing this, Ira. This allows us to debate the
> > > user-visible behavior semantics without getting bogged down in the
> > > implementation details. More comments below:
> > 
> > Thanks.  Sorry for the delay in response.  Turns out this email was in my
> > spam...  :-/  I'll need to work out why.
> > 
> > > > <fcntl man page addition>
> > > > Layout Leases
> > > > -------------
> > > > 
> > > > Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> > > > be informed about the manipulation of the underlying layout of a file.
> > > > 
> > > > A layout is defined as the logical file block -> physical file block mapping
> > > > including the file size and sharing of physical blocks among files.  Note that
> > > > the unwritten state of a block is not considered part of file layout.
> > > > 
> > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > 
> > > > Read layout leases can be used to be informed of layout changes by the
> > > > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > > > lease in that any attempt to change the _layout_ of the file will be reported to
> > > > the process through the lease break process.  But this lease is different
> > > > because the file can be opened for write and data can be read and/or written to
> > > > the file as long as the underlying layout of the file does not change.
> > > > Therefore, the lease is not broken if the file is simply open for write, but
> > > > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > > > results in changing the underlying layout.
> > > > 
> > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > 
> > > > Write Layout leases can be used to break read layout leases to indicate that
> > > > the process intends to change the underlying layout lease of the file.
> > > > 
> > > > A process which has taken a write layout lease has exclusive ownership of the
> > > > file layout and can modify that layout as long as the lease is held.
> > > > Operations which change the layout are allowed by that process.  But operations
> > > > from other file descriptors which attempt to change the layout will break the
> > > > lease through the standard lease break process.  The F_LAYOUT flag is used to
> > > > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> > > > the F_LAYOUT case opens for write do not break the lease.  But some operations,
> > > > if they change the underlying layout, may.
> > > > 
> > > > The distinction between read layout leases and write layout leases is that
> > > > write layout leases can change the layout without breaking the lease within the
> > > > owning process.  This is useful to guarantee a layout prior to specifying the
> > > > unbreakable flag described below.
> > > > 
> > > > 
> > > 
> > > The above sounds totally reasonable. You're essentially exposing the
> > > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > > leases work the same way as "normal" leases, wrt signals and timeouts?
> > 
> > That was my intention, yes.
> >
> > > I do wonder if we're better off not trying to "or" in flags for this,
> > > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > > feel terribly strongly about it.
> > 
> > I'm leaning that was as well.  To make these even more distinct from
> > F_SETLEASE.
> > 
> > > Also, at least in NFSv4, layouts are handed out for a particular byte
> > > range in a file. Should we consider doing this with an API that allows
> > > for that in the future? Is this something that would be desirable for
> > > your RDMA+DAX use-cases?
> > 
> > I don't see this.  I've thought it would be a nice thing to have but I don't
> > know of any hard use case.  But first I'd like to understand how this works for
> > NFS.
> > 
> 
> The NFSv4.1 spec allows the client to request the layouts for a
> particular range in the file:
> 
> https://tools.ietf.org/html/rfc5661#page-538
> 
> The knfsd only hands out whole-file layouts at present. Eventually we
> may want to make better use of segmented layouts, at which point we'd
> need something like a byte-range lease.
> 
> > > We could add a new F_SETLEASE variant that takes a struct with a byte
> > > range (something like struct flock).
> > 
> > I think this would be another reason to introduce F_[RD|WR|UN]LAYOUT as a
> > command.  Perhaps supporting smaller byte ranges could be added later?
> > 
> 
> I'd definitely not multiplex this over F_SETLEASE. An F_SETLAYOUT cmd
> would probably be sufficient, and maybe just reuse
> F_RDLCK/F_WRLCK/F_UNLCK for the iomode?
> 
> For the byte ranges, the catch there is that extending the userland
> interface for that later will be difficult.

Why would it be difficult?

> What I'd probably suggest
> (and what would jive with the way pNFS works) would be to go ahead and
> add an offset and length to the arguments and result (maybe also
> whence?).

Why not add new commands with range arguments later if it turns out to
be necessary?

--b.

> 
> The current kernel implementation could be free to deliver a larger
> range than requested and only hand out full-file layouts for now.
> Eventually we could rework the internals to allow for byte-range layout
> leases.
> 
> I think this means that you'll probably require an argument struct for
> layouts, analogous to struct flock for F_SETLK.
> 
> > > > **Unbreakable Layout Leases (F_UNBREAK)**
> > > > 
> > > > In order to support pinning of file pages by direct user space users an
> > > > unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> > > > lease.  When specified, F_UNBREAK indicates that any user attempting to break
> > > > the lease will fail with ETXTBUSY rather than follow the normal breaking
> > > > procedure.
> > > > 
> > > > Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> > > > specified.  The difference between an unbreakable read layout lease and an
> > > > unbreakable write layout lease are that an unbreakable read layout lease is
> > > > _not_ exclusive.  This means that once a layout is established on a file,
> > > > multiple unbreakable read layout leases can be taken by multiple processes and
> > > > used to pin the underlying pages of that file.
> > > > 
> > > > Care must therefore be taken to ensure that the layout of the file is as the
> > > > user wants prior to using the unbreakable read layout lease.  A safe mechanism
> > > > to do this would be to take a write layout lease and use fallocate() to set the
> > > > layout of the file.  The layout lease can then be "downgraded" to unbreakable
> > > > read layout as long as no other user broke the write layout lease.
> > > > 
> > > 
> > > Will userland require any special privileges in order to set an
> > > F_UNBREAK lease? This seems like something that could be used for DoS. I
> > > assume that these will never time out.
> > 
> > Dan and I discussed this some more and yes I think the uid of the process needs
> > to be the owner of the file.  I think that is a reasonable mechanism.
> > 
> 
> If that's the model we want to use, then I think the owner of the file
> will need some mechanism to forcibly seize the layout in this event.
> 
> What happens when the file is chowned in that case. Is that also denied?
> If I set an F_UNBREAK layout (maybe as root) and then setuid(), do I get
> to keep the layout?
> 
> > > How will we deal with the case where something is is squatting on an
> > > F_UNBREAK lease and isn't letting it go?
> > 
> > That is a good question.  I had not considered someone taking the UNBREAK
> > without pinning the file.
> > 
> 
> Even if the file is "pinned", I think this is still something that could
> be abused. We need to try to consider how we will address those
> situations up front.
> 
> In that same vein, I know you mentioned that conflicting activity will
> just be denied when there is an outstanding F_UNBREAK lease. Will the
> process holding one be notified in some fashion when another task
> attempts to do some conflicting activity?
> 
> > > Leases are technically "owned" by the file description -- we can't
> > > necessarily trace it back to a single task in a threaded program. The
> > > kernel task that set the lease may have exited by the time we go
> > > looking.
> > > 
> > > Will we be content trying to determine this using /proc/locks+lsof, etc,
> > > or will we need something better?
> > 
> > I think using /proc/locks is our best bet.  Similar to my intention to report
> > files being pinned.[1]
> > 
> > In fact should we consider files with F_UNBREAK leases "pinned" and just report
> > them there?
> > 
> > [1] https://lkml.org/lkml/2019/8/9/1043
> 
> Sure, but eventually you'll want to track that back to a process that is
> holding the thing. /proc/locks just shows you dev+ino for a particular
> lock. You'll need to use something like lsof to figure out who is
> holding the file open.
> 
> Since layouts aren't necessarily broken on open, there may be a bunch of
> tasks that have the file open. How will you identify which one holds the
> F_UNBREAK layout?
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-02 19:27       ` bfields
@ 2019-10-02 20:35         ` Jeff Layton
  2019-10-03  8:43           ` Jan Kara
  2019-10-03 15:37           ` J. Bruce Fields
  0 siblings, 2 replies; 18+ messages in thread
From: Jeff Layton @ 2019-10-02 20:35 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Ira Weiny, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	linux-kernel, linux-nvdimm, linux-mm, Dave Chinner, Jan Kara,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Wed, 2019-10-02 at 15:27 -0400, J. Bruce Fields wrote:
> On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> > On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > > Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> > > > > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > > > > thread it was suggested I try and write some documentation and/or tests for the
> > > > > new mechanism being proposed.  I have created a foundation to test lease
> > > > > functionality within xfstests.[3] This should be close to being accepted.
> > > > > Before writing additional lease tests, or changing lots of kernel code, this
> > > > > email presents documentation for the new proposed "layout lease" semantic.
> > > > > 
> > > > > At Linux Plumbers[4] just over a week ago, I presented the current state of the
> > > > > patch set and the outstanding issues.  Based on the discussion there, well as
> > > > > follow up emails, I propose the following addition to the fcntl() man page.
> > > > > 
> > > > > Thank you,
> > > > > Ira
> > > > > 
> > > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > > 
> > > > > 
> > > > 
> > > > Thank you so much for doing this, Ira. This allows us to debate the
> > > > user-visible behavior semantics without getting bogged down in the
> > > > implementation details. More comments below:
> > > 
> > > Thanks.  Sorry for the delay in response.  Turns out this email was in my
> > > spam...  :-/  I'll need to work out why.
> > > 
> > > > > <fcntl man page addition>
> > > > > Layout Leases
> > > > > -------------
> > > > > 
> > > > > Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> > > > > be informed about the manipulation of the underlying layout of a file.
> > > > > 
> > > > > A layout is defined as the logical file block -> physical file block mapping
> > > > > including the file size and sharing of physical blocks among files.  Note that
> > > > > the unwritten state of a block is not considered part of file layout.
> > > > > 
> > > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > > 
> > > > > Read layout leases can be used to be informed of layout changes by the
> > > > > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > > > > lease in that any attempt to change the _layout_ of the file will be reported to
> > > > > the process through the lease break process.  But this lease is different
> > > > > because the file can be opened for write and data can be read and/or written to
> > > > > the file as long as the underlying layout of the file does not change.
> > > > > Therefore, the lease is not broken if the file is simply open for write, but
> > > > > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > > > > results in changing the underlying layout.
> > > > > 
> > > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > > 
> > > > > Write Layout leases can be used to break read layout leases to indicate that
> > > > > the process intends to change the underlying layout lease of the file.
> > > > > 
> > > > > A process which has taken a write layout lease has exclusive ownership of the
> > > > > file layout and can modify that layout as long as the lease is held.
> > > > > Operations which change the layout are allowed by that process.  But operations
> > > > > from other file descriptors which attempt to change the layout will break the
> > > > > lease through the standard lease break process.  The F_LAYOUT flag is used to
> > > > > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> > > > > the F_LAYOUT case opens for write do not break the lease.  But some operations,
> > > > > if they change the underlying layout, may.
> > > > > 
> > > > > The distinction between read layout leases and write layout leases is that
> > > > > write layout leases can change the layout without breaking the lease within the
> > > > > owning process.  This is useful to guarantee a layout prior to specifying the
> > > > > unbreakable flag described below.
> > > > > 
> > > > > 
> > > > 
> > > > The above sounds totally reasonable. You're essentially exposing the
> > > > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > > > leases work the same way as "normal" leases, wrt signals and timeouts?
> > > 
> > > That was my intention, yes.
> > > 
> > > > I do wonder if we're better off not trying to "or" in flags for this,
> > > > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > > > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > > > feel terribly strongly about it.
> > > 
> > > I'm leaning that was as well.  To make these even more distinct from
> > > F_SETLEASE.
> > > 
> > > > Also, at least in NFSv4, layouts are handed out for a particular byte
> > > > range in a file. Should we consider doing this with an API that allows
> > > > for that in the future? Is this something that would be desirable for
> > > > your RDMA+DAX use-cases?
> > > 
> > > I don't see this.  I've thought it would be a nice thing to have but I don't
> > > know of any hard use case.  But first I'd like to understand how this works for
> > > NFS.
> > > 
> > 
> > The NFSv4.1 spec allows the client to request the layouts for a
> > particular range in the file:
> > 
> > https://tools.ietf.org/html/rfc5661#page-538
> > 
> > The knfsd only hands out whole-file layouts at present. Eventually we
> > may want to make better use of segmented layouts, at which point we'd
> > need something like a byte-range lease.
> > 
> > > > We could add a new F_SETLEASE variant that takes a struct with a byte
> > > > range (something like struct flock).
> > > 
> > > I think this would be another reason to introduce F_[RD|WR|UN]LAYOUT as a
> > > command.  Perhaps supporting smaller byte ranges could be added later?
> > > 
> > 
> > I'd definitely not multiplex this over F_SETLEASE. An F_SETLAYOUT cmd
> > would probably be sufficient, and maybe just reuse
> > F_RDLCK/F_WRLCK/F_UNLCK for the iomode?
> > 
> > For the byte ranges, the catch there is that extending the userland
> > interface for that later will be difficult.
> 
> Why would it be difficult?
> 

Legacy userland code that wanted to use byte range enabled layouts would
have to be rebuilt to take advantage of them. If we require a range from
the get-go, then they will get the benefit of them once they're
available.
 
> > What I'd probably suggest
> > (and what would jive with the way pNFS works) would be to go ahead and
> > add an offset and length to the arguments and result (maybe also
> > whence?).
> 
> Why not add new commands with range arguments later if it turns out to
> be necessary?

We could do that. It'd be a little ugly, IMO, simply because then we'd
end up with two interfaces that do almost the exact same thing.

Should byte-range layouts at that point conflict with non-byte range
layouts, or should they be in different "spaces" (a'la POSIX and flock
locks)? When it's all one interface, those sorts of questions sort of
answer themselves. When they aren't we'll have to document them clearly
and I think the result will be more confusing for userland programmers.

If you felt strongly about leaving those out for now, you could just do
something similar to what Aleksa is planning for openat2 -- have a
struct pointer and length as arguments for this cmd, and only have a
single iomode member in there for now.

The kernel would have to know how to deal with "legacy" and byte-range-
enabled variants if we ever extend it, but that's not too hard to
handle.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-02 20:35         ` Jeff Layton
@ 2019-10-03  8:43           ` Jan Kara
  2019-10-03 15:37           ` J. Bruce Fields
  1 sibling, 0 replies; 18+ messages in thread
From: Jan Kara @ 2019-10-03  8:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, Ira Weiny, linux-fsdevel, linux-xfs, linux-ext4,
	linux-rdma, linux-kernel, linux-nvdimm, linux-mm, Dave Chinner,
	Jan Kara, Theodore Ts'o, John Hubbard, Dan Williams,
	Jason Gunthorpe

On Wed 02-10-19 16:35:55, Jeff Layton wrote:
> On Wed, 2019-10-02 at 15:27 -0400, J. Bruce Fields wrote:
> > On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> > > On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > > > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > > > Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> > > > > > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > > > > > thread it was suggested I try and write some documentation and/or tests for the
> > > > > > new mechanism being proposed.  I have created a foundation to test lease
> > > > > > functionality within xfstests.[3] This should be close to being accepted.
> > > > > > Before writing additional lease tests, or changing lots of kernel code, this
> > > > > > email presents documentation for the new proposed "layout lease" semantic.
> > > > > > 
> > > > > > At Linux Plumbers[4] just over a week ago, I presented the current state of the
> > > > > > patch set and the outstanding issues.  Based on the discussion there, well as
> > > > > > follow up emails, I propose the following addition to the fcntl() man page.
> > > > > > 
> > > > > > Thank you,
> > > > > > Ira
> > > > > > 
> > > > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Thank you so much for doing this, Ira. This allows us to debate the
> > > > > user-visible behavior semantics without getting bogged down in the
> > > > > implementation details. More comments below:
> > > > 
> > > > Thanks.  Sorry for the delay in response.  Turns out this email was in my
> > > > spam...  :-/  I'll need to work out why.
> > > > 
> > > > > > <fcntl man page addition>
> > > > > > Layout Leases
> > > > > > -------------
> > > > > > 
> > > > > > Layout (F_LAYOUT) leases are special leases which can be used to control and/or
> > > > > > be informed about the manipulation of the underlying layout of a file.
> > > > > > 
> > > > > > A layout is defined as the logical file block -> physical file block mapping
> > > > > > including the file size and sharing of physical blocks among files.  Note that
> > > > > > the unwritten state of a block is not considered part of file layout.
> > > > > > 
> > > > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > > > 
> > > > > > Read layout leases can be used to be informed of layout changes by the
> > > > > > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > > > > > lease in that any attempt to change the _layout_ of the file will be reported to
> > > > > > the process through the lease break process.  But this lease is different
> > > > > > because the file can be opened for write and data can be read and/or written to
> > > > > > the file as long as the underlying layout of the file does not change.
> > > > > > Therefore, the lease is not broken if the file is simply open for write, but
> > > > > > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > > > > > results in changing the underlying layout.
> > > > > > 
> > > > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > > > 
> > > > > > Write Layout leases can be used to break read layout leases to indicate that
> > > > > > the process intends to change the underlying layout lease of the file.
> > > > > > 
> > > > > > A process which has taken a write layout lease has exclusive ownership of the
> > > > > > file layout and can modify that layout as long as the lease is held.
> > > > > > Operations which change the layout are allowed by that process.  But operations
> > > > > > from other file descriptors which attempt to change the layout will break the
> > > > > > lease through the standard lease break process.  The F_LAYOUT flag is used to
> > > > > > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> > > > > > the F_LAYOUT case opens for write do not break the lease.  But some operations,
> > > > > > if they change the underlying layout, may.
> > > > > > 
> > > > > > The distinction between read layout leases and write layout leases is that
> > > > > > write layout leases can change the layout without breaking the lease within the
> > > > > > owning process.  This is useful to guarantee a layout prior to specifying the
> > > > > > unbreakable flag described below.
> > > > > > 
> > > > > > 
> > > > > 
> > > > > The above sounds totally reasonable. You're essentially exposing the
> > > > > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > > > > leases work the same way as "normal" leases, wrt signals and timeouts?
> > > > 
> > > > That was my intention, yes.
> > > > 
> > > > > I do wonder if we're better off not trying to "or" in flags for this,
> > > > > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > > > > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > > > > feel terribly strongly about it.
> > > > 
> > > > I'm leaning that was as well.  To make these even more distinct from
> > > > F_SETLEASE.
> > > > 
> > > > > Also, at least in NFSv4, layouts are handed out for a particular byte
> > > > > range in a file. Should we consider doing this with an API that allows
> > > > > for that in the future? Is this something that would be desirable for
> > > > > your RDMA+DAX use-cases?
> > > > 
> > > > I don't see this.  I've thought it would be a nice thing to have but I don't
> > > > know of any hard use case.  But first I'd like to understand how this works for
> > > > NFS.
> > > > 
> > > 
> > > The NFSv4.1 spec allows the client to request the layouts for a
> > > particular range in the file:
> > > 
> > > https://tools.ietf.org/html/rfc5661#page-538
> > > 
> > > The knfsd only hands out whole-file layouts at present. Eventually we
> > > may want to make better use of segmented layouts, at which point we'd
> > > need something like a byte-range lease.
> > > 
> > > > > We could add a new F_SETLEASE variant that takes a struct with a byte
> > > > > range (something like struct flock).
> > > > 
> > > > I think this would be another reason to introduce F_[RD|WR|UN]LAYOUT as a
> > > > command.  Perhaps supporting smaller byte ranges could be added later?
> > > > 
> > > 
> > > I'd definitely not multiplex this over F_SETLEASE. An F_SETLAYOUT cmd
> > > would probably be sufficient, and maybe just reuse
> > > F_RDLCK/F_WRLCK/F_UNLCK for the iomode?
> > > 
> > > For the byte ranges, the catch there is that extending the userland
> > > interface for that later will be difficult.
> > 
> > Why would it be difficult?
> > 
> 
> Legacy userland code that wanted to use byte range enabled layouts would
> have to be rebuilt to take advantage of them. If we require a range from
> the get-go, then they will get the benefit of them once they're
> available.

I don't think this is true. Because current implementation of locking the
whole file may hide implementation bugs in userspace. So the new
range lock handling may break userspace and history shows such
problems with APIs are actually rather common. So I think switching to
range locking *must* be conscious decision of the application and as
such having separate API for that is the most natural thing to do.

> > > What I'd probably suggest
> > > (and what would jive with the way pNFS works) would be to go ahead and
> > > add an offset and length to the arguments and result (maybe also
> > > whence?).
> > 
> > Why not add new commands with range arguments later if it turns out to
> > be necessary?
> 
> We could do that. It'd be a little ugly, IMO, simply because then we'd
> end up with two interfaces that do almost the exact same thing.
> 
> Should byte-range layouts at that point conflict with non-byte range
> layouts, or should they be in different "spaces" (a'la POSIX and flock
> locks)? When it's all one interface, those sorts of questions sort of
> answer themselves. When they aren't we'll have to document them clearly
> and I think the result will be more confusing for userland programmers.
> 
> If you felt strongly about leaving those out for now, you could just do
> something similar to what Aleksa is planning for openat2 -- have a
> struct pointer and length as arguments for this cmd, and only have a
> single iomode member in there for now.
> 
> The kernel would have to know how to deal with "legacy" and byte-range-
> enabled variants if we ever extend it, but that's not too hard to
> handle.

Yeah, so we can discuss how to make possible future extension towards
range locking the least confusing to userspace. E.g. we can just put the
ranges in the API and require that start is always 0 and end is always
ULONG_MAX or whatever. But switching to smaller ranges must be the decision
in the application after the kernel supports it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-01 18:17   ` Ira Weiny
  2019-10-02 12:28     ` Jeff Layton
@ 2019-10-03  9:01     ` Jan Kara
  2019-10-03 17:05       ` Ira Weiny
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Kara @ 2019-10-03  9:01 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jeff Layton, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	linux-kernel, linux-nvdimm, linux-mm, Dave Chinner, Jan Kara,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Tue 01-10-19 11:17:00, Ira Weiny wrote:
> On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > 
> > Will userland require any special privileges in order to set an
> > F_UNBREAK lease? This seems like something that could be used for DoS. I
> > assume that these will never time out.
> 
> Dan and I discussed this some more and yes I think the uid of the process needs
> to be the owner of the file.  I think that is a reasonable mechanism.

Honestly, I'm not convinced anything more than open-for-write should be
required. Sure unbreakable lease may result in failing truncate and other
ops but as we discussed at LFS/MM, this is not hugely different from
executing a file resulting in ETXTBUSY for any truncate attempt (even from
root). So sufficiently priviledged user has to be able to easily find which
process(es) owns the lease so that he can kill it / take other
administrative action to release the lease. But that's about it.
 
> > How will we deal with the case where something is is squatting on an
> > F_UNBREAK lease and isn't letting it go?
> 
> That is a good question.  I had not considered someone taking the UNBREAK
> without pinning the file.

IMHO the same answer as above - sufficiently priviledged user should be
able to easily find the process holding the lease and kill it. Given the
lease owner has to have write access to the file, he better should be from
the same "security domain"...

> > Leases are technically "owned" by the file description -- we can't
> > necessarily trace it back to a single task in a threaded program. The
> > kernel task that set the lease may have exited by the time we go
> > looking.
> > 
> > Will we be content trying to determine this using /proc/locks+lsof, etc,
> > or will we need something better?
> 
> I think using /proc/locks is our best bet.  Similar to my intention to report
> files being pinned.[1]
> 
> In fact should we consider files with F_UNBREAK leases "pinned" and just report
> them there?

As Jeff wrote later, /proc/locks is not enough. You need PID(s) which have
access to the lease and hold it alive. Your /proc/<pid>/ files you had in your
patches should do that, shouldn't they? Maybe they were not tied to the
right structure... They really need to be tied to the existence of a lease.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-02 20:35         ` Jeff Layton
  2019-10-03  8:43           ` Jan Kara
@ 2019-10-03 15:37           ` J. Bruce Fields
  1 sibling, 0 replies; 18+ messages in thread
From: J. Bruce Fields @ 2019-10-03 15:37 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Ira Weiny, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	linux-kernel, linux-nvdimm, linux-mm, Dave Chinner, Jan Kara,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Wed, Oct 02, 2019 at 04:35:55PM -0400, Jeff Layton wrote:
> On Wed, 2019-10-02 at 15:27 -0400, J. Bruce Fields wrote:
> > On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> > > For the byte ranges, the catch there is that extending the userland
> > > interface for that later will be difficult.
> > 
> > Why would it be difficult?
> 
> Legacy userland code that wanted to use byte range enabled layouts would
> have to be rebuilt to take advantage of them. If we require a range from
> the get-go, then they will get the benefit of them once they're
> available.

I can't see writing byte-range code for a kernel that doesn't support
that yet.  How would I test it?

> > > What I'd probably suggest
> > > (and what would jive with the way pNFS works) would be to go ahead and
> > > add an offset and length to the arguments and result (maybe also
> > > whence?).
> > 
> > Why not add new commands with range arguments later if it turns out to
> > be necessary?
> 
> We could do that. It'd be a little ugly, IMO, simply because then we'd
> end up with two interfaces that do almost the exact same thing.
> 
> Should byte-range layouts at that point conflict with non-byte range
> layouts, or should they be in different "spaces" (a'la POSIX and flock
> locks)? When it's all one interface, those sorts of questions sort of
> answer themselves. When they aren't we'll have to document them clearly
> and I think the result will be more confusing for userland programmers.

I was hoping they'd be in the same space, with the old interface just
defined to deal in locks with range [0,∞).

I'm just worried about getting the interface wrong if it's specified
without being implemented.  Maybe this is straightforward enough that
there's not a risk, I don't know.

--b.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-03  9:01     ` Jan Kara
@ 2019-10-03 17:05       ` Ira Weiny
  0 siblings, 0 replies; 18+ messages in thread
From: Ira Weiny @ 2019-10-03 17:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	linux-kernel, linux-nvdimm, linux-mm, Dave Chinner,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Thu, Oct 03, 2019 at 11:01:10AM +0200, Jan Kara wrote:
> On Tue 01-10-19 11:17:00, Ira Weiny wrote:
> > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > 
> > > Will userland require any special privileges in order to set an
> > > F_UNBREAK lease? This seems like something that could be used for DoS. I
> > > assume that these will never time out.
> > 
> > Dan and I discussed this some more and yes I think the uid of the process needs
> > to be the owner of the file.  I think that is a reasonable mechanism.
> 
> Honestly, I'm not convinced anything more than open-for-write should be
> required. Sure unbreakable lease may result in failing truncate and other
> ops but as we discussed at LFS/MM, this is not hugely different from
> executing a file resulting in ETXTBUSY for any truncate attempt (even from
> root). So sufficiently priviledged user has to be able to easily find which
> process(es) owns the lease so that he can kill it / take other
> administrative action to release the lease. But that's about it.

Well that was kind of what I was thinking.  However I wanted to be careful
about requiring write permission when doing a F_RDLCK.  I think that it has to
be clearly documented _why_ write permission is required.

>  
> > > How will we deal with the case where something is is squatting on an
> > > F_UNBREAK lease and isn't letting it go?
> > 
> > That is a good question.  I had not considered someone taking the UNBREAK
> > without pinning the file.
> 
> IMHO the same answer as above - sufficiently priviledged user should be
> able to easily find the process holding the lease and kill it. Given the
> lease owner has to have write access to the file, he better should be from
> the same "security domain"...
> 
> > > Leases are technically "owned" by the file description -- we can't
> > > necessarily trace it back to a single task in a threaded program. The
> > > kernel task that set the lease may have exited by the time we go
> > > looking.
> > > 
> > > Will we be content trying to determine this using /proc/locks+lsof, etc,
> > > or will we need something better?
> > 
> > I think using /proc/locks is our best bet.  Similar to my intention to report
> > files being pinned.[1]
> > 
> > In fact should we consider files with F_UNBREAK leases "pinned" and just report
> > them there?
> 
> As Jeff wrote later, /proc/locks is not enough. You need PID(s) which have
> access to the lease and hold it alive. Your /proc/<pid>/ files you had in your
> patches should do that, shouldn't they? Maybe they were not tied to the
> right structure... They really need to be tied to the existence of a lease.

Yes, sorry.  I misspoke above.

Right now /proc/<pid>/file_pins indicates that the file is pinned by GUP.  I
think it may be reasonable to extend that to any file which has F_UNBREAK
specified.  'file_pins' may be the wrong name when we include F_UNBREAK'ed
leased files, so I will think on the name.  But I think this is possible and
desired.

Ira


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-09-30  8:42     ` Dave Chinner
  2019-10-01 21:01       ` Ira Weiny
@ 2019-10-04  7:51       ` Jan Kara
  1 sibling, 0 replies; 18+ messages in thread
From: Jan Kara @ 2019-10-04  7:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ira Weiny, linux-fsdevel, linux-xfs, linux-ext4, linux-rdma,
	linux-kernel, linux-nvdimm, linux-mm, Jeff Layton, Jan Kara,
	Theodore Ts'o, John Hubbard, Dan Williams, Jason Gunthorpe

On Mon 30-09-19 18:42:33, Dave Chinner wrote:
> On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > doesn't appear to be compatible with the semantics required by
> > > existing users of layout leases.
> > 
> > I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> > with what is currently implemented.  Also, by exporting all this to user space
> > we can now write tests for it independent of the RDMA pinning.
> 
> The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> layout changes to occur to the file while the layout lease is held.

I remember you saying that in the past conversations. But I agree with Ira
that I don't see where in the code this would be implemented. AFAICS
break_layout() called from xfs_break_leased_layouts() simply breaks all the
leases with F_LAYOUT set attached to the inode... Now I'm not any expert on
file leases but what am I missing?

> IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> to change the is in direct contradition to existing users.
> 
> I've said this several times over the past few months now: shared
> layout leases must allow layout modifications to be made. Only
> allowing an exclusive layout lease to modify the layout rules out
> many potential use cases for direct data placement and p2p DMA
> applications, not to mention conflicts with the existing pNFS usage.
> Layout leases need to support more than just RDMA, and tailoring the
> API to exactly the immediate needs of RDMA is just going to make it
> useless for anything else.

I agree we should not tailor the layout lease definition to just RDMA
usecase. But let's talk about the semantics once our confusion about how
pNFS currently uses layout leases is clear out.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Lease semantic proposal
  2019-10-01 21:01       ` Ira Weiny
  2019-10-02 13:07         ` Dan Williams
@ 2019-10-10 10:39         ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2019-10-10 10:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-rdma, linux-kernel,
	linux-nvdimm, linux-mm, Jeff Layton, Jan Kara, Theodore Ts'o,
	John Hubbard, Dan Williams, Jason Gunthorpe

On Tue, Oct 01, 2019 at 02:01:57PM -0700, Ira Weiny wrote:
> On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> > On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > > doesn't appear to be compatible with the semantics required by
> > > > existing users of layout leases.
> > > 
> > > I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> > > with what is currently implemented.  Also, by exporting all this to user space
> > > we can now write tests for it independent of the RDMA pinning.
> > 
> > The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> > layout changes to occur to the file while the layout lease is held.
> 
> This was not my understanding.

These are the remote procerdeure calls that the pNFS client uses to
map and/or allocate space in the file it has a lease on:

struct export_operations {
....
        int (*map_blocks)(struct inode *inode, loff_t offset,
                          u64 len, struct iomap *iomap,
                          bool write, u32 *device_generation);
        int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
                             int nr_iomaps, struct iattr *iattr);
};

.map_blocks() allows the pnfs client to allocate blocks in the
storage.  .commit_blocks() is called once the write is complete to
do things like unwritten extent conversion on extents that it
allocated. In the XFS implementation of these methods, they call
directly into the XFS same block mapping code that the
read/write/mmap IO paths call into.

A typical pNFS use case is a HPC clusters, where thousands of nodes
might all be writing to separate parts of a huge sparse file (e.g.
out of core sparse matrix solver) and are reading/writing direct to
the storage via iSER or some other low level IB/RDMA storage
protocol.  Every write on every pNFS client needs space allocation,
so the pNFS server is basically just providing a remote interface to
the XFS space allocation interfaces for direct IO on the pNFS
clients.

IOWs, there can be thousands of concurrent pNFS layout leases on a
single inode at any given time and they can all be allocating space,
too.

> > IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> > to change the is in direct contradition to existing users.
> > 
> > I've said this several times over the past few months now: shared
> > layout leases must allow layout modifications to be made.
> 
> I don't understand what the point of having a layout lease is then?

It's a *notification* mechanism.

Multiple processes can modify the file layout at the same time -
XFs was designed as a multi-write filesystem from the ground up and
we make use of that with shared IO locks for direct IO writes. 

The read layout lease model we've used for pNFS is essentially the
same concurrent writer model that direct IO in XFS uses. And to
enable concurrent writers, we use shared locking for the the layout
leases.

IOWs, the pNFS client IO model is very similar to local direct IO,
except for the fact they can remotely cache layout mappings.  Hence
if you do a server-side local buffered write (delayed allocation),
truncate, punch a hole, etc, (or a remote operation through the NFS
server that ends up in these same paths) the mappings those pNFS
clients hold are no longer guaranteed to cover valid data and/or
have correct physical mappings for valid data held on the server.

At this point, the layouts need to be invalidated, and so the layout
lease is broken by the filesystem operations that may cause an
issue. The pNFS server reacts to the lease break by recalling the
client layout(s) and the pNFS client has to request a new layout
from the server to be able to directly access the storage again.

i.e. the layout lease is primarily a *notification* mechanism to
allow safe interop between application level direct access
mechanisms and local filesystem access.

What you are trying to do is turn this multi-writer layout lease
notification mechanism into a single writer access control
mechanism. i.e. F_UNBREAK is all about /control/ of the layout and
who can and can't modify it, regardless of whether they write
permissions have been granted or not.

It seems I have been unable to get this point across despite trying
for months now: access control is not a function provided by layout
leases. If you need to guarantee exclusive access to a file so
nobody else can modify it, direct access or through the filesystem,
then that is what permission bits, ACLs, file locks, LSMs, etc are
for. Don't try to overload a layout change notification mechanism
with data access controls.

> I apologize for not understanding this.  My reading of the code is that layout
> changes require the read layout to be broken prior to proceeding.

There's a difference between additive layout changes (such as
allocating unwritten extents over a hole before a write) that don't
pose any risk of uninitialised data exposure or use-after free.
These sorts of layout changes are allowed while holding a layout
lease.

pNFS clients can only do additive changes to the layout via the
export ops above. Further, technically speaking (because we don't
currently implement this), local direct IO read/write is allowed
without breaking layout leases as DIO writes can only trigger
additive changes to the layout.

The layout changes we need notification about (i.e. lease breaks)
are subtractive layout changes (truncate, hole punch, copy-on-write)
and ethereal layout changes (e.g. delayed allocation, where data is
in memory but has no physical space allocated). Those are the ones
that lead to problems with direct access, either in terms of
in-correct in-memory state (pages mapped into RDMA hardware that no
longer have backing store) or the file mapping the application has
cached (e.g. via fiemap or pNFS file layouts) is no longer valid.

These subtractive/ethereal layout changes are the ones that need to
break _all_ outstanding layout leases, because nobody knows ahead of
time which applications might be impacted by the layout modification
that is about to occur.

IOWs, layout leases are not intended to directly control who can and
who can't modify the layout of a file, they are for issuing
notifications to parties using direct storage access that a
potentially dangerous layout change is about to be made to a file
they are directly accessing....

> The break layout code does this by creating a F_WRLCK of type FL_LAYOUT which
> conflicts with the F_RDLCK of type FL_LAYOUT...

Yes, but that's an internal implementation detail of how leases are
broken. __break_lease(O_WRONLY) means "break all leases", and it
does that by creating a temporary exclusive lease and then breaking
all the leases on the inode that conflict with that lease. Which, by
definition, is all of the leases of the same type. It never attaches
that temporary lease to the inode - it is only used for comparison
and is discarded once the lease break is done.

That doesn't mean this behaviour is intended to be part of the
visible layout lease user API, nor that it means F_WRLCK layout
leases are something that is supposed to provide exlusive
modification access to the file layout. It's just an implementation
mechanism that simplifies breaking existing leases.

> Also, I don't see any code which limits the number of read layout holders which
> can be present

There is no limit on the number of holders that can have read
layouts...

> and all of them will be revoked by the above code.

Yup, that's what breaking leases does right now - it notifies all
lease holders that a potentially problematic layout change is about
to be made.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, back to index

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-23 19:08 Lease semantic proposal Ira Weiny
2019-09-23 20:17 ` Jeff Layton
2019-10-01 18:17   ` Ira Weiny
2019-10-02 12:28     ` Jeff Layton
2019-10-02 19:27       ` bfields
2019-10-02 20:35         ` Jeff Layton
2019-10-03  8:43           ` Jan Kara
2019-10-03 15:37           ` J. Bruce Fields
2019-10-03  9:01     ` Jan Kara
2019-10-03 17:05       ` Ira Weiny
2019-09-23 22:26 ` Dave Chinner
2019-09-25 23:46   ` Ira Weiny
2019-09-26 11:29     ` Jeff Layton
2019-09-30  8:42     ` Dave Chinner
2019-10-01 21:01       ` Ira Weiny
2019-10-02 13:07         ` Dan Williams
2019-10-10 10:39         ` Dave Chinner
2019-10-04  7:51       ` Jan Kara

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git