All of lore.kernel.org
 help / color / mirror / Atom feed
* directory delegations
@ 2019-04-01 16:21 Bradley C. Kuszmaul
  2019-04-02 16:11 ` J. Bruce Fields
  0 siblings, 1 reply; 20+ messages in thread
From: Bradley C. Kuszmaul @ 2019-04-01 16:21 UTC (permalink / raw)
  To: linux-nfs

Hi, I'm the architect for Oracle's File Storage Service.   FSS is 
basically a big scalable NFS server that runs in Oracle's cloud.

Our metadata operations have higher latency than a vanilla NFS server 
(e.g., a Linux NFS server serving a XFS stored on a block device), and 
we suspect that directory delegations would make a big performance 
improvement.

I understand, however, that essentially no one implements directory 
delegations.

Can anyone fill me in on the current thinking of the future of support 
for directory delegations in the Linux NFS client?

-Bradley



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-01 16:21 directory delegations Bradley C. Kuszmaul
@ 2019-04-02 16:11 ` J. Bruce Fields
  2019-04-02 17:26   ` Bradley C. Kuszmaul
  0 siblings, 1 reply; 20+ messages in thread
From: J. Bruce Fields @ 2019-04-02 16:11 UTC (permalink / raw)
  To: Bradley C. Kuszmaul; +Cc: linux-nfs

On Mon, Apr 01, 2019 at 12:21:49PM -0400, Bradley C. Kuszmaul wrote:
> Hi, I'm the architect for Oracle's File Storage Service.   FSS is
> basically a big scalable NFS server that runs in Oracle's cloud.
> 
> Our metadata operations have higher latency than a vanilla NFS
> server (e.g., a Linux NFS server serving a XFS stored on a block
> device), and we suspect that directory delegations would make a big
> performance improvement.
> 
> I understand, however, that essentially no one implements directory
> delegations.
> 
> Can anyone fill me in on the current thinking of the future of
> support for directory delegations in the Linux NFS client?

Maybe somebody else will speak up, but I don't know of any effort to
implement directory delegations.

What metadata operations specifically are you worried about?  The
directory delegations that are specified in RFC 5661 are read-only.
Which might explain some of the lack of interest.

But there may be other steps that we could take to improve matters.

--b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-02 16:11 ` J. Bruce Fields
@ 2019-04-02 17:26   ` Bradley C. Kuszmaul
  2019-04-02 17:29     ` Bradley C. Kuszmaul
  2019-04-02 19:41     ` J. Bruce Fields
  0 siblings, 2 replies; 20+ messages in thread
From: Bradley C. Kuszmaul @ 2019-04-02 17:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

My simple model of metadata operations is to untar something like the 
linux sources.

Each file incurs a LOOKUP, CREATE, SETATTR, and WRITE, each of which is 
fairly high latency (even the WRITE ends up being done essentially 
synchronously because tar closes the file after its write(2) call.)

I guess directory delegations might save the cost of LOOKUP.

Is there any hope for getting write delegations?

What other steps might be possible?

-Bradley

On 4/2/19 12:11 PM, bfields@fieldses.org wrote:
> On Mon, Apr 01, 2019 at 12:21:49PM -0400, Bradley C. Kuszmaul wrote:
>> Hi, I'm the architect for Oracle's File Storage Service.   FSS is
>> basically a big scalable NFS server that runs in Oracle's cloud.
>>
>> Our metadata operations have higher latency than a vanilla NFS
>> server (e.g., a Linux NFS server serving a XFS stored on a block
>> device), and we suspect that directory delegations would make a big
>> performance improvement.
>>
>> I understand, however, that essentially no one implements directory
>> delegations.
>>
>> Can anyone fill me in on the current thinking of the future of
>> support for directory delegations in the Linux NFS client?
> Maybe somebody else will speak up, but I don't know of any effort to
> implement directory delegations.
>
> What metadata operations specifically are you worried about?  The
> directory delegations that are specified in RFC 5661 are read-only.
> Which might explain some of the lack of interest.
>
> But there may be other steps that we could take to improve matters.
>
> --b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-02 17:26   ` Bradley C. Kuszmaul
@ 2019-04-02 17:29     ` Bradley C. Kuszmaul
  2019-04-02 19:41     ` J. Bruce Fields
  1 sibling, 0 replies; 20+ messages in thread
From: Bradley C. Kuszmaul @ 2019-04-02 17:29 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

I stand corrected.  Apparently there is no LOOKUP, so maybe read-only 
directory delegations don't help at all with untar.

-Bradley

On 4/2/19 1:26 PM, Bradley C. Kuszmaul wrote:
> My simple model of metadata operations is to untar something like the 
> linux sources.
>
> Each file incurs a LOOKUP, CREATE, SETATTR, and WRITE, each of which 
> is fairly high latency (even the WRITE ends up being done essentially 
> synchronously because tar closes the file after its write(2) call.)
>
> I guess directory delegations might save the cost of LOOKUP.
>
> Is there any hope for getting write delegations?
>
> What other steps might be possible?
>
> -Bradley
>
> On 4/2/19 12:11 PM, bfields@fieldses.org wrote:
>> On Mon, Apr 01, 2019 at 12:21:49PM -0400, Bradley C. Kuszmaul wrote:
>>> Hi, I'm the architect for Oracle's File Storage Service.   FSS is
>>> basically a big scalable NFS server that runs in Oracle's cloud.
>>>
>>> Our metadata operations have higher latency than a vanilla NFS
>>> server (e.g., a Linux NFS server serving a XFS stored on a block
>>> device), and we suspect that directory delegations would make a big
>>> performance improvement.
>>>
>>> I understand, however, that essentially no one implements directory
>>> delegations.
>>>
>>> Can anyone fill me in on the current thinking of the future of
>>> support for directory delegations in the Linux NFS client?
>> Maybe somebody else will speak up, but I don't know of any effort to
>> implement directory delegations.
>>
>> What metadata operations specifically are you worried about? The
>> directory delegations that are specified in RFC 5661 are read-only.
>> Which might explain some of the lack of interest.
>>
>> But there may be other steps that we could take to improve matters.
>>
>> --b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-02 17:26   ` Bradley C. Kuszmaul
  2019-04-02 17:29     ` Bradley C. Kuszmaul
@ 2019-04-02 19:41     ` J. Bruce Fields
  2019-04-02 21:51       ` Trond Myklebust
  1 sibling, 1 reply; 20+ messages in thread
From: J. Bruce Fields @ 2019-04-02 19:41 UTC (permalink / raw)
  To: Bradley C. Kuszmaul; +Cc: linux-nfs, Trond Myklebust

On Tue, Apr 02, 2019 at 01:26:19PM -0400, Bradley C. Kuszmaul wrote:
> My simple model of metadata operations is to untar something like
> the linux sources.
> 
> Each file incurs a LOOKUP, CREATE, SETATTR, and WRITE, each of which
> is fairly high latency (even the WRITE ends up being done
> essentially synchronously because tar closes the file after its
> write(2) call.)

An ordinary file write delegation can help with some of that.

> I guess directory delegations might save the cost of LOOKUP.
> 
> Is there any hope for getting write delegations?
> 
> What other steps might be possible?

Trond, wasn't there a draft describing your idea that a server should be
able to grant a write delegation on create and delay the sync?  I can't
find it right now.

--b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-02 19:41     ` J. Bruce Fields
@ 2019-04-02 21:51       ` Trond Myklebust
  2019-04-02 22:33         ` Trond Myklebust
  2019-04-03  0:28         ` bfields
  0 siblings, 2 replies; 20+ messages in thread
From: Trond Myklebust @ 2019-04-02 21:51 UTC (permalink / raw)
  To: bfields, bradley.kuszmaul; +Cc: linux-nfs

On Tue, 2019-04-02 at 15:41 -0400, J. Bruce Fields wrote:
> On Tue, Apr 02, 2019 at 01:26:19PM -0400, Bradley C. Kuszmaul wrote:
> > My simple model of metadata operations is to untar something like
> > the linux sources.
> > 
> > Each file incurs a LOOKUP, CREATE, SETATTR, and WRITE, each of
> > which
> > is fairly high latency (even the WRITE ends up being done
> > essentially synchronously because tar closes the file after its
> > write(2) call.)
> 
> An ordinary file write delegation can help with some of that.
> 
> > I guess directory delegations might save the cost of LOOKUP.
> > 
> > Is there any hope for getting write delegations?
> > 
> > What other steps might be possible?
> 
> Trond, wasn't there a draft describing your idea that a server should
> be
> able to grant a write delegation on create and delay the sync?  I
> can't
> find it right now.


Do you mean this one? 
https://tools.ietf.org/html/draft-haynes-nfsv4-delstid-00

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-02 21:51       ` Trond Myklebust
@ 2019-04-02 22:33         ` Trond Myklebust
  2019-04-03  0:28         ` bfields
  1 sibling, 0 replies; 20+ messages in thread
From: Trond Myklebust @ 2019-04-02 22:33 UTC (permalink / raw)
  To: bfields, bradley.kuszmaul; +Cc: linux-nfs

On Tue, 2019-04-02 at 21:51 +0000, Trond Myklebust wrote:
> On Tue, 2019-04-02 at 15:41 -0400, J. Bruce Fields wrote:
> > On Tue, Apr 02, 2019 at 01:26:19PM -0400, Bradley C. Kuszmaul
> > wrote:
> > > My simple model of metadata operations is to untar something like
> > > the linux sources.
> > > 
> > > Each file incurs a LOOKUP, CREATE, SETATTR, and WRITE, each of
> > > which
> > > is fairly high latency (even the WRITE ends up being done
> > > essentially synchronously because tar closes the file after its
> > > write(2) call.)
> > 
> > An ordinary file write delegation can help with some of that.
> > 
> > > I guess directory delegations might save the cost of LOOKUP.
> > > 
> > > Is there any hope for getting write delegations?
> > > 
> > > What other steps might be possible?
> > 
> > Trond, wasn't there a draft describing your idea that a server
> > should
> > be
> > able to grant a write delegation on create and delay the sync?  I
> > can't
> > find it right now.
> 
> Do you mean this one? 
> https://tools.ietf.org/html/draft-haynes-nfsv4-delstid-00
> 

BTW: assuming that we do get the above draft through the IETF, then we
should look into delaying that SETATTR as well. Recall that the SETATTR
exists because the server pushes the exclusive create verifier into the
file timestamps. With the attribute delegations proposed in the above
draft, then we can essentially defer that SETATTR to when we return the
delegation, which means you would end up with 3 operations per file to
untar: OPEN, WRITE, DELEGRETURN (with embedded SETATTR op). Only the
first operation would need to be synchronous...

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-02 21:51       ` Trond Myklebust
  2019-04-02 22:33         ` Trond Myklebust
@ 2019-04-03  0:28         ` bfields
  2019-04-03  2:02           ` Trond Myklebust
  1 sibling, 1 reply; 20+ messages in thread
From: bfields @ 2019-04-03  0:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bradley.kuszmaul, linux-nfs

On Tue, Apr 02, 2019 at 09:51:42PM +0000, Trond Myklebust wrote:
> On Tue, 2019-04-02 at 15:41 -0400, J. Bruce Fields wrote:
> > On Tue, Apr 02, 2019 at 01:26:19PM -0400, Bradley C. Kuszmaul wrote:
> > > My simple model of metadata operations is to untar something like
> > > the linux sources.
> > > 
> > > Each file incurs a LOOKUP, CREATE, SETATTR, and WRITE, each of
> > > which
> > > is fairly high latency (even the WRITE ends up being done
> > > essentially synchronously because tar closes the file after its
> > > write(2) call.)
> > 
> > An ordinary file write delegation can help with some of that.
> > 
> > > I guess directory delegations might save the cost of LOOKUP.
> > > 
> > > Is there any hope for getting write delegations?
> > > 
> > > What other steps might be possible?
> > 
> > Trond, wasn't there a draft describing your idea that a server should
> > be
> > able to grant a write delegation on create and delay the sync?  I
> > can't
> > find it right now.
> 
> 
> Do you mean this one? 
> https://tools.ietf.org/html/draft-haynes-nfsv4-delstid-00

Maybe it's too subtle for me.  What's the part that allows delaying sync
on create?

--b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-03  0:28         ` bfields
@ 2019-04-03  2:02           ` Trond Myklebust
  2019-04-03  2:07             ` bfields
  0 siblings, 1 reply; 20+ messages in thread
From: Trond Myklebust @ 2019-04-03  2:02 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, bradley.kuszmaul

On Tue, 2019-04-02 at 20:28 -0400, bfields@fieldses.org wrote:
> On Tue, Apr 02, 2019 at 09:51:42PM +0000, Trond Myklebust wrote:
> > On Tue, 2019-04-02 at 15:41 -0400, J. Bruce Fields wrote:
> > > On Tue, Apr 02, 2019 at 01:26:19PM -0400, Bradley C. Kuszmaul
> > > wrote:
> > > > My simple model of metadata operations is to untar something
> > > > like
> > > > the linux sources.
> > > > 
> > > > Each file incurs a LOOKUP, CREATE, SETATTR, and WRITE, each of
> > > > which
> > > > is fairly high latency (even the WRITE ends up being done
> > > > essentially synchronously because tar closes the file after its
> > > > write(2) call.)
> > > 
> > > An ordinary file write delegation can help with some of that.
> > > 
> > > > I guess directory delegations might save the cost of LOOKUP.
> > > > 
> > > > Is there any hope for getting write delegations?
> > > > 
> > > > What other steps might be possible?
> > > 
> > > Trond, wasn't there a draft describing your idea that a server
> > > should
> > > be
> > > able to grant a write delegation on create and delay the sync?  I
> > > can't
> > > find it right now.
> > 
> > Do you mean this one? 
> > https://tools.ietf.org/html/draft-haynes-nfsv4-delstid-00
> 
> Maybe it's too subtle for me.  What's the part that allows delaying
> sync
> on create?
> 

The create itself needs to be sync, but the attribute delegations mean
that the client, not the server, is authoritative for the timestamps.
So the client now owns the atime and mtime, and just sets them as part
of the (asynchronous) delegreturn some time after you are done writing.

Were you perhaps thinking about this earlier proposal? 
https://tools.ietf.org/html/draft-myklebust-nfsv4-unstable-file-creation-01

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-03  2:02           ` Trond Myklebust
@ 2019-04-03  2:07             ` bfields
  2019-04-03 16:56               ` Bradley C. Kuszmaul
  0 siblings, 1 reply; 20+ messages in thread
From: bfields @ 2019-04-03  2:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, bradley.kuszmaul

On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
> The create itself needs to be sync, but the attribute delegations mean
> that the client, not the server, is authoritative for the timestamps.
> So the client now owns the atime and mtime, and just sets them as part
> of the (asynchronous) delegreturn some time after you are done writing.
> 
> Were you perhaps thinking about this earlier proposal? 
> https://tools.ietf.org/html/draft-myklebust-nfsv4-unstable-file-creation-01

That's it, thanks!

Bradley is concerned about performance of something like untar on a
backend filesystem with particularly high-latency metadata operations,
so something like your unstable file createion proposal (or actual write
delegations) seems like it should help.

--b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-03  2:07             ` bfields
@ 2019-04-03 16:56               ` Bradley C. Kuszmaul
  2019-04-04  1:05                 ` bfields
  0 siblings, 1 reply; 20+ messages in thread
From: Bradley C. Kuszmaul @ 2019-04-03 16:56 UTC (permalink / raw)
  To: bfields, Trond Myklebust; +Cc: linux-nfs

This proposal does look like it would be helpful.   How does this kind 
of proposal play out in terms of actually seeing the light of day in 
deployed systems?

-Bradley

On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
> On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
>> The create itself needs to be sync, but the attribute delegations mean
>> that the client, not the server, is authoritative for the timestamps.
>> So the client now owns the atime and mtime, and just sets them as part
>> of the (asynchronous) delegreturn some time after you are done writing.
>>
>> Were you perhaps thinking about this earlier proposal?
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
> That's it, thanks!
>
> Bradley is concerned about performance of something like untar on a
> backend filesystem with particularly high-latency metadata operations,
> so something like your unstable file createion proposal (or actual write
> delegations) seems like it should help.
>
> --b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-03 16:56               ` Bradley C. Kuszmaul
@ 2019-04-04  1:05                 ` bfields
  2019-04-04 15:09                   ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: bfields @ 2019-04-04  1:05 UTC (permalink / raw)
  To: Bradley C. Kuszmaul; +Cc: Trond Myklebust, linux-nfs

On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
> This proposal does look like it would be helpful.   How does this
> kind of proposal play out in terms of actually seeing the light of
> day in deployed systems?

We need some people to commit to implementing it.

We have 2-3 testing events a year, so ideally we'd agree to show up with
implementations at one of those to test and hash out any issues.

We revise the draft based on any experience or feedback we get.  If
nothing else, it looks like it needs some updates for v4.2.

The on-the-wire protocol change seems small, and my feeling is that if
there's running code then documenting the protocol and getting it
through the IETF process shouldn't be a big deal.

--b.

> On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
> >On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
> >>The create itself needs to be sync, but the attribute delegations mean
> >>that the client, not the server, is authoritative for the timestamps.
> >>So the client now owns the atime and mtime, and just sets them as part
> >>of the (asynchronous) delegreturn some time after you are done writing.
> >>
> >>Were you perhaps thinking about this earlier proposal?
> >>https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
> >That's it, thanks!
> >
> >Bradley is concerned about performance of something like untar on a
> >backend filesystem with particularly high-latency metadata operations,
> >so something like your unstable file createion proposal (or actual write
> >delegations) seems like it should help.
> >
> >--b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04  1:05                 ` bfields
@ 2019-04-04 15:09                   ` Jeff Layton
  2019-04-04 15:22                     ` Chuck Lever
  2019-04-04 15:37                     ` bfields
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff Layton @ 2019-04-04 15:09 UTC (permalink / raw)
  To: bfields; +Cc: Bradley C. Kuszmaul, Trond Myklebust, linux-nfs

On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org
<bfields@fieldses.org> wrote:
>
> On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
> > This proposal does look like it would be helpful.   How does this
> > kind of proposal play out in terms of actually seeing the light of
> > day in deployed systems?
>
> We need some people to commit to implementing it.
>
> We have 2-3 testing events a year, so ideally we'd agree to show up with
> implementations at one of those to test and hash out any issues.
>
> We revise the draft based on any experience or feedback we get.  If
> nothing else, it looks like it needs some updates for v4.2.
>
> The on-the-wire protocol change seems small, and my feeling is that if
> there's running code then documenting the protocol and getting it
> through the IETF process shouldn't be a big deal.
>
> --b.
>
> > On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
> > >On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
> > >>The create itself needs to be sync, but the attribute delegations mean
> > >>that the client, not the server, is authoritative for the timestamps.
> > >>So the client now owns the atime and mtime, and just sets them as part
> > >>of the (asynchronous) delegreturn some time after you are done writing.
> > >>
> > >>Were you perhaps thinking about this earlier proposal?
> > >>https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
> > >That's it, thanks!
> > >
> > >Bradley is concerned about performance of something like untar on a
> > >backend filesystem with particularly high-latency metadata operations,
> > >so something like your unstable file createion proposal (or actual write
> > >delegations) seems like it should help.
> > >
> > >--b.

The serialized create with something like an untar is a
performance-killer though.

FWIW, I'm working on something similar right now for Ceph. If a ceph
client has adequate caps [1] for a directory and the dentry inode,
then we should (in principle) be able to buffer up directory morphing
operations and flush them out to the server asynchronously.

I'm starting with unlink (mostly because it's simpler), and am mainly
just returning early when we do have the right caps -- after issuing
the call but before the reply comes in. We should be able to do the
same for link, rename and create too. Create will require the Ceph MDS
to delegate out a range of inode numbers (and that bit hasn't been
implemented yet).

My thinking with all of this is that the buffering of directory
morphing operations is not as helpful as something like a pagecache
write is, as we aren't that interested in merging operations that
change the same dentry. However, being able to do them asynchronously
should work really well. That should allow us to better parallellize
create/link/unlink/rename on different dentries even when they are
issued serially by a single task.

RFC5661 doesn't currently provide for writeable directory delegations,
AFAICT, but they could eventually be implemented in a similar way.

[1]: cephfs capabilies (aka caps) are like a delegation for a subset
of inode metadata
--
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04 15:09                   ` Jeff Layton
@ 2019-04-04 15:22                     ` Chuck Lever
  2019-04-04 15:36                       ` Jeff Layton
  2019-04-04 20:03                       ` Bradley C. Kuszmaul
  2019-04-04 15:37                     ` bfields
  1 sibling, 2 replies; 20+ messages in thread
From: Chuck Lever @ 2019-04-04 15:22 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Bruce Fields, Bradley C. Kuszmaul, Trond Myklebust,
	Linux NFS Mailing List



> On Apr 4, 2019, at 11:09 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> 
> On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org
> <bfields@fieldses.org> wrote:
>> 
>> On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
>>> This proposal does look like it would be helpful.   How does this
>>> kind of proposal play out in terms of actually seeing the light of
>>> day in deployed systems?
>> 
>> We need some people to commit to implementing it.
>> 
>> We have 2-3 testing events a year, so ideally we'd agree to show up with
>> implementations at one of those to test and hash out any issues.
>> 
>> We revise the draft based on any experience or feedback we get.  If
>> nothing else, it looks like it needs some updates for v4.2.
>> 
>> The on-the-wire protocol change seems small, and my feeling is that if
>> there's running code then documenting the protocol and getting it
>> through the IETF process shouldn't be a big deal.
>> 
>> --b.
>> 
>>> On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
>>>> On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
>>>>> The create itself needs to be sync, but the attribute delegations mean
>>>>> that the client, not the server, is authoritative for the timestamps.
>>>>> So the client now owns the atime and mtime, and just sets them as part
>>>>> of the (asynchronous) delegreturn some time after you are done writing.
>>>>> 
>>>>> Were you perhaps thinking about this earlier proposal?
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
>>>> That's it, thanks!
>>>> 
>>>> Bradley is concerned about performance of something like untar on a
>>>> backend filesystem with particularly high-latency metadata operations,
>>>> so something like your unstable file createion proposal (or actual write
>>>> delegations) seems like it should help.
>>>> 
>>>> --b.
> 
> The serialized create with something like an untar is a
> performance-killer though.
> 
> FWIW, I'm working on something similar right now for Ceph. If a ceph
> client has adequate caps [1] for a directory and the dentry inode,
> then we should (in principle) be able to buffer up directory morphing
> operations and flush them out to the server asynchronously.
> 
> I'm starting with unlink (mostly because it's simpler), and am mainly
> just returning early when we do have the right caps -- after issuing
> the call but before the reply comes in. We should be able to do the
> same for link, rename and create too. Create will require the Ceph MDS
> to delegate out a range of inode numbers (and that bit hasn't been
> implemented yet).
> 
> My thinking with all of this is that the buffering of directory
> morphing operations is not as helpful as something like a pagecache
> write is, as we aren't that interested in merging operations that
> change the same dentry. However, being able to do them asynchronously
> should work really well. That should allow us to better parallellize
> create/link/unlink/rename on different dentries even when they are
> issued serially by a single task.

What happens if an asynchronous directory change fails (eg. ENOSPC)?


> RFC5661 doesn't currently provide for writeable directory delegations,
> AFAICT, but they could eventually be implemented in a similar way.
> 
> [1]: cephfs capabilies (aka caps) are like a delegation for a subset
> of inode metadata
> --
> Jeff Layton <jlayton@poochiereds.net>

--
Chuck Lever




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04 15:22                     ` Chuck Lever
@ 2019-04-04 15:36                       ` Jeff Layton
  2019-04-04 20:03                       ` Bradley C. Kuszmaul
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2019-04-04 15:36 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Bruce Fields, Bradley C. Kuszmaul, Trond Myklebust,
	Linux NFS Mailing List

On Thu, Apr 4, 2019 at 11:22 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Apr 4, 2019, at 11:09 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> >
> > On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org
> > <bfields@fieldses.org> wrote:
> >>
> >> On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
> >>> This proposal does look like it would be helpful.   How does this
> >>> kind of proposal play out in terms of actually seeing the light of
> >>> day in deployed systems?
> >>
> >> We need some people to commit to implementing it.
> >>
> >> We have 2-3 testing events a year, so ideally we'd agree to show up with
> >> implementations at one of those to test and hash out any issues.
> >>
> >> We revise the draft based on any experience or feedback we get.  If
> >> nothing else, it looks like it needs some updates for v4.2.
> >>
> >> The on-the-wire protocol change seems small, and my feeling is that if
> >> there's running code then documenting the protocol and getting it
> >> through the IETF process shouldn't be a big deal.
> >>
> >> --b.
> >>
> >>> On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
> >>>> On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
> >>>>> The create itself needs to be sync, but the attribute delegations mean
> >>>>> that the client, not the server, is authoritative for the timestamps.
> >>>>> So the client now owns the atime and mtime, and just sets them as part
> >>>>> of the (asynchronous) delegreturn some time after you are done writing.
> >>>>>
> >>>>> Were you perhaps thinking about this earlier proposal?
> >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
> >>>> That's it, thanks!
> >>>>
> >>>> Bradley is concerned about performance of something like untar on a
> >>>> backend filesystem with particularly high-latency metadata operations,
> >>>> so something like your unstable file createion proposal (or actual write
> >>>> delegations) seems like it should help.
> >>>>
> >>>> --b.
> >
> > The serialized create with something like an untar is a
> > performance-killer though.
> >
> > FWIW, I'm working on something similar right now for Ceph. If a ceph
> > client has adequate caps [1] for a directory and the dentry inode,
> > then we should (in principle) be able to buffer up directory morphing
> > operations and flush them out to the server asynchronously.
> >
> > I'm starting with unlink (mostly because it's simpler), and am mainly
> > just returning early when we do have the right caps -- after issuing
> > the call but before the reply comes in. We should be able to do the
> > same for link, rename and create too. Create will require the Ceph MDS
> > to delegate out a range of inode numbers (and that bit hasn't been
> > implemented yet).
> >
> > My thinking with all of this is that the buffering of directory
> > morphing operations is not as helpful as something like a pagecache
> > write is, as we aren't that interested in merging operations that
> > change the same dentry. However, being able to do them asynchronously
> > should work really well. That should allow us to better parallellize
> > create/link/unlink/rename on different dentries even when they are
> > issued serially by a single task.
>
> What happens if an asynchronous directory change fails (eg. ENOSPC)?
>

We have a well-established expectation with most local filesystems
that directory changes are not necessarily persisted until you issue
fsync on the parent(s). My thinking is that we'd report these sorts of
errors to that fsync.

All of this is _really_ experimental so far, so I don't claim to have
worked out all of the gory details as of yet. :)
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04 15:09                   ` Jeff Layton
  2019-04-04 15:22                     ` Chuck Lever
@ 2019-04-04 15:37                     ` bfields
  2019-04-04 15:44                       ` Jeff Layton
  1 sibling, 1 reply; 20+ messages in thread
From: bfields @ 2019-04-04 15:37 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Bradley C. Kuszmaul, Trond Myklebust, linux-nfs

On Thu, Apr 04, 2019 at 11:09:47AM -0400, Jeff Layton wrote:
> On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org <bfields@fieldses.org> wrote:
> The serialized create with something like an untar is a
> performance-killer though.

Yes.  And Trond's proposal only allows hiding the server-to-disk round
trip time, not the client-to-server round trip time.  On the other hand,
it seems a lot easier than write delegations.

> FWIW, I'm working on something similar right now for Ceph. If a ceph
> client has adequate caps [1] for a directory and the dentry inode,
> then we should (in principle) be able to buffer up directory morphing
> operations and flush them out to the server asynchronously.
> 
> I'm starting with unlink (mostly because it's simpler), and am mainly
> just returning early when we do have the right caps -- after issuing
> the call but before the reply comes in. We should be able to do the
> same for link, rename and create too. Create will require the Ceph MDS
> to delegate out a range of inode numbers (and that bit hasn't been
> implemented yet).

Is there some reason it's impossible for the client to return from
create before it has an inode number?

> My thinking with all of this is that the buffering of directory
> morphing operations is not as helpful as something like a pagecache
> write is, as we aren't that interested in merging operations that
> change the same dentry. However, being able to do them asynchronously
> should work really well. That should allow us to better parallellize
> create/link/unlink/rename on different dentries even when they are
> issued serially by a single task.
> 
> RFC5661 doesn't currently provide for writeable directory delegations,
> AFAICT, but they could eventually be implemented in a similar way.

People also worried about delegating create in the face of differing
rules about case insensitivity and about which characters are legal in
filenames.  But I really think there should be some way to manage that.

--b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04 15:37                     ` bfields
@ 2019-04-04 15:44                       ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2019-04-04 15:44 UTC (permalink / raw)
  To: bfields; +Cc: Bradley C. Kuszmaul, Trond Myklebust, linux-nfs

On Thu, Apr 4, 2019 at 11:37 AM bfields@fieldses.org
<bfields@fieldses.org> wrote:
>
> On Thu, Apr 04, 2019 at 11:09:47AM -0400, Jeff Layton wrote:
> > On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org <bfields@fieldses.org> wrote:
> > The serialized create with something like an untar is a
> > performance-killer though.
>
> Yes.  And Trond's proposal only allows hiding the server-to-disk round
> trip time, not the client-to-server round trip time.  On the other hand,
> it seems a lot easier than write delegations.
>
> > FWIW, I'm working on something similar right now for Ceph. If a ceph
> > client has adequate caps [1] for a directory and the dentry inode,
> > then we should (in principle) be able to buffer up directory morphing
> > operations and flush them out to the server asynchronously.
> >
> > I'm starting with unlink (mostly because it's simpler), and am mainly
> > just returning early when we do have the right caps -- after issuing
> > the call but before the reply comes in. We should be able to do the
> > same for link, rename and create too. Create will require the Ceph MDS
> > to delegate out a range of inode numbers (and that bit hasn't been
> > implemented yet).
>
> Is there some reason it's impossible for the client to return from
> create before it has an inode number?
>

Not necessarily, but you can't handle a stat() at that point until the
create returns. Also for cephfs, we can't issue data writes to the
OSDs until we know the inode number (the underlying objects are named
with the format "inode_number.chunk_index"). Cephfs works a little
like pNFS, in that we do reads and writes directly to/from the OSDs,
but the data is placed algorithmically so we know what the layout will
be if we know the inode number.

> > My thinking with all of this is that the buffering of directory
> > morphing operations is not as helpful as something like a pagecache
> > write is, as we aren't that interested in merging operations that
> > change the same dentry. However, being able to do them asynchronously
> > should work really well. That should allow us to better parallellize
> > create/link/unlink/rename on different dentries even when they are
> > issued serially by a single task.
> >
> > RFC5661 doesn't currently provide for writeable directory delegations,
> > AFAICT, but they could eventually be implemented in a similar way.
>
> People also worried about delegating create in the face of differing
> rules about case insensitivity and about which characters are legal in
> filenames.  But I really think there should be some way to manage that.
>

Oh, good god. I hadn't even considered that.

I tend to think at that point, we could just return EINVAL on a
subsequent fsync of the dir or something, and let the program sort out
what went wrong.
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04 15:22                     ` Chuck Lever
  2019-04-04 15:36                       ` Jeff Layton
@ 2019-04-04 20:03                       ` Bradley C. Kuszmaul
  2019-04-04 20:41                         ` Bruce Fields
  1 sibling, 1 reply; 20+ messages in thread
From: Bradley C. Kuszmaul @ 2019-04-04 20:03 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: Bruce Fields, Trond Myklebust, Linux NFS Mailing List

It would also be possible with our file system to preallocate inode 
numbers (inumbers).

This isn't necessarily directly related to NFS, but one could imagine 
further extending NSF to allow a CREATE to happen entirely on the client 
by letting the client maintain a cache of preallocated inumbers.

Just for the fun of it, I'll tell you a little bit more about how we 
preallocate inumbers.

For Oracle's File Storage Service (FSS), Inumbers are cheap to allocate, 
and it's not a big deal if a few of them end up unused. Unused inode 
numbers don't use up any space. I would imagine that most B-tree-based 
file systems are like this.   In contrast in an ext-style file system, 
unused inumbers imply unused storage.

Furthermore, FSS never reuses inumbers when files are deleted. It just 
keeps allocating new ones.

There's a tradeoff between preallocating lots of inumbers to get better 
performance but potentially wasting the inumbers if the client were to 
crash just after getting a batch.   If you only ask for one at a time, 
you don't get much performance, but if you ask for 1000 at a time, 
there's a chance that the client could start, ask for 1000 and then 
immediately crash, and then repeat the cycle, quickly using up many 
inumbers.  Here's a 2-competetive algorithm to solve this problem (by 
"2-competetive" I mean that it's guaranteed to waste at most half of the 
inode numbers):

  * A client that has successfully created K files without crashing is 
allowed, when it's preallocated cache of inumbers goes empty, to ask for 
another K inumbers.

The worst-case lossage occurs if the client crashes just after getting K 
inumbers, and those inumbers go to waste.   But we know that the client 
successfully created K files, so we are wasting at most half the inumbers.

For a long-running client, each time it asks for another batch of 
inumbers, it doubles the size of the request.  For the first file 
created, it does it the old-fashioned way.   For the second file, it 
preallocated a single inumber.   For the third file, it preallocates 2 
inumbers.   On the fifth file creation, it preallocates 4 inumbers.  And 
so forth.

One obstacle to getting FSS to use any of these ideas is that we 
currently support only NFSv3.   We need to get an NFSv4 server going, 
and then we'll be interested in doing the server work to speed up these 
kinds of metadata workloads.

-Bradley

On 4/4/19 11:22 AM, Chuck Lever wrote:
>
>> On Apr 4, 2019, at 11:09 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
>>
>> On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org
>> <bfields@fieldses.org> wrote:
>>> On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
>>>> This proposal does look like it would be helpful.   How does this
>>>> kind of proposal play out in terms of actually seeing the light of
>>>> day in deployed systems?
>>> We need some people to commit to implementing it.
>>>
>>> We have 2-3 testing events a year, so ideally we'd agree to show up with
>>> implementations at one of those to test and hash out any issues.
>>>
>>> We revise the draft based on any experience or feedback we get.  If
>>> nothing else, it looks like it needs some updates for v4.2.
>>>
>>> The on-the-wire protocol change seems small, and my feeling is that if
>>> there's running code then documenting the protocol and getting it
>>> through the IETF process shouldn't be a big deal.
>>>
>>> --b.
>>>
>>>> On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
>>>>> On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
>>>>>> The create itself needs to be sync, but the attribute delegations mean
>>>>>> that the client, not the server, is authoritative for the timestamps.
>>>>>> So the client now owns the atime and mtime, and just sets them as part
>>>>>> of the (asynchronous) delegreturn some time after you are done writing.
>>>>>>
>>>>>> Were you perhaps thinking about this earlier proposal?
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
>>>>> That's it, thanks!
>>>>>
>>>>> Bradley is concerned about performance of something like untar on a
>>>>> backend filesystem with particularly high-latency metadata operations,
>>>>> so something like your unstable file createion proposal (or actual write
>>>>> delegations) seems like it should help.
>>>>>
>>>>> --b.
>> The serialized create with something like an untar is a
>> performance-killer though.
>>
>> FWIW, I'm working on something similar right now for Ceph. If a ceph
>> client has adequate caps [1] for a directory and the dentry inode,
>> then we should (in principle) be able to buffer up directory morphing
>> operations and flush them out to the server asynchronously.
>>
>> I'm starting with unlink (mostly because it's simpler), and am mainly
>> just returning early when we do have the right caps -- after issuing
>> the call but before the reply comes in. We should be able to do the
>> same for link, rename and create too. Create will require the Ceph MDS
>> to delegate out a range of inode numbers (and that bit hasn't been
>> implemented yet).
>>
>> My thinking with all of this is that the buffering of directory
>> morphing operations is not as helpful as something like a pagecache
>> write is, as we aren't that interested in merging operations that
>> change the same dentry. However, being able to do them asynchronously
>> should work really well. That should allow us to better parallellize
>> create/link/unlink/rename on different dentries even when they are
>> issued serially by a single task.
> What happens if an asynchronous directory change fails (eg. ENOSPC)?
>
>
>> RFC5661 doesn't currently provide for writeable directory delegations,
>> AFAICT, but they could eventually be implemented in a similar way.
>>
>> [1]: cephfs capabilies (aka caps) are like a delegation for a subset
>> of inode metadata
>> --
>> Jeff Layton <jlayton@poochiereds.net>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04 20:03                       ` Bradley C. Kuszmaul
@ 2019-04-04 20:41                         ` Bruce Fields
  2019-04-04 20:45                           ` Bradley C. Kuszmaul
  0 siblings, 1 reply; 20+ messages in thread
From: Bruce Fields @ 2019-04-04 20:41 UTC (permalink / raw)
  To: Bradley C. Kuszmaul
  Cc: Chuck Lever, Jeff Layton, Trond Myklebust, Linux NFS Mailing List

On Thu, Apr 04, 2019 at 04:03:42PM -0400, Bradley C. Kuszmaul wrote:
> It would also be possible with our file system to preallocate inode
> numbers (inumbers).
> 
> This isn't necessarily directly related to NFS, but one could
> imagine further extending NSF to allow a CREATE to happen entirely
> on the client by letting the client maintain a cache of preallocated
> inumbers.

So, we'd need new protocol to allow clients to request inode numbers,
and I guess we'd also need vfs interfaces to allow our server to request
them from various filesystems.  Naively, it sounds doable.  From what
Jeff says, this isn't a requirement for correctness, it's an
optimization for a case when the client creates and then immediately
does a stat (or readdir?).  Is that important?

--b.

> 
> Just for the fun of it, I'll tell you a little bit more about how we
> preallocate inumbers.
> 
> For Oracle's File Storage Service (FSS), Inumbers are cheap to
> allocate, and it's not a big deal if a few of them end up unused.
> Unused inode numbers don't use up any space. I would imagine that
> most B-tree-based file systems are like this.   In contrast in an
> ext-style file system, unused inumbers imply unused storage.
> 
> Furthermore, FSS never reuses inumbers when files are deleted. It
> just keeps allocating new ones.
> 
> There's a tradeoff between preallocating lots of inumbers to get
> better performance but potentially wasting the inumbers if the
> client were to crash just after getting a batch.   If you only ask
> for one at a time, you don't get much performance, but if you ask
> for 1000 at a time, there's a chance that the client could start,
> ask for 1000 and then immediately crash, and then repeat the cycle,
> quickly using up many inumbers.  Here's a 2-competetive algorithm to
> solve this problem (by "2-competetive" I mean that it's guaranteed
> to waste at most half of the inode numbers):
> 
>  * A client that has successfully created K files without crashing
> is allowed, when it's preallocated cache of inumbers goes empty, to
> ask for another K inumbers.
> 
> The worst-case lossage occurs if the client crashes just after
> getting K inumbers, and those inumbers go to waste.   But we know
> that the client successfully created K files, so we are wasting at
> most half the inumbers.
> 
> For a long-running client, each time it asks for another batch of
> inumbers, it doubles the size of the request.  For the first file
> created, it does it the old-fashioned way.   For the second file, it
> preallocated a single inumber.   For the third file, it preallocates
> 2 inumbers.   On the fifth file creation, it preallocates 4
> inumbers.  And so forth.
> 
> One obstacle to getting FSS to use any of these ideas is that we
> currently support only NFSv3.   We need to get an NFSv4 server
> going, and then we'll be interested in doing the server work to
> speed up these kinds of metadata workloads.
> 
> -Bradley
> 
> On 4/4/19 11:22 AM, Chuck Lever wrote:
> >
> >>On Apr 4, 2019, at 11:09 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> >>
> >>On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org
> >><bfields@fieldses.org> wrote:
> >>>On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
> >>>>This proposal does look like it would be helpful.   How does this
> >>>>kind of proposal play out in terms of actually seeing the light of
> >>>>day in deployed systems?
> >>>We need some people to commit to implementing it.
> >>>
> >>>We have 2-3 testing events a year, so ideally we'd agree to show up with
> >>>implementations at one of those to test and hash out any issues.
> >>>
> >>>We revise the draft based on any experience or feedback we get.  If
> >>>nothing else, it looks like it needs some updates for v4.2.
> >>>
> >>>The on-the-wire protocol change seems small, and my feeling is that if
> >>>there's running code then documenting the protocol and getting it
> >>>through the IETF process shouldn't be a big deal.
> >>>
> >>>--b.
> >>>
> >>>>On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
> >>>>>On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
> >>>>>>The create itself needs to be sync, but the attribute delegations mean
> >>>>>>that the client, not the server, is authoritative for the timestamps.
> >>>>>>So the client now owns the atime and mtime, and just sets them as part
> >>>>>>of the (asynchronous) delegreturn some time after you are done writing.
> >>>>>>
> >>>>>>Were you perhaps thinking about this earlier proposal?
> >>>>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
> >>>>>That's it, thanks!
> >>>>>
> >>>>>Bradley is concerned about performance of something like untar on a
> >>>>>backend filesystem with particularly high-latency metadata operations,
> >>>>>so something like your unstable file createion proposal (or actual write
> >>>>>delegations) seems like it should help.
> >>>>>
> >>>>>--b.
> >>The serialized create with something like an untar is a
> >>performance-killer though.
> >>
> >>FWIW, I'm working on something similar right now for Ceph. If a ceph
> >>client has adequate caps [1] for a directory and the dentry inode,
> >>then we should (in principle) be able to buffer up directory morphing
> >>operations and flush them out to the server asynchronously.
> >>
> >>I'm starting with unlink (mostly because it's simpler), and am mainly
> >>just returning early when we do have the right caps -- after issuing
> >>the call but before the reply comes in. We should be able to do the
> >>same for link, rename and create too. Create will require the Ceph MDS
> >>to delegate out a range of inode numbers (and that bit hasn't been
> >>implemented yet).
> >>
> >>My thinking with all of this is that the buffering of directory
> >>morphing operations is not as helpful as something like a pagecache
> >>write is, as we aren't that interested in merging operations that
> >>change the same dentry. However, being able to do them asynchronously
> >>should work really well. That should allow us to better parallellize
> >>create/link/unlink/rename on different dentries even when they are
> >>issued serially by a single task.
> >What happens if an asynchronous directory change fails (eg. ENOSPC)?
> >
> >
> >>RFC5661 doesn't currently provide for writeable directory delegations,
> >>AFAICT, but they could eventually be implemented in a similar way.
> >>
> >>[1]: cephfs capabilies (aka caps) are like a delegation for a subset
> >>of inode metadata
> >>--
> >>Jeff Layton <jlayton@poochiereds.net>
> >--
> >Chuck Lever
> >
> >
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: directory delegations
  2019-04-04 20:41                         ` Bruce Fields
@ 2019-04-04 20:45                           ` Bradley C. Kuszmaul
  0 siblings, 0 replies; 20+ messages in thread
From: Bradley C. Kuszmaul @ 2019-04-04 20:45 UTC (permalink / raw)
  To: Bruce Fields
  Cc: Chuck Lever, Jeff Layton, Trond Myklebust, Linux NFS Mailing List

Yes, maybe it's not important.

-Bradley

On 4/4/19 4:41 PM, Bruce Fields wrote:
> On Thu, Apr 04, 2019 at 04:03:42PM -0400, Bradley C. Kuszmaul wrote:
>> It would also be possible with our file system to preallocate inode
>> numbers (inumbers).
>>
>> This isn't necessarily directly related to NFS, but one could
>> imagine further extending NSF to allow a CREATE to happen entirely
>> on the client by letting the client maintain a cache of preallocated
>> inumbers.
> So, we'd need new protocol to allow clients to request inode numbers,
> and I guess we'd also need vfs interfaces to allow our server to request
> them from various filesystems.  Naively, it sounds doable.  From what
> Jeff says, this isn't a requirement for correctness, it's an
> optimization for a case when the client creates and then immediately
> does a stat (or readdir?).  Is that important?
>
> --b.
>
>> Just for the fun of it, I'll tell you a little bit more about how we
>> preallocate inumbers.
>>
>> For Oracle's File Storage Service (FSS), Inumbers are cheap to
>> allocate, and it's not a big deal if a few of them end up unused.
>> Unused inode numbers don't use up any space. I would imagine that
>> most B-tree-based file systems are like this.   In contrast in an
>> ext-style file system, unused inumbers imply unused storage.
>>
>> Furthermore, FSS never reuses inumbers when files are deleted. It
>> just keeps allocating new ones.
>>
>> There's a tradeoff between preallocating lots of inumbers to get
>> better performance but potentially wasting the inumbers if the
>> client were to crash just after getting a batch.   If you only ask
>> for one at a time, you don't get much performance, but if you ask
>> for 1000 at a time, there's a chance that the client could start,
>> ask for 1000 and then immediately crash, and then repeat the cycle,
>> quickly using up many inumbers.  Here's a 2-competetive algorithm to
>> solve this problem (by "2-competetive" I mean that it's guaranteed
>> to waste at most half of the inode numbers):
>>
>>   * A client that has successfully created K files without crashing
>> is allowed, when it's preallocated cache of inumbers goes empty, to
>> ask for another K inumbers.
>>
>> The worst-case lossage occurs if the client crashes just after
>> getting K inumbers, and those inumbers go to waste.   But we know
>> that the client successfully created K files, so we are wasting at
>> most half the inumbers.
>>
>> For a long-running client, each time it asks for another batch of
>> inumbers, it doubles the size of the request.  For the first file
>> created, it does it the old-fashioned way.   For the second file, it
>> preallocated a single inumber.   For the third file, it preallocates
>> 2 inumbers.   On the fifth file creation, it preallocates 4
>> inumbers.  And so forth.
>>
>> One obstacle to getting FSS to use any of these ideas is that we
>> currently support only NFSv3.   We need to get an NFSv4 server
>> going, and then we'll be interested in doing the server work to
>> speed up these kinds of metadata workloads.
>>
>> -Bradley
>>
>> On 4/4/19 11:22 AM, Chuck Lever wrote:
>>>> On Apr 4, 2019, at 11:09 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
>>>>
>>>> On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org
>>>> <bfields@fieldses.org> wrote:
>>>>> On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
>>>>>> This proposal does look like it would be helpful.   How does this
>>>>>> kind of proposal play out in terms of actually seeing the light of
>>>>>> day in deployed systems?
>>>>> We need some people to commit to implementing it.
>>>>>
>>>>> We have 2-3 testing events a year, so ideally we'd agree to show up with
>>>>> implementations at one of those to test and hash out any issues.
>>>>>
>>>>> We revise the draft based on any experience or feedback we get.  If
>>>>> nothing else, it looks like it needs some updates for v4.2.
>>>>>
>>>>> The on-the-wire protocol change seems small, and my feeling is that if
>>>>> there's running code then documenting the protocol and getting it
>>>>> through the IETF process shouldn't be a big deal.
>>>>>
>>>>> --b.
>>>>>
>>>>>> On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
>>>>>>> On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
>>>>>>>> The create itself needs to be sync, but the attribute delegations mean
>>>>>>>> that the client, not the server, is authoritative for the timestamps.
>>>>>>>> So the client now owns the atime and mtime, and just sets them as part
>>>>>>>> of the (asynchronous) delegreturn some time after you are done writing.
>>>>>>>>
>>>>>>>> Were you perhaps thinking about this earlier proposal?
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
>>>>>>> That's it, thanks!
>>>>>>>
>>>>>>> Bradley is concerned about performance of something like untar on a
>>>>>>> backend filesystem with particularly high-latency metadata operations,
>>>>>>> so something like your unstable file createion proposal (or actual write
>>>>>>> delegations) seems like it should help.
>>>>>>>
>>>>>>> --b.
>>>> The serialized create with something like an untar is a
>>>> performance-killer though.
>>>>
>>>> FWIW, I'm working on something similar right now for Ceph. If a ceph
>>>> client has adequate caps [1] for a directory and the dentry inode,
>>>> then we should (in principle) be able to buffer up directory morphing
>>>> operations and flush them out to the server asynchronously.
>>>>
>>>> I'm starting with unlink (mostly because it's simpler), and am mainly
>>>> just returning early when we do have the right caps -- after issuing
>>>> the call but before the reply comes in. We should be able to do the
>>>> same for link, rename and create too. Create will require the Ceph MDS
>>>> to delegate out a range of inode numbers (and that bit hasn't been
>>>> implemented yet).
>>>>
>>>> My thinking with all of this is that the buffering of directory
>>>> morphing operations is not as helpful as something like a pagecache
>>>> write is, as we aren't that interested in merging operations that
>>>> change the same dentry. However, being able to do them asynchronously
>>>> should work really well. That should allow us to better parallellize
>>>> create/link/unlink/rename on different dentries even when they are
>>>> issued serially by a single task.
>>> What happens if an asynchronous directory change fails (eg. ENOSPC)?
>>>
>>>
>>>> RFC5661 doesn't currently provide for writeable directory delegations,
>>>> AFAICT, but they could eventually be implemented in a similar way.
>>>>
>>>> [1]: cephfs capabilies (aka caps) are like a delegation for a subset
>>>> of inode metadata
>>>> --
>>>> Jeff Layton <jlayton@poochiereds.net>
>>> --
>>> Chuck Lever
>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-04-04 20:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-01 16:21 directory delegations Bradley C. Kuszmaul
2019-04-02 16:11 ` J. Bruce Fields
2019-04-02 17:26   ` Bradley C. Kuszmaul
2019-04-02 17:29     ` Bradley C. Kuszmaul
2019-04-02 19:41     ` J. Bruce Fields
2019-04-02 21:51       ` Trond Myklebust
2019-04-02 22:33         ` Trond Myklebust
2019-04-03  0:28         ` bfields
2019-04-03  2:02           ` Trond Myklebust
2019-04-03  2:07             ` bfields
2019-04-03 16:56               ` Bradley C. Kuszmaul
2019-04-04  1:05                 ` bfields
2019-04-04 15:09                   ` Jeff Layton
2019-04-04 15:22                     ` Chuck Lever
2019-04-04 15:36                       ` Jeff Layton
2019-04-04 20:03                       ` Bradley C. Kuszmaul
2019-04-04 20:41                         ` Bruce Fields
2019-04-04 20:45                           ` Bradley C. Kuszmaul
2019-04-04 15:37                     ` bfields
2019-04-04 15:44                       ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.