All of lore.kernel.org
 help / color / mirror / Atom feed
* fixing redundant network opens on Linux file creation
@ 2003-01-06 17:25 Steven French
  2003-01-06 18:14 ` Richard Sharpe
  2003-01-06 22:18 ` Marcos Dione
  0 siblings, 2 replies; 19+ messages in thread
From: Steven French @ 2003-01-06 17:25 UTC (permalink / raw)
  To: samba-technical, linux-fsdevel





The creat() system call results (for the Linux kernel) in calls to create
(via vfs_create) then later a call to open (via dentry_open) both of which
eventually end up (for the cifs vfs) doing a network open of the file from
the perspective of the CIFS protocol which degrades performance (because
every creat does one additional open & close than ideal).    In the cifs
protocol file creation is handled as a flag on the open request so create
has a sideeffect of opening the file.   Unfortunately since mknod can call
vfs_create (presumably without immediately afterwards calling open), it
seems like a vfs can't assume that all creates are necessarily going to be
immediately followed by a file open (server file handle leaks would be
possible if such an assumption were made).    smbfs in effect ignores the
subsequent open and the nfs vfs doesn't have this problem because it
doesn't send a remote open request in nfs_open (since v2 and v3 nfs doesn't
really need an open file handle for file based operations like smb/cifs
does).  To improve creat() performance for cifs (without changing namei.c
itself) it seems like there are only two obvious alternatives:

1) Have the cifs vfs ignore subsequent opens of the same file (never have
more than one open per inode - ala smbfs) - which has the disadvantage of
making the open flags (and pid) incorrect for subsequent opens and would
cause server problems with handling byte range locks and potentially causes
problems with other clients accessing a file that was just created via
mknod and therefore should not be considered open anymore.

2) Have the cifs vfs do "lazy close" of files - perhaps using the original
"opbatch" distributing caching mechanism in the smb/cifs protocol (which
cached opens for optimal performance running batch files on network drives)
for distributed cache management (so the client will not cause sharing
violations if other clients try to access the same file).

I prefer the latter but am working on proving that it works now.   Any
other approaches?

Steve French
Senior Software Engineer
Linux Technology Center - IBM Austin
phone: 512-838-2294
email: sfrench@us.ibm.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 18:14 ` Richard Sharpe
@ 2003-01-06 17:59   ` Jan Hudec
  2003-01-06 19:42     ` Bryan Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Hudec @ 2003-01-06 17:59 UTC (permalink / raw)
  To: Richard Sharpe; +Cc: Steven French, samba-technical, linux-fsdevel

On Mon, Jan 06, 2003 at 10:14:10AM -0800, Richard Sharpe wrote:
> On Mon, 6 Jan 2003, Steven French wrote:
> 
> > The creat() system call results (for the Linux kernel) in calls to create
> > (via vfs_create) then later a call to open (via dentry_open) both of which
> > eventually end up (for the cifs vfs) doing a network open of the file from
> > the perspective of the CIFS protocol which degrades performance (because
> > every creat does one additional open & close than ideal).    In the cifs
> > protocol file creation is handled as a flag on the open request so create
> > has a sideeffect of opening the file.   Unfortunately since mknod can call
> > vfs_create (presumably without immediately afterwards calling open), it
> > seems like a vfs can't assume that all creates are necessarily going to be
> > immediately followed by a file open (server file handle leaks would be
> > possible if such an assumption were made).    smbfs in effect ignores the
> > subsequent open and the nfs vfs doesn't have this problem because it
> > doesn't send a remote open request in nfs_open (since v2 and v3 nfs doesn't
> > really need an open file handle for file based operations like smb/cifs
> > does).  To improve creat() performance for cifs (without changing namei.c
> > itself) it seems like there are only two obvious alternatives:
>
> Isn't creat() a legacy call? I have never used it, and use open(..., 
> O_CREAT,...) instead.
> 
> Isn't this just a cost of using legacy calls? Why complicate things overly 
> for a call that might not be used all that much? 

I am not sure, what it means "legacy call", but I am pretty sure, that
creat and open(... O_CREAT) end up calling exactly the same filesystem
methods with exactly the same parameters. (First lookup is called and it
does not know, what is to happen to the file, then create is called and
it does not know open mode for the file and last open is called with
apropriate mode).

> > 1) Have the cifs vfs ignore subsequent opens of the same file (never have
> > more than one open per inode - ala smbfs) - which has the disadvantage of
> > making the open flags (and pid) incorrect for subsequent opens and would
> > cause server problems with handling byte range locks and potentially causes
> > problems with other clients accessing a file that was just created via
> > mknod and therefore should not be considered open anymore.
> > 
> > 2) Have the cifs vfs do "lazy close" of files - perhaps using the original
> > "opbatch" distributing caching mechanism in the smb/cifs protocol (which
> > cached opens for optimal performance running batch files on network drives)
> > for distributed cache management (so the client will not cause sharing
> > violations if other clients try to access the same file).
> > 
> > I prefer the latter but am working on proving that it works now.   Any
> > other approaches?

There is a lookup intent patch from lustre group. It can be found
somewhere in the archives. Pushing that (or something along that lines)
to mainline and using that would be IMHO most beneficial (because all
networking filesystems could benefit from this patch). However that does
not fall into the category "not changing namei.c".

-------------------------------------------------------------------------------
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 17:25 fixing redundant network opens on Linux file creation Steven French
@ 2003-01-06 18:14 ` Richard Sharpe
  2003-01-06 17:59   ` Jan Hudec
  2003-01-06 22:18 ` Marcos Dione
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Sharpe @ 2003-01-06 18:14 UTC (permalink / raw)
  To: Steven French; +Cc: samba-technical, linux-fsdevel

On Mon, 6 Jan 2003, Steven French wrote:

> The creat() system call results (for the Linux kernel) in calls to create
> (via vfs_create) then later a call to open (via dentry_open) both of which
> eventually end up (for the cifs vfs) doing a network open of the file from
> the perspective of the CIFS protocol which degrades performance (because
> every creat does one additional open & close than ideal).    In the cifs
> protocol file creation is handled as a flag on the open request so create
> has a sideeffect of opening the file.   Unfortunately since mknod can call
> vfs_create (presumably without immediately afterwards calling open), it
> seems like a vfs can't assume that all creates are necessarily going to be
> immediately followed by a file open (server file handle leaks would be
> possible if such an assumption were made).    smbfs in effect ignores the
> subsequent open and the nfs vfs doesn't have this problem because it
> doesn't send a remote open request in nfs_open (since v2 and v3 nfs doesn't
> really need an open file handle for file based operations like smb/cifs
> does).  To improve creat() performance for cifs (without changing namei.c
> itself) it seems like there are only two obvious alternatives:

Isn't creat() a legacy call? I have never used it, and use open(..., 
O_CREAT,...) instead.

Isn't this just a cost of using legacy calls? Why complicate things overly 
for a call that might not be used all that much? 
 
> 1) Have the cifs vfs ignore subsequent opens of the same file (never have
> more than one open per inode - ala smbfs) - which has the disadvantage of
> making the open flags (and pid) incorrect for subsequent opens and would
> cause server problems with handling byte range locks and potentially causes
> problems with other clients accessing a file that was just created via
> mknod and therefore should not be considered open anymore.
> 
> 2) Have the cifs vfs do "lazy close" of files - perhaps using the original
> "opbatch" distributing caching mechanism in the smb/cifs protocol (which
> cached opens for optimal performance running batch files on network drives)
> for distributed cache management (so the client will not cause sharing
> violations if other clients try to access the same file).
> 
> I prefer the latter but am working on proving that it works now.   Any
> other approaches?
> 
> Steve French
> Senior Software Engineer
> Linux Technology Center - IBM Austin
> phone: 512-838-2294
> email: sfrench@us.ibm.com
> 

-- 
Regards
-----
Richard Sharpe, rsharpe[at]ns.aus.com, rsharpe[at]samba.org, 
sharpe[at]ethereal.com, http://www.richardsharpe.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 17:59   ` Jan Hudec
@ 2003-01-06 19:42     ` Bryan Henderson
  2003-01-06 19:56       ` Jan Harkes
  2003-01-06 21:31       ` Andreas Dilger
  0 siblings, 2 replies; 19+ messages in thread
From: Bryan Henderson @ 2003-01-06 19:42 UTC (permalink / raw)
  To: Jan Hudec; +Cc: linux-fsdevel, Richard Sharpe, samba-technical, Steven French





>There is a lookup intent patch from lustre group. It can be found
>somewhere in the archives. Pushing that (or something along that lines)
>to mainline and using that would be IMHO most beneficial

Better still would be to add a "create-and-open" VFS call and have namei
use it.  This solves a number of problems, including the fact that it is
impossible to correctly implement an exclusive create and open with a
shared filesystem (because between when Linux confirms that the file
doesn't exist and when Linux does the VFS create, another system may have
created the file).

"Intent," as it's generally understood, is not a promise of future activity
-- it's either a hint to improve efficiency or it's a promise to restrict
future activity, but it should be possible simply to bail out before
exercising that intent.  E.g. you can't open a file at the same time as it
is looked up just because the looker upper says he intends to open it
later.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 19:42     ` Bryan Henderson
@ 2003-01-06 19:56       ` Jan Harkes
  2003-01-06 21:58         ` Bryan Henderson
  2003-01-06 21:31       ` Andreas Dilger
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Harkes @ 2003-01-06 19:56 UTC (permalink / raw)
  To: linux-fsdevel

On Mon, Jan 06, 2003 at 11:42:26AM -0800, Bryan Henderson wrote:
> >There is a lookup intent patch from lustre group. It can be found
> >somewhere in the archives. Pushing that (or something along that lines)
> >to mainline and using that would be IMHO most beneficial
> 
> Better still would be to add a "create-and-open" VFS call and have namei
> use it.  This solves a number of problems, including the fact that it is
> impossible to correctly implement an exclusive create and open with a
> shared filesystem (because between when Linux confirms that the file
> doesn't exist and when Linux does the VFS create, another system may have
> created the file).

But create is a directory operation, while open is an operation on a
file. It is just a matter of convenience that open(..., O_CREAT)
happens to create the directory entry if it doesn't yet exist. Logically
they shouldn't be combined.

Perhaps having the exclusive create lock the object and pass that info
on to the associated open. In Coda these objects are named 'virgin
files'. And they have some special properties, such as being able to
write to a file you were allowed to create even when the ACL's are set
so that you have no write permission.

I sometimes wish that file creation would have been done the other way
around.

- Open/create a new, unnamed object, which gives a file handle.
- Link this open handle into the filesystem's namespace.

That way the application can lock the object, or write the data to it,
etc. before making it visible to the world. Might have solved some of
the possible inconsistencies for networked filesystems and is probably
more resiliant wrt. symlink attacks.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 19:42     ` Bryan Henderson
  2003-01-06 19:56       ` Jan Harkes
@ 2003-01-06 21:31       ` Andreas Dilger
  2003-01-06 22:23         ` Bryan Henderson
  1 sibling, 1 reply; 19+ messages in thread
From: Andreas Dilger @ 2003-01-06 21:31 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Jan Hudec, linux-fsdevel, Richard Sharpe, samba-technical,
	Steven French, Lustre Development Mailing List

On Jan 06, 2003  11:42 -0800, Bryan Henderson wrote:
> >There is a lookup intent patch from lustre group. It can be found
> >somewhere in the archives. Pushing that (or something along that lines)
> >to mainline and using that would be IMHO most beneficial
> 
> "Intent," as it's generally understood, is not a promise of future activity
> -- it's either a hint to improve efficiency or it's a promise to restrict
> future activity, but it should be possible simply to bail out before
> exercising that intent.  E.g. you can't open a file at the same time as it
> is looked up just because the looker upper says he intends to open it
> later.

In our code, the lookup-with-intent actually performs both of the operations
on the server, and it is up to the client methods to detect that the operation
was done and deal with it appropriately.  We have very well-tested code for
2.4 and 2.5 code is mostly functional (2.5 is a lot neater implementation but
the changes mean that it isn't yet as functional as the 2.4 code).

In the Lustre code, the premise is that the lookup-with-intent operation
(called lookup2 for now) does one of:
1) the lookup + operation on the server in one RPC (i.e. lookup+create[+open],
   lookup+unlink, lookup+rename) and tells the client "I just did this
   for you, here are the attributes of the new entry and a lock on it if
   necessary, please fix up your local state to match", and the actual VFS
   operations are only doing the post-facto state cleanup.

2) OR it returns a lock to the client that grants the client exclusive
   control over the item(s) in question (normally the parent dir(s)) and
   lets the client do the operations locally and send the operations to
   the server separately.

We currently implement (1) only right now, but the goal is to implement
(2) in the future (which would be back to nearly what the VFS currently
does, except that we are now granted the locks in advance) so that
a client can do many operations locally without the need for getting
lots of locks.  For example, in the future, a Lustre client creating a new
directory could be granted the lock on that directory, and it could then
create files in that directory without further RPCs to the server very
efficiently (e.g.  untarring a file) until another client revokes the
lock(s) and forces the client to flush all of its updates to the server.

For an updated version of the intent patch, see:

ftp://ftp.lustre.org/pub/kernels/patches/2.4.18-hp1_pnnl18_l5.patch
ftp://ftp.lustre.org/pub/kernels/patches/37chaos-l5.patch

The first patch is good for vanilla kernels, and the second for RH
2.4.18-17ish kernels.  There is a bit of extra stuff therein which
isn't really related to the intent changes.

Cheers, Andreas

PS - I've added lustre-devel to this thread so that the Lustre developers
     also see any discussion related to the intent changes.
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 19:56       ` Jan Harkes
@ 2003-01-06 21:58         ` Bryan Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Bryan Henderson @ 2003-01-06 21:58 UTC (permalink / raw)
  To: Jan Harkes; +Cc: linux-fsdevel





>But create is a directory operation,

But it isn't.  Here's the thing: directories and files are intimately tied
together in Unix.  I often wish they weren't.  If they weren't, as in VMS,
creating a directory entry and creating a file would be independent
operations.  Most of a lookup would be done above the kernel.  System calls
would address filesystem objects by inode number.

>It is just a matter of convenience that open(..., O_CREAT)
>happens to create the directory entry if it doesn't yet exist. Logically
>they shouldn't be combined.

Logically, POSIX shouldn't require them to be combined, but it does.

An alternative to having an atomic create-and-open VFS call would be to
define VFS lock/unlock directory calls.  For a shared filesystem, a central
lock manager would have to coordinate these locks among the various systems
-- and deal with the problems of systems crashing or dropping off the
network while holding a directory lock.  Considerably more implementation
work.

>Perhaps having the exclusive create lock the object and pass that info
>on to the associated open.

I don't see how this is different from just having the create open the
file.  You still have the call that adds an entry to a directory also doing
a file operation (creating a file) and then another file operation (locking
the file).  Might as well just let it open the file.

>I sometimes wish that file creation would have been done the other way
>around.
>
>- Open/create a new, unnamed object, which gives a file handle.
>- Link this open handle into the filesystem's namespace.

Assuming the POSIX directory-file binding, this has a similar problem.
User asks to open a file and create it if "it" doesn't already exist.
namei determines the file doesn't exist, so creates and opens a new,
unnamed file.  Another system then creates the file and adds it to the
directory.  namei now goes to add the file it created into the directory,
but can't.  Now what?

Incidentally, AIX VFS has the create-and-open call (consistent with the
system call interface, creating is done just by flags on an open VFS call).


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 17:25 fixing redundant network opens on Linux file creation Steven French
  2003-01-06 18:14 ` Richard Sharpe
@ 2003-01-06 22:18 ` Marcos Dione
  2003-01-07  9:35   ` Jan Hudec
  1 sibling, 1 reply; 19+ messages in thread
From: Marcos Dione @ 2003-01-06 22:18 UTC (permalink / raw)
  To: Steven French; +Cc: samba-technical, linux-fsdevel

On Mon, Jan 06, 2003 at 11:25:32AM -0600, Steven French wrote:
> The creat() system call results (for the Linux kernel) in calls to create
> (via vfs_create) then later a call to open (via dentry_open) both of which
> eventually end up (for the cifs vfs) doing a network open of the file from
> the perspective of the CIFS protocol which degrades performance (because

    why not implement create as a separate feature? you can use a
different message and mknod(2) on the server.

    I'm asking 'cause I'll have the same problem when implementing my
thesis.

-- 
well-designed technology should allow people the luxury of ignorance
              -- Eric S. Raymond

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 21:31       ` Andreas Dilger
@ 2003-01-06 22:23         ` Bryan Henderson
  2003-01-06 22:48           ` Andreas Dilger
  0 siblings, 1 reply; 19+ messages in thread
From: Bryan Henderson @ 2003-01-06 22:23 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Hudec, linux-fsdevel, Lustre Development Mailing List,
	Richard Sharpe, samba-technical, Steven French





>In our code, the lookup-with-intent actually performs both of the
operations
>on the server,

What I don't get is why is the concept of "intent" even involved here?  If
lookup-with-intent does the lookup and open (and, I guess, create where
appropriate), why don't you call it "lookup-and-open" and then skip the
subsequent VFS open call?

You also mention the distributed version of the Lustre lookup-with-intent:

>OR it returns a lock to the client that grants the client exclusive
>   control over the item(s) in question (normally the parent dir(s)) and
>   lets the client do the operations locally and send the operations to
>   the server separately.

and the same question applies in that case.  While the client may do the
open separately, there's no reason it shouldn't do it before returning from
the VFS lookup-with-intent call, which means it would be simpler as a
lookup-and-open.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 22:23         ` Bryan Henderson
@ 2003-01-06 22:48           ` Andreas Dilger
  2003-01-07  1:06             ` Bryan Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Dilger @ 2003-01-06 22:48 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Jan Hudec, linux-fsdevel, Lustre Development Mailing List,
	Richard Sharpe, samba-technical, Steven French

On Jan 06, 2003  14:23 -0800, Bryan Henderson wrote:
> >In our code, the lookup-with-intent actually performs both of the
> >operations on the server,
> 
> What I don't get is why is the concept of "intent" even involved here?  If
> lookup-with-intent does the lookup and open (and, I guess, create where
> appropriate), why don't you call it "lookup-and-open" and then skip the
> subsequent VFS open call?

Because the intent code is much more than just "lookup-and-open".
It is also lookup-and-create, lookup-and-mkdir, lookup-and-unlink,
lookup-and-setattr, etc.  I don't think we want separate VFS ops for
every possible VFS op.

Also, in the Linux VFS, the lookup call is the one which is actually
doing the locking on the appropriate objects for atomicity purposes,
which is actually the critical thing here - we use lookup2 for doing
the distributed locking as much as for the RPC savings.

Like another Lustre developer remarked "it's a lock-with-intent on the
wire, and a lookup-with-intent in the kernel".

> You also mention the distributed version of the Lustre lookup-with-intent:
> 
> >OR it returns a lock to the client that grants the client exclusive
> >   control over the item(s) in question (normally the parent dir(s)) and
> >   lets the client do the operations locally and send the operations to
> >   the server separately.
> 
> and the same question applies in that case.  While the client may do the
> open separately, there's no reason it shouldn't do it before returning from
> the VFS lookup-with-intent call, which means it would be simpler as a
> lookup-and-open.

The reason we still do a VFS open call after we do the lookup-with-intent
are several:
1) like I said above, we don't want to have 2x every VFS op (one with
   lookup and another without) either in our code or in the VFS proper
2) the amount of changes needed to the VFS would be quite large, if it had
   to determine whether it should do a lookup-with-intent and no regular
   op, or the lookup + regular op
3) doing lookup-with-intent allows us to manage the locking internal to
   the filesystem however we want instead of having to live within the
   VFS's ideas of locking (e.g. we could split up the locks within a
   single directory so you could do concurrent creates/renames/unlinks
   in a single directory if we so choose, and we may).

The thing to focus on here is that lookup2 is as much a locking API as
it is a lookup+operation API.  I was thinking a clever name for it would
be "loockup", but that has some unfortunate connotations ;-).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 22:48           ` Andreas Dilger
@ 2003-01-07  1:06             ` Bryan Henderson
  2003-01-07 13:19               ` [Lustre-devel] " Mike Shaver
  0 siblings, 1 reply; 19+ messages in thread
From: Bryan Henderson @ 2003-01-07  1:06 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Steven French





>Because the intent code is much more than just "lookup-and-open".
>It is also lookup-and-create, lookup-and-mkdir, lookup-and-unlink,
>lookup-and-setattr, etc.  I don't think we want separate VFS ops for
>every possible VFS op.

That's really orthogonal to this discussion.  If you want to conserve the
number of VFS operation routines, you can have a single routine with
parameters for a dozen different operations whether it is
lookup-with-intent or lookup-and-do.  Pretty much the only difference in
the C code is the name of the routine.

But my discomfort with the lookup-with-intent approach is focused on the
open/create operation in particular.  From what I can tell, these intents
are more than just declaration of intent.  They're promises.  If the VFS
caller did a lookup with intent to create if not found, and then didn't
follow through on that intent, I guess that would cause trouble on Lustre
since the implementation of lookup-with-intent actually created the file.

That's not the concept of intent declaration as I've seen it everywhere
else.  Something like "open with write intent" always means either "open
the file and I won't do anything but write to it," or "open the file and
I'll probably be writing to it," but never "open the file and the next
thing you see from me will be a write of 10 bytes at offset 20."

Another thing the structure of this "intent" interface says to me is that a
filesystem driver might choose in some cases not to open the file but wait
until the open is actually requested.  If so, doesn't the filesystem driver
have to maintain some cognizance of the thread of file accesses, so it can
match up an open with a previous lookup-with-intent and know if that
particular open is already done?  That kind of state has always been
intentionally omitted from the VFS interface.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fixing redundant network opens on Linux file creation
  2003-01-06 22:18 ` Marcos Dione
@ 2003-01-07  9:35   ` Jan Hudec
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Hudec @ 2003-01-07  9:35 UTC (permalink / raw)
  To: Marcos Dione; +Cc: Steven French, samba-technical, linux-fsdevel

On Mon, Jan 06, 2003 at 07:18:30PM -0300, Marcos Dione wrote:
> On Mon, Jan 06, 2003 at 11:25:32AM -0600, Steven French wrote:
> > The creat() system call results (for the Linux kernel) in calls to create
> > (via vfs_create) then later a call to open (via dentry_open) both of which
> > eventually end up (for the cifs vfs) doing a network open of the file from
> > the perspective of the CIFS protocol which degrades performance (because
> 
>     why not implement create as a separate feature? you can use a
> different message and mknod(2) on the server.
> 
>     I'm asking 'cause I'll have the same problem when implementing my
> thesis.

That won't help. You are still doing two upcalls, it still isn't atomic
etc. etc. The problem is, that vfs always calls ->create and then
->open, both for open(O_CREAT) and create.

-------------------------------------------------------------------------------
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation
  2003-01-07  1:06             ` Bryan Henderson
@ 2003-01-07 13:19               ` Mike Shaver
  2003-01-07 17:28                 ` Bryan Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Shaver @ 2003-01-07 13:19 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Steven French

On Jan 06, Bryan Henderson wrote:
> That's really orthogonal to this discussion.  If you want to conserve the
> number of VFS operation routines, you can have a single routine with
> parameters for a dozen different operations whether it is
> lookup-with-intent or lookup-and-do.  Pretty much the only difference in
> the C code is the name of the routine.

That may be true, but the invasiveness of the change to the Linux VFS
would likely be much greater.  Our intent patches are pretty small, and
therefore much easier to port between versions, as well as more likely
to be integrated into 2.5/2.6.

> But my discomfort with the lookup-with-intent approach is focused on the
> open/create operation in particular.  From what I can tell, these intents
> are more than just declaration of intent. They're promises.  If the VFS
> caller did a lookup with intent to create if not found, and then didn't
> follow through on that intent, I guess that would cause trouble on Lustre
> since the implementation of lookup-with-intent actually created the file.

Do you use "the VFS caller" to mean "the code that calls into the VFS",
or "the caller of the intent-handling operations, which is the VFS"?
It's my understanding that these changes are transparent to the caller
of the VFS, but if the VFS itself were to "abort" halfway we might well
have a problem.  Not because something created the file, but because we
wouldn't necessarily clean up the intent structures correctly.  I expect
that this is a soluble problem, at the expense of more changes to the
VFS.

We haven't seen any problems with "aborted intent" in part because we
don't depend on the caller-into-the-VFS to cooperate; the VFS itself
completes the intent protocol correctly, every time, in no small part
because the intent is declarative and binding, rather than just
speculative.

> That's not the concept of intent declaration as I've seen it everywhere
> else.  Something like "open with write intent" always means either "open
> the file and I won't do anything but write to it," or "open the file and
> I'll probably be writing to it," but never "open the file and the next
> thing you see from me will be a write of 10 bytes at offset 20."

Is the objection really just to the terminology, then?  JFS, VxFS and
NetApp seem to use "intent logging" to mean something similar ("I will
be doing this next", rather than "I might be doing this next, but maybe
not").  Maybe I misunderstand the intent log, though, and the time at
which it gets updated.  It certainly does seem to describe fact rather
than a fallible expectation.

The origin of the intent stuff is really, to my understanding, in the
locking: the client requests a lock with the declared intent of performing
some other FS operation (getattr, create of a child, etc.).  The
presence of that intent information, in the form of a fully-specified FS
operation, is what permits the server to perform the desired operation
on behalf of the client, where system performance would be degraded
unacceptably by giving one client an exclusive lock on a contended
resource.  That we have intent-driven behaviour in lookup/lookup2 is
largely due to the fact that it's in lookup that we need to acquire our
locks.

> Another thing the structure of this "intent" interface says to me is that a
> filesystem driver might choose in some cases not to open the file but wait
> until the open is actually requested.  If so, doesn't the filesystem driver
> have to maintain some cognizance of the thread of file accesses, so it can
> match up an open with a previous lookup-with-intent and know if that
> particular open is already done?  That kind of state has always been
> intentionally omitted from the VFS interface.

I think it's that state, specifically, that's represented by the intent
parameters added to the various ops.  I understand that it was a design
compromise motivated in no small part by the desire to minimize changes
to the Linux VFS at this stage.  I'm not at all certain that we would
structure things in this form if we were writing an intent-enabled VFS
from first principles.

Mike

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation
  2003-01-07 13:19               ` [Lustre-devel] " Mike Shaver
@ 2003-01-07 17:28                 ` Bryan Henderson
  2003-01-07 18:50                   ` Andreas Dilger
  0 siblings, 1 reply; 19+ messages in thread
From: Bryan Henderson @ 2003-01-07 17:28 UTC (permalink / raw)
  To: Mike Shaver
  Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Steven French





>Is the objection really just to the terminology, then?

Partly the terminology, and partly the things that the terminology implies.
Those who support this name must do so because they expect the interface to
have some properties of intent.  But I've argued that it cannot be a true
intent declaration, and therefore any approximation to intent will cause
trouble.

>JFS, VxFS and
>NetApp seem to use "intent logging" to mean something similar ("I will
>be doing this next", rather than "I might be doing this next, but maybe
>not").  Maybe I misunderstand the intent log, though, and the time at
>which it gets updated.  It certainly does seem to describe fact rather
>than a fallible expectation.

I'm not a big fan of this use of the word "intent" either, and in fact the
technique it refers to is often called other things.  But it definitely
_is_ a case where the intent can be abandonned.  That's the whole point --
you log an intent to create a file, but don't actually commit to creating
it.  If the system should crash before all the corequesisites of that
creation are complete, the file ends up never having been created.  In
contrst, the proposed Linux lookup-with-intent scheme appears actually to
irrevocably create a file as soon as the "intent" to create it is declared.

>I understand that it was a design
>compromise motivated in no small part by the desire to minimize changes
>to the Linux VFS at this stage.

That explanation makes a lot of sense.  What it really boils down to is
that the parameters aren't so much a declaration of intent as a revelation
of the context in which the caller is making the call.  In other words, a
contravention of modularity.  That is always a minimal-lines-of-code
solution to a protocol problem.  It's a heavy design tradeoff, though.

>Do you use "the VFS caller" to mean "the code that calls into the VFS",
>or "the caller of the intent-handling operations, which is the VFS"?

One of the irritating things about Linux filesystem discussions is the
diversity of terminology.  Several of the most key terms, including "VFS"
are used to mean multiple very different things.  In this case, you are
clearly using "VFS" to mean the Linux code found in the 'fs' directory.
That's common, but it is also common to use it to refer to the code found
in directories such as 'fs/ext2'.  I don't find either of those definitions
useful.  To me, VFS has always been the name of the protocol that said
pieces of code use to talk to each other.  And it applies in general to all
operating systems that have such an interface inside them.  The name "FS"
works better for the code in the 'fs' directory (not just because that's
what the directory is called, but also because the oldest documents
describing it call it that).  The term "filesystem driver" is far more
descriptive, unambiguous, and universal for the code in fs/ext2.  But
people most often refer to that code as "a filesystem."  Along with 5 other
things they use the word "filesystem" for.

Note that in theory, the invoker of the VFS protocol operations could be
anything, and the filesystem driver should not care.  Even in practice, it
is not always the 'fs/' component.  Sometimes it is the NFS server code.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation
  2003-01-07 17:28                 ` Bryan Henderson
@ 2003-01-07 18:50                   ` Andreas Dilger
  2003-01-08 17:52                     ` Bryan Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Dilger @ 2003-01-07 18:50 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Mike Shaver, Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Steven French

On Jan 07, 2003  09:28 -0800, Bryan Henderson wrote:
> >JFS, VxFS and
> >NetApp seem to use "intent logging" to mean something similar ("I will
> >be doing this next", rather than "I might be doing this next, but maybe
> >not").  Maybe I misunderstand the intent log, though, and the time at
> >which it gets updated.  It certainly does seem to describe fact rather
> >than a fallible expectation.
> 
> I'm not a big fan of this use of the word "intent" either, and in fact the
> technique it refers to is often called other things.  But it definitely
> _is_ a case where the intent can be abandonned.  That's the whole point --
> you log an intent to create a file, but don't actually commit to creating
> it.  If the system should crash before all the corequesisites of that
> creation are complete, the file ends up never having been created.  In
> contrst, the proposed Linux lookup-with-intent scheme appears actually to
> irrevocably create a file as soon as the "intent" to create it is declared.

I don't see where you are coming from here.  Could you be more specific on
whether you think the entity declaring an "intent" is user-space, the VFS
code in fs/*.c, the filesystem driver code in fs/*/*.c or what?  I don't
really see where you can "change your mind" in the middle of creating a
file, unless there was an error somewhere along the way.  If you call
sys_mkdir() you have declared an "intent" to create a directory, and the
VFS better not arbitrarily decide that it doesn't feel like creating
directories today.

What I am getting at, is that once an application has called a system call,
either that system call will do what it was supposed to do (e.g. create,
rename, remove, change a file/dir) or it will have an error.  Whether that
operation was done in the "lookup-with-intent call on server + op fixup on
client" or as a lookup+op call on a local filesystem is unrelated to the
fact that the operation will complete either way.

The "intent" that we are talking about in regards to Lustre is not a "maybe"
thing like open(..., O_RDWR) where you may or may not read or write to a
file after opening it.  The intent is set up at entry to the kernel syscall
code, and is destroyed before the syscall returns to user code again.  The
only two options are that the server acted on the intent and did the operation
there and the kernel code on the client handles this, or the server granted
a lock to the client, and the kernel code on the client is required to
complete the operation itself.  Anything else is a bug.

> Note that in theory, the invoker of the VFS protocol operations could be
> anything, and the filesystem driver should not care.  Even in practice, it
> is not always the 'fs/' component.  Sometimes it is the NFS server code.

Or, by no small coincidence, the Lustre target code.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation
  2003-01-07 18:50                   ` Andreas Dilger
@ 2003-01-08 17:52                     ` Bryan Henderson
  2003-01-08 19:11                       ` Peter Braam
  0 siblings, 1 reply; 19+ messages in thread
From: Bryan Henderson @ 2003-01-08 17:52 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Mike Shaver, Steven French





>I don't see where you are coming from here.  Could you be more specific on
>whether you think the entity declaring an "intent" is user-space, the VFS
>code in fs/*.c, the filesystem driver code in fs/*/*.c or what?

As a general principle, any of those things could declare intent.  In the
Lustre design we're talking about, I don't believe any of them does.  Hence
my objection to the term "intent."  Based on that word, I thought at first
I might just have missed something in the definition of the interface, but
I don't think so anymore.

>I don't
>really see where you can "change your mind" in the middle of creating a
>file, unless there was an error somewhere along the way.

I don't either.  (And apparently, simple errors are no exception in the
Lustre design).  Hence, you have declared significantly more than an intent
when you did the lookup.

>If you call
>sys_mkdir() you have declared an "intent" to create a directory

Not as "intent" is usually understood.  If you call sys_mkdir(), you have
commanded the kernel to create the directory.  That's a lot different from
declaring that you intend to create the directory.

I believe the lustre patch works.  I also believe it uses the wrong
terminology, creates an interface to filesystem drivers that is brittle and
hard to understand, and doesn't solve as wide a range of problems as it
could.  I believe that what it calls a declaration of intent is really a
declaration of what POSIX system call the caller is in the middle of
performing.

On the other hand, it has been pointed out that one of its goals was to
minimize the changes to fs/*.c.  I agree the patch is a good way to achieve
that goal.

If it were my decision, I would solve the Lustre problem, and the Samba
problem, and some of my own as well, by putting higher level filesystem
driver interfaces into Linux, such as some other kernels do.  Let the
filesystem driver do the whole "lookup, create directory, add directory
entry" operation if it wants to, and in that case make just that one call
to the filesystem driver and be done.  Let the filesystem driver deal with
the problems of failures halfway through the sequence.

But suggestions I've made to give more power to filesystem drivers have in
the past met resistance from those who want to keep centralized control and
maintain uniformity among the various filesystem types).


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation
  2003-01-08 17:52                     ` Bryan Henderson
@ 2003-01-08 19:11                       ` Peter Braam
  2003-01-09  2:08                         ` Bryan Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Braam @ 2003-01-08 19:11 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Mike Shaver, Steven French

Hi, 

I have no objections to a name change.  We are not so religious about
"intent" as a name.

On Wed, Jan 08, 2003 at 10:52:51AM -0700, Bryan Henderson wrote:

> >I don't see where you are coming from here.  Could you be more specific on
> >whether you think the entity declaring an "intent" is user-space, the VFS
> >code in fs/*.c, the filesystem driver code in fs/*/*.c or what?
> 
> As a general principle, any of those things could declare intent.  In the
> Lustre design we're talking about, I don't believe any of them does.  Hence
> my objection to the term "intent."  Based on that word, I thought at first
> I might just have missed something in the definition of the interface, but
> I don't think so anymore.
> 
> >I don't
> >really see where you can "change your mind" in the middle of creating a
> >file, unless there was an error somewhere along the way.

open with O_CREATE | O_EXCL is a good example.

> I don't either.  (And apparently, simple errors are no exception in the
> Lustre design).  Hence, you have declared significantly more than an intent
> when you did the lookup.
> 
> >If you call
> >sys_mkdir() you have declared an "intent" to create a directory
> 
> Not as "intent" is usually understood.  If you call sys_mkdir(), you have
> commanded the kernel to create the directory.  That's a lot different from
> declaring that you intend to create the directory.
> 
> I believe the lustre patch works.  I also believe it uses the wrong
> terminology, creates an interface to filesystem drivers that is brittle and
> hard to understand, and doesn't solve as wide a range of problems as it
> could.  I believe that what it calls a declaration of intent is really a
> declaration of what POSIX system call the caller is in the middle of
> performing.
> 
> On the other hand, it has been pointed out that one of its goals was to
> minimize the changes to fs/*.c.  I agree the patch is a good way to achieve
> that goal.
> 
> If it were my decision, I would solve the Lustre problem, and the Samba
> problem, and some of my own as well, by putting higher level filesystem
> driver interfaces into Linux, such as some other kernels do. 
>
>  Let the
> filesystem driver do the whole "lookup, create directory, add directory
> entry" operation if it wants to, and in that case make just that one call
> to the filesystem driver and be done.  Let the filesystem driver deal with
> the problems of failures halfway through the sequence.
> 
> But suggestions I've made to give more power to filesystem drivers have in
> the past met resistance from those who want to keep centralized control and
> maintain uniformity among the various filesystem types).

That proposal has been made by many other people, everywhere.  Of
course we could work with that too. 

Personally I rather like the Linux VFS because it does locking etc: Al
Viro has made it very clear that e.g. locking for renames, which is
incredibly hard, is best done once (what you call centralized) than
many times by different file systems.

This is the one single reason that we used the "intent" solution: it
can make use of the VFS infrastructure better than high level calls. 

But again, I'm not religious about this -- I am religious about
getting correctness for clustering file systems. And we have had to do
some other things (like dealing with dentries in highly non-standard
ways) to get correctness.  And of course, we have many problems
left...

- Peter -

> 
> 
> 
> -------------------------------------------------------
> This SF.NET email is sponsored by:
> SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
> http://www.vasoftware.com
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lustre-devel
- Peter -

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation
  2003-01-08 19:11                       ` Peter Braam
@ 2003-01-09  2:08                         ` Bryan Henderson
  2003-01-09  3:36                           ` Peter Braam
  0 siblings, 1 reply; 19+ messages in thread
From: Bryan Henderson @ 2003-01-09  2:08 UTC (permalink / raw)
  To: Peter Braam
  Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Mike Shaver, Steven French





>I have no objections to a name change.  We are not so religious about
>"intent" as a name.

How religious are you about the idea of having to have BOTH a lookup2()
that contains all the information necessary to create a directory if the
name is available, AND a subsequent "create directory" call?  Because once
you remove the word "intent" from the description, that looks even more
silly.

It is the relationship between those two (sometimes 3) redundant calls that
is the real substance in what otherwise appears to be just a naming issue.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lustre-devel] Re: fixing redundant network opens on Linux file creation
  2003-01-09  2:08                         ` Bryan Henderson
@ 2003-01-09  3:36                           ` Peter Braam
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Braam @ 2003-01-09  3:36 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Andreas Dilger, Jan Hudec, linux-fsdevel, linux-fsdevel-owner,
	Lustre Development Mailing List, Richard Sharpe, samba-technical,
	Mike Shaver, Steven French

Bryan, 

On Wed, Jan 08, 2003 at 06:08:48PM -0800, Bryan Henderson wrote:
> 
> 
> 
> 
> >I have no objections to a name change.  We are not so religious about
> >"intent" as a name.
> 
> How religious are you about the idea of having to have BOTH a lookup2()
> that contains all the information necessary to create a directory if the
> name is available, AND a subsequent "create directory" call?  Because once
> you remove the word "intent" from the description, that looks even more
> silly.

Good question.  For mkdir your solution is much preferrable.   So no
religion here at all.  But mkdir is an easy case, possibly the easiest.

For open, rename, setattr and dealing with symbolic links we found
having the separation of the lookup phase with intents and actual
execution to be quite useful, since the symbolic links may bring you
back to another file system.

> It is the relationship between those two (sometimes 3) redundant calls that
> is the real substance in what otherwise appears to be just a naming issue.

Yes, and the answer is "sometimes" - in the mkdir case it (moderately)
easy to give the whole task to the file system (symlinks remain
hairy), in open, rename, setattr we found a lot of useful VFS
functionality between lookup and operation.

- Peter -

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2003-01-09  3:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-01-06 17:25 fixing redundant network opens on Linux file creation Steven French
2003-01-06 18:14 ` Richard Sharpe
2003-01-06 17:59   ` Jan Hudec
2003-01-06 19:42     ` Bryan Henderson
2003-01-06 19:56       ` Jan Harkes
2003-01-06 21:58         ` Bryan Henderson
2003-01-06 21:31       ` Andreas Dilger
2003-01-06 22:23         ` Bryan Henderson
2003-01-06 22:48           ` Andreas Dilger
2003-01-07  1:06             ` Bryan Henderson
2003-01-07 13:19               ` [Lustre-devel] " Mike Shaver
2003-01-07 17:28                 ` Bryan Henderson
2003-01-07 18:50                   ` Andreas Dilger
2003-01-08 17:52                     ` Bryan Henderson
2003-01-08 19:11                       ` Peter Braam
2003-01-09  2:08                         ` Bryan Henderson
2003-01-09  3:36                           ` Peter Braam
2003-01-06 22:18 ` Marcos Dione
2003-01-07  9:35   ` Jan Hudec

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.