linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found] ` <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>
@ 2001-08-03 13:09   ` Daniel Phillips
  2001-08-03 14:43     ` Horst von Brand
                       ` (2 more replies)
  2001-08-03 18:36   ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
  1 sibling, 3 replies; 77+ messages in thread
From: Daniel Phillips @ 2001-08-03 13:09 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Horst von Brand, linux-kernel

On Friday 03 August 2001 05:13, Alexander Viro wrote:
> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > There is only one chain of directories from the fd's dentry up to
> > the root, that's the one to sync.
>
> You forgot ".. at any given moment". IOW, operation you propose is
> inherently racy. You want to do that - you do that in userland.

Are you saying that there may not be a ".." some of the time?  Or just 
that it may spontaneously be relinked?  If it does spontaneously change 
it doesn't matter, you have still made sure there is access by at least 
one path.

The trouble with doing this in userland is, the locked chain of dcache 
entries isn't there.

--
Daniel

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 13:09   ` intermediate summary of ext3-2.4-0.9.4 thread Daniel Phillips
@ 2001-08-03 14:43     ` Horst von Brand
  2001-08-03 17:49     ` Mike Castle
  2001-08-03 18:08     ` Alexander Viro
  2 siblings, 0 replies; 77+ messages in thread
From: Horst von Brand @ 2001-08-03 14:43 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Daniel Phillips <phillips@bonn-fries.net> said:
> On Friday 03 August 2001 05:13, Alexander Viro wrote:

[...]

> > You forgot ".. at any given moment". IOW, operation you propose is
> > inherently racy. You want to do that - you do that in userland.

> Are you saying that there may not be a ".." some of the time?  Or just 
> that it may spontaneously be relinked?  If it does spontaneously change 
> it doesn't matter, you have still made sure there is access by at least 
> one path.

Think "mv thisdir somewhereelse"
-- 
Dr. Horst H. von Brand                Usuario #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 13:09   ` intermediate summary of ext3-2.4-0.9.4 thread Daniel Phillips
  2001-08-03 14:43     ` Horst von Brand
@ 2001-08-03 17:49     ` Mike Castle
  2001-08-04  3:23       ` Daniel Phillips
  2001-08-03 18:08     ` Alexander Viro
  2 siblings, 1 reply; 77+ messages in thread
From: Mike Castle @ 2001-08-03 17:49 UTC (permalink / raw)
  To: linux-kernel

On Fri, Aug 03, 2001 at 03:09:06PM +0200, Daniel Phillips wrote:
> On Friday 03 August 2001 05:13, Alexander Viro wrote:
> > On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > > There is only one chain of directories from the fd's dentry up to
> > > the root, that's the one to sync.
> >
> > You forgot ".. at any given moment". IOW, operation you propose is
> > inherently racy. You want to do that - you do that in userland.
> 
> Are you saying that there may not be a ".." some of the time?  Or just 
> that it may spontaneously be relinked?  If it does spontaneously change 
> it doesn't matter, you have still made sure there is access by at least 
> one path.


I read the ".." as a typo for "..."  As in Al was suggesting the sentence
should read "There is only one chain of directories at any given moment
from the fd's dentry up to the root...."

At least, that was my reading on it.

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 13:09   ` intermediate summary of ext3-2.4-0.9.4 thread Daniel Phillips
  2001-08-03 14:43     ` Horst von Brand
  2001-08-03 17:49     ` Mike Castle
@ 2001-08-03 18:08     ` Alexander Viro
  2001-08-03 18:26       ` Daniel Phillips
                         ` (2 more replies)
  2 siblings, 3 replies; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 18:08 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Horst von Brand, linux-kernel



On Fri, 3 Aug 2001, Daniel Phillips wrote:

> Are you saying that there may not be a ".." some of the time?  Or just 
> that it may spontaneously be relinked?  If it does spontaneously change 
> it doesn't matter, you have still made sure there is access by at least 
> one path.
> 
> The trouble with doing this in userland is, the locked chain of dcache 
> entries isn't there.

There is no _locked_ chain. And if you want to grab the locks on all
ancestors - think again. It means sorting the inodes by address _and_
relocking if any of them had been moved while you were locking the
previous ones. I absolutely refuse to add such crap to the tree and I
seriously suspect that Linus and Alan will do the same.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:08     ` Alexander Viro
@ 2001-08-03 18:26       ` Daniel Phillips
  2001-08-03 18:53         ` Alexander Viro
  2001-08-03 18:34       ` Linus Torvalds
  2001-08-03 22:29       ` Anton Altaparmakov
  2 siblings, 1 reply; 77+ messages in thread
From: Daniel Phillips @ 2001-08-03 18:26 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Horst von Brand, linux-kernel

On Friday 03 August 2001 20:08, Alexander Viro wrote:
> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > Are you saying that there may not be a ".." some of the time?  Or just
> > that it may spontaneously be relinked?  If it does spontaneously change
> > it doesn't matter, you have still made sure there is access by at least
> > one path.
> >
> > The trouble with doing this in userland is, the locked chain of dcache
> > entries isn't there.
>
> There is no _locked_ chain.

Locked as in can't be destroyed (refcount) not i_sem or such, sorry for the
loose usage.

> And if you want to grab the locks on all
> ancestors - think again. It means sorting the inodes by address _and_
> relocking if any of them had been moved while you were locking the
> previous ones. I absolutely refuse to add such crap to the tree and I
> seriously suspect that Linus and Alan will do the same.

--
Daniel

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:08     ` Alexander Viro
  2001-08-03 18:26       ` Daniel Phillips
@ 2001-08-03 18:34       ` Linus Torvalds
  2001-08-03 22:01         ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Chris Wedgwood
  2001-08-03 22:29       ` Anton Altaparmakov
  2 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2001-08-03 18:34 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.GSO.4.21.0108031400590.3272-100000@weyl.math.psu.edu>,
Alexander Viro  <viro@math.psu.edu> wrote:
>
>
>On Fri, 3 Aug 2001, Daniel Phillips wrote:
>
>> Are you saying that there may not be a ".." some of the time?  Or just 
>> that it may spontaneously be relinked?  If it does spontaneously change 
>> it doesn't matter, you have still made sure there is access by at least 
>> one path.
>> 
>> The trouble with doing this in userland is, the locked chain of dcache 
>> entries isn't there.
>
>There is no _locked_ chain. And if you want to grab the locks on all
>ancestors - think again. It means sorting the inodes by address _and_
>relocking if any of them had been moved while you were locking the
>previous ones. I absolutely refuse to add such crap to the tree and I
>seriously suspect that Linus and Alan will do the same.

Note that while there is no "absoilutely correct" thing here, I think
the right thing to do (as in "do what the user _expects_ you to do")
would be fairly simple to implement with a simple

	fsync(int fd)
	{
		dentry = fdget(fd);
		do_fsync(dentry);
		for (;;) {
			tmp = dentry;
			dentry = dentry->d_parent;
			if (dentry == tmp)
				break;
			do_fdatasync(dentry);
		}
	}

Add dcount increments as needed (and they _are_ needed, to make sure
that we don't hold on to a dentry while the child has been moved
somewhere else and the dentry now has a zero count). And we only need to
sync up to the closest mount-point, not the global root.

Does this guarantee that we fsync the whole path in the presense of
concurrent renames? No. That's a user problem, why should we care? He
should fsync his other renames too, he didn't ask us to fsync them.

And we don't care about any other paths that the file may have. We're
syncing the path that the user opened, and no others. Again, if the user
opened another path, he should have synced _that_ one.

		Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found] ` <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>
  2001-08-03 13:09   ` intermediate summary of ext3-2.4-0.9.4 thread Daniel Phillips
@ 2001-08-03 18:36   ` Matthias Andree
  2001-08-03 19:16     ` Alexander Viro
  1 sibling, 1 reply; 77+ messages in thread
From: Matthias Andree @ 2001-08-03 18:36 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Daniel Phillips, Horst von Brand, linux-kernel

On Thu, 02 Aug 2001, Alexander Viro wrote:

> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> 
> > There is only one chain of directories from the fd's dentry up to the 
> > root, that's the one to sync.
> 
> You forgot ".. at any given moment". IOW, operation you propose is inherently
> racy. You want to do that - you do that in userland.

Applications usually protect their playgrounds - separate uid for
instance. If only the application has access to that area, and itself
does not trigger races, "at any given moment" is not a restriction.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:26       ` Daniel Phillips
@ 2001-08-03 18:53         ` Alexander Viro
  2001-08-03 20:50           ` Daniel Phillips
  2001-08-04  3:43           ` Matthias Andree
  0 siblings, 2 replies; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 18:53 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Horst von Brand, linux-kernel



On Fri, 3 Aug 2001, Daniel Phillips wrote:

> > There is no _locked_ chain.
> 
> Locked as in can't be destroyed (refcount) not i_sem or such, sorry for the
> loose usage.

Sigh... You need i_sem for fsync(). Moreover, there is no warranty that
set of objects you sync has _anything_ to path by the time when you start
syncing the second one. Application has information about the use of
parts of tree it's interested in. Kernel doesn't. Notice that all this
wankage was full of "oh, but MTA doesn't care for symlinks", "oh, but MTA
doesn't deal with parents renamed", ad nausea. You know what it means?
Right, that kernel shouldn't try to second-guess the userland. Application
knows what fs objects it wants synced. Kernel provides a primitive for
syncing an object and leaves the choice of objects to sync to application.

Folks, putting policy into the kernel is Wrong(tm). And that's precisely
what you are advocating here.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:36   ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
@ 2001-08-03 19:16     ` Alexander Viro
  2001-08-04  3:40       ` fdatasync(2) is also there (was: intermediate summary of ext3-2.4-0.9.4 thread) Matthias Andree
  0 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 19:16 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Daniel Phillips, Horst von Brand, linux-kernel



On Fri, 3 Aug 2001, Matthias Andree wrote:

> Applications usually protect their playgrounds - separate uid for
> instance. If only the application has access to that area, and itself
> does not trigger races, "at any given moment" is not a restriction.

Bingo. The whole thing relies on second-guessing the application.
BTW, I can think of very legitimate cases when we want to create
a bunch of files, fsync them as we go and then fsync the directory
where they had been created. Application knows what and when should
be synced _and_ it has a way to ask kernel to sync an object.

BTW, symlinks may be a problem - they can't be opened and symlink
creation is asynchronous. And there are applications that _do_
care about them - for all they care, lost symlink is as bad as
the need to scan lost+found


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:53         ` Alexander Viro
@ 2001-08-03 20:50           ` Daniel Phillips
  2001-08-04  3:43           ` Matthias Andree
  1 sibling, 0 replies; 77+ messages in thread
From: Daniel Phillips @ 2001-08-03 20:50 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Horst von Brand, linux-kernel

On Friday 03 August 2001 20:53, Alexander Viro wrote:
> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > > There is no _locked_ chain.
> >
> > Locked as in can't be destroyed (refcount) not i_sem or such, sorry for
> > the loose usage.
>
> Sigh... You need i_sem for fsync().

We can drop it before syncing the parent, the point is, the dcache entry
stays.

> Moreover, there is no warranty that
> set of objects you sync has _anything_ to path by the time when you start
> syncing the second one.

OK, you win, I'll provide this example:

	   A				   B

  echo hello >/a/b/c/d
  fsync(d)->
    fsync_one(d)
				rename /a/b/c/d as /x/y/z/d
      fsync_one(c)
        fsync_one(b)
          fsync_one(a)
            fsync_one(/)

where fsync_one looks roughly like our current sys_fsync.

So we fsynced and we still might have lost the path to d.

The moral of the story: if the filesystem isn't designed to provide a
guaranteed commit on rename then we shouldn't try to fix the VFS so it
sorta does.

> Application has information about the use of
> parts of tree it's interested in. Kernel doesn't. Notice that all this
> wankage was full of "oh, but MTA doesn't care for symlinks", "oh, but MTA
> doesn't deal with parents renamed", ad nausea. You know what it means?
> Right, that kernel shouldn't try to second-guess the userland. Application
> knows what fs objects it wants synced. Kernel provides a primitive for
> syncing an object and leaves the choice of objects to sync to application.
>
> Folks, putting policy into the kernel is Wrong(tm). And that's precisely
> what you are advocating here.

I needed an example where even a new, improved sys_fsync would fail to
do the right thing for Ext2 in the absence of specific coding in the
MTA.  So the MTA is either designed to handle dumb filesystems or it
isn't.  It's pretty much a moot point since we have four filesystems
now in a nearly usable state that can provide the needed commit
guarantees, which should, as has been pointed out in this thread,
provide MTA's with both faster and more reliable operation.

--
Daniel

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 18:34       ` Linus Torvalds
@ 2001-08-03 22:01         ` Chris Wedgwood
  2001-08-03 22:33           ` Alexander Viro
                             ` (4 more replies)
  0 siblings, 5 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 22:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Alan Cox, Chris Mason

[-- Attachment #1: Type: text/plain, Size: 863 bytes --]

On Fri, Aug 03, 2001 at 06:34:14PM +0000, Linus Torvalds wrote:

    	fsync(int fd)
    	{
    		dentry = fdget(fd);
    		do_fsync(dentry);
    		for (;;) {
    			tmp = dentry;
    			dentry = dentry->d_parent;
    			if (dentry == tmp)
    				break;
    			do_fdatasync(dentry);
    		}
    	}

I really like this idea. Can people please try out the attached patch?

Please note, it contains a couple of things that need not be there in
the final version.

Note, there is also a reiserfs fix in here because we can call
f_op->fsync on a directory and without this fix it will BUG!  Chris,
perhaps you can suggest a better fix?


Linus, one more thing --- the first argument to ->fsync is struct file*
and nothing uses it, I'd like to blow it away or would you prefer we
wait to 2.5.x as its essentially and API change and will break XFS,
JFS, etc.



   --cw

[-- Attachment #2: 2.4.8-pre3_fsync-entire-path.patch --]
[-- Type: text/plain, Size: 2056 bytes --]

diff -Nur --exclude=*.o --exclude=*~ linux-2.4.8-pre3/fs/buffer.c linux-2.4.8-pre3+expt/fs/buffer.c
--- linux-2.4.8-pre3/fs/buffer.c	Wed Jul 25 19:03:05 2001
+++ linux-2.4.8-pre3+expt/fs/buffer.c	Sat Aug  4 10:53:07 2001
@@ -379,33 +379,45 @@
 
 asmlinkage long sys_fsync(unsigned int fd)
 {
-	struct file * file;
-	struct dentry * dentry;
-	struct inode * inode;
+	struct file *file;
+	struct dentry *dentry;
+	struct inode *inode;
 	int err;
 
 	err = -EBADF;
 	file = fget(fd);
 	if (!file)
 		goto out;
-
-	dentry = file->f_dentry;
-	inode = dentry->d_inode;
-
 	err = -EINVAL;
 	if (!file->f_op || !file->f_op->fsync)
 		goto out_putf;
 
-	/* We need to protect against concurrent writers.. */
-	down(&inode->i_sem);
-	filemap_fdatasync(inode->i_mapping);
-	err = file->f_op->fsync(file, dentry, 0);
-	filemap_fdatawait(inode->i_mapping);
-	up(&inode->i_sem);
+	/* walk the path (dentry chain) fsync'ing everything along
+	   it. This algorithm to do "do what the user _expects_ you to
+	   do" was suggested by Linus. --cw */
+	dentry = file->f_dentry;
+	do {
+		inode = dentry->d_inode;
+		/* We need to protect against concurrent writers.. */
+		down(&inode->i_sem);
+		filemap_fdatasync(inode->i_mapping);
+		err = -EINVAL;
+		if (!file->f_op || !file->f_op->fsync)
+			goto out_putf;
+		err = file->f_op->fsync(file, dentry, 0);
+		filemap_fdatawait(inode->i_mapping);
+		up(&inode->i_sem);
+
+		/* stop at root of fs */
+		if(dentry == dentry->d_parent)
+			break;
+
+		dentry = dentry->d_parent;
+	} while(dentry && !err);
 
-out_putf:
+ out_putf:
 	fput(file);
-out:
+ out:
 	return err;
 }
 
diff -Nur --exclude=*.o --exclude=*~ linux-2.4.8-pre3/fs/reiserfs/file.c linux-2.4.8-pre3+expt/fs/reiserfs/file.c
--- linux-2.4.8-pre3/fs/reiserfs/file.c	Fri Jul 20 20:37:09 2001
+++ linux-2.4.8-pre3+expt/fs/reiserfs/file.c	Sat Aug  4 09:06:14 2001
@@ -90,6 +90,9 @@
 
   lock_kernel() ;
 
+  if (S_ISDIR(p_s_inode->i_mode))
+          return reiserfs_dir_fsync(p_s_filp, p_s_dentry, datasync);
+
   if (!S_ISREG(p_s_inode->i_mode))
       BUG ();
 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 18:08     ` Alexander Viro
  2001-08-03 18:26       ` Daniel Phillips
  2001-08-03 18:34       ` Linus Torvalds
@ 2001-08-03 22:29       ` Anton Altaparmakov
  2001-08-03 23:06         ` Chris Wedgwood
  2001-08-06 15:23         ` looking for resources for designing loopback filesystem dave-mlist
  2 siblings, 2 replies; 77+ messages in thread
From: Anton Altaparmakov @ 2001-08-03 22:29 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

Your patch down()s a semaphore without calling up() in the error code path...

Best regards,

Anton

At 23:01 03/08/2001, Chris Wedgwood wrote:
>On Fri, Aug 03, 2001 at 06:34:14PM +0000, Linus Torvalds wrote:
>
>         fsync(int fd)
>         {
>                 dentry = fdget(fd);
>                 do_fsync(dentry);
>                 for (;;) {
>                         tmp = dentry;
>                         dentry = dentry->d_parent;
>                         if (dentry == tmp)
>                                 break;
>                         do_fdatasync(dentry);
>                 }
>         }
>
>I really like this idea. Can people please try out the attached patch?
>
>Please note, it contains a couple of things that need not be there in
>the final version.
>
>Note, there is also a reiserfs fix in here because we can call
>f_op->fsync on a directory and without this fix it will BUG!  Chris,
>perhaps you can suggest a better fix?
>
>
>Linus, one more thing --- the first argument to ->fsync is struct file*
>and nothing uses it, I'd like to blow it away or would you prefer we
>wait to 2.5.x as its essentially and API change and will break XFS,
>JFS, etc.
>
>
>
>    --cw

-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 22:01         ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Chris Wedgwood
@ 2001-08-03 22:33           ` Alexander Viro
  2001-08-03 23:16             ` Chris Wedgwood
  2001-08-03 22:45           ` Alexander Viro
                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 22:33 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sat, 4 Aug 2001, Chris Wedgwood wrote:

> I really like this idea. Can people please try out the attached patch?
> 
> Please note, it contains a couple of things that need not be there in
> the final version.

Like an oopsable race absolutely trivial to exploit for any user with write
access to fs?

	a) dentry can freed under you. Just rename the parent while you
are syncing it. Then you'll block on attempt to take ->i_sem on
grandparent and merrily go to hell when parent will be moved away and
rename(2) will do dput() on grandparent.

	b) access to ->d_parent requires at least one of the following:
dcache_lock, BKL, i_sem or ->i_zombie on inode of parent.

	c) as it is, you will get a hell of IO load on a dumb fs.
dumb == anything that uses file_fsync() as ->fsync() of directories.
You'll do full sync of fs on every bloody step.

	d) sequence of inodes you sync has only one property guaranteed:
at some moment nth inode is a parent of (n-1)th. That's it. E.g. it's easy
to get a situation when _none_ of the inodes you sync had ever been a
grandparent of the original inode.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 22:01         ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Chris Wedgwood
  2001-08-03 22:33           ` Alexander Viro
@ 2001-08-03 22:45           ` Alexander Viro
  2001-08-03 23:09             ` Chris Wedgwood
  2001-08-04  1:08           ` Andrew Morton
                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 22:45 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sat, 4 Aug 2001, Chris Wedgwood wrote:

> Linus, one more thing --- the first argument to ->fsync is struct file*
> and nothing uses it, I'd like to blow it away or would you prefer we

nfs_fsync().


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 22:29       ` Anton Altaparmakov
@ 2001-08-03 23:06         ` Chris Wedgwood
  2001-08-06 15:23         ` looking for resources for designing loopback filesystem dave-mlist
  1 sibling, 0 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 23:06 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, Aug 03, 2001 at 11:29:29PM +0100, Anton Altaparmakov wrote:

    Your patch down()s a semaphore without calling up() in the error
    code path...

Errors? Nah, you won't get errors :)


Yes, it does... I did a quick copy of the code from above and didn't
check it. I'll fix that shortly.



  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 22:45           ` Alexander Viro
@ 2001-08-03 23:09             ` Chris Wedgwood
  2001-08-03 23:15               ` Alexander Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 23:09 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, Aug 03, 2001 at 06:45:47PM -0400, Alexander Viro wrote:

    nfs_fsync().

Ah, well... then I'm not sure how the loop should look.  I use
f->fsync(file, dentry, ...) where file references the original file
not any of the parent path components.

Obviously, for ext2 and reiserfs (which is what I have here) this
won't matter --- will it for NFS?  If so, so I need to open/etc. for
each parent component to get a valid struct file*?

Please bear with me, I'm still learning :)



  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:09             ` Chris Wedgwood
@ 2001-08-03 23:15               ` Alexander Viro
  2001-08-03 23:20                 ` Chris Wedgwood
  0 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 23:15 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sat, 4 Aug 2001, Chris Wedgwood wrote:

> On Fri, Aug 03, 2001 at 06:45:47PM -0400, Alexander Viro wrote:
> 
>     nfs_fsync().
> 
> Ah, well... then I'm not sure how the loop should look.  I use
> f->fsync(file, dentry, ...) where file references the original file
> not any of the parent path components.
> 
> Obviously, for ext2 and reiserfs (which is what I have here) this
> won't matter --- will it for NFS?  If so, so I need to open/etc. for
> each parent component to get a valid struct file*?

<shrug> credentials cache. 2.5 fodder. Notice that with NFS you don't
have fsync for directories. At all. So that's not a problem in that
particular case - you can pass NULL on all subsequent iterations.
On the first step you need struct file * - NFS needs credentials
to pass to the server.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 22:33           ` Alexander Viro
@ 2001-08-03 23:16             ` Chris Wedgwood
  2001-08-03 23:23               ` Alexander Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 23:16 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, Aug 03, 2001 at 06:33:37PM -0400, Alexander Viro wrote:

    Like an oopsable race absolutely trivial to exploit for any user
    with write access to fs?

:(

    	a) dentry can freed under you. Just rename the parent while
    you are syncing it. Then you'll block on attempt to take ->i_sem
    on grandparent and merrily go to hell when parent will be moved
    away and rename(2) will do dput() on grandparent.

Can I lock a dentry to prevent this?

    	b) access to ->d_parent requires at least one of the
    following: dcache_lock, BKL, i_sem or ->i_zombie on inode of
    parent.

I assume BLK lock is undesirable here?  dcache_lock seems quite
reasonable and very cheap.  I can't see how else to do it as getting
inode of parent required d_parent.

    	c) as it is, you will get a hell of IO load on a dumb fs.
    dumb == anything that uses file_fsync() as ->fsync() of
    directories.  You'll do full sync of fs on every bloody step.

Yes, but it reality it's not that bad. As is, reiserfs and ext3 (used
to, might not now) sync pending transaction on fsync anyhow.  Run lilo
recently on reiserfs of ext3?  Ever seemed slow?  strace -f and you'll
see it calls fsync after writing to the map files each time (no idea
why) and it it still acceptable.

    	d) sequence of inodes you sync has only one property
    guaranteed: at some moment nth inode is a parent of
    (n-1)th. That's it. E.g. it's easy to get a situation when _none_
    of the inodes you sync had ever been a grandparent of the original
    inode.

I'm not sure I follow you hear... is this because of bind semantics or
somethng else?  I can see it relevant in the case of a fs mount in
/var/somewhere/blah when you fsync /var/somewhere/blah/dir/file, you
don't need to go past blah --- but how would I detect this 'edge'?




Thanks!

  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:15               ` Alexander Viro
@ 2001-08-03 23:20                 ` Chris Wedgwood
  2001-08-03 23:25                   ` Alexander Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 23:20 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, Aug 03, 2001 at 07:15:59PM -0400, Alexander Viro wrote:

    <shrug> credentials cache. 2.5 fodder. Notice that with NFS you
    don't have fsync for directories. At all. So that's not a problem
    in that particular case - you can pass NULL on all subsequent
    iterations.  On the first step you need struct file * - NFS needs
    credentials to pass to the server.

How about passing NULL then for now? --- and explicitly requiring
->fsync accept NULL for the first argument and ignore it?  It would be
nice to know that this is the case then though, because we could stop
immediately.



  --cw


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:16             ` Chris Wedgwood
@ 2001-08-03 23:23               ` Alexander Viro
  2001-08-04  3:53                 ` Matthias Andree
  0 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 23:23 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sat, 4 Aug 2001, Chris Wedgwood wrote:

> Can I lock a dentry to prevent this?

dget_locked() under dcache_lock + dput() when you are done.

>     	b) access to ->d_parent requires at least one of the
>     following: dcache_lock, BKL, i_sem or ->i_zombie on inode of
>     parent.
> 
> I assume BLK lock is undesirable here?  dcache_lock seems quite
> reasonable and very cheap.  I can't see how else to do it as getting
> inode of parent required d_parent.
> 
>     	c) as it is, you will get a hell of IO load on a dumb fs.
>     dumb == anything that uses file_fsync() as ->fsync() of
>     directories.  You'll do full sync of fs on every bloody step.
> 
> Yes, but it reality it's not that bad. As is, reiserfs and ext3 (used
> to, might not now) sync pending transaction on fsync anyhow.  Run lilo
> recently on reiserfs of ext3?  Ever seemed slow?  strace -f and you'll
> see it calls fsync after writing to the map files each time (no idea
> why) and it it still acceptable.

Sure. And now each fsync() turns into a bunch of such calls.
 
>     	d) sequence of inodes you sync has only one property
>     guaranteed: at some moment nth inode is a parent of
>     (n-1)th. That's it. E.g. it's easy to get a situation when _none_
>     of the inodes you sync had ever been a grandparent of the original
>     inode.
> 
> I'm not sure I follow you hear... is this because of bind semantics or
> somethng else?  I can see it relevant in the case of a fs mount in
> /var/somewhere/blah when you fsync /var/somewhere/blah/dir/file, you
> don't need to go past blah --- but how would I detect this 'edge'?

It has nothing to bindings/mount/etc. fsync /a/b/c. While c is written
out, mv a/b/c /a/d/c. While d is written out, mv a/d/c a/b/c && mv a/d e/d
Through all these renames /a remained the grandparent of c. You won't sync it -
you sync c, then d, then e, then root.

Constructing less convoluted examples with similar results is left as an
exercise to reader...


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:20                 ` Chris Wedgwood
@ 2001-08-03 23:25                   ` Alexander Viro
  2001-08-03 23:35                     ` Chris Wedgwood
  0 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 23:25 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sat, 4 Aug 2001, Chris Wedgwood wrote:

> On Fri, Aug 03, 2001 at 07:15:59PM -0400, Alexander Viro wrote:
> 
>     <shrug> credentials cache. 2.5 fodder. Notice that with NFS you
>     don't have fsync for directories. At all. So that's not a problem
>     in that particular case - you can pass NULL on all subsequent
>     iterations.  On the first step you need struct file * - NFS needs
>     credentials to pass to the server.
> 
> How about passing NULL then for now? --- and explicitly requiring
> ->fsync accept NULL for the first argument and ignore it?  It would be
> nice to know that this is the case then though, because we could stop
> immediately.

You need credentials to sync a regular file on any network filesystem.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:25                   ` Alexander Viro
@ 2001-08-03 23:35                     ` Chris Wedgwood
  2001-08-03 23:41                       ` Alexander Viro
  2001-08-03 23:42                       ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Alan Cox
  0 siblings, 2 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 23:35 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, Aug 03, 2001 at 07:25:19PM -0400, Alexander Viro wrote:

    You need credentials to sync a regular file on any network
    filesystem.

For 2.5.x I assume your planning or a credentials cache?  Something
like dentry->d_creds or something?  If that's the case we still don't
need the struct file* to be passed --- but I suspect that's not the
case and I really don't understand.



  --cw




^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:35                     ` Chris Wedgwood
@ 2001-08-03 23:41                       ` Alexander Viro
  2001-08-03 23:46                         ` Chris Wedgwood
  2001-08-03 23:42                       ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Alan Cox
  1 sibling, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 23:41 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sat, 4 Aug 2001, Chris Wedgwood wrote:

> On Fri, Aug 03, 2001 at 07:25:19PM -0400, Alexander Viro wrote:
> 
>     You need credentials to sync a regular file on any network
>     filesystem.
> 
> For 2.5.x I assume your planning or a credentials cache?  Something
> like dentry->d_creds or something?  If that's the case we still don't
> need the struct file* to be passed --- but I suspect that's not the
> case and I really don't understand.

file->f_cred. Different people opening the same file can have different
credentials (e.g. credentials can be revoked)


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:35                     ` Chris Wedgwood
  2001-08-03 23:41                       ` Alexander Viro
@ 2001-08-03 23:42                       ` Alan Cox
  2001-08-03 23:44                         ` Chris Wedgwood
  1 sibling, 1 reply; 77+ messages in thread
From: Alan Cox @ 2001-08-03 23:42 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Alexander Viro, Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

> For 2.5.x I assume your planning or a credentials cache?  Something
> like dentry->d_creds or something?  If that's the case we still don't
> need the struct file* to be passed --- but I suspect that's not the
> case and I really don't understand.

It can't come off the dentry as multiple people can have the same file open
with different rights.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:42                       ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Alan Cox
@ 2001-08-03 23:44                         ` Chris Wedgwood
  0 siblings, 0 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 23:44 UTC (permalink / raw)
  To: Alan Cox; +Cc: Alexander Viro, Linus Torvalds, linux-kernel, Chris Mason

On Sat, Aug 04, 2001 at 12:42:38AM +0100, Alan Cox wrote:

    It can't come off the dentry as multiple people can have the same
    file open with different rights.

Yes, of course.

If that's the case, I'm not sure how you pick reasonable creds. for
fsyncing path components for a network filesystem, unless you (as the
calling user) open each component and sync that bit by bit --- which
could almost be done in libc (kernel is easier though).


  --cw


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:41                       ` Alexander Viro
@ 2001-08-03 23:46                         ` Chris Wedgwood
  2001-08-03 23:53                           ` Alexander Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-03 23:46 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, Aug 03, 2001 at 07:41:40PM -0400, Alexander Viro wrote:

    file->f_cred. Different people opening the same file can have
    different credentials (e.g. credentials can be revoked)

Sure, but if I

        f = open("/home/viro/file", ... );
        fsync(f);

I only have creds for 'file' --- I have no such thing for 'viro' or
'home', so I can't do anything sensible here.



  --cw





^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:46                         ` Chris Wedgwood
@ 2001-08-03 23:53                           ` Alexander Viro
  2001-08-04  7:45                             ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchange patch) Hans Reiser
  0 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-03 23:53 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sat, 4 Aug 2001, Chris Wedgwood wrote:

> On Fri, Aug 03, 2001 at 07:41:40PM -0400, Alexander Viro wrote:
> 
>     file->f_cred. Different people opening the same file can have
>     different credentials (e.g. credentials can be revoked)
> 
> Sure, but if I
> 
>         f = open("/home/viro/file", ... );
>         fsync(f);
> 
> I only have creds for 'file' --- I have no such thing for 'viro' or
> 'home', so I can't do anything sensible here.

Credentials are about "I'm foo", not "I'm allowed to do this with that".
Server decides what you can do.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic  change patch)
  2001-08-03 22:01         ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Chris Wedgwood
  2001-08-03 22:33           ` Alexander Viro
  2001-08-03 22:45           ` Alexander Viro
@ 2001-08-04  1:08           ` Andrew Morton
  2001-08-04  1:19             ` Chris Wedgwood
  2001-08-04 17:35           ` Jan Harkes
  2001-08-06 15:02           ` Chris Mason
  4 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2001-08-04  1:08 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

Chris Wedgwood wrote:
> 
> On Fri, Aug 03, 2001 at 06:34:14PM +0000, Linus Torvalds wrote:
> 
>         fsync(int fd)
>         {
>                 dentry = fdget(fd);
>                 do_fsync(dentry);
>                 for (;;) {
>                         tmp = dentry;
>                         dentry = dentry->d_parent;
>                         if (dentry == tmp)
>                                 break;
>                         do_fdatasync(dentry);
>                 }
>         }
> 
> I really like this idea. Can people please try out the attached patch?

Ow.  You just crippled ext3.

I don't think an ext2 problem (which I don't think is a problem
at all) should be "fixed" at the VFS layer when other filesystems
are perfectly happy without it, no?

This whole thread, talking about "linux this" and "linux that"
is off-base.  It's ext2 we're talking about.  This MTA requirement
is a highly unusual and specialised thing - I don't see why the
general-purpose ext2 should bear the burden of supporting it when
other filesystems such as reiserfs (I think?) and ext3 support it
naturally and better than ext2 ever will.

-

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04  1:08           ` Andrew Morton
@ 2001-08-04  1:19             ` Chris Wedgwood
  2001-08-04  1:45               ` Andrew Morton
  2001-08-04 21:07               ` Andrew Morton
  0 siblings, 2 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-04  1:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, Aug 03, 2001 at 06:08:49PM -0700, Andrew Morton wrote:

    Ow.  You just crippled ext3.

How so? The Flush all transactions on fsync behaviour that resierfs
did/does have at present too?  (There are 'fixes' to reiserfs for
this).

    I don't think an ext2 problem (which I don't think is a problem at
    all) should be "fixed" at the VFS layer when other filesystems are
    perfectly happy without it, no?

If you want to be sure that when you fsync a file, that, silly bugger
rename games further up the path aside, the entire path is also on
disk, the VFS is the only place to do it with the current fs API.

really, there is _some_ merit in the argument that

        open
        fsync
        close
<crash>

shouldn't loose the file...

    This whole thread, talking about "linux this" and "linux that" is
    off-base.  It's ext2 we're talking about.  This MTA requirement is
    a highly unusual and specialised thing - I don't see why the
    general-purpose ext2 should bear the burden of supporting it when
    other filesystems such as reiserfs (I think?) and ext3 support it
    naturally and better than ext2 ever will.

Well, since it will only sync dirty blocks, it will hardly hurt ext2
that much at all --- and it will only force the dirty blocks in path
components to be written when you fsync the file, thats probably only
a single block anyhow.

FWIW, I don't think it's unreasonable we do this, nor does it fix all
the potential MTA problems :)



Anyhow, that patch was bogus.  As soon as I get some more clues, I'll
send another one (trying to get more clueful now, not having much
success!).




  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic  change patch)
  2001-08-04  1:19             ` Chris Wedgwood
@ 2001-08-04  1:45               ` Andrew Morton
  2001-08-04  4:04                 ` Matthias Andree
  2001-08-04 21:07               ` Andrew Morton
  1 sibling, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2001-08-04  1:45 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

Chris Wedgwood wrote:
> 
> On Fri, Aug 03, 2001 at 06:08:49PM -0700, Andrew Morton wrote:
> 
>     Ow.  You just crippled ext3.
> 
> How so? The Flush all transactions on fsync behaviour that resierfs
> did/does have at present too?  (There are 'fixes' to reiserfs for
> this).

That, plus the fact that ext3 gets its synchronous-op scalability
from batching the transactions from multiple threads together.
The holding of parent->parent->i_sem across synchronous writes could
defeat that.

ext3 does physical, block-level journalling.  (its journalling layer is
actually designed to be fs-independent so one could journal other filesystems
with it).  So there is no tracking between a particular fs event and a
particular transaction.  A transaction is just a blob of blocks which
can encompass thousands of fs events.

>     I don't think an ext2 problem (which I don't think is a problem at
>     all) should be "fixed" at the VFS layer when other filesystems are
>     perfectly happy without it, no?
> 
> If you want to be sure that when you fsync a file, that, silly bugger
> rename games further up the path aside, the entire path is also on
> disk, the VFS is the only place to do it with the current fs API.
> 
> really, there is _some_ merit in the argument that
> 
>         open
>         fsync
>         close
> <crash>
> 
> shouldn't loose the file...

Agreed - I think it's the expected and sensible behaviour.  But I've seen
no complaints about it except for use in a few specialised applications.
Where "a few" == "one", actually.

>     This whole thread, talking about "linux this" and "linux that" is
>     off-base.  It's ext2 we're talking about.  This MTA requirement is
>     a highly unusual and specialised thing - I don't see why the
>     general-purpose ext2 should bear the burden of supporting it when
>     other filesystems such as reiserfs (I think?) and ext3 support it
>     naturally and better than ext2 ever will.
> 
> Well, since it will only sync dirty blocks, it will hardly hurt ext2
> that much at all --- and it will only force the dirty blocks in path
> components to be written when you fsync the file, thats probably only
> a single block anyhow.

mmm... Holding i_sem across multiple revs of the disk will hurt.  It
doesn't *need* to be held while we're waiting on IO, but fixing that
would be a big change, and there has been little motivation to change
things because it is for specialised apps.

-

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 17:49     ` Mike Castle
@ 2001-08-04  3:23       ` Daniel Phillips
  0 siblings, 0 replies; 77+ messages in thread
From: Daniel Phillips @ 2001-08-04  3:23 UTC (permalink / raw)
  To: Mike Castle, linux-kernel

On Friday 03 August 2001 19:49, Mike Castle wrote:
> On Fri, Aug 03, 2001 at 03:09:06PM +0200, Daniel Phillips wrote:
> > On Friday 03 August 2001 05:13, Alexander Viro wrote:
> > > On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > > > There is only one chain of directories from the fd's dentry up to
> > > > the root, that's the one to sync.
> > >
> > > You forgot ".. at any given moment". IOW, operation you propose is
> > > inherently racy. You want to do that - you do that in userland.
> >
> > Are you saying that there may not be a ".." some of the time?  Or just
> > that it may spontaneously be relinked?  If it does spontaneously change
> > it doesn't matter, you have still made sure there is access by at least
> > one path.
>
> I read the ".." as a typo for "..."  As in Al was suggesting the sentence
> should read "There is only one chain of directories at any given moment
> from the fd's dentry up to the root...."

Heh, after some practice you get good at decoding Alspeak, it's not harder
than MIX.

--
Daniel


^ permalink raw reply	[flat|nested] 77+ messages in thread

* fdatasync(2) is also there (was: intermediate summary of ext3-2.4-0.9.4 thread)
  2001-08-03 19:16     ` Alexander Viro
@ 2001-08-04  3:40       ` Matthias Andree
  2001-08-05  0:28         ` Mike Castle
  0 siblings, 1 reply; 77+ messages in thread
From: Matthias Andree @ 2001-08-04  3:40 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel

On Fri, 03 Aug 2001, Alexander Viro wrote:

> Bingo. The whole thing relies on second-guessing the application.
> BTW, I can think of very legitimate cases when we want to create
> a bunch of files, fsync them as we go and then fsync the directory
> where they had been created. Application knows what and when should
> be synced _and_ it has a way to ask kernel to sync an object.

How portable is fsync()ing the directory?

How USEFUL is it to the application if all other boxen fsync() the
directory entries along with the file?

You want to sync the files, but the directory only after you created all
of the files? Use fdatasync(2) for the files - it doesn't flush meta
data, then sync the directory. It's POSIX, it's SUS v2.

Not everyone has it, though, so you may need to fall back to fsync() on
those systems.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:53         ` Alexander Viro
  2001-08-03 20:50           ` Daniel Phillips
@ 2001-08-04  3:43           ` Matthias Andree
  1 sibling, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-04  3:43 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Daniel Phillips, Horst von Brand, linux-kernel

On Fri, 03 Aug 2001, Alexander Viro wrote:

> Sigh... You need i_sem for fsync(). Moreover, there is no warranty that
> set of objects you sync has _anything_ to path by the time when you start
> syncing the second one. Application has information about the use of
> parts of tree it's interested in. Kernel doesn't. Notice that all this
> wankage was full of "oh, but MTA doesn't care for symlinks", "oh, but MTA
> doesn't deal with parents renamed", ad nausea. You know what it means?

I know a few MTAs, but none of them use symlinks, they always use hard
links (if at all).

MTAs don't rename parent directories.

> Folks, putting policy into the kernel is Wrong(tm). And that's precisely
> what you are advocating here.

Is putting options with a user-space interface (mount option, file
system option such as chattr) wrong as well? Is making fsync() more
compatible wrong?

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 23:23               ` Alexander Viro
@ 2001-08-04  3:53                 ` Matthias Andree
  2001-08-04  5:48                   ` Alexander Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Matthias Andree @ 2001-08-04  3:53 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linux-Kernel mailing list

On Fri, 03 Aug 2001, Alexander Viro wrote:

> It has nothing to bindings/mount/etc. fsync /a/b/c. While c is written
> out, mv a/b/c /a/d/c. While d is written out, mv a/d/c a/b/c && mv a/d e/d
> Through all these renames /a remained the grandparent of c. You won't sync it -
> you sync c, then d, then e, then root.

Which looks like the right thing, what used to be a/b/c is now e/d/c --
and you synced c, d, and e.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04  1:45               ` Andrew Morton
@ 2001-08-04  4:04                 ` Matthias Andree
  2001-08-04 18:30                   ` Chris Wedgwood
  2001-08-04 21:07                   ` Andrew Morton
  0 siblings, 2 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-04  4:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Wedgwood, Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Fri, 03 Aug 2001, Andrew Morton wrote:

> > really, there is _some_ merit in the argument that
> > 
> >         open
> >         fsync
> >         close
> > <crash>
> > 
> > shouldn't loose the file...

...

> mmm... Holding i_sem across multiple revs of the disk will hurt.  It
> doesn't *need* to be held while we're waiting on IO, but fixing that
> would be a big change, and there has been little motivation to change
> things because it is for specialised apps.

I have no clues of the inode, dentry whatever kernel structure names of
Linux. However, aren't we already at the point that ext3 fsync() flushes
the corresponding dirents? How difficult would it be to have the inode
track changes to its dirents such as rename or link? Without breaking it
if someone renames a parent or grandparent? I mean, FFS + softupdates
can do it. The MTA doesn't really care for the implementation, it cares
that it never loses its files over a crash -- where finding it renamed
to /mount/point/lost+found is considered a loss. For people that want
the data synced and want to sync the meta data later, there is
fdatasync() as they go and either fsync()ing the files or fsync()ing
their directory as they finish.

unlink and particularly symlink will need separate consideration, but
that's left for later.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04  3:53                 ` Matthias Andree
@ 2001-08-04  5:48                   ` Alexander Viro
  0 siblings, 0 replies; 77+ messages in thread
From: Alexander Viro @ 2001-08-04  5:48 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Linux-Kernel mailing list



On Sat, 4 Aug 2001, Matthias Andree wrote:

> On Fri, 03 Aug 2001, Alexander Viro wrote:
> 
> > It has nothing to bindings/mount/etc. fsync /a/b/c. While c is written
> > out, mv a/b/c /a/d/c. While d is written out, mv a/d/c a/b/c && mv a/d e/d
> > Through all these renames /a remained the grandparent of c. You won't sync it -
> > you sync c, then d, then e, then root.
> 
> Which looks like the right thing, what used to be a/b/c is now e/d/c --
> and you synced c, d, and e.

Like hell it is.

/        /a        /a/b        /a/b/c        /a/d        /e
                               ^^^^^^
/        /a        /a/b        /a/d/c        /a/d        /e
                               ^^^^^^
/        /a        /a/b        /a/b/c        /a/d        /e
                                             ^^^^
/        /a        /a/b        /a/b/c        /e/d        /e
                                             ^^^^
/        /a        /a/b        /a/b/c        /e/d        /e
                                                         ^^
/        /a        /a/b        /a/b/c        /e/d        /e
^

Result of these renames is the same as mv /a/d /e/d. See above for names
and currently synced inode during that sequence...


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchange  patch)
  2001-08-03 23:53                           ` Alexander Viro
@ 2001-08-04  7:45                             ` Hans Reiser
  2001-08-04 18:31                               ` Chris Wedgwood
  0 siblings, 1 reply; 77+ messages in thread
From: Hans Reiser @ 2001-08-04  7:45 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Chris Wedgwood, Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

Alexander Viro wrote:
> 
> On Sat, 4 Aug 2001, Chris Wedgwood wrote:
> 
> > On Fri, Aug 03, 2001 at 07:41:40PM -0400, Alexander Viro wrote:
> >
> >     file->f_cred. Different people opening the same file can have
> >     different credentials (e.g. credentials can be revoked)
> >
> > Sure, but if I
> >
> >         f = open("/home/viro/file", ... );
> >         fsync(f);
> >
> > I only have creds for 'file' --- I have no such thing for 'viro' or
> > 'home', so I can't do anything sensible here.
> 
> Credentials are about "I'm foo", not "I'm allowed to do this with that".
> Server decides what you can do.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


Can you define f_cred for us?

Hans

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 22:01         ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Chris Wedgwood
                             ` (2 preceding siblings ...)
  2001-08-04  1:08           ` Andrew Morton
@ 2001-08-04 17:35           ` Jan Harkes
  2001-08-04 18:18             ` Chris Wedgwood
  2001-08-06 15:02           ` Chris Mason
  4 siblings, 1 reply; 77+ messages in thread
From: Jan Harkes @ 2001-08-04 17:35 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Sat, Aug 04, 2001 at 10:01:43AM +1200, Chris Wedgwood wrote:
> On Fri, Aug 03, 2001 at 06:34:14PM +0000, Linus Torvalds wrote:
> 
>     	fsync(int fd)
>     	{
>     		dentry = fdget(fd);
>     		do_fsync(dentry);
>     		for (;;) {
>     			tmp = dentry;
>     			dentry = dentry->d_parent;
>     			if (dentry == tmp)
>     				break;
>     			do_fdatasync(dentry);
>     		}
>     	}
> 
> I really like this idea. Can people please try out the attached patch?

Deadly for Coda, every fsync triggers an upcall to userspace, which
costs at least 2 context switches and a whole bunch of network traffic
as updates are sent back to the server, i.e. the fact that some parent
has a new child or timestamp is not related to or interesting for this
delivery.

The existing fsync(dir) is far better and the reasons are simple and
explained multiple times throughout this far too long thread. So let me
reiterate.

- fsync(dir) is an _explicit_ indication by an application that actually
  cares what precisely needs to be written to disk.
- fsync(dir) doesn't negatively affect applications that don't care.
- The proposed patch doesn't solve the 'oops I relinked the file, or I
  renamed a parent' problems where the path leading to a file is lost
  anyways. And the relinking is exactly what is done by both qmail and
  cyrus imap, i.e. this patch wouldn't even solve the problem that is
  being discussed.
- fsync(dir) doesn't write out anymore that has to be written, i.e. a
  open/unlink in a temporary directory doesn't need to hit the disk if
  there is an fsync in the middle (see qmail/cyrus syscall paths in a
  previous email)
- We don't need to modify the code to make fsync(dir) work! If it turns
  out to be too slow, let's fix fsync(dir) to be more efficient.

Jan


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04 17:35           ` Jan Harkes
@ 2001-08-04 18:18             ` Chris Wedgwood
  0 siblings, 0 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-04 18:18 UTC (permalink / raw)
  To: Jan Harkes, linux-kernel

(cc' list trimmed)

On Sat, Aug 04, 2001 at 01:35:00PM -0400, Jan Harkes wrote:

    Deadly for Coda, every fsync triggers an upcall to userspace,
    which costs at least 2 context switches and a whole bunch of
    network traffic as updates are sent back to the server, i.e. the
    fact that some parent has a new child or timestamp is not related
    to or interesting for this delivery.

would limiting it to just the parent dentry be acceptable?

    - fsync(dir) is an _explicit_ indication by an application that
      actually cares what precisely needs to be written to disk.

i though nothing used it (maybe qmail?)

    - fsync(dir) doesn't negatively affect applications that don't care.

fsync(dir) or fsync(file)?

    - The proposed patch doesn't solve the 'oops I relinked the file,
      or I renamed a parent' problems where the path leading to a file
      is lost anyways.

      And the relinking is exactly what is done by both qmail and
      cyrus imap, i.e. this patch wouldn't even solve the problem that
      is being discussed.

cryrus/qmail do link("tmp_place", "store/where_i_want_it") ?

do they also do fsync(open("store")) ?




  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04  4:04                 ` Matthias Andree
@ 2001-08-04 18:30                   ` Chris Wedgwood
  2001-08-05 12:15                     ` Matthias Andree
  2001-08-04 21:07                   ` Andrew Morton
  1 sibling, 1 reply; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-04 18:30 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

(cc' list trimmed)

On Sat, Aug 04, 2001 at 06:04:23AM +0200, Matthias Andree wrote:

    However, aren't we already at the point that ext3 fsync() flushes
    the corresponding dirents?

Ideally an acceptable solution will also obvioate the needs for +S
under ext2 and also work for reiserfs and more.

    How difficult would it be to have the inode track changes to its
    dirents such as rename or link?

For simple cases, not terrible (I don't think),  but you would have to
limit how much you tracked in extreme cases (like 2 directory inode at
most or something).

    I mean, FFS + softupdates can do it.

So run FFS :)

    The MTA doesn't really care for the implementation, it cares that
    it never loses its files over a crash

By MTA you mean Postfix.  Wietse recently stated he might start
looking at other alternatives which don't relay on undocument FFS
semantics or synchronous metadata updates.

    where finding it renamed
    to /mount/point/lost+found is considered a loss.

Actually, there is enough information on the file to recover things
for postfix.  For sendmail (which has two files, your pretty much
stuffed though).

    unlink and particularly symlink will need separate consideration,
    but that's left for later.

Unlink should be ok, I don't see why symlink should even be
considered (it would make things way too complex for little or no
gain).



Anyhow, if you read recent comments it looks like thre are filesystems
which will be badly affected by placing this logic into the VFS
(eg. Coda).  It may well be this should become a fs-specific thing
(which sucks a little, because it makes the suggestion of tracking
link/unlink directories ugly) and for some filesystems such as
resierfs and ext3, they could be modified to merge the 'related
directories' writes with the metadata write of the file itself.




  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchange patch)
  2001-08-04  7:45                             ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchange patch) Hans Reiser
@ 2001-08-04 18:31                               ` Chris Wedgwood
  2001-08-05  1:47                                 ` Alexander Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-04 18:31 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Alexander Viro, Linus Torvalds, linux-kernel, Alan Cox, Chris Mason

On Sat, Aug 04, 2001 at 11:45:47AM +0400, Hans Reiser wrote:

    Can you define f_cred for us?

Surely is becomes a fs-specific opaque type, void* or whatever?  For
many filesystems it somply won't be relevant, and for some filesystems
such as NFS and Coda, presumably it will need deep fs-specific
details?



  --cw


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic  change patch)
  2001-08-04  4:04                 ` Matthias Andree
  2001-08-04 18:30                   ` Chris Wedgwood
@ 2001-08-04 21:07                   ` Andrew Morton
  1 sibling, 0 replies; 77+ messages in thread
From: Andrew Morton @ 2001-08-04 21:07 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

Matthias Andree wrote:
> 
> aren't we already at the point that ext3 fsync() flushes
> the corresponding dirents?

_any_ synchronous operation on ext3 has flushed _everything_
by the time it returns to the caller.  Every last little bit
is on disk.

This applies to fsync() against any file/dir, write() on an
O_SYNC file, any metadata operation or write() against a `chattr +S'
object, any metadata operation or write() against a `mount -o sync'
filesystem and msync().

The only exception is pageout of mmap'ed files - you'll need to
run msync() to guarantee that these are crashproofed.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic  change patch)
  2001-08-04  1:19             ` Chris Wedgwood
  2001-08-04  1:45               ` Andrew Morton
@ 2001-08-04 21:07               ` Andrew Morton
  2001-08-05  7:25                 ` Chris Wedgwood
  2001-08-09 13:25                 ` Matthias Andree
  1 sibling, 2 replies; 77+ messages in thread
From: Andrew Morton @ 2001-08-04 21:07 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: linux-kernel

Chris Wedgwood wrote:
> 
> On Fri, Aug 03, 2001 at 06:08:49PM -0700, Andrew Morton wrote:
> 
>     Ow.  You just crippled ext3.
> 
> How so? The Flush all transactions on fsync behaviour that resierfs
> did/does have at present too?  (There are 'fixes' to reiserfs for
> this).

Sorry, Chris - I was being even more than usually stupid.  All
you need do is overload the return value from file_operations.fsync().
It currently returns zero on success or negative error.  You can
just define a return value of "1" to mean that the fs has taken
care of syncing the directory info and no further action is needed
at the fs layer.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: fdatasync(2) is also there (was: intermediate summary of ext3-2.4-0.9.4 thread)
  2001-08-04  3:40       ` fdatasync(2) is also there (was: intermediate summary of ext3-2.4-0.9.4 thread) Matthias Andree
@ 2001-08-05  0:28         ` Mike Castle
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Castle @ 2001-08-05  0:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Viro

On Sat, Aug 04, 2001 at 05:40:43AM +0200, Matthias Andree wrote:
> How portable is fsync()ing the directory?

It is as portable as assuming that the directory entries are synced during
fsync().

> 
> How USEFUL is it to the application if all other boxen fsync() the
> directory entries along with the file?

Except, you can NOT guarantee that this is the case.  The spec does NOT
require it.

> You want to sync the files, but the directory only after you created all
> of the files? Use fdatasync(2) for the files - it doesn't flush meta
> data, then sync the directory. It's POSIX, it's SUS v2.

Is fsync(dir) what you mean here?

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchange patch)
  2001-08-04 18:31                               ` Chris Wedgwood
@ 2001-08-05  1:47                                 ` Alexander Viro
  2001-08-08 17:22                                   ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchangepatch) Hans Reiser
  0 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-05  1:47 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Hans Reiser, Linus Torvalds, linux-kernel, Alan Cox, Chris Mason



On Sun, 5 Aug 2001, Chris Wedgwood wrote:

> On Sat, Aug 04, 2001 at 11:45:47AM +0400, Hans Reiser wrote:
> 
>     Can you define f_cred for us?
> 
> Surely is becomes a fs-specific opaque type, void* or whatever?  For
> many filesystems it somply won't be relevant, and for some filesystems
> such as NFS and Coda, presumably it will need deep fs-specific
> details?

Not really. It _is_ relevant for, say it, ext2. Why? Because we need that
to handle, e.g. 5% reserve. Look:

	Disk is almost full. We have only 3% of blocks free and in that
situation only root should be able to allocate more.
	User foo creates an empty file. No blocks allocated, so we are OK.
He does truncate() to, say it, 10Mb. Still no block allocations.
	He mmaps that file. And does memset() on mmaped area. Now we have
a bunch of dirty pages. Well, eventually kswapd will write them out, right?
	What UID does kswapd run under? Exactly. Welcome to the fun - we've
managed to trick the system into allocating 10Mb of disk space for us, since
in the allocation time UID and GID were 0.
	Now that we have these blocks allocated, usual write() will succeeed
just fine - no block allocation, no checks.

See what I mean? Blind use of current->fsuid et.al. in filesystem code is a
Bad Thing(tm). We really want to know who is responsible for the operation
and "current task" is not always the right answer.

What we need is a cache of objects that would contain (at least) UID,
GID and auxillary groups. That, and ability to add extra authentication
tokens. Process should have a pointer to its current credentials (quite
possibly shared by many processes). Ditto for an opened file. When we
open a file we should simply inherit credentials of opening process.
When we do seteuid(2), etc., we should create a separate copy of current
credentials, modify that copy, drop the reference to old creds and set
the pointer to credentials to the modified copy. Fs operations should
(eventually) be changed to get credentials explicitly - as an argument.
That, BTW, would allow knfsd to avoid the silly games with setting
->fsuid for the duration of fs operation - we could simply pass the
credentials of the party responsible for NFS request to filesystem
methods.

It's nothing new - both the problem and solution are well-known since
at least early 90s. All this stuff is pretty straightforward, but it
can't be done in 2.4 - it involves changing the prototypes of fs
methods. Changes to the bodies of methods (i.e. to fs code) are
minimal - basically, it's a search-and-replace job that can be done
at once on the whole tree. However, such things do not belong to
-STABLE.

PS: probably "identity" would be a better term than "credentials". It's
a "look, I'm acting on behalf of Joe R. User and I carry a bunch of IDs to
prove that" thing.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04 21:07               ` Andrew Morton
@ 2001-08-05  7:25                 ` Chris Wedgwood
  2001-08-09 13:25                 ` Matthias Andree
  1 sibling, 0 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-05  7:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Sat, Aug 04, 2001 at 02:07:59PM -0700, Andrew Morton wrote:

    Sorry, Chris - I was being even more than usually stupid.  All you
    need do is overload the return value from file_operations.fsync().
    It currently returns zero on success or negative error.  You can
    just define a return value of "1" to mean that the fs has taken
    care of syncing the directory info and no further action is needed
    at the fs layer.

Seems quite reasonable.  I'm quite keen on ehat approach except it
seems for some filesystems, walking the path is expensive.  How about
a flag in the superblock on whether we do this or not?

That said, it's been pointed out it doesn't actually help in several
cases (link/unlink) for example, so maybe I should forget the whole
thing?




  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04 18:30                   ` Chris Wedgwood
@ 2001-08-05 12:15                     ` Matthias Andree
  2001-08-05 12:32                       ` Chris Wedgwood
  0 siblings, 1 reply; 77+ messages in thread
From: Matthias Andree @ 2001-08-05 12:15 UTC (permalink / raw)
  To: linux-kernel

On Sun, 05 Aug 2001, Chris Wedgwood wrote:

> Anyhow, if you read recent comments it looks like thre are filesystems
> which will be badly affected by placing this logic into the VFS
> (eg. Coda).  It may well be this should become a fs-specific thing
> (which sucks a little, because it makes the suggestion of tracking
> link/unlink directories ugly) and for some filesystems such as

Why does it? Each file-system is self-contained with respect to hard
links. You cannot have link cross file system boundaries.

Common code can be placed into a library. (Probably 2.5 stuff though.)

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-05 12:15                     ` Matthias Andree
@ 2001-08-05 12:32                       ` Chris Wedgwood
  2001-08-05 13:02                         ` Matti Aarnio
  0 siblings, 1 reply; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-05 12:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Matthias Andree


On Sun, Aug 05, 2001 at 02:15:46PM +0200, Matthias Andree wrote:

    Why does it? Each file-system is self-contained with respect to hard
    links. You cannot have link cross file system boundaries.

    Common code can be placed into a library. (Probably 2.5 stuff though.)

As pointed out by Jan Harkes, logic that works for ext2 (eg. walking
the dentry chain and sync'ing all the components) sucks for things
like Coda, where the performance impact may be noticable (actually,
I'm not conviced it will be, but what do I know).

Not only that, it doesn't help qmail, cyrus imapd or Postfix
completely.




  --cw

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-05 12:32                       ` Chris Wedgwood
@ 2001-08-05 13:02                         ` Matti Aarnio
  0 siblings, 0 replies; 77+ messages in thread
From: Matti Aarnio @ 2001-08-05 13:02 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: linux-kernel

On Mon, Aug 06, 2001 at 12:32:42AM +1200, Chris Wedgwood wrote:
> On Sun, Aug 05, 2001 at 02:15:46PM +0200, Matthias Andree wrote:
> 
>     Why does it? Each file-system is self-contained with respect to hard
>     links. You cannot have link cross file system boundaries.
> 
>     Common code can be placed into a library. (Probably 2.5 stuff though.)
> 
> As pointed out by Jan Harkes, logic that works for ext2 (eg. walking
> the dentry chain and sync'ing all the components) sucks for things
> like Coda, where the performance impact may be noticable (actually,
> I'm not conviced it will be, but what do I know).
> 
> Not only that, it doesn't help qmail, cyrus imapd or Postfix
> completely.

	And for that matter, it (full access-tree fsync()ing) isn't
	necessary for systems which don't go around creating directories,
	instead place things into existing ones, and move things around
	in between the directories.


>   --cw

/Matti Aarnio

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-03 22:01         ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Chris Wedgwood
                             ` (3 preceding siblings ...)
  2001-08-04 17:35           ` Jan Harkes
@ 2001-08-06 15:02           ` Chris Mason
  2001-08-06 20:48             ` Chris Wedgwood
  4 siblings, 1 reply; 77+ messages in thread
From: Chris Mason @ 2001-08-06 15:02 UTC (permalink / raw)
  To: Chris Wedgwood, Linus Torvalds; +Cc: linux-kernel, Alan Cox



On Saturday, August 04, 2001 10:01:43 AM +1200 Chris Wedgwood <cw@f00f.org>
wrote:
 
> Note, there is also a reiserfs fix in here because we can call
> f_op->fsync on a directory and without this fix it will BUG!  Chris,
> perhaps you can suggest a better fix?
> 

Aside from the rest of the discussion on this, did you actually trigger
this BUG for fsync(dir)?  f_op->fsync should already be set to
reiserfs_dir_fsync.

-chris


^ permalink raw reply	[flat|nested] 77+ messages in thread

* looking for resources for designing loopback filesystem
  2001-08-03 22:29       ` Anton Altaparmakov
  2001-08-03 23:06         ` Chris Wedgwood
@ 2001-08-06 15:23         ` dave-mlist
  2001-08-06 16:11           ` Ville Herva
  1 sibling, 1 reply; 77+ messages in thread
From: dave-mlist @ 2001-08-06 15:23 UTC (permalink / raw)
  To: linux-kernel

I'd like to create a filesystem which mirrors another local
filesystem, but with different permissions, and with some content
invisible to some users.  Is there a kernel module that does this
already?  Is the loopback module the one?  If so, where can I learn
about applying the module to this purpose?

Dave

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: looking for resources for designing loopback filesystem
  2001-08-06 15:23         ` looking for resources for designing loopback filesystem dave-mlist
@ 2001-08-06 16:11           ` Ville Herva
  0 siblings, 0 replies; 77+ messages in thread
From: Ville Herva @ 2001-08-06 16:11 UTC (permalink / raw)
  To: dave-mlist; +Cc: linux-kernel

On Mon, Aug 06, 2001 at 08:23:47AM -0700, you [dave-mlist@bfnet.com] claimed:
> I'd like to create a filesystem which mirrors another local
> filesystem, but with different permissions, and with some content
> invisible to some users.  Is there a kernel module that does this
> already?  Is the loopback module the one?  If so, where can I learn
> about applying the module to this purpose?

Perhaps you could hack an nfs-server (knfsd or nfsd) to do the permission
tweaks and hiding and then just mount the stuff over to another location
with ordinary nfs client stuff.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-06 15:02           ` Chris Mason
@ 2001-08-06 20:48             ` Chris Wedgwood
  0 siblings, 0 replies; 77+ messages in thread
From: Chris Wedgwood @ 2001-08-06 20:48 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, linux-kernel, Alan Cox

On Mon, Aug 06, 2001 at 11:02:27AM -0400, Chris Mason wrote:

    Aside from the rest of the discussion on this, did you actually trigger
    this BUG for fsync(dir)?  f_op->fsync should already be set to
    reiserfs_dir_fsync.

Yes, but when walking the dentry chain I got to resierfs_file_sync or
something rather than reiserfs_dir_fsync --- two possible fixes,
either have reiserfs internally deal with this (my patch just calls
the dir function from the file function) or try and make the code I
wrote aware of the difference (which technicaly is probably more
correct, I just couldn't figure out how to do it).

I actually made a note to myself to look at merging reiserfs_dir_fsync
and resierfs_file_sync at some point.




  --cw


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync  semanticchangepatch)
  2001-08-05  1:47                                 ` Alexander Viro
@ 2001-08-08 17:22                                   ` Hans Reiser
  0 siblings, 0 replies; 77+ messages in thread
From: Hans Reiser @ 2001-08-08 17:22 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Chris Wedgwood, linux-kernel, Alan Cox, Chris Mason

Does fseuid capture the meaning you intend?  And if it does, what exactly is
missing from the tradional paradigm of effective vs. real uid that cannot be
applied to kswapd?

Can you give an example of when does the filesytem need to know that the
effective user id is A but the real user id is kswapd?

I understand that kswapd is not managing its uids, but I don't yet understand
why the fs needs two fields rather than one to track the effective uid
completely not caring that the real uid is kswapd.  

I suspect I just need more details on the issue to understand you....

Hans

Alexander Viro wrote:
> 
> On Sun, 5 Aug 2001, Chris Wedgwood wrote:
> 
> > On Sat, Aug 04, 2001 at 11:45:47AM +0400, Hans Reiser wrote:
> >
> >     Can you define f_cred for us?
> >
> > Surely is becomes a fs-specific opaque type, void* or whatever?  For
> > many filesystems it somply won't be relevant, and for some filesystems
> > such as NFS and Coda, presumably it will need deep fs-specific
> > details?
> 
> Not really. It _is_ relevant for, say it, ext2. Why? Because we need that
> to handle, e.g. 5% reserve. Look:
> 
>         Disk is almost full. We have only 3% of blocks free and in that
> situation only root should be able to allocate more.
>         User foo creates an empty file. No blocks allocated, so we are OK.
> He does truncate() to, say it, 10Mb. Still no block allocations.
>         He mmaps that file. And does memset() on mmaped area. Now we have
> a bunch of dirty pages. Well, eventually kswapd will write them out, right?
>         What UID does kswapd run under? Exactly. Welcome to the fun - we've
> managed to trick the system into allocating 10Mb of disk space for us, since
> in the allocation time UID and GID were 0.
>         Now that we have these blocks allocated, usual write() will succeeed
> just fine - no block allocation, no checks.
> 
> See what I mean? Blind use of current->fsuid et.al. in filesystem code is a
> Bad Thing(tm). We really want to know who is responsible for the operation
> and "current task" is not always the right answer.
> 
> What we need is a cache of objects that would contain (at least) UID,
> GID and auxillary groups. That, and ability to add extra authentication
> tokens. Process should have a pointer to its current credentials (quite
> possibly shared by many processes). Ditto for an opened file. When we
> open a file we should simply inherit credentials of opening process.
> When we do seteuid(2), etc., we should create a separate copy of current
> credentials, modify that copy, drop the reference to old creds and set
> the pointer to credentials to the modified copy. Fs operations should
> (eventually) be changed to get credentials explicitly - as an argument.
> That, BTW, would allow knfsd to avoid the silly games with setting
> ->fsuid for the duration of fs operation - we could simply pass the
> credentials of the party responsible for NFS request to filesystem
> methods.
> 
> It's nothing new - both the problem and solution are well-known since
> at least early 90s. All this stuff is pretty straightforward, but it
> can't be done in 2.4 - it involves changing the prototypes of fs
> methods. Changes to the bodies of methods (i.e. to fs code) are
> minimal - basically, it's a search-and-replace job that can be done
> at once on the whole tree. However, such things do not belong to
> -STABLE.
> 
> PS: probably "identity" would be a better term than "credentials". It's
> a "look, I'm acting on behalf of Joe R. User and I carry a bunch of IDs to
> prove that" thing.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch)
  2001-08-04 21:07               ` Andrew Morton
  2001-08-05  7:25                 ` Chris Wedgwood
@ 2001-08-09 13:25                 ` Matthias Andree
  1 sibling, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-09 13:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chris Wedgwood, linux-kernel

On Sat, 04 Aug 2001, Andrew Morton wrote:

> Sorry, Chris - I was being even more than usually stupid.  All
> you need do is overload the return value from file_operations.fsync().
> It currently returns zero on success or negative error.  You can
> just define a return value of "1" to mean that the fs has taken
> care of syncing the directory info and no further action is needed
> at the fs layer.

But ensure that the user space never sees the "1" return value from
fsync() itself.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:39               ` Matthias Andree
  2001-08-03  9:57                 ` Christoph Hellwig
@ 2001-08-04  7:55                 ` Eric W. Biederman
  1 sibling, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2001-08-04  7:55 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Paul Jakma, linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> On Fri, 03 Aug 2001, Eric W. Biederman wrote:
> 
> > Actually given that this thread keeps coming up, but no one does anything
> > about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
> > Then there will be enough pain that someone will fix the MTA instead of
> > moaning that kernel is slow...
> 
> They'd just drop Linux from the list of supported OS's, Linux will
> disappoint people who trusted it, nothing is gained. Deliberate breakage
> will not happen, because it would not help anyone except people with
> twisted minds.

There are some other uses for a fully synchronous disk accesses so I'm
not going to run out and do it.  The point is that work arounds for
strange programs is not a right, it is a nice optional feature.
 
> NO-ONE, including you, has come up with SERIOUS objections against a
> dirsync option, except "is it really so much slower than chattr +S? show
> figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Clear objects against dirsync.  
- Extra code maintenance, makes the fs less reliable 
  (A reason for removing even synchrouns fs operations BTW).
- Unnecessary. fsync(dir) works today.
- dirsync is unlikely to be faster than fsync(file); fsync(dir) [not chattr +S]
  You really need something that can say remember these 5 syscalls,
  and sync the all their changes to disk togeter to really get an
  improvement in sync speed.
- I don't see anyone volunteering to write the code.

> Reconsider your position.

Nope.  Right now I would rather
a) Patch the mail programs to do the needed fsync(dir)
b) Totally remove synchrous disk updates from my OS, and
   make life really painful.
Before adding a dirsync option.

> Stop trolling please.

It wasn't trying to troll, just get this conversation on some productive
grounds.  I think supporting the MTA's is good, so long as it is a two
way relationship.

If someone went out and tried using fsync(dir) and then saying it
sucked we could definentily have more peace.

Using dirsync, and chattr +S hide the real problems that need to get
fixed.  Getting a good reliable, and high performance way to commit
actions to a filesystem.  

We already have one work around on linux that will work reliably.  So
now let's see if we can get a functional high performance solution.

And oh btw, new, functional high performance solutions are not
portable because they haven't been implemented in every operating
system.  Full understanding of the problem, and the solutions are two
new for the implementations to have gotten around.

Eric




^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]               ` <20010803021406.A9845@emma1.emma.line.org>
  2001-08-03 16:20                 ` Jan Harkes
@ 2001-08-03 22:48                 ` Andreas Dilger
  1 sibling, 0 replies; 77+ messages in thread
From: Andreas Dilger @ 2001-08-03 22:48 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Anton Altaparmakov, Matthias Andree, Andreas Dilger,
	Alexander Viro, Daniel Phillips, Stephen C. Tweedie,
	linux-kernel



^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: intermediate summary of ext3-2.4-0.9.4 thread
@ 2001-08-03 20:25 Sam James
  0 siblings, 0 replies; 77+ messages in thread
From: Sam James @ 2001-08-03 20:25 UTC (permalink / raw)
  To: Albert D. Cahalan, tao; +Cc: phillips, sct, linux-kernel

>
>This is just completely true. One wonders why we seem to enjoy
>getting screwed this way. We shouldn't be patching these MTAs or
>hacking Linux to act like BSD. We should be avoiding these MTAs.
>
>Somebody can create a big MTA list, listing the good and bad ones.
>Then we get the Linux-hostile MTAs out of the Linux distributions,
>demanding compliance like we do for filesystem layout. We also hunt
>down Linux-related web pages that mention these MTAs and get the
>pages changed or removed. The point is to make these MTAs just
>disappear, never to be seen again. Nice MTAs get promoted.


Your not related to Bill Gates are you?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:50         ` David Weinehall
  2001-08-03 18:31           ` Matthias Andree
@ 2001-08-03 19:59           ` Albert D. Cahalan
  2001-08-03 19:54             ` Gregory Maxwell
  1 sibling, 1 reply; 77+ messages in thread
From: Albert D. Cahalan @ 2001-08-03 19:59 UTC (permalink / raw)
  To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

David Weinehall writes:
> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:

>> Still, some people object to a dirsync mount option. But this has been
>> the actual reason for the thread - MTA authors are refusing to pamper
>> Linux and use chattr +S instead which gives unnecessary (premature) sync
>> operations on write() - but MTAs know how to fsync().
>
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.
>
> [snip]
>
> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's
> complain instead of writing alternative code that suits Linux systems
> better than this code does." A lot like all the discussions on threads,
> ueally.

This is just completely true. One wonders why we seem to enjoy
getting screwed this way. We shouldn't be patching these MTAs or
hacking Linux to act like BSD. We should be avoiding these MTAs.

Somebody can create a big MTA list, listing the good and bad ones.
Then we get the Linux-hostile MTAs out of the Linux distributions,
demanding compliance like we do for filesystem layout. We also hunt
down Linux-related web pages that mention these MTAs and get the
pages changed or removed. The point is to make these MTAs just
disappear, never to be seen again. Nice MTAs get promoted.



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 19:59           ` Albert D. Cahalan
@ 2001-08-03 19:54             ` Gregory Maxwell
  0 siblings, 0 replies; 77+ messages in thread
From: Gregory Maxwell @ 2001-08-03 19:54 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: David Weinehall, Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Fri, Aug 03, 2001 at 03:59:02PM -0400, Albert D. Cahalan wrote:
[snip]
> Somebody can create a big MTA list, listing the good and bad ones.
> Then we get the Linux-hostile MTAs out of the Linux distributions,
> demanding compliance like we do for filesystem layout. We also hunt
> down Linux-related web pages that mention these MTAs and get the
> pages changed or removed. The point is to make these MTAs just
> disappear, never to be seen again. Nice MTAs get promoted.

Think we could just get their authors to 'disappear'? It might be more cost
effective, and I can think of at least one example where removing the author
would have other benefits beyond MTAs. :) :)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found] ` <200108030207.f7326OpR003086@sleipnir.valparaiso.cl>
@ 2001-08-03 18:34   ` Matthias Andree
  0 siblings, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-03 18:34 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Daniel Phillips, linux-kernel

On Thu, 02 Aug 2001, Horst von Brand wrote:

> > He wants to be able to set a bit on the directory that specifies such 
> > behaviour for all files in the directory.  I don't see what's wrong 
> > with that.
> 
> That there isn't THE directory in which the file resides. There might be
> several, and setting the bit on one of them at random and expect it to work
> is a _lot_ of work. For no real use.

Well, if the file resides in a directory without that flag, it's not
synched. The application or its installation process should take care of
that. Suppose the MTA knows where it writes its queue files and which
must have this flag set accordingly.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:50         ` David Weinehall
@ 2001-08-03 18:31           ` Matthias Andree
  2001-08-03 19:59           ` Albert D. Cahalan
  1 sibling, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-03 18:31 UTC (permalink / raw)
  To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Fri, 03 Aug 2001, David Weinehall wrote:

> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> > Still, some people object to a dirsync mount option. But this has been
> > the actual reason for the thread - MTA authors are refusing to pamper
> > Linux and use chattr +S instead which gives unnecessary (premature) sync
> > operations on write() - but MTAs know how to fsync().
> 
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.

It's a maintenance issue.

You effectively start wrapping up all relevant syscalls and have
system-specific interfaces. One wants the directory fsync()ed, the other
offers a special other trick to get the data flushed... what useful is
portability then if systems are so different?

> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's

No wonder if the application chooses fully-synchronous operation on
Linux.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:30         ` Stephen C. Tweedie
@ 2001-08-03 18:28           ` Matthias Andree
  0 siblings, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-03 18:28 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Daniel Phillips, linux-kernel

On Fri, 03 Aug 2001, Stephen Tweedie wrote:

> > > The prescription for symlinks is, if you want them safely on disk you 
> > > have to explicitly fsync the containing directory.
> > 
> > Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> > waste inodes on most systems).
> 
> Irrelevant.   We're talking about what makes sensible semantics, not
> what assumptions any specific application makes.  It makes no sense to
> say that dirsync won't affect symlinks just because some existing
> applications don't rely on that!

It's rather my imagination that tracking hard links might be easier than
symlinks because hard links share the inode number. A more advanced (and
complex) implementation might prove the imagination wrong. I don't want
to consider which one is more efficient.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 19:47         ` Bill Rugolsky Jr.
@ 2001-08-03 18:22           ` Matthias Andree
  0 siblings, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-03 18:22 UTC (permalink / raw)
  To: Bill Rugolsky Jr.; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Bill Rugolsky Jr. wrote:

> I have no idea where BSD falls, but the basic point stands:  unused
> features should not penalize other applications.  Andrew Morton has
> figured out how to do this efficiently with ext3, and many kudos to him
> for doing the work.  Absent that, why should I have to go get a cup of
> coffee every time I want to patch a tree, just so some MTA can make
> naive assumptions?

The whole idea is to have a switch to turn on BSD-style synchronous
directory update semantics. Nothing more, nothing you would not be able
to get rid off. In fact, you can mount file systems async on BSD as
well, but you'd better not have the machine crash. Irrecoverable file
system damage can result. As a compromise, softupdates are nearly as
fast as async, but FS damage is guaranteed to be recoverable.

In either case (async or soft-updates), files can end up in lost+found
after the control had been returned to the application that called open
or link.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]               ` <20010803021406.A9845@emma1.emma.line.org>
@ 2001-08-03 16:20                 ` Jan Harkes
  2001-08-03 22:48                 ` Andreas Dilger
  1 sibling, 0 replies; 77+ messages in thread
From: Jan Harkes @ 2001-08-03 16:20 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Fri, Aug 03, 2001 at 02:14:06AM +0200, Matthias Andree wrote:
> On Fri, 03 Aug 2001, Anton Altaparmakov wrote:
> > filedescriptor to be synced to disk, the ONLY possible way to do this it to 
> > sync the parent directory in order to commit the file name to disk. On some 
> 
> Do I really need to sync the WHOLE parent directory? Not just the
> relevant part? My directories hardly have only 1 disk block.

Only dirty blocks are written back to disk, i.e. the parts of the
directory that were modified by adding/removing names. It should be
pretty efficient.

> > to explicitly sync the directory filedescriptor afterwards.
> 
> Which is non-portable and will not be done by many application
> programmers which just use chattr +S instead (makes things S)afe and
> S)low) - and spoil performance that way since it makes not only
> directory writes synchronous, but file (data) writes as well.

"chattr +S" is about as portable as adding fsync(parent), or even worse
as it only works on an ext2 file system. So I'm assuming that this is
just a nice exercise in annoying people.

Jan


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:39               ` Matthias Andree
@ 2001-08-03  9:57                 ` Christoph Hellwig
  2001-08-04  7:55                 ` Eric W. Biederman
  1 sibling, 0 replies; 77+ messages in thread
From: Christoph Hellwig @ 2001-08-03  9:57 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Matthias Andree, Paul Jakma, linux-kernel, Eric W. Biederman

In article <20010803103954.A11584@emma1.emma.line.org> you wrote:
> They'd just drop Linux from the list of supported OS's, Linux will
> disappoint people who trusted it, nothing is gained. Deliberate breakage
> will not happen, because it would not help anyone except people with
> twisted minds.

Who cares?  There are more than enough sane mailer around..

> NO-ONE, including you, has come up with SERIOUS objections against a
> dirsync option, except "is it really so much slower than chattr +S? show
> figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Talk is cheap.  Code up a non-invasive dirsync option and submit it to
Linus.  I don't see any reason why it won't be accepted..

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
@ 2001-08-03  9:16                 ` Matthias Andree
  0 siblings, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-03  9:16 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: linux-kernel

On Fri, 03 Aug 2001, Anton Altaparmakov wrote:

[dirsync chattr/mount options]
> Me neither. With regards to the parallel discussion on SUS compliance it is 
> probably a good idea to have such a thing in some form anyway (although if 
> I understood the discussion correctly, we really want this to happen by 
> default, not just when some flag is set but then again I never read the 
> standards...).

The standard doesn't really command the behaviour, as it seems, but we
might want to look again after SUS v3 has been released (supposed to
happen later this year) - the SUS compliance was rather on fsync than on
rename/link.

However, I'd rather not choose the default for somebody else, because he
may have different requirements, a compile-time switch to set the
default should be fine, THIS one might indeed default to dirsync/noasync
unless changed by make {x,menu,}config.

Assuming that the chattr +S is accompanied by a corresponding -o sync
mount option, I'd expect that the dirsync option be available as chattr
option and as mount option, and choosing default mount options should be
rather easy.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
                           ` (3 preceding siblings ...)
  2001-08-03  8:30         ` Stephen C. Tweedie
@ 2001-08-03  8:50         ` David Weinehall
  2001-08-03 18:31           ` Matthias Andree
  2001-08-03 19:59           ` Albert D. Cahalan
  4 siblings, 2 replies; 77+ messages in thread
From: David Weinehall @ 2001-08-03  8:50 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> On Thu, 02 Aug 2001, Daniel Phillips wrote:
> 
> [file name must be flushed on fsync()]
> > I don't know why it is hard or inefficient to implement this at the VFS 
> > level, though I'm sure there is a reason or this thread wouldn't 
> > exist.  Stephen, perhaps you could explain for the record why sys_fsync 
> > can't just walk the chain of dentry parent links doing fdatasync?  Does 
> > this create VFS or Ext3 locking problems?  Or maybe it repeats work 
> > that Ext3 is already supposed to have done?
> 
> Well, the course was that I asked whether ext3 would do synchronous
> directory updates, and some people jumped in and said that one should
> fsync() the parent directory, however, since we figure from SUS, that's
> invalid.
> 
> After some forth and back, we finally figured that at least ext2 is
> implementing fsync() improperly.
> 
> So this part is covered.

Yup, and this should be fixed imho.

> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.

And this is a feature, not a bug.

> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.

A directory fsync() might be expensive on non-Linux filesystems...

> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().

So what you mean is that MTA authors refuse to pamper Linux through use
of fsync of the directory, but can accept to "pamper" Linux through use
of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
demand that Linux should pamper for MTA authors that can't implement
fsync on the directory instead of writing BSD-specific code.

[snip]

To me this seems mostly like a way of saying "Hey, we've finally found
a way to make Linux look really bad compared to BSD-systems; let's
complain instead of writing alternative code that suits Linux systems
better than this code does." A lot like all the discussions on threads,
ueally.

Then again, I'm probably just extra grouchy today because it rained when
I rode my bike to work.


/David Weinehall
  _                                                                 _
 // David Weinehall <tao@acc.umu.se> /> Northern lights wander      \\
//  Project MCA Linux hacker        //  Dance across the winter sky //
\>  http://www.acc.umu.se/~tao/    </   Full colour fire           </

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  7:03             ` Eric W. Biederman
@ 2001-08-03  8:39               ` Matthias Andree
  2001-08-03  9:57                 ` Christoph Hellwig
  2001-08-04  7:55                 ` Eric W. Biederman
  0 siblings, 2 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-03  8:39 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Matthias Andree, Paul Jakma, linux-kernel

On Fri, 03 Aug 2001, Eric W. Biederman wrote:

> Actually given that this thread keeps coming up, but no one does anything
> about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
> Then there will be enough pain that someone will fix the MTA instead of
> moaning that kernel is slow...

They'd just drop Linux from the list of supported OS's, Linux will
disappoint people who trusted it, nothing is gained. Deliberate breakage
will not happen, because it would not help anyone except people with
twisted minds.

NO-ONE, including you, has come up with SERIOUS objections against a
dirsync option, except "is it really so much slower than chattr +S? show
figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Reconsider your position.

Stop trolling please.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
                           ` (2 preceding siblings ...)
       [not found]         ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
@ 2001-08-03  8:30         ` Stephen C. Tweedie
  2001-08-03 18:28           ` Matthias Andree
  2001-08-03  8:50         ` David Weinehall
  4 siblings, 1 reply; 77+ messages in thread
From: Stephen C. Tweedie @ 2001-08-03  8:30 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

Hi,

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:

> So this part is covered.
> 
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.

Not true.  There are tons of others.

The issue was that synchronous directory updates are *optional* on
many systems (Linux included), but that Linux's support for that is
really inefficient since it ends up syncing file metadata updates too
(and it's much more efficient to use fsync for that.)

> Still, some people object to a dirsync mount option.

Who?  People who have discussed this in the past have certainly not
objected to my knowledge.  It would clearly help situations like this
(as would a dirsync chattr option.)

> > The prescription for symlinks is, if you want them safely on disk you 
> > have to explicitly fsync the containing directory.
> 
> Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> waste inodes on most systems).

Irrelevant.   We're talking about what makes sensible semantics, not
what assumptions any specific application makes.  It makes no sense to
say that dirsync won't affect symlinks just because some existing
applications don't rely on that!

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]           ` <20010803021642.B9845@emma1.emma.line.org>
@ 2001-08-03  7:03             ` Eric W. Biederman
  2001-08-03  8:39               ` Matthias Andree
  0 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2001-08-03  7:03 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Paul Jakma, linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> On Fri, 03 Aug 2001, Paul Jakma wrote:
> 
> > if the prime directive of MTAs is data integrity paranoia, then
> > surely the best assumption for an MTA to make is that
> > rename/link/unlink/symlink /are/ asynchronous in the general case?
> 
> They do on Linux, use chattr +S, and are much slower than e. g. on
> FreeBSD. Well. Not that I'd written THAT for the first time...

Actually given that this thread keeps coming up, but no one does anything
about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
Then there will be enough pain that someone will fix the MTA instead of
moaning that kernel is slow...

That should be an easy patch to make...

Eric

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 22:18             ` Andreas Dilger
@ 2001-08-02 23:11               ` Matthias Andree
       [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
  1 sibling, 0 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-02 23:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Matthias Andree, Alexander Viro, Daniel Phillips,
	Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Andreas Dilger wrote:

> > open -> asynchronous, but filename synched on fsync()
> > rename/link/unlink(/symlink) -> synchronous
> > 
> > This way, you never need to fsync() the directory, so you never sync()
> > entries of temporary files. You never lose important files (because the
> > application uses fsync() and the OS synchs rename/link etc.).
> 
> Do you read what you are writing?  How can a "synchronous" operation for
> rename/link/unlink/symlink NOT also write out "temporary" files in the
> same directory?  How does calling fsync() on the directory IF YOU REQUIRE
> SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
> synchronous from within the kernel???

Can people please try to understand? Can people please start to THINK
before flaming?

I did not say that open() is to be synchronous. I did not write ANYTHING
of fsync()ing directories, I'm trying to get rid of this requirement.

Thus, if the kernel does rename/link synchronously, you'd never ever
fsync() a directory. To synch a filename to disk, you'd just fsync() the
filedescriptor (with a SUS compliant system, that is, i. e. ext3 or
reiserfs, but not ext2).

Now, if someone opens a temporary file, and nukes it later -- unlink()
--, and doesn't want it visible, he never calls fsync() for the file.

However, if some other process then fsync()s the directory, you start
synching the temporary file dirent -> unnecessary, is nuked later on
with an unlink().

That's why fsync() on the directory is on no account the minimum work.

> The only difference I can see is that making these specific operations
> ALWAYS be synchronous hurts the common case when they can be async (see
> Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
> an fsync() on the directory == only synchronous operation when it is
> actually needed, and no "extra" performance hit.

In case you haven't noticed, this is about reliability without need to
fsync() the directory that doesn't all belong to your single, stupid
process but may have lots of asynchronous data of other processes -
temporary files for instance. You synch() that as well, which is
unnecessary and brings down other processes' performance.

In case you haven't noticed the other issue:

The whole thread is a FEATURE REQUEST for a dirsync mount option, for
MTAs and other software which requires reliable file systems, where the
name is negotiable. It aims to REDUCE OVERHEAD since chattr +S which is
the only workaround for synch-dirs - and it synchs synchronous files and
writes as well, and rendering things slower than necessary, since
write() can be buffered until you fsync() (and you want that to cut off
seek times).

Call the option bsd_slow_dirs if you like, I don't care. Given the
option, the administrator/user has the choice, currently, he hasn't. He
cannot possibly change all applications ported from other Unices.

Note: hindering this option doesn't get Linux anywhere. Pure file
system benchmarks are not worth a single bit of entropy unless Linux is
benchmarked chattr +S -- it's unreliable otherwise.

I cannot remember how often I explained this during the course of this
thread. Every other day, some ignorant comes out of its cavern and
discusses the whole thing over and over again.

And, once again, fsync()ing the directory is not an option for portable
applications. It's unnecessary on every other system (until someone
shows a production-ready system which by default has asynchronous
directory updates as well, but no-one has so far.)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 18:47           ` Matthias Andree
@ 2001-08-02 22:18             ` Andreas Dilger
  2001-08-02 23:11               ` Matthias Andree
       [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
       [not found]             ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
  1 sibling, 2 replies; 77+ messages in thread
From: Andreas Dilger @ 2001-08-02 22:18 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Alexander Viro, Daniel Phillips, Stephen C. Tweedie, linux-kernel

Matthais Andree writes:
> fsync()ing the dir is not the minimal work possible, if e. g. temporary
> files are open that don't need their names synched. Fsync()ing the
> directory syncs also these temporary file NAMES that other processes may
> have open (but that they unlink rather than fsync()).
> 
> Assume:
> 
> open -> asynchronous, but filename synched on fsync()
> rename/link/unlink(/symlink) -> synchronous
> 
> This way, you never need to fsync() the directory, so you never sync()
> entries of temporary files. You never lose important files (because the
> application uses fsync() and the OS synchs rename/link etc.).

Do you read what you are writing?  How can a "synchronous" operation for
rename/link/unlink/symlink NOT also write out "temporary" files in the
same directory?  How does calling fsync() on the directory IF YOU REQUIRE
SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
synchronous from within the kernel???

The only difference I can see is that making these specific operations
ALWAYS be synchronous hurts the common case when they can be async (see
Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
an fsync() on the directory == only synchronous operation when it is
actually needed, and no "extra" performance hit.

The only slight point of contention is if you have very large directories
which span several filesystem blocks, in which case it _would_ be possible
to write out some blocks synchronously, while leaving other blocks dirty.
In practise however, you will either only be modifying a small number of
blocks (at the end of the directory) because an MTA usually only creates
files and doesn't delete them, and the actual speed of syncing several
blocks at one time is not noticably different than syncing only one.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
  2001-08-02 18:35         ` Alexander Viro
@ 2001-08-02 19:47         ` Bill Rugolsky Jr.
  2001-08-03 18:22           ` Matthias Andree
       [not found]         ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 77+ messages in thread
From: Bill Rugolsky Jr. @ 2001-08-02 19:47 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.
> 
> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.
> 
> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().

Let's inject a little reality into this discussion.  Filesystems are used
for something other than running MTA's written by stubborn "purists".

Solaris: Dell 600 MHz PIII 128MB RAM, largely quiescent:
         Solaris 8 mu4, UFS with logging

Linux:   VA Linux 800 MHZ PIII, 128MB RAM, largely quiescent
         RedHat Linux 7.1 w/ kernel-2.4.6-2.4 (2.4.6-ac5 + ext3-0.9.3).

660MB XFree86-4.1 build tree, cache primed with du -s in each case.

Here's something that we developers probably all do frequently: copy a
tree using hard links, so that we can patch it.

[solaris] find . | wc     
   33027   33027 1251671
[solaris] time find . -depth | cpio -pdul ../foo
0 blocks
 363.46s real    0.84s user   10.13s system 

Plain ext2:

[linux]# time find . -depth | cpio -pdul ../foo
0 blocks

real    0m3.823s user    0m0.240s sys     0m3.570s

Mounted ext3, ordered data mode.

[linux] time find . -depth | cpio -pdul ../foo
0 blocks

real    0m5.106s user    0m0.200s sys     0m3.700s

Mounted ext3, -o sync:

[root@ead51 bar]# time find . -depth | cpio -pdul ../foo
0 blocks

real    1m28.483s user    0m0.470s sys     0m4.410s 

=====================================================

Solaris8 UFS:   363.5 seconds
ext2:             3.8 seconds
ext3:             5.1 seconds
ext3 -o sync:    88.5 seconds

Got it?

Obviously, the last is the result of the poor interaction
of ext3+sync in 0.9.3, but Andrew Morton has already fixed that.
I will try again with 0.9.5 when I have a chance to upgrade that
machine.

I have no idea where BSD falls, but the basic point stands:  unused
features should not penalize other applications.  Andrew Morton has
figured out how to do this efficiently with ext3, and many kudos to him
for doing the work.  Absent that, why should I have to go get a cup of
coffee every time I want to patch a tree, just so some MTA can make
naive assumptions?

Regards,

   Bill Rugolsky

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 18:35         ` Alexander Viro
@ 2001-08-02 18:47           ` Matthias Andree
  2001-08-02 22:18             ` Andreas Dilger
       [not found]             ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
  0 siblings, 2 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-02 18:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Matthias Andree, Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Alexander Viro wrote:

> How the fuck it's expensive? It does _exactly_ the same as file fsync() -
> literally the same code. It doesn't write blocks that don't belong to
> directory. It doesn't write blocks that are clean. IOW, it does the
> minimal work possible.

fsync()ing the dir is not the minimal work possible, if e. g. temporary
files are open that don't need their names synched. Fsync()ing the
directory syncs also these temporary file NAMES that other processes may
have open (but that they unlink rather than fsync()).

Assume:

open -> asynchronous, but filename synched on fsync()
rename/link/unlink(/symlink) -> synchronous

This way, you never need to fsync() the directory, so you never sync()
entries of temporary files. You never lose important files (because the
application uses fsync() and the OS synchs rename/link etc.).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
@ 2001-08-02 18:35         ` Alexander Viro
  2001-08-02 18:47           ` Matthias Andree
  2001-08-02 19:47         ` Bill Rugolsky Jr.
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 77+ messages in thread
From: Alexander Viro @ 2001-08-02 18:35 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel



On Thu, 2 Aug 2001, Matthias Andree wrote:

> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.

How the fuck it's expensive? It does _exactly_ the same as file fsync() -
literally the same code. It doesn't write blocks that don't belong to
directory. It doesn't write blocks that are clean. IOW, it does the
minimal work possible.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:26     ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-02 17:37       ` Matthias Andree
  2001-08-02 18:35         ` Alexander Viro
                           ` (4 more replies)
  0 siblings, 5 replies; 77+ messages in thread
From: Matthias Andree @ 2001-08-02 17:37 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Daniel Phillips wrote:

[file name must be flushed on fsync()]
> I don't know why it is hard or inefficient to implement this at the VFS 
> level, though I'm sure there is a reason or this thread wouldn't 
> exist.  Stephen, perhaps you could explain for the record why sys_fsync 
> can't just walk the chain of dentry parent links doing fdatasync?  Does 
> this create VFS or Ext3 locking problems?  Or maybe it repeats work 
> that Ext3 is already supposed to have done?

Well, the course was that I asked whether ext3 would do synchronous
directory updates, and some people jumped in and said that one should
fsync() the parent directory, however, since we figure from SUS, that's
invalid.

After some forth and back, we finally figured that at least ext2 is
implementing fsync() improperly.

So this part is covered.

The other thing is, that Linux is the only known system that does
asynchronous rename/link/unlink/symlink -- people have claimed it might
not be the only one, but failed to name systems.

So we need to assume that Linux is the only system that does
asynchronous rename/link/unlink/symlink, however a directory fsync() is
believed to be rather expensive.

Still, some people object to a dirsync mount option. But this has been
the actual reason for the thread - MTA authors are refusing to pamper
Linux and use chattr +S instead which gives unnecessary (premature) sync
operations on write() - but MTAs know how to fsync().

> The prescription for symlinks is, if you want them safely on disk you 
> have to explicitly fsync the containing directory.

Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
waste inodes on most systems).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2001-08-09 13:25 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0108030507330F.00440@starship>
     [not found] ` <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>
2001-08-03 13:09   ` intermediate summary of ext3-2.4-0.9.4 thread Daniel Phillips
2001-08-03 14:43     ` Horst von Brand
2001-08-03 17:49     ` Mike Castle
2001-08-04  3:23       ` Daniel Phillips
2001-08-03 18:08     ` Alexander Viro
2001-08-03 18:26       ` Daniel Phillips
2001-08-03 18:53         ` Alexander Viro
2001-08-03 20:50           ` Daniel Phillips
2001-08-04  3:43           ` Matthias Andree
2001-08-03 18:34       ` Linus Torvalds
2001-08-03 22:01         ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Chris Wedgwood
2001-08-03 22:33           ` Alexander Viro
2001-08-03 23:16             ` Chris Wedgwood
2001-08-03 23:23               ` Alexander Viro
2001-08-04  3:53                 ` Matthias Andree
2001-08-04  5:48                   ` Alexander Viro
2001-08-03 22:45           ` Alexander Viro
2001-08-03 23:09             ` Chris Wedgwood
2001-08-03 23:15               ` Alexander Viro
2001-08-03 23:20                 ` Chris Wedgwood
2001-08-03 23:25                   ` Alexander Viro
2001-08-03 23:35                     ` Chris Wedgwood
2001-08-03 23:41                       ` Alexander Viro
2001-08-03 23:46                         ` Chris Wedgwood
2001-08-03 23:53                           ` Alexander Viro
2001-08-04  7:45                             ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchange patch) Hans Reiser
2001-08-04 18:31                               ` Chris Wedgwood
2001-08-05  1:47                                 ` Alexander Viro
2001-08-08 17:22                                   ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semanticchangepatch) Hans Reiser
2001-08-03 23:42                       ` [PATCH] 2.4.8-pre3 fsync entire path (+reiserfs fsync semantic change patch) Alan Cox
2001-08-03 23:44                         ` Chris Wedgwood
2001-08-04  1:08           ` Andrew Morton
2001-08-04  1:19             ` Chris Wedgwood
2001-08-04  1:45               ` Andrew Morton
2001-08-04  4:04                 ` Matthias Andree
2001-08-04 18:30                   ` Chris Wedgwood
2001-08-05 12:15                     ` Matthias Andree
2001-08-05 12:32                       ` Chris Wedgwood
2001-08-05 13:02                         ` Matti Aarnio
2001-08-04 21:07                   ` Andrew Morton
2001-08-04 21:07               ` Andrew Morton
2001-08-05  7:25                 ` Chris Wedgwood
2001-08-09 13:25                 ` Matthias Andree
2001-08-04 17:35           ` Jan Harkes
2001-08-04 18:18             ` Chris Wedgwood
2001-08-06 15:02           ` Chris Mason
2001-08-06 20:48             ` Chris Wedgwood
2001-08-03 22:29       ` Anton Altaparmakov
2001-08-03 23:06         ` Chris Wedgwood
2001-08-06 15:23         ` looking for resources for designing loopback filesystem dave-mlist
2001-08-06 16:11           ` Ville Herva
2001-08-03 18:36   ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
2001-08-03 19:16     ` Alexander Viro
2001-08-04  3:40       ` fdatasync(2) is also there (was: intermediate summary of ext3-2.4-0.9.4 thread) Matthias Andree
2001-08-05  0:28         ` Mike Castle
2001-08-03 20:25 intermediate summary of ext3-2.4-0.9.4 thread Sam James
     [not found] <0108030354130E.00440@starship>
     [not found] ` <200108030207.f7326OpR003086@sleipnir.valparaiso.cl>
2001-08-03 18:34   ` Matthias Andree
  -- strict thread matches above, loose matches on Subject: below --
2001-07-26  7:34 ext3-2.4-0.9.4 Andrew Morton
2001-08-01 16:02 ` ext3-2.4-0.9.4 Stephen C. Tweedie
2001-08-02  9:03   ` ext3-2.4-0.9.4 Matthias Andree
2001-08-02 17:26     ` ext3-2.4-0.9.4 Daniel Phillips
2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
2001-08-02 18:35         ` Alexander Viro
2001-08-02 18:47           ` Matthias Andree
2001-08-02 22:18             ` Andreas Dilger
2001-08-02 23:11               ` Matthias Andree
     [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
2001-08-03  9:16                 ` Matthias Andree
     [not found]             ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
     [not found]               ` <20010803021406.A9845@emma1.emma.line.org>
2001-08-03 16:20                 ` Jan Harkes
2001-08-03 22:48                 ` Andreas Dilger
2001-08-02 19:47         ` Bill Rugolsky Jr.
2001-08-03 18:22           ` Matthias Andree
     [not found]         ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
     [not found]           ` <20010803021642.B9845@emma1.emma.line.org>
2001-08-03  7:03             ` Eric W. Biederman
2001-08-03  8:39               ` Matthias Andree
2001-08-03  9:57                 ` Christoph Hellwig
2001-08-04  7:55                 ` Eric W. Biederman
2001-08-03  8:30         ` Stephen C. Tweedie
2001-08-03 18:28           ` Matthias Andree
2001-08-03  8:50         ` David Weinehall
2001-08-03 18:31           ` Matthias Andree
2001-08-03 19:59           ` Albert D. Cahalan
2001-08-03 19:54             ` Gregory Maxwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).