RE: intermediate summary of ext3-2.4-0.9.4 thread

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: intermediate summary of ext3-2.4-0.9.4 thread
@ 2001-08-03 20:25 Sam James
  0 siblings, 0 replies; 34+ messages in thread
From: Sam James @ 2001-08-03 20:25 UTC (permalink / raw)
  To: Albert D. Cahalan, tao; +Cc: phillips, sct, linux-kernel

>
>This is just completely true. One wonders why we seem to enjoy
>getting screwed this way. We shouldn't be patching these MTAs or
>hacking Linux to act like BSD. We should be avoiding these MTAs.
>
>Somebody can create a big MTA list, listing the good and bad ones.
>Then we get the Linux-hostile MTAs out of the Linux distributions,
>demanding compliance like we do for filesystem layout. We also hunt
>down Linux-related web pages that mention these MTAs and get the
>pages changed or removed. The point is to make these MTAs just
>disappear, never to be seen again. Nice MTAs get promoted.


Your not related to Bill Gates are you?

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <0108030507330F.00440@starship>]

[parent not found: <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>]

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found] ` <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>
@ 2001-08-03 13:09   ` Daniel Phillips
  2001-08-03 14:43     ` Horst von Brand
                       ` (2 more replies)
  2001-08-03 18:36   ` Matthias Andree
  1 sibling, 3 replies; 34+ messages in thread
From: Daniel Phillips @ 2001-08-03 13:09 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Horst von Brand, linux-kernel

On Friday 03 August 2001 05:13, Alexander Viro wrote:
> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > There is only one chain of directories from the fd's dentry up to
> > the root, that's the one to sync.
>
> You forgot ".. at any given moment". IOW, operation you propose is
> inherently racy. You want to do that - you do that in userland.

Are you saying that there may not be a ".." some of the time?  Or just 
that it may spontaneously be relinked?  If it does spontaneously change 
it doesn't matter, you have still made sure there is access by at least 
one path.

The trouble with doing this in userland is, the locked chain of dcache 
entries isn't there.

--
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 13:09   ` Daniel Phillips
@ 2001-08-03 14:43     ` Horst von Brand
  2001-08-03 17:49     ` Mike Castle
  2001-08-03 18:08     ` Alexander Viro
  2 siblings, 0 replies; 34+ messages in thread
From: Horst von Brand @ 2001-08-03 14:43 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Daniel Phillips <phillips@bonn-fries.net> said:
> On Friday 03 August 2001 05:13, Alexander Viro wrote:

[...]

> > You forgot ".. at any given moment". IOW, operation you propose is
> > inherently racy. You want to do that - you do that in userland.

> Are you saying that there may not be a ".." some of the time?  Or just 
> that it may spontaneously be relinked?  If it does spontaneously change 
> it doesn't matter, you have still made sure there is access by at least 
> one path.

Think "mv thisdir somewhereelse"
-- 
Dr. Horst H. von Brand                Usuario #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 13:09   ` Daniel Phillips
  2001-08-03 14:43     ` Horst von Brand
@ 2001-08-03 17:49     ` Mike Castle
  2001-08-04  3:23       ` Daniel Phillips
  2001-08-03 18:08     ` Alexander Viro
  2 siblings, 1 reply; 34+ messages in thread
From: Mike Castle @ 2001-08-03 17:49 UTC (permalink / raw)
  To: linux-kernel

On Fri, Aug 03, 2001 at 03:09:06PM +0200, Daniel Phillips wrote:
> On Friday 03 August 2001 05:13, Alexander Viro wrote:
> > On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > > There is only one chain of directories from the fd's dentry up to
> > > the root, that's the one to sync.
> >
> > You forgot ".. at any given moment". IOW, operation you propose is
> > inherently racy. You want to do that - you do that in userland.
> 
> Are you saying that there may not be a ".." some of the time?  Or just 
> that it may spontaneously be relinked?  If it does spontaneously change 
> it doesn't matter, you have still made sure there is access by at least 
> one path.


I read the ".." as a typo for "..."  As in Al was suggesting the sentence
should read "There is only one chain of directories at any given moment
from the fd's dentry up to the root...."

At least, that was my reading on it.

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 17:49     ` Mike Castle
@ 2001-08-04  3:23       ` Daniel Phillips
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Phillips @ 2001-08-04  3:23 UTC (permalink / raw)
  To: Mike Castle, linux-kernel

On Friday 03 August 2001 19:49, Mike Castle wrote:
> On Fri, Aug 03, 2001 at 03:09:06PM +0200, Daniel Phillips wrote:
> > On Friday 03 August 2001 05:13, Alexander Viro wrote:
> > > On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > > > There is only one chain of directories from the fd's dentry up to
> > > > the root, that's the one to sync.
> > >
> > > You forgot ".. at any given moment". IOW, operation you propose is
> > > inherently racy. You want to do that - you do that in userland.
> >
> > Are you saying that there may not be a ".." some of the time?  Or just
> > that it may spontaneously be relinked?  If it does spontaneously change
> > it doesn't matter, you have still made sure there is access by at least
> > one path.
>
> I read the ".." as a typo for "..."  As in Al was suggesting the sentence
> should read "There is only one chain of directories at any given moment
> from the fd's dentry up to the root...."

Heh, after some practice you get good at decoding Alspeak, it's not harder
than MIX.

--
Daniel


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 13:09   ` Daniel Phillips
  2001-08-03 14:43     ` Horst von Brand
  2001-08-03 17:49     ` Mike Castle
@ 2001-08-03 18:08     ` Alexander Viro
  2001-08-03 18:26       ` Daniel Phillips
  2001-08-03 18:34       ` Linus Torvalds
  2 siblings, 2 replies; 34+ messages in thread
From: Alexander Viro @ 2001-08-03 18:08 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Horst von Brand, linux-kernel

On Fri, 3 Aug 2001, Daniel Phillips wrote:

> Are you saying that there may not be a ".." some of the time?  Or just 
> that it may spontaneously be relinked?  If it does spontaneously change 
> it doesn't matter, you have still made sure there is access by at least 
> one path.
> 
> The trouble with doing this in userland is, the locked chain of dcache 
> entries isn't there.

There is no _locked_ chain. And if you want to grab the locks on all
ancestors - think again. It means sorting the inodes by address _and_
relocking if any of them had been moved while you were locking the
previous ones. I absolutely refuse to add such crap to the tree and I
seriously suspect that Linus and Alan will do the same.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:08     ` Alexander Viro
@ 2001-08-03 18:26       ` Daniel Phillips
  2001-08-03 18:53         ` Alexander Viro
  2001-08-03 18:34       ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Daniel Phillips @ 2001-08-03 18:26 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Horst von Brand, linux-kernel

On Friday 03 August 2001 20:08, Alexander Viro wrote:
> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > Are you saying that there may not be a ".." some of the time?  Or just
> > that it may spontaneously be relinked?  If it does spontaneously change
> > it doesn't matter, you have still made sure there is access by at least
> > one path.
> >
> > The trouble with doing this in userland is, the locked chain of dcache
> > entries isn't there.
>
> There is no _locked_ chain.

Locked as in can't be destroyed (refcount) not i_sem or such, sorry for the
loose usage.

> And if you want to grab the locks on all
> ancestors - think again. It means sorting the inodes by address _and_
> relocking if any of them had been moved while you were locking the
> previous ones. I absolutely refuse to add such crap to the tree and I
> seriously suspect that Linus and Alan will do the same.

--
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:26       ` Daniel Phillips
@ 2001-08-03 18:53         ` Alexander Viro
  2001-08-03 20:50           ` Daniel Phillips
  2001-08-04  3:43           ` Matthias Andree
  0 siblings, 2 replies; 34+ messages in thread
From: Alexander Viro @ 2001-08-03 18:53 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Horst von Brand, linux-kernel

On Fri, 3 Aug 2001, Daniel Phillips wrote:

> > There is no _locked_ chain.
> 
> Locked as in can't be destroyed (refcount) not i_sem or such, sorry for the
> loose usage.

Sigh... You need i_sem for fsync(). Moreover, there is no warranty that
set of objects you sync has _anything_ to path by the time when you start
syncing the second one. Application has information about the use of
parts of tree it's interested in. Kernel doesn't. Notice that all this
wankage was full of "oh, but MTA doesn't care for symlinks", "oh, but MTA
doesn't deal with parents renamed", ad nausea. You know what it means?
Right, that kernel shouldn't try to second-guess the userland. Application
knows what fs objects it wants synced. Kernel provides a primitive for
syncing an object and leaves the choice of objects to sync to application.

Folks, putting policy into the kernel is Wrong(tm). And that's precisely
what you are advocating here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:53         ` Alexander Viro
@ 2001-08-03 20:50           ` Daniel Phillips
  2001-08-04  3:43           ` Matthias Andree
  1 sibling, 0 replies; 34+ messages in thread
From: Daniel Phillips @ 2001-08-03 20:50 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Horst von Brand, linux-kernel

On Friday 03 August 2001 20:53, Alexander Viro wrote:
> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> > > There is no _locked_ chain.
> >
> > Locked as in can't be destroyed (refcount) not i_sem or such, sorry for
> > the loose usage.
>
> Sigh... You need i_sem for fsync().

We can drop it before syncing the parent, the point is, the dcache entry
stays.

> Moreover, there is no warranty that
> set of objects you sync has _anything_ to path by the time when you start
> syncing the second one.

OK, you win, I'll provide this example:

	   A				   B

  echo hello >/a/b/c/d
  fsync(d)->
    fsync_one(d)
				rename /a/b/c/d as /x/y/z/d
      fsync_one(c)
        fsync_one(b)
          fsync_one(a)
            fsync_one(/)

where fsync_one looks roughly like our current sys_fsync.

So we fsynced and we still might have lost the path to d.

The moral of the story: if the filesystem isn't designed to provide a
guaranteed commit on rename then we shouldn't try to fix the VFS so it
sorta does.

> Application has information about the use of
> parts of tree it's interested in. Kernel doesn't. Notice that all this
> wankage was full of "oh, but MTA doesn't care for symlinks", "oh, but MTA
> doesn't deal with parents renamed", ad nausea. You know what it means?
> Right, that kernel shouldn't try to second-guess the userland. Application
> knows what fs objects it wants synced. Kernel provides a primitive for
> syncing an object and leaves the choice of objects to sync to application.
>
> Folks, putting policy into the kernel is Wrong(tm). And that's precisely
> what you are advocating here.

I needed an example where even a new, improved sys_fsync would fail to
do the right thing for Ext2 in the absence of specific coding in the
MTA.  So the MTA is either designed to handle dumb filesystems or it
isn't.  It's pretty much a moot point since we have four filesystems
now in a nearly usable state that can provide the needed commit
guarantees, which should, as has been pointed out in this thread,
provide MTA's with both faster and more reliable operation.

--
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:53         ` Alexander Viro
  2001-08-03 20:50           ` Daniel Phillips
@ 2001-08-04  3:43           ` Matthias Andree
  1 sibling, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-04  3:43 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Daniel Phillips, Horst von Brand, linux-kernel

On Fri, 03 Aug 2001, Alexander Viro wrote:

> Sigh... You need i_sem for fsync(). Moreover, there is no warranty that
> set of objects you sync has _anything_ to path by the time when you start
> syncing the second one. Application has information about the use of
> parts of tree it's interested in. Kernel doesn't. Notice that all this
> wankage was full of "oh, but MTA doesn't care for symlinks", "oh, but MTA
> doesn't deal with parents renamed", ad nausea. You know what it means?

I know a few MTAs, but none of them use symlinks, they always use hard
links (if at all).

MTAs don't rename parent directories.

> Folks, putting policy into the kernel is Wrong(tm). And that's precisely
> what you are advocating here.

Is putting options with a user-space interface (mount option, file
system option such as chattr) wrong as well? Is making fsync() more
compatible wrong?

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:08     ` Alexander Viro
  2001-08-03 18:26       ` Daniel Phillips
@ 2001-08-03 18:34       ` Linus Torvalds
  1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2001-08-03 18:34 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.GSO.4.21.0108031400590.3272-100000@weyl.math.psu.edu>,
Alexander Viro  <viro@math.psu.edu> wrote:
>
>
>On Fri, 3 Aug 2001, Daniel Phillips wrote:
>
>> Are you saying that there may not be a ".." some of the time?  Or just 
>> that it may spontaneously be relinked?  If it does spontaneously change 
>> it doesn't matter, you have still made sure there is access by at least 
>> one path.
>> 
>> The trouble with doing this in userland is, the locked chain of dcache 
>> entries isn't there.
>
>There is no _locked_ chain. And if you want to grab the locks on all
>ancestors - think again. It means sorting the inodes by address _and_
>relocking if any of them had been moved while you were locking the
>previous ones. I absolutely refuse to add such crap to the tree and I
>seriously suspect that Linus and Alan will do the same.

Note that while there is no "absoilutely correct" thing here, I think
the right thing to do (as in "do what the user _expects_ you to do")
would be fairly simple to implement with a simple

	fsync(int fd)
	{
		dentry = fdget(fd);
		do_fsync(dentry);
		for (;;) {
			tmp = dentry;
			dentry = dentry->d_parent;
			if (dentry == tmp)
				break;
			do_fdatasync(dentry);
		}
	}

Add dcount increments as needed (and they _are_ needed, to make sure
that we don't hold on to a dentry while the child has been moved
somewhere else and the dentry now has a zero count). And we only need to
sync up to the closest mount-point, not the global root.

Does this guarantee that we fsync the whole path in the presense of
concurrent renames? No. That's a user problem, why should we care? He
should fsync his other renames too, he didn't ask us to fsync them.

And we don't care about any other paths that the file may have. We're
syncing the path that the user opened, and no others. Again, if the user
opened another path, he should have synced _that_ one.

		Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found] ` <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>
  2001-08-03 13:09   ` Daniel Phillips
@ 2001-08-03 18:36   ` Matthias Andree
  2001-08-03 19:16     ` Alexander Viro
  1 sibling, 1 reply; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:36 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Daniel Phillips, Horst von Brand, linux-kernel

On Thu, 02 Aug 2001, Alexander Viro wrote:

> On Fri, 3 Aug 2001, Daniel Phillips wrote:
> 
> > There is only one chain of directories from the fd's dentry up to the 
> > root, that's the one to sync.
> 
> You forgot ".. at any given moment". IOW, operation you propose is inherently
> racy. You want to do that - you do that in userland.

Applications usually protect their playgrounds - separate uid for
instance. If only the application has access to that area, and itself
does not trigger races, "at any given moment" is not a restriction.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 18:36   ` Matthias Andree
@ 2001-08-03 19:16     ` Alexander Viro
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Viro @ 2001-08-03 19:16 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Daniel Phillips, Horst von Brand, linux-kernel

On Fri, 3 Aug 2001, Matthias Andree wrote:

> Applications usually protect their playgrounds - separate uid for
> instance. If only the application has access to that area, and itself
> does not trigger races, "at any given moment" is not a restriction.

Bingo. The whole thing relies on second-guessing the application.
BTW, I can think of very legitimate cases when we want to create
a bunch of files, fsync them as we go and then fsync the directory
where they had been created. Application knows what and when should
be synced _and_ it has a way to ask kernel to sync an object.

BTW, symlinks may be a problem - they can't be opened and symlink
creation is asynchronous. And there are applications that _do_
care about them - for all they care, lost symlink is as bad as
the need to scan lost+found

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <0108030354130E.00440@starship>]

[parent not found: <200108030207.f7326OpR003086@sleipnir.valparaiso.cl>]

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found] ` <200108030207.f7326OpR003086@sleipnir.valparaiso.cl>
@ 2001-08-03 18:34   ` Matthias Andree
  0 siblings, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:34 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Daniel Phillips, linux-kernel

On Thu, 02 Aug 2001, Horst von Brand wrote:

> > He wants to be able to set a bit on the directory that specifies such 
> > behaviour for all files in the directory.  I don't see what's wrong 
> > with that.
> 
> That there isn't THE directory in which the file resides. There might be
> several, and setting the bit on one of them at random and expect it to work
> is a _lot_ of work. For no real use.

Well, if the file resides in a directory without that flag, it's not
synched. The application or its installation process should take care of
that. Suppose the MTA knows where it writes its queue files and which
must have this flag set accordingly.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* ext3-2.4-0.9.4
@ 2001-07-26  7:34 Andrew Morton
  2001-08-01 16:02 ` ext3-2.4-0.9.4 Stephen C. Tweedie
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2001-07-26  7:34 UTC (permalink / raw)
  To: lkml, ext3-users

An update to the ext3 filesystem for 2.4 kernels is available at

	http://www.uow.edu.au/~andrewm/linux/ext3/

The diffs are against linux-2.4.7 and linux-2.4.6-ac5.

The changelog is there.  One rarely-occurring but oopsable bug
was fixed and several quite significant performance enhancements
have been made.  These are in addition to the performance fixes
which went into 0.9.3.

Ted has put out a prelease of e2fsprogs-1.23 which supports
filesystem type `auto' in /etc/fstab, so it is now possible to
switch between ext3- and non-ext3-kernels without changing
any configuration.

It is recommended that users of earlier ext3 releases upgrade
to 0.9.4.

For people who are undertaking performance testing, it is perhaps
useful to point out that ext3 operates in one of three different
journalling modes, and that these modes have very different
functionality and very different performance characteristics.
Really, you need to test all three and balance the functionality
which each mode offers against the throughput which you obtain
in your application.

The modes are:

data=writeback

  This is classic metadata-only journalling.  File data is written
  back to the main fs lazily.  After a crash+recovery the fs's
  structural integrity is preserved, but the *contents* of files
  can and will contain old, stale data.  Potentially hundreds of
  megabytes of it.

  This is the fastest mode for normal filesystem applications.

data=ordered

  The fs ensures that file data is written into the main fs prior
  to committing its metadata.  Hence after a crash+recovery, your
  files will contain the correct data.

  This is the default operating mode and throughput is good. It
  adds about one second to a four minute kernel compile when
  compared with ext2.   Under heavier loads the difference
  becomes larger.

data=journal

  All data (as well as to metadata) is written to the journal
  before it is released to the main fs for writeback.

  This is a specialised mode - for normal fs usage you're better
  off using ordered data, which has the same benefits of not corrupting
  data after crash+recovery.  However for applications which require
  synchronous operation such as mail spools and synchronously exported
  NFS servers, this can be a performance win.  I have seen dbench
  figures in this mode (where the files were opened O_SYNC) running
  at ten times the throughput of ext2.  Not that this is the expected
  benefit for other applications!

Looking at the above issues, one may initially think that the
post-recovery data corruption is a serious issue with writeback mode,
and that there are big advantages to using journalled or ordered data.

However, even in these modes the affected files may be shorter-than-expected
after recovery, because the app hadn't finished writing them yet.  And
usually, a truncated file is just as useless as one which contains
garbage - it needs to be deleted.

It's not really as simple as that - for small (< a few hundred k) files,
it tends to be the case that either the whole file is intact after a crash,
or none of it is.  This is because the journalling mechanism starts a
new transaction every five seconds, and a typical open/write/close operation
usually fits entirely inside this window.

There is also a security issue to be considered: a recovered writeback-mode
filesystem will expose other people's old data to unintended recipients.

Hopefully this description will help people make their deployment choices.
If not, assistance is available on the ext3-users@redhat.com mailing list.

-

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: ext3-2.4-0.9.4
@ 2001-08-01 16:02 ` Stephen C. Tweedie
  2001-08-02  9:03   ` ext3-2.4-0.9.4 Matthias Andree
  0 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-08-01 16:02 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel; +Cc: Stephen Tweedie, Matthias Andree

Hi,

> Chase up to the root manually, because Linux' ext2 violates SUS v2
> fsync() (which requires meta data synched BTW)

Please quote chapter and verse --- my reading of SUS shows no such
requirement.  

fsync is required to force "all currently queued I/O operations
associated with the file indicated by file descriptor fildes to the
synchronised I/O completion state."  But as you should know, directory
entries and files are NOT the same thing in Unix/SUS.  

Are we expected to fsync the metadata belonging to just the file
itself?  Or all symlinks to the file?  Or all hard links?  Answer, as
best I can determine --- just the file.  That's all SUS talks about.
There can be many ways of reaching that file in the directory
hierarchy, or there can be none, but fsync() doesn't talk at all about
the status of those dirents after the sync.

> , as has been pointed out
> (and fixed in ReiserFS and ext3)?

ext3 happens to provide the guarantee, but that's coincidental and
does not imply that I think of it as being "fixed".  It's just changed
behaviour relative to ext2.

> So, please tell my why Single Unix Specification v2 specifies EIO for
> rename. Asynchronous I/O cannot possibly trigger immediate EIO.

Yes it can --- we may need to read metadata to complete the rename,
and such reads can fail.  

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-01 16:02 ` ext3-2.4-0.9.4 Stephen C. Tweedie
@ 2001-08-02  9:03   ` Matthias Andree
  2001-08-02 17:26     ` ext3-2.4-0.9.4 Daniel Phillips
  0 siblings, 1 reply; 34+ messages in thread
From: Matthias Andree @ 2001-08-02  9:03 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel

On Wed, 01 Aug 2001, Stephen Tweedie wrote:

> > Chase up to the root manually, because Linux' ext2 violates SUS v2
> > fsync() (which requires meta data synched BTW)
> 
> Please quote chapter and verse --- my reading of SUS shows no such
> requirement.  
> 
> fsync is required to force "all currently queued I/O operations
> associated with the file indicated by file descriptor fildes to the
> synchronised I/O completion state."  But as you should know, directory
> entries and files are NOT the same thing in Unix/SUS.  

Read on: "All I/O operations are completed as defined for synchronised
I/O _file_ integrity completion.". To show what that means, see the
glossary.

http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_004_000_291

  "synchronised I/O data integrity completion

  [...]

  * For write, when the operation has been completed or diagnosed if
  unsuccessful.  The write is complete only when the data specified in
  the write request is successfully transferred and all file system
  information required to retrieve the data is successfully transferred.

  File attributes that are not necessary for data retrieval (access
  time, modification time, status change time) need not be successfully
  transferred prior to returning to the calling process.

  synchronised I/O file integrity completion

  Identical to a synchronised I/O data integrity completion with the
  addition that all file attributes relative to the I/O operation
  (including access time, modification time, status change time) will be
  successfully transferred prior to returning to the calling process."

As I understand it, the directory entry's st_ino is a file attribute
necessary for data retrieval and also contains the m/a/ctime, so it must
be flushed to disk on fsync() as well.

> There can be many ways of reaching that file in the directory
> hierarchy, or there can be none, but fsync() doesn't talk at all about
> the status of those dirents after the sync.

Well, if there's not a single dirent, you cannot retrieve the data, so
I'd assume at least one dirent needs to be flushed as well. If there's a
simple way to get unflushed dentries to disk (hard links included),
flush them. Not sure about symlinks, but since they don't share the
inode number, that might be rather difficult for the kernel (I didn't
check):

touch 1 ; ln 1 2 ; ln -s 1 3 ; ls -li

 303464 -rw-r--r--   2 emma     users           0 Aug  2 10:56 1
 303464 -rw-r--r--   2 emma     users           0 Aug  2 10:56 2
 303466 lrwxrwxrwx   1 emma     users           1 Aug  2 10:56 3 -> 1

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02  9:03   ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-02 17:26     ` Daniel Phillips
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Phillips @ 2001-08-02 17:26 UTC (permalink / raw)
  To: Matthias Andree, Stephen C. Tweedie; +Cc: linux-kernel

On Thursday 02 August 2001 11:03, Matthias Andree wrote:
> On Wed, 01 Aug 2001, Stephen Tweedie wrote:
> > Matthias Andree wrote:
> > > Chase up to the root manually, because Linux' ext2 violates SUS
> > > v2 fsync() (which requires meta data synched BTW)
> >
> > Please quote chapter and verse --- my reading of SUS shows no such
> > requirement.
> >
> > fsync is required to force "all currently queued I/O operations
> > associated with the file indicated by file descriptor fildes to the
> > synchronised I/O completion state."  But as you should know,
> > directory entries and files are NOT the same thing in Unix/SUS.
>
> Read on: "All I/O operations are completed as defined for
> synchronised I/O _file_ integrity completion.". To show what that
> means, see the glossary.
>
> http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_0
>04_000_291
>
>   "synchronised I/O data integrity completion
>
>   [...]
>
>   * For write, when the operation has been completed or diagnosed if
>   unsuccessful.  The write is complete only when the data specified
> in the write request is successfully transferred and all file system
> information required to retrieve the data is successfully
> transferred.
>
>   File attributes that are not necessary for data retrieval (access
>   time, modification time, status change time) need not be
> successfully transferred prior to returning to the calling process.
>
>   synchronised I/O file integrity completion
>
>   Identical to a synchronised I/O data integrity completion with the
>   addition that all file attributes relative to the I/O operation
>   (including access time, modification time, status change time) will
> be successfully transferred prior to returning to the calling
> process."
>
> As I understand it, the directory entry's st_ino is a file attribute
> necessary for data retrieval and also contains the m/a/ctime, so it
> must be flushed to disk on fsync() as well.

I believed you've summarized the SUS requirements very well.  Apart 
from legalistic arguments, SUS quite clearly states that fsync should 
not return until you are sure of having recorded not only the file's 
data, but the access path to it.  I interpret this as being able to 
"access the file by its name", and being able to guess by looking in 
lost+found doesn't count.  I don't see the point in niggling about that.

So, it seems clear that an fsync which leaves any window of 
vulnerability where an interruption can leave a file unlinked is not 
SUS-compliant.

> > There can be many ways of reaching that file in the directory
> > hierarchy, or there can be none, but fsync() doesn't talk at all
> > about the status of those dirents after the sync.

This is a legalistic argument.  I don't think we should be looking for 
loopholes in SUS here.  To achieve SUS compliance there are two 
reasonable courses: "fix SUS" or "fix sys_fsync".  Since what SUS 
clearly wants here seems emminently reasonable, I'd suggest putting the 
energy that's currently going into this thread into fixing fsync 
instead.

> Well, if there's not a single dirent, you cannot retrieve the data,
> so I'd assume at least one dirent needs to be flushed as well. If
> there's a simple way to get unflushed dentries to disk (hard links
> included)...

*All* hard links?  No, there is no general way to do that.  However, 
any hard links[1] in the path used to open the file - yes.  There is 
always a chain of parent dentries held locked in the dcache for any 
open file.

I don't know why it is hard or inefficient to implement this at the VFS 
level, though I'm sure there is a reason or this thread wouldn't 
exist.  Stephen, perhaps you could explain for the record why sys_fsync 
can't just walk the chain of dentry parent links doing fdatasync?  Does 
this create VFS or Ext3 locking problems?  Or maybe it repeats work 
that Ext3 is already supposed to have done?

> ...flush them. Not sure about symlinks, but since they don't
> share the inode number, that might be rather difficult for the kernel
> (I didn't check)

The prescription for symlinks is, if you want them safely on disk you 
have to explicitly fsync the containing directory.

[1] In Ext2, all filename dirents are "hard links", i.e., there is no 
way to tell which of the two names is the original after creating a new 
hard link.

--
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:26     ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-02 17:37       ` Matthias Andree
  2001-08-02 18:35         ` Alexander Viro
                           ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-02 17:37 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Daniel Phillips wrote:

[file name must be flushed on fsync()]
> I don't know why it is hard or inefficient to implement this at the VFS 
> level, though I'm sure there is a reason or this thread wouldn't 
> exist.  Stephen, perhaps you could explain for the record why sys_fsync 
> can't just walk the chain of dentry parent links doing fdatasync?  Does 
> this create VFS or Ext3 locking problems?  Or maybe it repeats work 
> that Ext3 is already supposed to have done?

Well, the course was that I asked whether ext3 would do synchronous
directory updates, and some people jumped in and said that one should
fsync() the parent directory, however, since we figure from SUS, that's
invalid.

After some forth and back, we finally figured that at least ext2 is
implementing fsync() improperly.

So this part is covered.

The other thing is, that Linux is the only known system that does
asynchronous rename/link/unlink/symlink -- people have claimed it might
not be the only one, but failed to name systems.

So we need to assume that Linux is the only system that does
asynchronous rename/link/unlink/symlink, however a directory fsync() is
believed to be rather expensive.

Still, some people object to a dirsync mount option. But this has been
the actual reason for the thread - MTA authors are refusing to pamper
Linux and use chattr +S instead which gives unnecessary (premature) sync
operations on write() - but MTAs know how to fsync().

> The prescription for symlinks is, if you want them safely on disk you 
> have to explicitly fsync the containing directory.

Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
waste inodes on most systems).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
@ 2001-08-02 18:35         ` Alexander Viro
  2001-08-02 18:47           ` Matthias Andree
  2001-08-02 19:47         ` Bill Rugolsky Jr.
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Alexander Viro @ 2001-08-02 18:35 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel



On Thu, 2 Aug 2001, Matthias Andree wrote:

> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.

How the fuck it's expensive? It does _exactly_ the same as file fsync() -
literally the same code. It doesn't write blocks that don't belong to
directory. It doesn't write blocks that are clean. IOW, it does the
minimal work possible.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 18:35         ` Alexander Viro
@ 2001-08-02 18:47           ` Matthias Andree
  2001-08-02 22:18             ` Andreas Dilger
       [not found]             ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
  0 siblings, 2 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-02 18:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Matthias Andree, Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Alexander Viro wrote:

> How the fuck it's expensive? It does _exactly_ the same as file fsync() -
> literally the same code. It doesn't write blocks that don't belong to
> directory. It doesn't write blocks that are clean. IOW, it does the
> minimal work possible.

fsync()ing the dir is not the minimal work possible, if e. g. temporary
files are open that don't need their names synched. Fsync()ing the
directory syncs also these temporary file NAMES that other processes may
have open (but that they unlink rather than fsync()).

Assume:

open -> asynchronous, but filename synched on fsync()
rename/link/unlink(/symlink) -> synchronous

This way, you never need to fsync() the directory, so you never sync()
entries of temporary files. You never lose important files (because the
application uses fsync() and the OS synchs rename/link etc.).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 18:47           ` Matthias Andree
@ 2001-08-02 22:18             ` Andreas Dilger
  2001-08-02 23:11               ` Matthias Andree
       [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
       [not found]             ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
  1 sibling, 2 replies; 34+ messages in thread
From: Andreas Dilger @ 2001-08-02 22:18 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Alexander Viro, Daniel Phillips, Stephen C. Tweedie, linux-kernel

Matthais Andree writes:
> fsync()ing the dir is not the minimal work possible, if e. g. temporary
> files are open that don't need their names synched. Fsync()ing the
> directory syncs also these temporary file NAMES that other processes may
> have open (but that they unlink rather than fsync()).
> 
> Assume:
> 
> open -> asynchronous, but filename synched on fsync()
> rename/link/unlink(/symlink) -> synchronous
> 
> This way, you never need to fsync() the directory, so you never sync()
> entries of temporary files. You never lose important files (because the
> application uses fsync() and the OS synchs rename/link etc.).

Do you read what you are writing?  How can a "synchronous" operation for
rename/link/unlink/symlink NOT also write out "temporary" files in the
same directory?  How does calling fsync() on the directory IF YOU REQUIRE
SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
synchronous from within the kernel???

The only difference I can see is that making these specific operations
ALWAYS be synchronous hurts the common case when they can be async (see
Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
an fsync() on the directory == only synchronous operation when it is
actually needed, and no "extra" performance hit.

The only slight point of contention is if you have very large directories
which span several filesystem blocks, in which case it _would_ be possible
to write out some blocks synchronously, while leaving other blocks dirty.
In practise however, you will either only be modifying a small number of
blocks (at the end of the directory) because an MTA usually only creates
files and doesn't delete them, and the actual speed of syncing several
blocks at one time is not noticably different than syncing only one.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 22:18             ` Andreas Dilger
@ 2001-08-02 23:11               ` Matthias Andree
       [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
  1 sibling, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-02 23:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Matthias Andree, Alexander Viro, Daniel Phillips,
	Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Andreas Dilger wrote:

> > open -> asynchronous, but filename synched on fsync()
> > rename/link/unlink(/symlink) -> synchronous
> > 
> > This way, you never need to fsync() the directory, so you never sync()
> > entries of temporary files. You never lose important files (because the
> > application uses fsync() and the OS synchs rename/link etc.).
> 
> Do you read what you are writing?  How can a "synchronous" operation for
> rename/link/unlink/symlink NOT also write out "temporary" files in the
> same directory?  How does calling fsync() on the directory IF YOU REQUIRE
> SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
> synchronous from within the kernel???

Can people please try to understand? Can people please start to THINK
before flaming?

I did not say that open() is to be synchronous. I did not write ANYTHING
of fsync()ing directories, I'm trying to get rid of this requirement.

Thus, if the kernel does rename/link synchronously, you'd never ever
fsync() a directory. To synch a filename to disk, you'd just fsync() the
filedescriptor (with a SUS compliant system, that is, i. e. ext3 or
reiserfs, but not ext2).

Now, if someone opens a temporary file, and nukes it later -- unlink()
--, and doesn't want it visible, he never calls fsync() for the file.

However, if some other process then fsync()s the directory, you start
synching the temporary file dirent -> unnecessary, is nuked later on
with an unlink().

That's why fsync() on the directory is on no account the minimum work.

> The only difference I can see is that making these specific operations
> ALWAYS be synchronous hurts the common case when they can be async (see
> Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
> an fsync() on the directory == only synchronous operation when it is
> actually needed, and no "extra" performance hit.

In case you haven't noticed, this is about reliability without need to
fsync() the directory that doesn't all belong to your single, stupid
process but may have lots of asynchronous data of other processes -
temporary files for instance. You synch() that as well, which is
unnecessary and brings down other processes' performance.

In case you haven't noticed the other issue:

The whole thread is a FEATURE REQUEST for a dirsync mount option, for
MTAs and other software which requires reliable file systems, where the
name is negotiable. It aims to REDUCE OVERHEAD since chattr +S which is
the only workaround for synch-dirs - and it synchs synchronous files and
writes as well, and rendering things slower than necessary, since
write() can be buffered until you fsync() (and you want that to cut off
seek times).

Call the option bsd_slow_dirs if you like, I don't care. Given the
option, the administrator/user has the choice, currently, he hasn't. He
cannot possibly change all applications ported from other Unices.

Note: hindering this option doesn't get Linux anywhere. Pure file
system benchmarks are not worth a single bit of entropy unless Linux is
benchmarked chattr +S -- it's unreliable otherwise.

I cannot remember how often I explained this during the course of this
thread. Every other day, some ignorant comes out of its cavern and
discusses the whole thing over and over again.

And, once again, fsync()ing the directory is not an option for portable
applications. It's unnecessary on every other system (until someone
shows a production-ready system which by default has asynchronous
directory updates as well, but no-one has so far.)

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>]

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
@ 2001-08-03  9:16                 ` Matthias Andree
  0 siblings, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03  9:16 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: linux-kernel

On Fri, 03 Aug 2001, Anton Altaparmakov wrote:

[dirsync chattr/mount options]
> Me neither. With regards to the parallel discussion on SUS compliance it is 
> probably a good idea to have such a thing in some form anyway (although if 
> I understood the discussion correctly, we really want this to happen by 
> default, not just when some flag is set but then again I never read the 
> standards...).

The standard doesn't really command the behaviour, as it seems, but we
might want to look again after SUS v3 has been released (supposed to
happen later this year) - the SUS compliance was rather on fsync than on
rename/link.

However, I'd rather not choose the default for somebody else, because he
may have different requirements, a compile-time switch to set the
default should be fine, THIS one might indeed default to dirsync/noasync
unless changed by make {x,menu,}config.

Assuming that the chattr +S is accompanied by a corresponding -o sync
mount option, I'd expect that the dirsync option be available as chattr
option and as mount option, and choosing default mount options should be
rather easy.

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>]

[parent not found: <20010803021406.A9845@emma1.emma.line.org>]

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]               ` <20010803021406.A9845@emma1.emma.line.org>
@ 2001-08-03 16:20                 ` Jan Harkes
  2001-08-03 22:48                 ` Andreas Dilger
  1 sibling, 0 replies; 34+ messages in thread
From: Jan Harkes @ 2001-08-03 16:20 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Fri, Aug 03, 2001 at 02:14:06AM +0200, Matthias Andree wrote:
> On Fri, 03 Aug 2001, Anton Altaparmakov wrote:
> > filedescriptor to be synced to disk, the ONLY possible way to do this it to 
> > sync the parent directory in order to commit the file name to disk. On some 
> 
> Do I really need to sync the WHOLE parent directory? Not just the
> relevant part? My directories hardly have only 1 disk block.

Only dirty blocks are written back to disk, i.e. the parts of the
directory that were modified by adding/removing names. It should be
pretty efficient.

> > to explicitly sync the directory filedescriptor afterwards.
> 
> Which is non-portable and will not be done by many application
> programmers which just use chattr +S instead (makes things S)afe and
> S)low) - and spoil performance that way since it makes not only
> directory writes synchronous, but file (data) writes as well.

"chattr +S" is about as portable as adding fsync(parent), or even worse
as it only works on an ext2 file system. So I'm assuming that this is
just a nice exercise in annoying people.

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]               ` <20010803021406.A9845@emma1.emma.line.org>
  2001-08-03 16:20                 ` Jan Harkes
@ 2001-08-03 22:48                 ` Andreas Dilger
  1 sibling, 0 replies; 34+ messages in thread
From: Andreas Dilger @ 2001-08-03 22:48 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Anton Altaparmakov, Matthias Andree, Andreas Dilger,
	Alexander Viro, Daniel Phillips, Stephen C. Tweedie,
	linux-kernel



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
  2001-08-02 18:35         ` Alexander Viro
@ 2001-08-02 19:47         ` Bill Rugolsky Jr.
  2001-08-03 18:22           ` Matthias Andree
       [not found]         ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Bill Rugolsky Jr. @ 2001-08-02 19:47 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.
> 
> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.
> 
> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().

Let's inject a little reality into this discussion.  Filesystems are used
for something other than running MTA's written by stubborn "purists".

Solaris: Dell 600 MHz PIII 128MB RAM, largely quiescent:
         Solaris 8 mu4, UFS with logging

Linux:   VA Linux 800 MHZ PIII, 128MB RAM, largely quiescent
         RedHat Linux 7.1 w/ kernel-2.4.6-2.4 (2.4.6-ac5 + ext3-0.9.3).

660MB XFree86-4.1 build tree, cache primed with du -s in each case.

Here's something that we developers probably all do frequently: copy a
tree using hard links, so that we can patch it.

[solaris] find . | wc     
   33027   33027 1251671
[solaris] time find . -depth | cpio -pdul ../foo
0 blocks
 363.46s real    0.84s user   10.13s system 

Plain ext2:

[linux]# time find . -depth | cpio -pdul ../foo
0 blocks

real    0m3.823s user    0m0.240s sys     0m3.570s

Mounted ext3, ordered data mode.

[linux] time find . -depth | cpio -pdul ../foo
0 blocks

real    0m5.106s user    0m0.200s sys     0m3.700s

Mounted ext3, -o sync:

[root@ead51 bar]# time find . -depth | cpio -pdul ../foo
0 blocks

real    1m28.483s user    0m0.470s sys     0m4.410s 

=====================================================

Solaris8 UFS:   363.5 seconds
ext2:             3.8 seconds
ext3:             5.1 seconds
ext3 -o sync:    88.5 seconds

Got it?

Obviously, the last is the result of the poor interaction
of ext3+sync in 0.9.3, but Andrew Morton has already fixed that.
I will try again with 0.9.5 when I have a chance to upgrade that
machine.

I have no idea where BSD falls, but the basic point stands:  unused
features should not penalize other applications.  Andrew Morton has
figured out how to do this efficiently with ext3, and many kudos to him
for doing the work.  Absent that, why should I have to go get a cup of
coffee every time I want to patch a tree, just so some MTA can make
naive assumptions?

Regards,

   Bill Rugolsky

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 19:47         ` Bill Rugolsky Jr.
@ 2001-08-03 18:22           ` Matthias Andree
  0 siblings, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:22 UTC (permalink / raw)
  To: Bill Rugolsky Jr.; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Bill Rugolsky Jr. wrote:

> I have no idea where BSD falls, but the basic point stands:  unused
> features should not penalize other applications.  Andrew Morton has
> figured out how to do this efficiently with ext3, and many kudos to him
> for doing the work.  Absent that, why should I have to go get a cup of
> coffee every time I want to patch a tree, just so some MTA can make
> naive assumptions?

The whole idea is to have a switch to turn on BSD-style synchronous
directory update semantics. Nothing more, nothing you would not be able
to get rid off. In fact, you can mount file systems async on BSD as
well, but you'd better not have the machine crash. Irrecoverable file
system damage can result. As a compromise, softupdates are nearly as
fast as async, but FS damage is guaranteed to be recoverable.

In either case (async or soft-updates), files can end up in lost+found
after the control had been returned to the application that called open
or link.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>]

[parent not found: <20010803021642.B9845@emma1.emma.line.org>]

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]           ` <20010803021642.B9845@emma1.emma.line.org>
@ 2001-08-03  7:03             ` Eric W. Biederman
  2001-08-03  8:39               ` Matthias Andree
  0 siblings, 1 reply; 34+ messages in thread
From: Eric W. Biederman @ 2001-08-03  7:03 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Paul Jakma, linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> On Fri, 03 Aug 2001, Paul Jakma wrote:
> 
> > if the prime directive of MTAs is data integrity paranoia, then
> > surely the best assumption for an MTA to make is that
> > rename/link/unlink/symlink /are/ asynchronous in the general case?
> 
> They do on Linux, use chattr +S, and are much slower than e. g. on
> FreeBSD. Well. Not that I'd written THAT for the first time...

Actually given that this thread keeps coming up, but no one does anything
about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
Then there will be enough pain that someone will fix the MTA instead of
moaning that kernel is slow...

That should be an easy patch to make...

Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  7:03             ` Eric W. Biederman
@ 2001-08-03  8:39               ` Matthias Andree
  2001-08-03  9:57                 ` Christoph Hellwig
  2001-08-04  7:55                 ` Eric W. Biederman
  0 siblings, 2 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03  8:39 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Matthias Andree, Paul Jakma, linux-kernel

On Fri, 03 Aug 2001, Eric W. Biederman wrote:

> Actually given that this thread keeps coming up, but no one does anything
> about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
> Then there will be enough pain that someone will fix the MTA instead of
> moaning that kernel is slow...

They'd just drop Linux from the list of supported OS's, Linux will
disappoint people who trusted it, nothing is gained. Deliberate breakage
will not happen, because it would not help anyone except people with
twisted minds.

NO-ONE, including you, has come up with SERIOUS objections against a
dirsync option, except "is it really so much slower than chattr +S? show
figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Reconsider your position.

Stop trolling please.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:39               ` Matthias Andree
@ 2001-08-03  9:57                 ` Christoph Hellwig
  2001-08-04  7:55                 ` Eric W. Biederman
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2001-08-03  9:57 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Matthias Andree, Paul Jakma, linux-kernel, Eric W. Biederman

In article <20010803103954.A11584@emma1.emma.line.org> you wrote:
> They'd just drop Linux from the list of supported OS's, Linux will
> disappoint people who trusted it, nothing is gained. Deliberate breakage
> will not happen, because it would not help anyone except people with
> twisted minds.

Who cares?  There are more than enough sane mailer around..

> NO-ONE, including you, has come up with SERIOUS objections against a
> dirsync option, except "is it really so much slower than chattr +S? show
> figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Talk is cheap.  Code up a non-invasive dirsync option and submit it to
Linus.  I don't see any reason why it won't be accepted..

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:39               ` Matthias Andree
  2001-08-03  9:57                 ` Christoph Hellwig
@ 2001-08-04  7:55                 ` Eric W. Biederman
  1 sibling, 0 replies; 34+ messages in thread
From: Eric W. Biederman @ 2001-08-04  7:55 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Paul Jakma, linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> On Fri, 03 Aug 2001, Eric W. Biederman wrote:
> 
> > Actually given that this thread keeps coming up, but no one does anything
> > about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
> > Then there will be enough pain that someone will fix the MTA instead of
> > moaning that kernel is slow...
> 
> They'd just drop Linux from the list of supported OS's, Linux will
> disappoint people who trusted it, nothing is gained. Deliberate breakage
> will not happen, because it would not help anyone except people with
> twisted minds.

There are some other uses for a fully synchronous disk accesses so I'm
not going to run out and do it.  The point is that work arounds for
strange programs is not a right, it is a nice optional feature.

> NO-ONE, including you, has come up with SERIOUS objections against a
> dirsync option, except "is it really so much slower than chattr +S? show
> figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Clear objects against dirsync.  
- Extra code maintenance, makes the fs less reliable 
  (A reason for removing even synchrouns fs operations BTW).
- Unnecessary. fsync(dir) works today.
- dirsync is unlikely to be faster than fsync(file); fsync(dir) [not chattr +S]
  You really need something that can say remember these 5 syscalls,
  and sync the all their changes to disk togeter to really get an
  improvement in sync speed.
- I don't see anyone volunteering to write the code.

> Reconsider your position.

Nope.  Right now I would rather
a) Patch the mail programs to do the needed fsync(dir)
b) Totally remove synchrous disk updates from my OS, and
   make life really painful.
Before adding a dirsync option.

> Stop trolling please.

It wasn't trying to troll, just get this conversation on some productive
grounds.  I think supporting the MTA's is good, so long as it is a two
way relationship.

If someone went out and tried using fsync(dir) and then saying it
sucked we could definentily have more peace.

Using dirsync, and chattr +S hide the real problems that need to get
fixed.  Getting a good reliable, and high performance way to commit
actions to a filesystem.  

We already have one work around on linux that will work reliably.  So
now let's see if we can get a functional high performance solution.

And oh btw, new, functional high performance solutions are not
portable because they haven't been implemented in every operating
system.  Full understanding of the problem, and the solutions are two
new for the implementations to have gotten around.

Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
                           ` (2 preceding siblings ...)
       [not found]         ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
@ 2001-08-03  8:30         ` Stephen C. Tweedie
  2001-08-03 18:28           ` Matthias Andree
  2001-08-03  8:50         ` David Weinehall
  4 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-08-03  8:30 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

Hi,

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:

> So this part is covered.
> 
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.

Not true.  There are tons of others.

The issue was that synchronous directory updates are *optional* on
many systems (Linux included), but that Linux's support for that is
really inefficient since it ends up syncing file metadata updates too
(and it's much more efficient to use fsync for that.)

> Still, some people object to a dirsync mount option.

Who?  People who have discussed this in the past have certainly not
objected to my knowledge.  It would clearly help situations like this
(as would a dirsync chattr option.)

> > The prescription for symlinks is, if you want them safely on disk you 
> > have to explicitly fsync the containing directory.
> 
> Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> waste inodes on most systems).

Irrelevant.   We're talking about what makes sensible semantics, not
what assumptions any specific application makes.  It makes no sense to
say that dirsync won't affect symlinks just because some existing
applications don't rely on that!

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:30         ` Stephen C. Tweedie
@ 2001-08-03 18:28           ` Matthias Andree
  0 siblings, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:28 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Daniel Phillips, linux-kernel

On Fri, 03 Aug 2001, Stephen Tweedie wrote:

> > > The prescription for symlinks is, if you want them safely on disk you 
> > > have to explicitly fsync the containing directory.
> > 
> > Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> > waste inodes on most systems).
> 
> Irrelevant.   We're talking about what makes sensible semantics, not
> what assumptions any specific application makes.  It makes no sense to
> say that dirsync won't affect symlinks just because some existing
> applications don't rely on that!

It's rather my imagination that tracking hard links might be easier than
symlinks because hard links share the inode number. A more advanced (and
complex) implementation might prove the imagination wrong. I don't want
to consider which one is more efficient.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
                           ` (3 preceding siblings ...)
  2001-08-03  8:30         ` Stephen C. Tweedie
@ 2001-08-03  8:50         ` David Weinehall
  2001-08-03 18:31           ` Matthias Andree
  2001-08-03 19:59           ` Albert D. Cahalan
  4 siblings, 2 replies; 34+ messages in thread
From: David Weinehall @ 2001-08-03  8:50 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> On Thu, 02 Aug 2001, Daniel Phillips wrote:
> 
> [file name must be flushed on fsync()]
> > I don't know why it is hard or inefficient to implement this at the VFS 
> > level, though I'm sure there is a reason or this thread wouldn't 
> > exist.  Stephen, perhaps you could explain for the record why sys_fsync 
> > can't just walk the chain of dentry parent links doing fdatasync?  Does 
> > this create VFS or Ext3 locking problems?  Or maybe it repeats work 
> > that Ext3 is already supposed to have done?
> 
> Well, the course was that I asked whether ext3 would do synchronous
> directory updates, and some people jumped in and said that one should
> fsync() the parent directory, however, since we figure from SUS, that's
> invalid.
> 
> After some forth and back, we finally figured that at least ext2 is
> implementing fsync() improperly.
> 
> So this part is covered.

Yup, and this should be fixed imho.

> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.

And this is a feature, not a bug.

> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.

A directory fsync() might be expensive on non-Linux filesystems...

> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().

So what you mean is that MTA authors refuse to pamper Linux through use
of fsync of the directory, but can accept to "pamper" Linux through use
of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
demand that Linux should pamper for MTA authors that can't implement
fsync on the directory instead of writing BSD-specific code.

[snip]

To me this seems mostly like a way of saying "Hey, we've finally found
a way to make Linux look really bad compared to BSD-systems; let's
complain instead of writing alternative code that suits Linux systems
better than this code does." A lot like all the discussions on threads,
ueally.

Then again, I'm probably just extra grouchy today because it rained when
I rode my bike to work.


/David Weinehall
  _                                                                 _
 // David Weinehall <tao@acc.umu.se> /> Northern lights wander      \\
//  Project MCA Linux hacker        //  Dance across the winter sky //
\>  http://www.acc.umu.se/~tao/    </   Full colour fire           </

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:50         ` David Weinehall
@ 2001-08-03 18:31           ` Matthias Andree
  2001-08-03 19:59           ` Albert D. Cahalan
  1 sibling, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:31 UTC (permalink / raw)
  To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Fri, 03 Aug 2001, David Weinehall wrote:

> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> > Still, some people object to a dirsync mount option. But this has been
> > the actual reason for the thread - MTA authors are refusing to pamper
> > Linux and use chattr +S instead which gives unnecessary (premature) sync
> > operations on write() - but MTAs know how to fsync().
> 
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.

It's a maintenance issue.

You effectively start wrapping up all relevant syscalls and have
system-specific interfaces. One wants the directory fsync()ed, the other
offers a special other trick to get the data flushed... what useful is
portability then if systems are so different?

> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's

No wonder if the application chooses fully-synchronous operation on
Linux.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:50         ` David Weinehall
  2001-08-03 18:31           ` Matthias Andree
@ 2001-08-03 19:59           ` Albert D. Cahalan
  2001-08-03 19:54             ` Gregory Maxwell
  1 sibling, 1 reply; 34+ messages in thread
From: Albert D. Cahalan @ 2001-08-03 19:59 UTC (permalink / raw)
  To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

David Weinehall writes:
> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:

>> Still, some people object to a dirsync mount option. But this has been
>> the actual reason for the thread - MTA authors are refusing to pamper
>> Linux and use chattr +S instead which gives unnecessary (premature) sync
>> operations on write() - but MTAs know how to fsync().
>
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.
>
> [snip]
>
> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's
> complain instead of writing alternative code that suits Linux systems
> better than this code does." A lot like all the discussions on threads,
> ueally.

This is just completely true. One wonders why we seem to enjoy
getting screwed this way. We shouldn't be patching these MTAs or
hacking Linux to act like BSD. We should be avoiding these MTAs.

Somebody can create a big MTA list, listing the good and bad ones.
Then we get the Linux-hostile MTAs out of the Linux distributions,
demanding compliance like we do for filesystem layout. We also hunt
down Linux-related web pages that mention these MTAs and get the
pages changed or removed. The point is to make these MTAs just
disappear, never to be seen again. Nice MTAs get promoted.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 19:59           ` Albert D. Cahalan
@ 2001-08-03 19:54             ` Gregory Maxwell
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Maxwell @ 2001-08-03 19:54 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: David Weinehall, Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Fri, Aug 03, 2001 at 03:59:02PM -0400, Albert D. Cahalan wrote:
[snip]
> Somebody can create a big MTA list, listing the good and bad ones.
> Then we get the Linux-hostile MTAs out of the Linux distributions,
> demanding compliance like we do for filesystem layout. We also hunt
> down Linux-related web pages that mention these MTAs and get the
> pages changed or removed. The point is to make these MTAs just
> disappear, never to be seen again. Nice MTAs get promoted.

Think we could just get their authors to 'disappear'? It might be more cost
effective, and I can think of at least one example where removing the author
would have other benefits beyond MTAs. :) :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2001-08-07 17:01 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-03 20:25 intermediate summary of ext3-2.4-0.9.4 thread Sam James
     [not found] <0108030507330F.00440@starship>
     [not found] ` <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>
2001-08-03 13:09   ` Daniel Phillips
2001-08-03 14:43     ` Horst von Brand
2001-08-03 17:49     ` Mike Castle
2001-08-04  3:23       ` Daniel Phillips
2001-08-03 18:08     ` Alexander Viro
2001-08-03 18:26       ` Daniel Phillips
2001-08-03 18:53         ` Alexander Viro
2001-08-03 20:50           ` Daniel Phillips
2001-08-04  3:43           ` Matthias Andree
2001-08-03 18:34       ` Linus Torvalds
2001-08-03 18:36   ` Matthias Andree
2001-08-03 19:16     ` Alexander Viro
     [not found] <0108030354130E.00440@starship>
     [not found] ` <200108030207.f7326OpR003086@sleipnir.valparaiso.cl>
2001-08-03 18:34   ` Matthias Andree
  -- strict thread matches above, loose matches on Subject: below --
2001-07-26  7:34 ext3-2.4-0.9.4 Andrew Morton
2001-08-01 16:02 ` ext3-2.4-0.9.4 Stephen C. Tweedie
2001-08-02  9:03   ` ext3-2.4-0.9.4 Matthias Andree
2001-08-02 17:26     ` ext3-2.4-0.9.4 Daniel Phillips
2001-08-02 17:37       ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
2001-08-02 18:35         ` Alexander Viro
2001-08-02 18:47           ` Matthias Andree
2001-08-02 22:18             ` Andreas Dilger
2001-08-02 23:11               ` Matthias Andree
     [not found]               ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
2001-08-03  9:16                 ` Matthias Andree
     [not found]             ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
     [not found]               ` <20010803021406.A9845@emma1.emma.line.org>
2001-08-03 16:20                 ` Jan Harkes
2001-08-03 22:48                 ` Andreas Dilger
2001-08-02 19:47         ` Bill Rugolsky Jr.
2001-08-03 18:22           ` Matthias Andree
     [not found]         ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
     [not found]           ` <20010803021642.B9845@emma1.emma.line.org>
2001-08-03  7:03             ` Eric W. Biederman
2001-08-03  8:39               ` Matthias Andree
2001-08-03  9:57                 ` Christoph Hellwig
2001-08-04  7:55                 ` Eric W. Biederman
2001-08-03  8:30         ` Stephen C. Tweedie
2001-08-03 18:28           ` Matthias Andree
2001-08-03  8:50         ` David Weinehall
2001-08-03 18:31           ` Matthias Andree
2001-08-03 19:59           ` Albert D. Cahalan
2001-08-03 19:54             ` Gregory Maxwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).