* RE: intermediate summary of ext3-2.4-0.9.4 thread
@ 2001-08-03 20:25 Sam James
0 siblings, 0 replies; 34+ messages in thread
From: Sam James @ 2001-08-03 20:25 UTC (permalink / raw)
To: Albert D. Cahalan, tao; +Cc: phillips, sct, linux-kernel
>
>This is just completely true. One wonders why we seem to enjoy
>getting screwed this way. We shouldn't be patching these MTAs or
>hacking Linux to act like BSD. We should be avoiding these MTAs.
>
>Somebody can create a big MTA list, listing the good and bad ones.
>Then we get the Linux-hostile MTAs out of the Linux distributions,
>demanding compliance like we do for filesystem layout. We also hunt
>down Linux-related web pages that mention these MTAs and get the
>pages changed or removed. The point is to make these MTAs just
>disappear, never to be seen again. Nice MTAs get promoted.
Your not related to Bill Gates are you?
^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <0108030507330F.00440@starship>]
[parent not found: <0108030354130E.00440@starship>]
* ext3-2.4-0.9.4
@ 2001-07-26 7:34 Andrew Morton
2001-08-01 16:02 ` ext3-2.4-0.9.4 Stephen C. Tweedie
0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2001-07-26 7:34 UTC (permalink / raw)
To: lkml, ext3-users
An update to the ext3 filesystem for 2.4 kernels is available at
http://www.uow.edu.au/~andrewm/linux/ext3/
The diffs are against linux-2.4.7 and linux-2.4.6-ac5.
The changelog is there. One rarely-occurring but oopsable bug
was fixed and several quite significant performance enhancements
have been made. These are in addition to the performance fixes
which went into 0.9.3.
Ted has put out a prelease of e2fsprogs-1.23 which supports
filesystem type `auto' in /etc/fstab, so it is now possible to
switch between ext3- and non-ext3-kernels without changing
any configuration.
It is recommended that users of earlier ext3 releases upgrade
to 0.9.4.
For people who are undertaking performance testing, it is perhaps
useful to point out that ext3 operates in one of three different
journalling modes, and that these modes have very different
functionality and very different performance characteristics.
Really, you need to test all three and balance the functionality
which each mode offers against the throughput which you obtain
in your application.
The modes are:
data=writeback
This is classic metadata-only journalling. File data is written
back to the main fs lazily. After a crash+recovery the fs's
structural integrity is preserved, but the *contents* of files
can and will contain old, stale data. Potentially hundreds of
megabytes of it.
This is the fastest mode for normal filesystem applications.
data=ordered
The fs ensures that file data is written into the main fs prior
to committing its metadata. Hence after a crash+recovery, your
files will contain the correct data.
This is the default operating mode and throughput is good. It
adds about one second to a four minute kernel compile when
compared with ext2. Under heavier loads the difference
becomes larger.
data=journal
All data (as well as to metadata) is written to the journal
before it is released to the main fs for writeback.
This is a specialised mode - for normal fs usage you're better
off using ordered data, which has the same benefits of not corrupting
data after crash+recovery. However for applications which require
synchronous operation such as mail spools and synchronously exported
NFS servers, this can be a performance win. I have seen dbench
figures in this mode (where the files were opened O_SYNC) running
at ten times the throughput of ext2. Not that this is the expected
benefit for other applications!
Looking at the above issues, one may initially think that the
post-recovery data corruption is a serious issue with writeback mode,
and that there are big advantages to using journalled or ordered data.
However, even in these modes the affected files may be shorter-than-expected
after recovery, because the app hadn't finished writing them yet. And
usually, a truncated file is just as useless as one which contains
garbage - it needs to be deleted.
It's not really as simple as that - for small (< a few hundred k) files,
it tends to be the case that either the whole file is intact after a crash,
or none of it is. This is because the journalling mechanism starts a
new transaction every five seconds, and a typical open/write/close operation
usually fits entirely inside this window.
There is also a security issue to be considered: a recovered writeback-mode
filesystem will expose other people's old data to unintended recipients.
Hopefully this description will help people make their deployment choices.
If not, assistance is available on the ext3-users@redhat.com mailing list.
-
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext3-2.4-0.9.4
@ 2001-08-01 16:02 ` Stephen C. Tweedie
2001-08-02 9:03 ` ext3-2.4-0.9.4 Matthias Andree
0 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-08-01 16:02 UTC (permalink / raw)
To: Linus Torvalds, linux-kernel; +Cc: Stephen Tweedie, Matthias Andree
Hi,
> Chase up to the root manually, because Linux' ext2 violates SUS v2
> fsync() (which requires meta data synched BTW)
Please quote chapter and verse --- my reading of SUS shows no such
requirement.
fsync is required to force "all currently queued I/O operations
associated with the file indicated by file descriptor fildes to the
synchronised I/O completion state." But as you should know, directory
entries and files are NOT the same thing in Unix/SUS.
Are we expected to fsync the metadata belonging to just the file
itself? Or all symlinks to the file? Or all hard links? Answer, as
best I can determine --- just the file. That's all SUS talks about.
There can be many ways of reaching that file in the directory
hierarchy, or there can be none, but fsync() doesn't talk at all about
the status of those dirents after the sync.
> , as has been pointed out
> (and fixed in ReiserFS and ext3)?
ext3 happens to provide the guarantee, but that's coincidental and
does not imply that I think of it as being "fixed". It's just changed
behaviour relative to ext2.
> So, please tell my why Single Unix Specification v2 specifies EIO for
> rename. Asynchronous I/O cannot possibly trigger immediate EIO.
Yes it can --- we may need to read metadata to complete the rename,
and such reads can fail.
Cheers,
Stephen
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext3-2.4-0.9.4
2001-08-01 16:02 ` ext3-2.4-0.9.4 Stephen C. Tweedie
@ 2001-08-02 9:03 ` Matthias Andree
2001-08-02 17:26 ` ext3-2.4-0.9.4 Daniel Phillips
0 siblings, 1 reply; 34+ messages in thread
From: Matthias Andree @ 2001-08-02 9:03 UTC (permalink / raw)
To: Stephen C. Tweedie; +Cc: linux-kernel
On Wed, 01 Aug 2001, Stephen Tweedie wrote:
> > Chase up to the root manually, because Linux' ext2 violates SUS v2
> > fsync() (which requires meta data synched BTW)
>
> Please quote chapter and verse --- my reading of SUS shows no such
> requirement.
>
> fsync is required to force "all currently queued I/O operations
> associated with the file indicated by file descriptor fildes to the
> synchronised I/O completion state." But as you should know, directory
> entries and files are NOT the same thing in Unix/SUS.
Read on: "All I/O operations are completed as defined for synchronised
I/O _file_ integrity completion.". To show what that means, see the
glossary.
http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_004_000_291
"synchronised I/O data integrity completion
[...]
* For write, when the operation has been completed or diagnosed if
unsuccessful. The write is complete only when the data specified in
the write request is successfully transferred and all file system
information required to retrieve the data is successfully transferred.
File attributes that are not necessary for data retrieval (access
time, modification time, status change time) need not be successfully
transferred prior to returning to the calling process.
synchronised I/O file integrity completion
Identical to a synchronised I/O data integrity completion with the
addition that all file attributes relative to the I/O operation
(including access time, modification time, status change time) will be
successfully transferred prior to returning to the calling process."
As I understand it, the directory entry's st_ino is a file attribute
necessary for data retrieval and also contains the m/a/ctime, so it must
be flushed to disk on fsync() as well.
> There can be many ways of reaching that file in the directory
> hierarchy, or there can be none, but fsync() doesn't talk at all about
> the status of those dirents after the sync.
Well, if there's not a single dirent, you cannot retrieve the data, so
I'd assume at least one dirent needs to be flushed as well. If there's a
simple way to get unflushed dentries to disk (hard links included),
flush them. Not sure about symlinks, but since they don't share the
inode number, that might be rather difficult for the kernel (I didn't
check):
touch 1 ; ln 1 2 ; ln -s 1 3 ; ls -li
303464 -rw-r--r-- 2 emma users 0 Aug 2 10:56 1
303464 -rw-r--r-- 2 emma users 0 Aug 2 10:56 2
303466 lrwxrwxrwx 1 emma users 1 Aug 2 10:56 3 -> 1
--
Matthias Andree
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext3-2.4-0.9.4
2001-08-02 9:03 ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-02 17:26 ` Daniel Phillips
2001-08-02 17:37 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
0 siblings, 1 reply; 34+ messages in thread
From: Daniel Phillips @ 2001-08-02 17:26 UTC (permalink / raw)
To: Matthias Andree, Stephen C. Tweedie; +Cc: linux-kernel
On Thursday 02 August 2001 11:03, Matthias Andree wrote:
> On Wed, 01 Aug 2001, Stephen Tweedie wrote:
> > Matthias Andree wrote:
> > > Chase up to the root manually, because Linux' ext2 violates SUS
> > > v2 fsync() (which requires meta data synched BTW)
> >
> > Please quote chapter and verse --- my reading of SUS shows no such
> > requirement.
> >
> > fsync is required to force "all currently queued I/O operations
> > associated with the file indicated by file descriptor fildes to the
> > synchronised I/O completion state." But as you should know,
> > directory entries and files are NOT the same thing in Unix/SUS.
>
> Read on: "All I/O operations are completed as defined for
> synchronised I/O _file_ integrity completion.". To show what that
> means, see the glossary.
>
> http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_0
>04_000_291
>
> "synchronised I/O data integrity completion
>
> [...]
>
> * For write, when the operation has been completed or diagnosed if
> unsuccessful. The write is complete only when the data specified
> in the write request is successfully transferred and all file system
> information required to retrieve the data is successfully
> transferred.
>
> File attributes that are not necessary for data retrieval (access
> time, modification time, status change time) need not be
> successfully transferred prior to returning to the calling process.
>
> synchronised I/O file integrity completion
>
> Identical to a synchronised I/O data integrity completion with the
> addition that all file attributes relative to the I/O operation
> (including access time, modification time, status change time) will
> be successfully transferred prior to returning to the calling
> process."
>
> As I understand it, the directory entry's st_ino is a file attribute
> necessary for data retrieval and also contains the m/a/ctime, so it
> must be flushed to disk on fsync() as well.
I believed you've summarized the SUS requirements very well. Apart
from legalistic arguments, SUS quite clearly states that fsync should
not return until you are sure of having recorded not only the file's
data, but the access path to it. I interpret this as being able to
"access the file by its name", and being able to guess by looking in
lost+found doesn't count. I don't see the point in niggling about that.
So, it seems clear that an fsync which leaves any window of
vulnerability where an interruption can leave a file unlinked is not
SUS-compliant.
> > There can be many ways of reaching that file in the directory
> > hierarchy, or there can be none, but fsync() doesn't talk at all
> > about the status of those dirents after the sync.
This is a legalistic argument. I don't think we should be looking for
loopholes in SUS here. To achieve SUS compliance there are two
reasonable courses: "fix SUS" or "fix sys_fsync". Since what SUS
clearly wants here seems emminently reasonable, I'd suggest putting the
energy that's currently going into this thread into fixing fsync
instead.
> Well, if there's not a single dirent, you cannot retrieve the data,
> so I'd assume at least one dirent needs to be flushed as well. If
> there's a simple way to get unflushed dentries to disk (hard links
> included)...
*All* hard links? No, there is no general way to do that. However,
any hard links[1] in the path used to open the file - yes. There is
always a chain of parent dentries held locked in the dcache for any
open file.
I don't know why it is hard or inefficient to implement this at the VFS
level, though I'm sure there is a reason or this thread wouldn't
exist. Stephen, perhaps you could explain for the record why sys_fsync
can't just walk the chain of dentry parent links doing fdatasync? Does
this create VFS or Ext3 locking problems? Or maybe it repeats work
that Ext3 is already supposed to have done?
> ...flush them. Not sure about symlinks, but since they don't
> share the inode number, that might be rather difficult for the kernel
> (I didn't check)
The prescription for symlinks is, if you want them safely on disk you
have to explicitly fsync the containing directory.
[1] In Ext2, all filename dirents are "hard links", i.e., there is no
way to tell which of the two names is the original after creating a new
hard link.
--
Daniel
^ permalink raw reply [flat|nested] 34+ messages in thread
* intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 17:26 ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-02 17:37 ` Matthias Andree
2001-08-02 18:35 ` Alexander Viro
` (4 more replies)
0 siblings, 5 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-02 17:37 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Stephen C. Tweedie, linux-kernel
On Thu, 02 Aug 2001, Daniel Phillips wrote:
[file name must be flushed on fsync()]
> I don't know why it is hard or inefficient to implement this at the VFS
> level, though I'm sure there is a reason or this thread wouldn't
> exist. Stephen, perhaps you could explain for the record why sys_fsync
> can't just walk the chain of dentry parent links doing fdatasync? Does
> this create VFS or Ext3 locking problems? Or maybe it repeats work
> that Ext3 is already supposed to have done?
Well, the course was that I asked whether ext3 would do synchronous
directory updates, and some people jumped in and said that one should
fsync() the parent directory, however, since we figure from SUS, that's
invalid.
After some forth and back, we finally figured that at least ext2 is
implementing fsync() improperly.
So this part is covered.
The other thing is, that Linux is the only known system that does
asynchronous rename/link/unlink/symlink -- people have claimed it might
not be the only one, but failed to name systems.
So we need to assume that Linux is the only system that does
asynchronous rename/link/unlink/symlink, however a directory fsync() is
believed to be rather expensive.
Still, some people object to a dirsync mount option. But this has been
the actual reason for the thread - MTA authors are refusing to pamper
Linux and use chattr +S instead which gives unnecessary (premature) sync
operations on write() - but MTAs know how to fsync().
> The prescription for symlinks is, if you want them safely on disk you
> have to explicitly fsync the containing directory.
Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
waste inodes on most systems).
--
Matthias Andree
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 17:37 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
@ 2001-08-02 18:35 ` Alexander Viro
2001-08-02 18:47 ` Matthias Andree
2001-08-02 19:47 ` Bill Rugolsky Jr.
` (3 subsequent siblings)
4 siblings, 1 reply; 34+ messages in thread
From: Alexander Viro @ 2001-08-02 18:35 UTC (permalink / raw)
To: Matthias Andree; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel
On Thu, 2 Aug 2001, Matthias Andree wrote:
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.
How the fuck it's expensive? It does _exactly_ the same as file fsync() -
literally the same code. It doesn't write blocks that don't belong to
directory. It doesn't write blocks that are clean. IOW, it does the
minimal work possible.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 18:35 ` Alexander Viro
@ 2001-08-02 18:47 ` Matthias Andree
2001-08-02 22:18 ` Andreas Dilger
[not found] ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
0 siblings, 2 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-02 18:47 UTC (permalink / raw)
To: Alexander Viro
Cc: Matthias Andree, Daniel Phillips, Stephen C. Tweedie, linux-kernel
On Thu, 02 Aug 2001, Alexander Viro wrote:
> How the fuck it's expensive? It does _exactly_ the same as file fsync() -
> literally the same code. It doesn't write blocks that don't belong to
> directory. It doesn't write blocks that are clean. IOW, it does the
> minimal work possible.
fsync()ing the dir is not the minimal work possible, if e. g. temporary
files are open that don't need their names synched. Fsync()ing the
directory syncs also these temporary file NAMES that other processes may
have open (but that they unlink rather than fsync()).
Assume:
open -> asynchronous, but filename synched on fsync()
rename/link/unlink(/symlink) -> synchronous
This way, you never need to fsync() the directory, so you never sync()
entries of temporary files. You never lose important files (because the
application uses fsync() and the OS synchs rename/link etc.).
--
Matthias Andree
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 18:47 ` Matthias Andree
@ 2001-08-02 22:18 ` Andreas Dilger
2001-08-02 23:11 ` Matthias Andree
[not found] ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
[not found] ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
1 sibling, 2 replies; 34+ messages in thread
From: Andreas Dilger @ 2001-08-02 22:18 UTC (permalink / raw)
To: Matthias Andree
Cc: Alexander Viro, Daniel Phillips, Stephen C. Tweedie, linux-kernel
Matthais Andree writes:
> fsync()ing the dir is not the minimal work possible, if e. g. temporary
> files are open that don't need their names synched. Fsync()ing the
> directory syncs also these temporary file NAMES that other processes may
> have open (but that they unlink rather than fsync()).
>
> Assume:
>
> open -> asynchronous, but filename synched on fsync()
> rename/link/unlink(/symlink) -> synchronous
>
> This way, you never need to fsync() the directory, so you never sync()
> entries of temporary files. You never lose important files (because the
> application uses fsync() and the OS synchs rename/link etc.).
Do you read what you are writing? How can a "synchronous" operation for
rename/link/unlink/symlink NOT also write out "temporary" files in the
same directory? How does calling fsync() on the directory IF YOU REQUIRE
SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
synchronous from within the kernel???
The only difference I can see is that making these specific operations
ALWAYS be synchronous hurts the common case when they can be async (see
Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
an fsync() on the directory == only synchronous operation when it is
actually needed, and no "extra" performance hit.
The only slight point of contention is if you have very large directories
which span several filesystem blocks, in which case it _would_ be possible
to write out some blocks synchronously, while leaving other blocks dirty.
In practise however, you will either only be modifying a small number of
blocks (at the end of the directory) because an MTA usually only creates
files and doesn't delete them, and the actual speed of syncing several
blocks at one time is not noticably different than syncing only one.
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 22:18 ` Andreas Dilger
@ 2001-08-02 23:11 ` Matthias Andree
[not found] ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
1 sibling, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-02 23:11 UTC (permalink / raw)
To: Andreas Dilger
Cc: Matthias Andree, Alexander Viro, Daniel Phillips,
Stephen C. Tweedie, linux-kernel
On Thu, 02 Aug 2001, Andreas Dilger wrote:
> > open -> asynchronous, but filename synched on fsync()
> > rename/link/unlink(/symlink) -> synchronous
> >
> > This way, you never need to fsync() the directory, so you never sync()
> > entries of temporary files. You never lose important files (because the
> > application uses fsync() and the OS synchs rename/link etc.).
>
> Do you read what you are writing? How can a "synchronous" operation for
> rename/link/unlink/symlink NOT also write out "temporary" files in the
> same directory? How does calling fsync() on the directory IF YOU REQUIRE
> SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
> synchronous from within the kernel???
Can people please try to understand? Can people please start to THINK
before flaming?
I did not say that open() is to be synchronous. I did not write ANYTHING
of fsync()ing directories, I'm trying to get rid of this requirement.
Thus, if the kernel does rename/link synchronously, you'd never ever
fsync() a directory. To synch a filename to disk, you'd just fsync() the
filedescriptor (with a SUS compliant system, that is, i. e. ext3 or
reiserfs, but not ext2).
Now, if someone opens a temporary file, and nukes it later -- unlink()
--, and doesn't want it visible, he never calls fsync() for the file.
However, if some other process then fsync()s the directory, you start
synching the temporary file dirent -> unnecessary, is nuked later on
with an unlink().
That's why fsync() on the directory is on no account the minimum work.
> The only difference I can see is that making these specific operations
> ALWAYS be synchronous hurts the common case when they can be async (see
> Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
> an fsync() on the directory == only synchronous operation when it is
> actually needed, and no "extra" performance hit.
In case you haven't noticed, this is about reliability without need to
fsync() the directory that doesn't all belong to your single, stupid
process but may have lots of asynchronous data of other processes -
temporary files for instance. You synch() that as well, which is
unnecessary and brings down other processes' performance.
In case you haven't noticed the other issue:
The whole thread is a FEATURE REQUEST for a dirsync mount option, for
MTAs and other software which requires reliable file systems, where the
name is negotiable. It aims to REDUCE OVERHEAD since chattr +S which is
the only workaround for synch-dirs - and it synchs synchronous files and
writes as well, and rendering things slower than necessary, since
write() can be buffered until you fsync() (and you want that to cut off
seek times).
Call the option bsd_slow_dirs if you like, I don't care. Given the
option, the administrator/user has the choice, currently, he hasn't. He
cannot possibly change all applications ported from other Unices.
Note: hindering this option doesn't get Linux anywhere. Pure file
system benchmarks are not worth a single bit of entropy unless Linux is
benchmarked chattr +S -- it's unreliable otherwise.
I cannot remember how often I explained this during the course of this
thread. Every other day, some ignorant comes out of its cavern and
discusses the whole thing over and over again.
And, once again, fsync()ing the directory is not an option for portable
applications. It's unnecessary on every other system (until someone
shows a production-ready system which by default has asynchronous
directory updates as well, but no-one has so far.)
^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>]
* Re: intermediate summary of ext3-2.4-0.9.4 thread
[not found] ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
@ 2001-08-03 9:16 ` Matthias Andree
0 siblings, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 9:16 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: linux-kernel
On Fri, 03 Aug 2001, Anton Altaparmakov wrote:
[dirsync chattr/mount options]
> Me neither. With regards to the parallel discussion on SUS compliance it is
> probably a good idea to have such a thing in some form anyway (although if
> I understood the discussion correctly, we really want this to happen by
> default, not just when some flag is set but then again I never read the
> standards...).
The standard doesn't really command the behaviour, as it seems, but we
might want to look again after SUS v3 has been released (supposed to
happen later this year) - the SUS compliance was rather on fsync than on
rename/link.
However, I'd rather not choose the default for somebody else, because he
may have different requirements, a compile-time switch to set the
default should be fine, THIS one might indeed default to dirsync/noasync
unless changed by make {x,menu,}config.
Assuming that the chattr +S is accompanied by a corresponding -o sync
mount option, I'd expect that the dirsync option be available as chattr
option and as mount option, and choosing default mount options should be
rather easy.
^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>]
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 17:37 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
2001-08-02 18:35 ` Alexander Viro
@ 2001-08-02 19:47 ` Bill Rugolsky Jr.
2001-08-03 18:22 ` Matthias Andree
[not found] ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
` (2 subsequent siblings)
4 siblings, 1 reply; 34+ messages in thread
From: Bill Rugolsky Jr. @ 2001-08-02 19:47 UTC (permalink / raw)
To: Daniel Phillips, Stephen C. Tweedie, linux-kernel
On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.
>
> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.
>
> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().
Let's inject a little reality into this discussion. Filesystems are used
for something other than running MTA's written by stubborn "purists".
Solaris: Dell 600 MHz PIII 128MB RAM, largely quiescent:
Solaris 8 mu4, UFS with logging
Linux: VA Linux 800 MHZ PIII, 128MB RAM, largely quiescent
RedHat Linux 7.1 w/ kernel-2.4.6-2.4 (2.4.6-ac5 + ext3-0.9.3).
660MB XFree86-4.1 build tree, cache primed with du -s in each case.
Here's something that we developers probably all do frequently: copy a
tree using hard links, so that we can patch it.
[solaris] find . | wc
33027 33027 1251671
[solaris] time find . -depth | cpio -pdul ../foo
0 blocks
363.46s real 0.84s user 10.13s system
Plain ext2:
[linux]# time find . -depth | cpio -pdul ../foo
0 blocks
real 0m3.823s user 0m0.240s sys 0m3.570s
Mounted ext3, ordered data mode.
[linux] time find . -depth | cpio -pdul ../foo
0 blocks
real 0m5.106s user 0m0.200s sys 0m3.700s
Mounted ext3, -o sync:
[root@ead51 bar]# time find . -depth | cpio -pdul ../foo
0 blocks
real 1m28.483s user 0m0.470s sys 0m4.410s
=====================================================
Solaris8 UFS: 363.5 seconds
ext2: 3.8 seconds
ext3: 5.1 seconds
ext3 -o sync: 88.5 seconds
Got it?
Obviously, the last is the result of the poor interaction
of ext3+sync in 0.9.3, but Andrew Morton has already fixed that.
I will try again with 0.9.5 when I have a chance to upgrade that
machine.
I have no idea where BSD falls, but the basic point stands: unused
features should not penalize other applications. Andrew Morton has
figured out how to do this efficiently with ext3, and many kudos to him
for doing the work. Absent that, why should I have to go get a cup of
coffee every time I want to patch a tree, just so some MTA can make
naive assumptions?
Regards,
Bill Rugolsky
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 19:47 ` Bill Rugolsky Jr.
@ 2001-08-03 18:22 ` Matthias Andree
0 siblings, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:22 UTC (permalink / raw)
To: Bill Rugolsky Jr.; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel
On Thu, 02 Aug 2001, Bill Rugolsky Jr. wrote:
> I have no idea where BSD falls, but the basic point stands: unused
> features should not penalize other applications. Andrew Morton has
> figured out how to do this efficiently with ext3, and many kudos to him
> for doing the work. Absent that, why should I have to go get a cup of
> coffee every time I want to patch a tree, just so some MTA can make
> naive assumptions?
The whole idea is to have a switch to turn on BSD-style synchronous
directory update semantics. Nothing more, nothing you would not be able
to get rid off. In fact, you can mount file systems async on BSD as
well, but you'd better not have the machine crash. Irrecoverable file
system damage can result. As a compromise, softupdates are nearly as
fast as async, but FS damage is guaranteed to be recoverable.
In either case (async or soft-updates), files can end up in lost+found
after the control had been returned to the application that called open
or link.
--
Matthias Andree
^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>]
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 17:37 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
` (2 preceding siblings ...)
[not found] ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
@ 2001-08-03 8:30 ` Stephen C. Tweedie
2001-08-03 18:28 ` Matthias Andree
2001-08-03 8:50 ` David Weinehall
4 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-08-03 8:30 UTC (permalink / raw)
To: Daniel Phillips, Stephen C. Tweedie, linux-kernel
Hi,
On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> So this part is covered.
>
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.
Not true. There are tons of others.
The issue was that synchronous directory updates are *optional* on
many systems (Linux included), but that Linux's support for that is
really inefficient since it ends up syncing file metadata updates too
(and it's much more efficient to use fsync for that.)
> Still, some people object to a dirsync mount option.
Who? People who have discussed this in the past have certainly not
objected to my knowledge. It would clearly help situations like this
(as would a dirsync chattr option.)
> > The prescription for symlinks is, if you want them safely on disk you
> > have to explicitly fsync the containing directory.
>
> Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> waste inodes on most systems).
Irrelevant. We're talking about what makes sensible semantics, not
what assumptions any specific application makes. It makes no sense to
say that dirsync won't affect symlinks just because some existing
applications don't rely on that!
Cheers,
Stephen
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-03 8:30 ` Stephen C. Tweedie
@ 2001-08-03 18:28 ` Matthias Andree
0 siblings, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:28 UTC (permalink / raw)
To: Stephen C. Tweedie; +Cc: Daniel Phillips, linux-kernel
On Fri, 03 Aug 2001, Stephen Tweedie wrote:
> > > The prescription for symlinks is, if you want them safely on disk you
> > > have to explicitly fsync the containing directory.
> >
> > Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> > waste inodes on most systems).
>
> Irrelevant. We're talking about what makes sensible semantics, not
> what assumptions any specific application makes. It makes no sense to
> say that dirsync won't affect symlinks just because some existing
> applications don't rely on that!
It's rather my imagination that tracking hard links might be easier than
symlinks because hard links share the inode number. A more advanced (and
complex) implementation might prove the imagination wrong. I don't want
to consider which one is more efficient.
--
Matthias Andree
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-02 17:37 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
` (3 preceding siblings ...)
2001-08-03 8:30 ` Stephen C. Tweedie
@ 2001-08-03 8:50 ` David Weinehall
2001-08-03 18:31 ` Matthias Andree
2001-08-03 19:59 ` Albert D. Cahalan
4 siblings, 2 replies; 34+ messages in thread
From: David Weinehall @ 2001-08-03 8:50 UTC (permalink / raw)
To: Daniel Phillips, Stephen C. Tweedie, linux-kernel
On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> On Thu, 02 Aug 2001, Daniel Phillips wrote:
>
> [file name must be flushed on fsync()]
> > I don't know why it is hard or inefficient to implement this at the VFS
> > level, though I'm sure there is a reason or this thread wouldn't
> > exist. Stephen, perhaps you could explain for the record why sys_fsync
> > can't just walk the chain of dentry parent links doing fdatasync? Does
> > this create VFS or Ext3 locking problems? Or maybe it repeats work
> > that Ext3 is already supposed to have done?
>
> Well, the course was that I asked whether ext3 would do synchronous
> directory updates, and some people jumped in and said that one should
> fsync() the parent directory, however, since we figure from SUS, that's
> invalid.
>
> After some forth and back, we finally figured that at least ext2 is
> implementing fsync() improperly.
>
> So this part is covered.
Yup, and this should be fixed imho.
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.
And this is a feature, not a bug.
> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.
A directory fsync() might be expensive on non-Linux filesystems...
> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().
So what you mean is that MTA authors refuse to pamper Linux through use
of fsync of the directory, but can accept to "pamper" Linux through use
of chattr +S?! This seem ridiculous. It seems equally ridiculous to
demand that Linux should pamper for MTA authors that can't implement
fsync on the directory instead of writing BSD-specific code.
[snip]
To me this seems mostly like a way of saying "Hey, we've finally found
a way to make Linux look really bad compared to BSD-systems; let's
complain instead of writing alternative code that suits Linux systems
better than this code does." A lot like all the discussions on threads,
ueally.
Then again, I'm probably just extra grouchy today because it rained when
I rode my bike to work.
/David Weinehall
_ _
// David Weinehall <tao@acc.umu.se> /> Northern lights wander \\
// Project MCA Linux hacker // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-03 8:50 ` David Weinehall
@ 2001-08-03 18:31 ` Matthias Andree
2001-08-03 19:59 ` Albert D. Cahalan
1 sibling, 0 replies; 34+ messages in thread
From: Matthias Andree @ 2001-08-03 18:31 UTC (permalink / raw)
To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel
On Fri, 03 Aug 2001, David Weinehall wrote:
> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> > Still, some people object to a dirsync mount option. But this has been
> > the actual reason for the thread - MTA authors are refusing to pamper
> > Linux and use chattr +S instead which gives unnecessary (premature) sync
> > operations on write() - but MTAs know how to fsync().
>
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous. It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.
It's a maintenance issue.
You effectively start wrapping up all relevant syscalls and have
system-specific interfaces. One wants the directory fsync()ed, the other
offers a special other trick to get the data flushed... what useful is
portability then if systems are so different?
> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's
No wonder if the application chooses fully-synchronous operation on
Linux.
--
Matthias Andree
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-03 8:50 ` David Weinehall
2001-08-03 18:31 ` Matthias Andree
@ 2001-08-03 19:59 ` Albert D. Cahalan
2001-08-03 19:54 ` Gregory Maxwell
1 sibling, 1 reply; 34+ messages in thread
From: Albert D. Cahalan @ 2001-08-03 19:59 UTC (permalink / raw)
To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel
David Weinehall writes:
> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
>> Still, some people object to a dirsync mount option. But this has been
>> the actual reason for the thread - MTA authors are refusing to pamper
>> Linux and use chattr +S instead which gives unnecessary (premature) sync
>> operations on write() - but MTAs know how to fsync().
>
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous. It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.
>
> [snip]
>
> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's
> complain instead of writing alternative code that suits Linux systems
> better than this code does." A lot like all the discussions on threads,
> ueally.
This is just completely true. One wonders why we seem to enjoy
getting screwed this way. We shouldn't be patching these MTAs or
hacking Linux to act like BSD. We should be avoiding these MTAs.
Somebody can create a big MTA list, listing the good and bad ones.
Then we get the Linux-hostile MTAs out of the Linux distributions,
demanding compliance like we do for filesystem layout. We also hunt
down Linux-related web pages that mention these MTAs and get the
pages changed or removed. The point is to make these MTAs just
disappear, never to be seen again. Nice MTAs get promoted.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: intermediate summary of ext3-2.4-0.9.4 thread
2001-08-03 19:59 ` Albert D. Cahalan
@ 2001-08-03 19:54 ` Gregory Maxwell
0 siblings, 0 replies; 34+ messages in thread
From: Gregory Maxwell @ 2001-08-03 19:54 UTC (permalink / raw)
To: Albert D. Cahalan
Cc: David Weinehall, Daniel Phillips, Stephen C. Tweedie, linux-kernel
On Fri, Aug 03, 2001 at 03:59:02PM -0400, Albert D. Cahalan wrote:
[snip]
> Somebody can create a big MTA list, listing the good and bad ones.
> Then we get the Linux-hostile MTAs out of the Linux distributions,
> demanding compliance like we do for filesystem layout. We also hunt
> down Linux-related web pages that mention these MTAs and get the
> pages changed or removed. The point is to make these MTAs just
> disappear, never to be seen again. Nice MTAs get promoted.
Think we could just get their authors to 'disappear'? It might be more cost
effective, and I can think of at least one example where removing the author
would have other benefits beyond MTAs. :) :)
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2001-08-07 17:01 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-03 20:25 intermediate summary of ext3-2.4-0.9.4 thread Sam James
[not found] <0108030507330F.00440@starship>
[not found] ` <Pine.GSO.4.21.0108022312211.1494-100000@weyl.math.psu.edu>
2001-08-03 13:09 ` Daniel Phillips
2001-08-03 14:43 ` Horst von Brand
2001-08-03 17:49 ` Mike Castle
2001-08-04 3:23 ` Daniel Phillips
2001-08-03 18:08 ` Alexander Viro
2001-08-03 18:26 ` Daniel Phillips
2001-08-03 18:53 ` Alexander Viro
2001-08-03 20:50 ` Daniel Phillips
2001-08-04 3:43 ` Matthias Andree
2001-08-03 18:34 ` Linus Torvalds
2001-08-03 18:36 ` Matthias Andree
2001-08-03 19:16 ` Alexander Viro
[not found] <0108030354130E.00440@starship>
[not found] ` <200108030207.f7326OpR003086@sleipnir.valparaiso.cl>
2001-08-03 18:34 ` Matthias Andree
-- strict thread matches above, loose matches on Subject: below --
2001-07-26 7:34 ext3-2.4-0.9.4 Andrew Morton
2001-08-01 16:02 ` ext3-2.4-0.9.4 Stephen C. Tweedie
2001-08-02 9:03 ` ext3-2.4-0.9.4 Matthias Andree
2001-08-02 17:26 ` ext3-2.4-0.9.4 Daniel Phillips
2001-08-02 17:37 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
2001-08-02 18:35 ` Alexander Viro
2001-08-02 18:47 ` Matthias Andree
2001-08-02 22:18 ` Andreas Dilger
2001-08-02 23:11 ` Matthias Andree
[not found] ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
2001-08-03 9:16 ` Matthias Andree
[not found] ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
[not found] ` <20010803021406.A9845@emma1.emma.line.org>
2001-08-03 16:20 ` Jan Harkes
2001-08-03 22:48 ` Andreas Dilger
2001-08-02 19:47 ` Bill Rugolsky Jr.
2001-08-03 18:22 ` Matthias Andree
[not found] ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
[not found] ` <20010803021642.B9845@emma1.emma.line.org>
2001-08-03 7:03 ` Eric W. Biederman
2001-08-03 8:39 ` Matthias Andree
2001-08-03 9:57 ` Christoph Hellwig
2001-08-04 7:55 ` Eric W. Biederman
2001-08-03 8:30 ` Stephen C. Tweedie
2001-08-03 18:28 ` Matthias Andree
2001-08-03 8:50 ` David Weinehall
2001-08-03 18:31 ` Matthias Andree
2001-08-03 19:59 ` Albert D. Cahalan
2001-08-03 19:54 ` Gregory Maxwell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).