linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
       [not found] ` <mit.lcs.mail.linux-kernel/20020712162306$aa7d@traf.lcs.mit.edu>
@ 2002-07-15 15:22   ` Patrick J. LoPresti
  2002-07-15 17:31     ` Chris Mason
                       ` (3 more replies)
  0 siblings, 4 replies; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 15:22 UTC (permalink / raw)
  To: linux-kernel

Consider this argument:

  Given: On ext3, fsync() of any file on a partition commits all
         outstanding transactions on that partition to the log.

  Given: data=ordered forces pending data writes for a file to happen
         before related transactions are committed to the log.

  Therefore: With data=ordered, fsync() of any file on a partition
             syncs the outstanding writes of EVERY file on that
             partition.

Is this argument correct?  If so, it suggests that data=ordered is
actually the *worst* possible journalling mode for a mail spool.

One other thing.  I think this statement is misleading:

    IF your server is stable and not prone to crashing, and/or you
    have the write cache on your hard drives battery backed, you
    should strongly consider using the writeback journaling mode of
    Ext3 versus ordered.

This makes it sound like data=writeback is somehow unsafe when
machines crash.  I do not think this is true.  If your application
(e.g., Postfix) is written correctly (which it is), so it calls
fsync() when it is supposed to, then data=writeback is *exactly* as
safe as any other journalling mode.  "Battery backed caches" and the
like have nothing to do with it.  And if your application is written
incorrectly, then other journalling modes will reduce but not
eliminate the chances for things to break catastrophically on a crash.

So if the partition is dedicated to correct applications, like a mail
spool is, then data=writeback is perfectly safe.  If it is faster,
too, then it really is a no-brainer.

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 15:22   ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Patrick J. LoPresti
@ 2002-07-15 17:31     ` Chris Mason
  2002-07-15 18:33     ` Matthias Andree
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 82+ messages in thread
From: Chris Mason @ 2002-07-15 17:31 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On Mon, 2002-07-15 at 11:22, Patrick J. LoPresti wrote:
> Consider this argument:
> 
>   Given: On ext3, fsync() of any file on a partition commits all
>          outstanding transactions on that partition to the log.
> 
>   Given: data=ordered forces pending data writes for a file to happen
>          before related transactions are committed to the log.
> 
>   Therefore: With data=ordered, fsync() of any file on a partition
>              syncs the outstanding writes of EVERY file on that
>              partition.
> 
> Is this argument correct?  If so, it suggests that data=ordered is
> actually the *worst* possible journalling mode for a mail spool.
> 

Yes.  In practice this doesn't hurt as much as it could, because ext3
does a good job of letting more writers come in before forcing the
commit.  What hurts you is when a forced commit comes in the middle of
creating the file.  A data write that could have been contiguous gets
broken into two or more writes instead.

> One other thing.  I think this statement is misleading:
> 
>     IF your server is stable and not prone to crashing, and/or you
>     have the write cache on your hard drives battery backed, you
>     should strongly consider using the writeback journaling mode of
>     Ext3 versus ordered.
> 
> This makes it sound like data=writeback is somehow unsafe when
> machines crash.  I do not think this is true.  If your application
> (e.g., Postfix) is written correctly (which it is), so it calls
> fsync() when it is supposed to, then data=writeback is *exactly* as
> safe as any other journalling mode.  

Almost.  data=writeback makes it possible for the old contents of a
block to end up in a newly grown file.  There are a few ways this can
screw you up:

1) that newly grown file is someone's inbox, and the old contents of the
new block include someone else's private message.

2) That newly grown file is a control file for the application, and the
application expects it to contain valid data within (think sendmail).  

> "Battery backed caches" and the
> like have nothing to do with it.  

Nope, battery backed caches don't make data=writeback more or less safe
(with respect to the data anyway).  They do make data=ordered and
data=journal more safe.

> And if your application is written
> incorrectly, then other journalling modes will reduce but not
> eliminate the chances for things to break catastrophically on a crash.
> 
> So if the partition is dedicated to correct applications, like a mail
> spool is, then data=writeback is perfectly safe.  If it is faster,
> too, then it really is a no-brainer.

For mail servers, data=journal is your friend.  ext3 sometimes needs a
bigger log for it (reiserfs data=journal patches don't), but the
performance increase can be significant.

-chris



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 15:22   ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Patrick J. LoPresti
  2002-07-15 17:31     ` Chris Mason
@ 2002-07-15 18:33     ` Matthias Andree
       [not found]     ` <20020715173337$acad@traf.lcs.mit.edu>
  2002-07-16  7:07     ` Dax Kelson
  3 siblings, 0 replies; 82+ messages in thread
From: Matthias Andree @ 2002-07-15 18:33 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> One other thing.  I think this statement is misleading:
> 
>     IF your server is stable and not prone to crashing, and/or you
>     have the write cache on your hard drives battery backed, you
>     should strongly consider using the writeback journaling mode of
>     Ext3 versus ordered.
> 
> This makes it sound like data=writeback is somehow unsafe when
> machines crash.  I do not think this is true.  If your application

Well, if your fsync() completes...

> (e.g., Postfix) is written correctly (which it is), so it calls
> fsync() when it is supposed to, then data=writeback is *exactly* as
> safe as any other journalling mode.  "Battery backed caches" and the
> like have nothing to do with it.  And if your application is written
> incorrectly, then other journalling modes will reduce but not
> eliminate the chances for things to break catastrophically on a crash.

...then you're right. If the machine crashes amidst the fsync()
operation, but has scheduled meta data before file contents, then
journal recovery can present you a file that contains bogus data which
will confuse some applications. I believe Postfix will recover from
this condition either way, see its file is hosed and ignore or discard
it (depending on what it is), but software that blindly relies on a
special format without checking will barf.

All of this assumes two things:

1. the application actually calls fsync()

2. the application can detect if fsync() succeeded before the crash
(like fsync -> fchmod -> fsync, structured file contents, whatever).

> So if the partition is dedicated to correct applications, like a mail
> spool is, then data=writeback is perfectly safe.  If it is faster,
> too, then it really is a no-brainer.

These ordering promises also apply to applications that do not call
fsync() or that cannot detect hosed files. Been there, seen that, with
CVS on unpatched ReiserFS as of Linux-2.4.19-presomething: suddenly one
,v file contained NUL blocks. The server barfed, the (remote!) client
segfaulted... yes, it's almost as bad as it can get.

Not catastrophic, tape backup available, but it gave some time to
restore the file and investigate this issue nonetheless. It boiled down
to "nobody's fault, but missing feature". With data=ordered or
data=journal, I would have either had my old ,v file around or a proper
new one.

I'm now using Chris Mason's data-logging patches to try and see how
things work out, I had one crash with an old version, then updated to
the -11 version and have yet to see something break again.

I'd certainly appreciate if these patches were merged early in
2.4.20-pre so they get some testing and can be in 2.4.20 and Linux had
two file systems with data=ordered to choose from.

Disclaimer: I don't know anything except the bare existence, about XFS
or JFS. Feel free to add comments.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
       [not found]       ` <mit.lcs.mail.linux-kernel/20020715173337$acad@traf.lcs.mit.edu>
@ 2002-07-15 19:13         ` Patrick J. LoPresti
  2002-07-15 20:55           ` Matthias Andree
  2002-07-15 21:14           ` Chris Mason
  0 siblings, 2 replies; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 19:13 UTC (permalink / raw)
  To: linux-kernel

Chris Mason <mason@suse.com> writes:

> > One other thing.  I think this statement is misleading:
> > 
> >     IF your server is stable and not prone to crashing, and/or you
> >     have the write cache on your hard drives battery backed, you
> >     should strongly consider using the writeback journaling mode of
> >     Ext3 versus ordered.
> > 
> > This makes it sound like data=writeback is somehow unsafe when
> > machines crash.  I do not think this is true.  If your application
> > (e.g., Postfix) is written correctly (which it is), so it calls
> > fsync() when it is supposed to, then data=writeback is *exactly* as
> > safe as any other journalling mode.  
> 
> Almost.  data=writeback makes it possible for the old contents of a
> block to end up in a newly grown file.

Only if the application is already broken.

> There are a few ways this can screw you up:
> 
> 1) that newly grown file is someone's inbox, and the old contents of the
> new block include someone else's private message.
>
> 2) That newly grown file is a control file for the application, and the
> application expects it to contain valid data within (think sendmail).  

In a correctly-written application, neither of these things can
happen.  (See my earlier message today on fsync() and MTAs.)  To get a
file onto disk reliably, the application must 1) flush the data, and
then 2) flush a "validity" indicator.  This could be a sequence like:

  create temp file
  flush data to temp file
  rename temp file
  flush rename operation

In this sequence, the file's existence under a particular name is the
indicator of its validity.

If you skip either of these flush operations, you are not behaving
reliably.  Skipping the first flush means the validity indicator might
hit the disk before the data; so after a crash, you might see invalid
data in an allegedly valid file.  Skipping the second flush means you
do not know that the validity indicator has been set, so you cannot
report success to whoever is waiting for this "reliable write" to
happen.

It is possible to make an application which relies on data=ordered
semantics; for example, skipping the "flush data to temp file" step
above.  But such an application would be broken for every version of
Unix *except* Linux in data=ordered mode.  I would call that an
incorrect application.

> Nope, battery backed caches don't make data=writeback more or less safe
> (with respect to the data anyway).  They do make data=ordered and
> data=journal more safe.

A theorist would say that "more safe" is a sloppy concept.  Either an
operation is safe or it is not.  As I said in my last message,
data=ordered (and data=journal) can reduce the risk for poorly written
apps.  But they cannot eliminate that risk, and for a correctly
written app, data=writeback is 100% as safe.

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 19:13         ` Patrick J. LoPresti
@ 2002-07-15 20:55           ` Matthias Andree
  2002-07-15 21:23             ` Patrick J. LoPresti
  2002-07-15 22:55             ` Alan Cox
  2002-07-15 21:14           ` Chris Mason
  1 sibling, 2 replies; 82+ messages in thread
From: Matthias Andree @ 2002-07-15 20:55 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> In a correctly-written application, neither of these things can
> happen.  (See my earlier message today on fsync() and MTAs.)  To get a
> file onto disk reliably, the application must 1) flush the data, and
> then 2) flush a "validity" indicator.  This could be a sequence like:
> 
>   create temp file
>   flush data to temp file
>   rename temp file
>   flush rename operation
> 
> In this sequence, the file's existence under a particular name is the
> indicator of its validity.

Assume that most applications are broken then.

I assume that most will just call close() or fclose() and exit() right
away. Does fclose() imply fsync()? 

Some applications will not even check the [f]close() return value...

> It is possible to make an application which relies on data=ordered
> semantics; for example, skipping the "flush data to temp file" step
> above.  But such an application would be broken for every version of
> Unix *except* Linux in data=ordered mode.  I would call that an
> incorrect application.

Or very specific, at least.

> > Nope, battery backed caches don't make data=writeback more or less safe
> > (with respect to the data anyway).  They do make data=ordered and
> > data=journal more safe.
> 
> A theorist would say that "more safe" is a sloppy concept.  Either an
> operation is safe or it is not.  As I said in my last message,
> data=ordered (and data=journal) can reduce the risk for poorly written
> apps.  But they cannot eliminate that risk, and for a correctly
> written app, data=writeback is 100% as safe.

IF that application uses a marker to mark completion. If it does not,
data=ordered will be the safe bet, regardless of fsync() or not. The
machine can crash BEFORE the fsync() is called.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 19:13         ` Patrick J. LoPresti
  2002-07-15 20:55           ` Matthias Andree
@ 2002-07-15 21:14           ` Chris Mason
  2002-07-15 21:31             ` Patrick J. LoPresti
  2002-07-16 12:35             ` Matthias Andree
  1 sibling, 2 replies; 82+ messages in thread
From: Chris Mason @ 2002-07-15 21:14 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On Mon, 2002-07-15 at 15:13, Patrick J. LoPresti wrote:

> > 1) that newly grown file is someone's inbox, and the old contents of the
> > new block include someone else's private message.
> >
> > 2) That newly grown file is a control file for the application, and the
> > application expects it to contain valid data within (think sendmail).  
> 
> In a correctly-written application, neither of these things can
> happen.  (See my earlier message today on fsync() and MTAs.)  To get a
> file onto disk reliably, the application must 1) flush the data, and
> then 2) flush a "validity" indicator.  This could be a sequence like:
> 
>   create temp file
>   flush data to temp file
>   rename temp file
>   flush rename operation

Yes, most mtas do this for queue files, I'm not sure how many do it for
the actual spool file.  mail server authors are more than welcome to
recommend the best safety/performance combo for their product, and to
ask the FS guys which combinations are safe.

-chris



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 20:55           ` Matthias Andree
@ 2002-07-15 21:23             ` Patrick J. LoPresti
  2002-07-15 21:38               ` Thunder from the hill
  2002-07-15 21:59               ` Ketil Froyn
  2002-07-15 22:55             ` Alan Cox
  1 sibling, 2 replies; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 21:23 UTC (permalink / raw)
  To: linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> I assume that most will just call close() or fclose() and exit() right
> away. Does fclose() imply fsync()? 

Not according to my close(2) man page:

       A successful close does not guarantee that  the  data  has
       been  successfully  saved  to  disk,  as the kernel defers
       writes. It is not common for a  filesystem  to  flush  the
       buffers  when the stream is closed. If you need to be sure
       that the data is physically stored use fsync(2).  (It will
       depend on the disk hardware at this point.)

Note that this means writing a truly reliable shell or Perl script is
tricky.  I suppose you can "use POSIX qw(fsync);" in Perl.  But what
do you do for a shell script?  /bin/sync :-) ?

> Some applications will not even check the [f]close() return value...

Such applications are broken, of course.

> > It is possible to make an application which relies on data=ordered
> > semantics; for example, skipping the "flush data to temp file" step
> > above.  But such an application would be broken for every version of
> > Unix *except* Linux in data=ordered mode.  I would call that an
> > incorrect application.
> 
> Or very specific, at least.

Hm.  Does BSD with soft updates guarantee anything about write
ordering on fsync()?  In particular, does it promise to commit the
data before the metadata?

> > A theorist would say that "more safe" is a sloppy concept.  Either an
> > operation is safe or it is not.  As I said in my last message,
> > data=ordered (and data=journal) can reduce the risk for poorly written
> > apps.  But they cannot eliminate that risk, and for a correctly
> > written app, data=writeback is 100% as safe.
> 
> IF that application uses a marker to mark completion. If it does not,
> data=ordered will be the safe bet, regardless of fsync() or not. The
> machine can crash BEFORE the fsync() is called.

Without marking completion, there is no safe bet.  Without calling
fsync(), you *never* know when the data will hit the disk.  It is very
hard to build a reliable system that way...  For an MTA, for example,
you can never safely inform the remote mailer that you have accepted
the message.  But this problem goes beyond MTAs; very few applications
live in a vacuum.

Reliable systems are tricky.  I guess this is why Oracle and Sybase
make all that money.

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:14           ` Chris Mason
@ 2002-07-15 21:31             ` Patrick J. LoPresti
  2002-07-15 22:12               ` Richard A Nelson
  2002-07-16  1:02               ` Lawrence Greenfield
  2002-07-16 12:35             ` Matthias Andree
  1 sibling, 2 replies; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 21:31 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel

Chris Mason <mason@suse.com> writes:

> Yes, most mtas do this for queue files, I'm not sure how many do it for
> the actual spool file.

Maybe the control files are small enough to fit in one disk block,
making the operations atomic in practice.  Or something.

> mail server authors are more than welcome to recommend the best
> safety/performance combo for their product, and to ask the FS guys
> which combinations are safe.

Yeah, but it's a shame if those combinations require performance hits
like "synchronous directory updates" or, worse, "fsync() == sync()".

I really wish MTA authors would just support Linux's "fsync the
directory" approach.  It is simple, reliable, and fast.  Yes, it does
require Linux-specific support in the application, but that's what
application authors should expect when there is a gap in the
standards.

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:23             ` Patrick J. LoPresti
@ 2002-07-15 21:38               ` Thunder from the hill
  2002-07-16 12:31                 ` Matthias Andree
  2002-07-15 21:59               ` Ketil Froyn
  1 sibling, 1 reply; 82+ messages in thread
From: Thunder from the hill @ 2002-07-15 21:38 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

Hi,

On 15 Jul 2002, Patrick J. LoPresti wrote:
> Note that this means writing a truly reliable shell or Perl script is
> tricky.  I suppose you can "use POSIX qw(fsync);" in Perl.  But what do
> you do for a shell script?  /bin/sync :-) ?

Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be 
done with it.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 22:55             ` Alan Cox
@ 2002-07-15 21:58               ` Matthias Andree
  0 siblings, 0 replies; 82+ messages in thread
From: Matthias Andree @ 2002-07-15 21:58 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Alan Cox wrote:

> We are only interested in reliable code. Anything else is already
> fatally broken.
> 
> -- quote --
>        Not checking the return value of close  is  a  common  but
>        nevertheless   serious  programming  error.   File  system

As in 6. on http://www.apocalypse.org/pub/u/paul/docs/commandments.html
(The Ten Commandments for C Programmers, by Henry Spencer).

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:23             ` Patrick J. LoPresti
  2002-07-15 21:38               ` Thunder from the hill
@ 2002-07-15 21:59               ` Ketil Froyn
  2002-07-15 23:08                 ` Matti Aarnio
  1 sibling, 1 reply; 82+ messages in thread
From: Ketil Froyn @ 2002-07-15 21:59 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On 15 Jul 2002, Patrick J. LoPresti wrote:

> Without calling fsync(), you *never* know when the data will hit the
> disk.

Doesn't bdflush ensure that data is written to disk within 30 seconds or 
some tunable number of seconds?

Ketil


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:31             ` Patrick J. LoPresti
@ 2002-07-15 22:12               ` Richard A Nelson
  2002-07-16  1:02               ` Lawrence Greenfield
  1 sibling, 0 replies; 82+ messages in thread
From: Richard A Nelson @ 2002-07-15 22:12 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Chris Mason, linux-kernel

On 15 Jul 2002, Patrick J. LoPresti wrote:

> I really wish MTA authors would just support Linux's "fsync the
> directory" approach.  It is simple, reliable, and fast.  Yes, it does
> require Linux-specific support in the application, but that's what
> application authors should expect when there is a gap in the
> standards.

This is exactly what sendmail did in its 8.12.0 release (2001/09/08)

-- 
Rick Nelson
"...very few phenomena can pull someone out of Deep Hack Mode, with two
noted exceptions: being struck by lightning, or worse, your *computer*
being struck by lightning."
(By Matt Welsh)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 20:55           ` Matthias Andree
  2002-07-15 21:23             ` Patrick J. LoPresti
@ 2002-07-15 22:55             ` Alan Cox
  2002-07-15 21:58               ` Matthias Andree
  1 sibling, 1 reply; 82+ messages in thread
From: Alan Cox @ 2002-07-15 22:55 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Mon, 2002-07-15 at 21:55, Matthias Andree wrote:
> I assume that most will just call close() or fclose() and exit() right
> away. Does fclose() imply fsync()? 

It doesn't.

> Some applications will not even check the [f]close() return value...

We are only interested in reliable code. Anything else is already
fatally broken.

-- quote --
       Not checking the return value of close  is  a  common  but
       nevertheless   serious  programming  error.   File  system
       implementations which use techniques  as  ``write-behind''
       to  increase  performance may lead to write(2) succeeding,
       although the data has not been  written  yet.   The  error
       status  may be reported at a later write operation, but it
       is guaranteed to be reported on  closing  the  file.   Not
       checking  the  return value when closing the file may lead
       to silent loss of data.  This can especially  be  observed
       with NFS and disk quotas.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:59               ` Ketil Froyn
@ 2002-07-15 23:08                 ` Matti Aarnio
  2002-07-16 12:33                   ` Matthias Andree
  0 siblings, 1 reply; 82+ messages in thread
From: Matti Aarnio @ 2002-07-15 23:08 UTC (permalink / raw)
  To: Ketil Froyn; +Cc: linux-kernel

On Mon, Jul 15, 2002 at 11:59:48PM +0200, Ketil Froyn wrote:
> On 15 Jul 2002, Patrick J. LoPresti wrote:
> > Without calling fsync(), you *never* know when the data will hit the
> > disk.
> 
> Doesn't bdflush ensure that data is written to disk within 30 seconds or 
> some tunable number of seconds?

  It TRIES TO, it does not guarantee anything.

  The MTA systems are an example of software suites which have
  transaction requirements.  The goal has been usually stated
  as:  must not fail to deliver.

  Practical implementations without full-blown all encompassing
  transactions will usually mean that the message "will be delivered
  at least once", e.g. double-delivery can happen.

  One view to MTA behaviour is moving the message from one substate
  to another during its processing.

  These days, usually, the transaction database for MTAs is UNIX
  filesystem.   For ZMailer I have considered (although not actually
  done - yet) using SleepyCat DB files for the transaction subsystem.
  There are great challenges in failure compartementalisation, and
  integrity, when using that kind of integrated database mechanisms.
  Getting SEGV is potentially _very_ bad thing!

> Ketil

/Matti Aarnio

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:31             ` Patrick J. LoPresti
  2002-07-15 22:12               ` Richard A Nelson
@ 2002-07-16  1:02               ` Lawrence Greenfield
       [not found]                 ` <mit.lcs.mail.linux-kernel/200207160102.g6G12BiH022986@lin2.andrew.cmu.edu>
  1 sibling, 1 reply; 82+ messages in thread
From: Lawrence Greenfield @ 2002-07-16  1:02 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

   From: "Patrick J. LoPresti" <patl@curl.com>
   Date: 	15 Jul 2002 17:31:07 -0400
[...]
   I really wish MTA authors would just support Linux's "fsync the
   directory" approach.  It is simple, reliable, and fast.  Yes, it does
   require Linux-specific support in the application, but that's what
   application authors should expect when there is a gap in the
   standards.

Actually, it's not all that simple (you have to find the enclosing
directories of any files you're modifying, which might require string
manipulation) or necessarily all that fast (you're doubling the number
of system calls and now the application is imposing an ordering on the
filesystem that didn't exist before).

It's only necessary for ext2. Modern Linux filesystems (such as ext3
or reiserfs) don't require it.

Finally: ext2 isn't safe even if you do call fsync() on the directory!

Let's consider: some filesystem operation modifies two different
blocks. This operation is safe if block A is written before block
B. 

. FFS guarantees this by performing the writes synchronously: block A
is written when it is changed, followed by block B when it is changed.

. Journalling filesystems (ext3, reiserfs) guarantee this by
journalling the operation and forcing that journal entry to disk
before either A or B can be modified.

. What does ext2 do (in the default mode)? It modifies A, it modifies
B, and then leaves it up to the buffer cache to write them back---and
the buffer cache might decide to write B before A.

We're finally getting to some decent shared semantics on
filesystems. Reiserfs, ext3, FFS w/ softupdates, vxfs, etc., all work
with just fsync()ing the file (though an fsync() is required after a
link() or rename() operation). Let's encourage all filesystems to
provide these semantics and make it slightly easier on us stupid
application programmers.

Larry





^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
       [not found]                 ` <mit.lcs.mail.linux-kernel/200207160102.g6G12BiH022986@lin2.andrew.cmu.edu>
@ 2002-07-16  1:43                   ` Patrick J. LoPresti
  2002-07-16  1:56                     ` Thunder from the hill
                                       ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-16  1:43 UTC (permalink / raw)
  To: linux-kernel

Lawrence Greenfield <leg+@andrew.cmu.edu> writes:

> Actually, it's not all that simple (you have to find the enclosing
> directories of any files you're modifying, which might require string
> manipulation)

No, you have to find the directories you are modifying.  And the
application knows darn well which directories it is modifying.

Don't speculate.  Show some sample code, and let's see how hard it
would be to use the "Linux way".  I am betting on "not hard at all".

> or necessarily all that fast (you're doubling the number of system
> calls and now the application is imposing an ordering on the
> filesystem that didn't exist before).

No, you are not doubling the number of system calls.  As I have tried
to point out repeatedly, doing this stuff reliably and portably
already requires a sequence like this:

   write data
   flush data
   write "validity" indicator (e.g., rename() or fchmod())
   flush validity indicator

On Linux, flushing a rename() means calling fsync() on the directory
instead of the file.  That's it.  Doing that instead of fsync'ing the
file adds at most two system calls (to open and close the directory),
and those can be amortized over many operations on that directory
(think "mail spool").  So the system call overhead is non-existent.

As for "imposing an ordering on the filesystem that didn't exist
before", that is complete nonsense.  This is imposing *precisely* the
ordering required for reliable operation; no more, no less.  Relying
on mount options, "chattr +S", or journaling artifacts for your
ordering is the inefficient approach; since they impose extra
ordering, they can never be faster and will usually be slower.

> It's only necessary for ext2. Modern Linux filesystems (such as ext3
> or reiserfs) don't require it.

Only because they take the performance hit of flushing the whole log
to disk on every fsync().  Combine that with "data=ordered" and see
what happens to your performance.  (Perhaps "data=ordered" should be
called "fsync=sync".)  I would rather get back the performance and
convince application authors to understand what they are doing.

> Finally: ext2 isn't safe even if you do call fsync() on the directory!

Wrong.

   write temp file
   fsync() temp file
   rename() temp file to actual file
   fsync() directory

No matter where this crashes, it is perfectly safe on ext2.  (If not,
ext2 is badly broken.)  The worst that can happen after a crash is
that the file might exist with both the old name and the new name.
But an application can detect this case on startup and clean it up.

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  1:43                   ` Patrick J. LoPresti
@ 2002-07-16  1:56                     ` Thunder from the hill
  2002-07-16 12:47                     ` Matthias Andree
  2002-07-16 21:09                     ` James Antill
  2 siblings, 0 replies; 82+ messages in thread
From: Thunder from the hill @ 2002-07-16  1:56 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

Hi,

On 15 Jul 2002, Patrick J. LoPresti wrote:
> Doing that instead of fsync'ing the
> file adds at most two system calls (to open and close the directory),

Keep the directory fd open all the time, and flush it when needed. This 
gets rid of the open(dir, dd); fsync(dd); close(dd);, you just have:
open(dir, dd); once, then fsync(dd); fsync(dd); ... and then one close(dd);

Not too much of an overhead, is it?

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 15:22   ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Patrick J. LoPresti
                       ` (2 preceding siblings ...)
       [not found]     ` <20020715173337$acad@traf.lcs.mit.edu>
@ 2002-07-16  7:07     ` Dax Kelson
  3 siblings, 0 replies; 82+ messages in thread
From: Dax Kelson @ 2002-07-16  7:07 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On Mon, 2002-07-15 at 09:22, Patrick J. LoPresti wrote:

> One other thing.  I think this statement is misleading:
> 
>     IF your server is stable and not prone to crashing, and/or you
>     have the write cache on your hard drives battery backed, you
>     should strongly consider using the writeback journaling mode of
>     Ext3 versus ordered.

I rewrote that statement on the website.

Dax Kelson
Guru Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:38               ` Thunder from the hill
@ 2002-07-16 12:31                 ` Matthias Andree
  2002-07-16 15:53                   ` Thunder from the hill
  0 siblings, 1 reply; 82+ messages in thread
From: Matthias Andree @ 2002-07-16 12:31 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Thunder from the hill wrote:

> Hi,
> 
> On 15 Jul 2002, Patrick J. LoPresti wrote:
> > Note that this means writing a truly reliable shell or Perl script is
> > tricky.  I suppose you can "use POSIX qw(fsync);" in Perl.  But what do
> > you do for a shell script?  /bin/sync :-) ?
> 
> Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be 
> done with it.

Or steal one from FreeBSD (written by Paul Saab), fix the err() function
and be done with it.

.../usr.bin/fsync/fsync.{1,c}

Interesting side note -- mind the O_RDONLY:

        for (i = 1; i < argc; ++i) {
                if ((fd = open(argv[i], O_RDONLY)) < 0)
                        err(1, "open %s", argv[i]);

                if (fsync(fd) != 0)
                        err(1, "fsync %s", argv[1]);
                close(fd);
        }

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 23:08                 ` Matti Aarnio
@ 2002-07-16 12:33                   ` Matthias Andree
  0 siblings, 0 replies; 82+ messages in thread
From: Matthias Andree @ 2002-07-16 12:33 UTC (permalink / raw)
  To: linux-kernel

On Tue, 16 Jul 2002, Matti Aarnio wrote:

>   These days, usually, the transaction database for MTAs is UNIX
>   filesystem.   For ZMailer I have considered (although not actually
>   done - yet) using SleepyCat DB files for the transaction subsystem.
>   There are great challenges in failure compartementalisation, and
>   integrity, when using that kind of integrated database mechanisms.
>   Getting SEGV is potentially _very_ bad thing!

Read: lethal to the spool. Has SleepyCat DB learned to recover from
ENOSPC in the meanwhile? I had a db1.85 file corrupt after ENOSPC once...

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:14           ` Chris Mason
  2002-07-15 21:31             ` Patrick J. LoPresti
@ 2002-07-16 12:35             ` Matthias Andree
  1 sibling, 0 replies; 82+ messages in thread
From: Matthias Andree @ 2002-07-16 12:35 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Chris Mason wrote:

> On Mon, 2002-07-15 at 15:13, Patrick J. LoPresti wrote:
> 
> > > 1) that newly grown file is someone's inbox, and the old contents of the
> > > new block include someone else's private message.
> > >
> > > 2) That newly grown file is a control file for the application, and the
> > > application expects it to contain valid data within (think sendmail).  
> > 
> > In a correctly-written application, neither of these things can
> > happen.  (See my earlier message today on fsync() and MTAs.)  To get a
> > file onto disk reliably, the application must 1) flush the data, and
> > then 2) flush a "validity" indicator.  This could be a sequence like:
> > 
> >   create temp file
> >   flush data to temp file
> >   rename temp file
> >   flush rename operation
> 
> Yes, most mtas do this for queue files, I'm not sure how many do it for
> the actual spool file.  mail server authors are more than welcome to

Less. For one, Postfix' local(8) daemon relies on synchronous directory
update for Maildir spools. For mbox spool, the problem is less
prevalent, because spool files usually exist already and fsync() is
sufficient (and fsync() is done before local(8) reports success to the
queue manager).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  1:43                   ` Patrick J. LoPresti
  2002-07-16  1:56                     ` Thunder from the hill
@ 2002-07-16 12:47                     ` Matthias Andree
  2002-07-16 21:09                     ` James Antill
  2 siblings, 0 replies; 82+ messages in thread
From: Matthias Andree @ 2002-07-16 12:47 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> On Linux, flushing a rename() means calling fsync() on the directory
> instead of the file.  That's it.  Doing that instead of fsync'ing the
> file adds at most two system calls (to open and close the directory),
> and those can be amortized over many operations on that directory
> (think "mail spool").  So the system call overhead is non-existent.

Indeed, but I can also leave the file descriptor open on any file system
on any system except SOME of Linux'. (Ok, this precludes systems that
don't offer POSIX synchronous completion semantics, but these systems
don't nessarily have fsync() either).

> ordering required for reliable operation; no more, no less.  Relying
> on mount options, "chattr +S", or journaling artifacts for your
> ordering is the inefficient approach; since they impose extra
> ordering, they can never be faster and will usually be slower.

It is sometimes the only way, if the application is unaware. I hope I'm
not loosening a flame war if I mention qmail now, which is not even
softupdates aware. Without chattr +S or mount -o sync, nothing is to be
gained. OTOH, where mount -o sync only makes directory updates
synchronous, it's not too expensive, which is why the +D approach is
still useful there.

> > It's only necessary for ext2. Modern Linux filesystems (such as ext3
> > or reiserfs) don't require it.
> 
> Only because they take the performance hit of flushing the whole log
> to disk on every fsync().  Combine that with "data=ordered" and see
> what happens to your performance.  (Perhaps "data=ordered" should be
> called "fsync=sync".)  I would rather get back the performance and
> convince application authors to understand what they are doing.

1. data=ordered is more than fsync=sync. It guarantees that data blocks
are flushed before flushing the meta data blocks that reference the data
blocks. Try this on ext2fs and lose.

2. sync() is unreliable, it can return control to the caller earlier
than what is sound. It can "complete" at any time it desires without
having completed.
(Probably so it can ever return as new blocks are written by another
process, but at least SUS v2 did not detail on this).

3. Application authors do not desire fsync=sync semantics, but they want
to rely on "fsync(fd) also syncs recent renames". It comes as a
now-guaranteed side effect of how ext3fs works, so I am told.

I'm not sure how the ext3fs journal works internally, but it'd fine with
all applications if only that part of a file system be synched that is
really relevant to the current fsync(fd). No more. It seems as though
fsync==sync is an artifact that ext2 also suffers from.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:31                 ` Matthias Andree
@ 2002-07-16 15:53                   ` Thunder from the hill
  2002-07-16 19:26                     ` Matthias Andree
  0 siblings, 1 reply; 82+ messages in thread
From: Thunder from the hill @ 2002-07-16 15:53 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> > Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be 
> > done with it.
> 
> Or steal one from FreeBSD (written by Paul Saab), fix the err() function
> and be done with it.
> 
> .../usr.bin/fsync/fsync.{1,c}
> 
> Interesting side note -- mind the O_RDONLY:
> 
>         for (i = 1; i < argc; ++i) {
>                 if ((fd = open(argv[i], O_RDONLY)) < 0)
>                         err(1, "open %s", argv[i]);
> 
>                 if (fsync(fd) != 0)
>                         err(1, "fsync %s", argv[1]);
>                 close(fd);
>         }

Pretty much the thing I had in mind, except that the close return code is 
disregarded here...

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 15:53                   ` Thunder from the hill
@ 2002-07-16 19:26                     ` Matthias Andree
  2002-07-16 19:38                       ` Thunder from the hill
  0 siblings, 1 reply; 82+ messages in thread
From: Matthias Andree @ 2002-07-16 19:26 UTC (permalink / raw)
  To: linux-kernel

On Tue, 16 Jul 2002, Thunder from the hill wrote:

> >                 if (fsync(fd) != 0)
> >                         err(1, "fsync %s", argv[1]);
> >                 close(fd);
> >         }
> 
> Pretty much the thing I had in mind, except that the close return code is 
> disregarded here...

Indeed, but OTOH, what error is close to report when the file is opened
read-only?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 19:26                     ` Matthias Andree
@ 2002-07-16 19:38                       ` Thunder from the hill
  2002-07-16 23:22                         ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
  0 siblings, 1 reply; 82+ messages in thread
From: Thunder from the hill @ 2002-07-16 19:38 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> Indeed, but OTOH, what error is close to report when the file is opened
> read-only?

Well, you can still get EIO, EINTR, EBADF. Whatever you say, disregarding 
the close return code is never any good.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  1:43                   ` Patrick J. LoPresti
  2002-07-16  1:56                     ` Thunder from the hill
  2002-07-16 12:47                     ` Matthias Andree
@ 2002-07-16 21:09                     ` James Antill
  2 siblings, 0 replies; 82+ messages in thread
From: James Antill @ 2002-07-16 21:09 UTC (permalink / raw)
  To: Lawrence Greenfield, Patrick J. LoPresti; +Cc: linux-kernel

"Patrick J. LoPresti" <patl@curl.com> writes:

> Lawrence Greenfield <leg+@andrew.cmu.edu> writes:
> 
> > Actually, it's not all that simple (you have to find the enclosing
> > directories of any files you're modifying, which might require string
> > manipulation)
> 
> No, you have to find the directories you are modifying.  And the
> application knows darn well which directories it is modifying.
> 
> Don't speculate.  Show some sample code, and let's see how hard it
> would be to use the "Linux way".  I am betting on "not hard at all".

 I added fsync() on directories to exim-3.31, it took about 2hrs
coding and another hours testing it (with strace) to make sure it was
doing the right thing. That was from almost never seeing the source
before.
 The only reason it took that long was because that version of exim
altered the spool in a couple of different places. Forward porting to
3.951 took about 20minutes IIRC (that version only plays witht he
spool in one place).

-- 
# James Antill -- james@and.org
:0:
* ^From: .*james@and\.org
/dev/null

^ permalink raw reply	[flat|nested] 82+ messages in thread

* close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-16 19:38                       ` Thunder from the hill
@ 2002-07-16 23:22                         ` Zack Weinberg
  2002-07-17  1:03                           ` Alan Cox
  0 siblings, 1 reply; 82+ messages in thread
From: Zack Weinberg @ 2002-07-16 23:22 UTC (permalink / raw)
  To: linux-kernel

Thunder wrote:
> On Tue, 16 Jul 2002, Matthias Andree wrote:
> > Indeed, but OTOH, what error is close to report when the file is
> > opened read-only?
>
> Well, you can still get EIO, EINTR, EBADF. Whatever you say,
> disregarding the close return code is never any good.

Making use of the close return value is also never any good.

Consider: There is no guarantee that close will detect errors.  Only
NFS and Coda implement f_op->flush methods.  For files on all other
file systems, sys_close will always return success (assuming the file
descriptor was open in the first place); the data may still be sitting
in the page cache.  If you need the data pushed to the physical disk,
you have to call fsync.

Consider: If you have called fsync, and it returned successfully, an
immediate call to close is guaranteed to return successfully.  (Any
hypothetical f_op->flush method would have nothing to do; if not, that
filesystem does not correctly implement fsync.)

Therefore, I would argue that it is wrong for any application ever to
inspect close's return value.  Either the program does not need data
integrity guarantees, or it should be using fsync and paying attention
to that instead.

There's also an ugly semantic bind if you make close detect errors.
If close returns an error other than EBADF, has that file descriptor
been closed?  The standards do not specify.  If it has not been
closed, you have a descriptor leak.  But if it has been closed, it is
too late to recover from the error.  [As far as I know, Unix
implementations generally do close the descriptor.]

The manpage that was quoted earlier in this thread is incorrect in
claiming that errors will be detected by close; it should be fixed.

zw

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17  1:03                           ` Alan Cox
@ 2002-07-16 23:52                             ` David S. Miller
  2002-07-17  1:35                               ` Alan Cox
  2002-07-17  0:10                             ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
  2002-07-17  2:22                             ` Elladan
  2 siblings, 1 reply; 82+ messages in thread
From: David S. Miller @ 2002-07-16 23:52 UTC (permalink / raw)
  To: alan; +Cc: zack, linux-kernel

   From: Alan Cox <alan@lxorguk.ukuu.org.uk>
   Date: 17 Jul 2002 02:03:02 +0100
   
   close() checking is not about physical disk guarantees. It's about more
   basic "I/O completed". In some future Linux only close() might tell you
   about some kinds of I/O error. The fact it doesn't do it now is no
   excuse for sloppy programming

Practice dictates that if you make close() return error values
your whole system will blow up.  Try it out for yourself.
I can tell you of at least 1 app that is going to explode :-)

I believe Linus mentioned way back when that this is a "shall not"
when we had similar problems with NFS returning errors from close().

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  1:03                           ` Alan Cox
  2002-07-16 23:52                             ` close return value David S. Miller
@ 2002-07-17  0:10                             ` Zack Weinberg
  2002-07-17  1:45                               ` Alan Cox
  2002-07-17  8:00                               ` Lars Marowsky-Bree
  2002-07-17  2:22                             ` Elladan
  2 siblings, 2 replies; 82+ messages in thread
From: Zack Weinberg @ 2002-07-17  0:10 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On Wed, Jul 17, 2002 at 02:03:02AM +0100, Alan Cox wrote:
> On Wed, 2002-07-17 at 00:22, Zack Weinberg wrote:
> > Making use of the close return value is also never any good.
> 
> This is untrue

I beg to differ.

> > Consider: There is no guarantee that close will detect errors.  Only
> > NFS and Coda implement f_op->flush methods.  For files on all other
> > file systems, sys_close will always return success (assuming the file
> > descriptor was open in the first place); the data may still be sitting
> > in the page cache.  If you need the data pushed to the physical disk,
> > you have to call fsync.
> 
> close() checking is not about physical disk guarantees. It's about more
> basic "I/O completed". In some future Linux only close() might tell you
> about some kinds of I/O error.

I think we're talking past each other.

My first point is that a portable application cannot rely on close to
detect any error.  Only fsync guarantees to detect any errors at all
(except ENOSPC/EDQUOT, which should come back on write; yes, I know
about the buggy NFS implementations that report them only on close).

My second point, which you deleted, is that if some hypothetical close
implementation reports an error under some circumstances, an
immediately preceding fsync call MUST also report the same error under
the same circumstances.

Therefore, if you've checked the return value of fsync, there's no
point in checking the subsequent close; and if you don't care to call
fsync, the close return value is useless since it isn't guaranteed to
detect anything.

> > There's also an ugly semantic bind if you make close detect errors.
> > If close returns an error other than EBADF, has that file descriptor
> > been closed?  The standards do not specify.  If it has not been
> > closed, you have a descriptor leak.  But if it has been closed, it is
> > too late to recover from the error.  [As far as I know, Unix
> > implementations generally do close the descriptor.]
> 
> If it bothers you close it again 8)

And watch it come back with an error again, repeat ad infinitum?

> > The manpage that was quoted earlier in this thread is incorrect in
> > claiming that errors will be detected by close; it should be fixed.
> 
> The man page matches the stsndard. Implementation may be a subset of the
> allowed standard right now, but don't program to implementation
> assumptions, it leads to nasty accidents

You missed the point.  The manpage asserts that I/O errors are
guaranteed to be detected by close; there is no such guarantee.

zw

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17  1:35                               ` Alan Cox
@ 2002-07-17  0:20                                 ` David S. Miller
  2002-07-17  1:05                                   ` Linus Torvalds
       [not found]                                   ` <mailman.1026868201.10433.linux-kernel2news@redhat.com>
  0 siblings, 2 replies; 82+ messages in thread
From: David S. Miller @ 2002-07-17  0:20 UTC (permalink / raw)
  To: alan; +Cc: zack, linux-kernel

   From: Alan Cox <alan@lxorguk.ukuu.org.uk>
   Date: 17 Jul 2002 02:35:41 +0100

   Our NFS can return errors from close().

Better tell Linus.

   So I'd get fixing the applications.

I wish you luck, it is quite a daunting task and nothing I would
sanely sigh up for :-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-16 23:22                         ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
@ 2002-07-17  1:03                           ` Alan Cox
  2002-07-16 23:52                             ` close return value David S. Miller
                                               ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Alan Cox @ 2002-07-17  1:03 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: linux-kernel

On Wed, 2002-07-17 at 00:22, Zack Weinberg wrote:
> Making use of the close return value is also never any good.

This is untrue

> Consider: There is no guarantee that close will detect errors.  Only
> NFS and Coda implement f_op->flush methods.  For files on all other
> file systems, sys_close will always return success (assuming the file
> descriptor was open in the first place); the data may still be sitting
> in the page cache.  If you need the data pushed to the physical disk,
> you have to call fsync.

close() checking is not about physical disk guarantees. It's about more
basic "I/O completed". In some future Linux only close() might tell you
about some kinds of I/O error. The fact it doesn't do it now is no
excuse for sloppy programming

> There's also an ugly semantic bind if you make close detect errors.
> If close returns an error other than EBADF, has that file descriptor
> been closed?  The standards do not specify.  If it has not been
> closed, you have a descriptor leak.  But if it has been closed, it is
> too late to recover from the error.  [As far as I know, Unix
> implementations generally do close the descriptor.]

If it bothers you close it again 8)

> The manpage that was quoted earlier in this thread is incorrect in
> claiming that errors will be detected by close; it should be fixed.

The man page matches the stsndard. Implementation may be a subset of the
allowed standard right now, but don't program to implementation
assumptions, it leads to nasty accidents


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17  0:20                                 ` David S. Miller
@ 2002-07-17  1:05                                   ` Linus Torvalds
  2002-07-17  1:05                                     ` David S. Miller
       [not found]                                   ` <mailman.1026868201.10433.linux-kernel2news@redhat.com>
  1 sibling, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2002-07-17  1:05 UTC (permalink / raw)
  To: linux-kernel

In article <20020716.172026.55847426.davem@redhat.com>,
David S. Miller <davem@redhat.com> wrote:
>   From: Alan Cox <alan@lxorguk.ukuu.org.uk>
>   Date: 17 Jul 2002 02:35:41 +0100
>
>   Our NFS can return errors from close().
>
>Better tell Linus.

Oh, Linus knows.  In fact, Linus wrote some of the code in question. 

But the thing is, Linus doesn't want to have people have the same issues
with local filesystems.  I _know_ there are broken applications that do
not test the error return from close(), and I think it is a politeness
issue to return error codes that you can know about as soon as humanly
possible. 

For NFS, you simply cannot do any reasonable performance without doing
deferred error reporting.  The same isn't true of other filesystems. 
Even in the presense of delayed block allocation, a local filesystem can
_reserve_ the blocks early, and has no excuse for giving errors late
(except, of course, for actual IO errors). 

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17  1:05                                   ` Linus Torvalds
@ 2002-07-17  1:05                                     ` David S. Miller
  2002-07-17  1:23                                       ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: David S. Miller @ 2002-07-17  1:05 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

   From: torvalds@transmeta.com (Linus Torvalds)
   Date: Wed, 17 Jul 2002 01:05:00 +0000 (UTC)

   In article <20020716.172026.55847426.davem@redhat.com>,
   David S. Miller <davem@redhat.com> wrote:
   >Better tell Linus.
   
   Oh, Linus knows.  In fact, Linus wrote some of the code in question. 

Ok, I think the issue here is different.

Several years ago we were returning -EAGAIN from close() via NFS and
that is what caused the problems.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17  1:05                                     ` David S. Miller
@ 2002-07-17  1:23                                       ` Linus Torvalds
  2002-07-17 11:51                                         ` Matthias Andree
  2002-07-20  8:00                                         ` Florian Weimer
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2002-07-17  1:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel



On Tue, 16 Jul 2002, David S. Miller wrote:
>
>    Oh, Linus knows.  In fact, Linus wrote some of the code in question.
>
> Ok, I think the issue here is different.
>
> Several years ago we were returning -EAGAIN from close() via NFS and
> that is what caused the problems.

Oh.

Yes, EAGAIN doesn't really work as a close return value, simply because
_nobody_ expects that (and leaving the file descriptor open after a
close() is definitely unexpected, ie people can very validly complain
about buggy behaviour).

		Linus


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-16 23:52                             ` close return value David S. Miller
@ 2002-07-17  1:35                               ` Alan Cox
  2002-07-17  0:20                                 ` David S. Miller
  0 siblings, 1 reply; 82+ messages in thread
From: Alan Cox @ 2002-07-17  1:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: zack, linux-kernel

On Wed, 2002-07-17 at 00:52, David S. Miller wrote:
>    From: Alan Cox <alan@lxorguk.ukuu.org.uk>
>    Date: 17 Jul 2002 02:03:02 +0100
>    
>    close() checking is not about physical disk guarantees. It's about more
>    basic "I/O completed". In some future Linux only close() might tell you
>    about some kinds of I/O error. The fact it doesn't do it now is no
>    excuse for sloppy programming
> 
> Practice dictates that if you make close() return error values
> your whole system will blow up.  Try it out for yourself.
> I can tell you of at least 1 app that is going to explode :-)
> 
> I believe Linus mentioned way back when that this is a "shall not"
> when we had similar problems with NFS returning errors from close().

Our NFS can return errors from close(). So I'd get fixing the
applications.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  0:10                             ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
@ 2002-07-17  1:45                               ` Alan Cox
  2002-07-17 18:24                                 ` Zack Weinberg
  2002-07-22 16:42                                 ` Rogier Wolff
  2002-07-17  8:00                               ` Lars Marowsky-Bree
  1 sibling, 2 replies; 82+ messages in thread
From: Alan Cox @ 2002-07-17  1:45 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: linux-kernel

On Wed, 2002-07-17 at 01:10, Zack Weinberg wrote:
> My first point is that a portable application cannot rely on close to
> detect any error.  Only fsync guarantees to detect any errors at all
> (except ENOSPC/EDQUOT, which should come back on write; yes, I know
> about the buggy NFS implementations that report them only on close).

They are not buggy merely inconvenient. The reality of the NFS protocol
makes it the only viable way to do it

> My second point, which you deleted, is that if some hypothetical close
> implementation reports an error under some circumstances, an
> immediately preceding fsync call MUST also report the same error under
> the same circumstances.

I can't think of a case I'd disagree

> Therefore, if you've checked the return value of fsync, there's no
> point in checking the subsequent close; and if you don't care to call
> fsync, the close return value is useless since it isn't guaranteed to
> detect anything.

If you don't check the return code it might not detect anything. If you
do check the return code it might detect something. In fact you
contradict yourself IMHO by giving the NFS example.

> > If it bothers you close it again 8)
> 
> And watch it come back with an error again, repeat ad infinitum?

The use of intelligence doesn't help. Come on I know you aren't a cobol
programmer. Check for -EBADF ...

> You missed the point.  The manpage asserts that I/O errors are
> guaranteed to be detected by close; there is no such guarantee.

Disagree. It says

It is quite possible that errors on a  previous  write(2)  operation 
are first  reported  at  the  final  close

Not checking the return value when closing the file may lead to silent
loss of  data.

       A successful close does not guarantee that  the  data  has
       been  successfully  saved  to  disk,  as the kernel defers
       writes. It is not common for a  filesystem  to  flush  the
       buffers  when the stream is closed. If you need to be sure
       that the data is physically stored use fsync(2).  (It will
       depend on the disk hardware at this point.)

None of which guarantee what you say, and which agree about the use of
fsync being appropriate now and then


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  1:03                           ` Alan Cox
  2002-07-16 23:52                             ` close return value David S. Miller
  2002-07-17  0:10                             ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
@ 2002-07-17  2:22                             ` Elladan
  2002-07-17  2:54                               ` Thunder from the hill
                                                 ` (2 more replies)
  2 siblings, 3 replies; 82+ messages in thread
From: Elladan @ 2002-07-17  2:22 UTC (permalink / raw)
  To: Alan Cox; +Cc: Zack Weinberg, linux-kernel

On Wed, Jul 17, 2002 at 02:03:02AM +0100, Alan Cox wrote:
> On Wed, 2002-07-17 at 00:22, Zack Weinberg wrote:
> 
> > There's also an ugly semantic bind if you make close detect errors.
> > If close returns an error other than EBADF, has that file descriptor
> > been closed?  The standards do not specify.  If it has not been
> > closed, you have a descriptor leak.  But if it has been closed, it is
> > too late to recover from the error.  [As far as I know, Unix
> > implementations generally do close the descriptor.]
> 
> If it bothers you close it again 8)

Consider:

Two threads share the file descriptor table.  

  1. Thread 1 performs close() on a file descriptor.  close fails.
  2. Thread 2 performs open().
* 3. Thread 1 performs close() again, just to make sure.


open() may return any file descriptor not currently in use.

Is step 3 necessary?  Is it dangerous?  The question is, is close
guaranteed to work, or isn't it?


Case 1: Close is guaranteed to close the file.

Thread 2 may have just re-used the file descriptor.  Thus, Thread 1
closes a different file in step 3.  Thread 2 is now using a bad file
descriptor, and becomes very angry because the kernel just said all was
right with the world, and then claims there was a mistake.  Thread 2
leaves in a huff.


Case 2: Close is guaranteed to leave the file open on error.

Thread 2 can't have just re-used the descriptor, so the world is ok in
that sense.  However, Thread 1 *must* perform step 3, or it leaks a
descriptor, the tables fill, and the world becomes a frozen wasteland.


Case 3: Close may or may not leave it open due to random chance or
filesystem peculiarities.

Thread 1 may be required to close it twice, or it may be required not to
close it twice.  It doesn't know!  Night is falling!  The world is in
flames!  Aaaaaaugh!


I believe this demonstrates the need for a standard, one way, or the
other.  :-)

-J

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  2:22                             ` Elladan
@ 2002-07-17  2:54                               ` Thunder from the hill
  2002-07-17  3:00                                 ` Elladan
  2002-07-17  4:17                               ` Stevie O
  2002-07-17  7:34                               ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks Kai Henningsen
  2 siblings, 1 reply; 82+ messages in thread
From: Thunder from the hill @ 2002-07-17  2:54 UTC (permalink / raw)
  To: Elladan; +Cc: Alan Cox, Zack Weinberg, linux-kernel

Hi,

On Tue, 16 Jul 2002, Elladan wrote:
> Two threads share the file descriptor table.  
> 
>   1. Thread 1 performs close() on a file descriptor.  close fails.
>   2. Thread 2 performs open().
> * 3. Thread 1 performs close() again, just to make sure.

Thread 2 shouldn't be able to reuse a currently open fd. This application 
design is seriously broken.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  2:54                               ` Thunder from the hill
@ 2002-07-17  3:00                                 ` Elladan
  2002-07-17  3:10                                   ` Thunder from the hill
  0 siblings, 1 reply; 82+ messages in thread
From: Elladan @ 2002-07-17  3:00 UTC (permalink / raw)
  To: Thunder from the hill; +Cc: Elladan, Alan Cox, Zack Weinberg, linux-kernel

On Tue, Jul 16, 2002 at 08:54:54PM -0600, Thunder from the hill wrote:
> Hi,
> 
> On Tue, 16 Jul 2002, Elladan wrote:
> > Two threads share the file descriptor table.  
> > 
> >   1. Thread 1 performs close() on a file descriptor.  close fails.
> >   2. Thread 2 performs open().
> > * 3. Thread 1 performs close() again, just to make sure.
> 
> Thread 2 shouldn't be able to reuse a currently open fd. This application 
> design is seriously broken.

No.

Thread 2 doesn't manage the file descriptor table, the kernel does.
Whether the kernel may re-use the descriptor or not depends on whether
the descriptor is closed or not.  The kernel knows, but unless close()
behaves in a defined way, the application does not at this point.  Thus,
step 3 may either be required, forbidden, or undefined.

-J

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  3:00                                 ` Elladan
@ 2002-07-17  3:10                                   ` Thunder from the hill
  2002-07-17  3:31                                     ` Elladan
  0 siblings, 1 reply; 82+ messages in thread
From: Thunder from the hill @ 2002-07-17  3:10 UTC (permalink / raw)
  To: Elladan; +Cc: Thunder from the hill, Alan Cox, Zack Weinberg, linux-kernel

Hi,

On Tue, 16 Jul 2002, Elladan wrote:
> > Thread 2 shouldn't be able to reuse a currently open fd. This application 
> > design is seriously broken.

Okay, again. It's about doing a second close() in case the first one fails 
with EAGAIN. If we have to do it again, the filehandle is not closed, and 
if the filehandle is not closed, the kernel knows that, and if the kernel 
knows that the filehandle is still open, it won't get reassigned. Problem 
gone.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  3:10                                   ` Thunder from the hill
@ 2002-07-17  3:31                                     ` Elladan
  0 siblings, 0 replies; 82+ messages in thread
From: Elladan @ 2002-07-17  3:31 UTC (permalink / raw)
  To: Thunder from the hill; +Cc: Elladan, Alan Cox, Zack Weinberg, linux-kernel

On Tue, Jul 16, 2002 at 09:10:49PM -0600, Thunder from the hill wrote:
> Hi,
> 
> On Tue, 16 Jul 2002, Elladan wrote:
> > > Thread 2 shouldn't be able to reuse a currently open fd. This application 
> > > design is seriously broken.
> 
> Okay, again. It's about doing a second close() in case the first one fails 
> with EAGAIN. If we have to do it again, the filehandle is not closed, and 
> if the filehandle is not closed, the kernel knows that, and if the kernel 
> knows that the filehandle is still open, it won't get reassigned. Problem 
> gone.

This is case 2, "Close is guaranteed to leave the file open on error."

In this case, all applications are required to reissue close commands
upon certain errors, or leak a file descriptor.  This would be a well
defined behavior, though perhaps error prone.

However, note that this is manifestly different from case 1, "Close is
guaranteed to close the file the first time."  If the system behaves via
case 1, closing the handle again is broken as the example illustrated.

The worst, of course, would be undefined behavior for close.  In this
case, the application effectively can't do the right thing without
extreme measures.

-J

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  2:22                             ` Elladan
  2002-07-17  2:54                               ` Thunder from the hill
@ 2002-07-17  4:17                               ` Stevie O
  2002-07-17  4:38                                 ` Elladan
  2002-07-17  7:34                               ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks Kai Henningsen
  2 siblings, 1 reply; 82+ messages in thread
From: Stevie O @ 2002-07-17  4:17 UTC (permalink / raw)
  To: Elladan, Alan Cox; +Cc: Zack Weinberg, linux-kernel

At 07:22 PM 7/16/2002 -0700, Elladan wrote:
>  1. Thread 1 performs close() on a file descriptor.  close fails.
>  2. Thread 2 performs open().
>* 3. Thread 1 performs close() again, just to make sure.
>
>
>open() may return any file descriptor not currently in use.

I'm confused here... the only way close() can fail is if the file descriptor is invalid (EBADF); wouldn't it be rather stupid to close() a known-to-be-bad descriptor?


--
Stevie-O

Real programmers use COPY CON PROGRAM.EXE


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  4:17                               ` Stevie O
@ 2002-07-17  4:38                                 ` Elladan
  2002-07-17 14:39                                   ` Andreas Schwab
  2002-07-17 17:17                                   ` Andries Brouwer
  0 siblings, 2 replies; 82+ messages in thread
From: Elladan @ 2002-07-17  4:38 UTC (permalink / raw)
  To: Stevie O; +Cc: Elladan, Alan Cox, Zack Weinberg, linux-kernel

On Wed, Jul 17, 2002 at 12:17:40AM -0400, Stevie O wrote:
> At 07:22 PM 7/16/2002 -0700, Elladan wrote:
> >  1. Thread 1 performs close() on a file descriptor.  close fails.
> >  2. Thread 2 performs open().
> >* 3. Thread 1 performs close() again, just to make sure.
> >
> >
> >open() may return any file descriptor not currently in use.
> 
> I'm confused here... the only way close() can fail is if the file
> descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
> a known-to-be-bad descriptor?

Well, obviously, if that's the case.  However, the man page for close(2)
doesn't agree (see below).  close() is allowed to return EBADF, EINTR,
or EIO.

The question is, does the OS standard guarantee that the fd is closed,
even if close() returns EINTR or EIO?  Just going by the normal usage of
EINTR, one might think otherwise.  It doesn't appear to be documented
one way or another.

Alan said you could just issue close again to make sure - the example
shows that this is not the case.  A second close is either required or
forbidden in that example - and the behavior has to be well defined or
you won't know which to do.

-J

NAME
       close - close a file descriptor

SYNOPSIS
       #include <unistd.h>

       int close(int fd);

DESCRIPTION
       close closes a file descriptor, so that it no longer refers
       to any file and may be reused. Any locks held on the file it
       was associated with, and owned by the process, are removed
       (regardless of the file descriptor that was used to obtain the
       lock).

       If fd is the last copy of a particular file descriptor the
       resources associated with it are freed; if the descriptor was the
       last reference to a file which has been removed using unlink(2)
       the file is deleted.

RETURN VALUE
       close returns zero on success, or -1 if an error occurred.

ERRORS
       EBADF  fd isn't a valid open file descriptor.

       EINTR  The close() call was interrupted by a signal.

       EIO    An I/O error occurred.

CONFORMING TO
       SVr4,  SVID,  POSIX,  X/OPEN,  BSD 4.3.  SVr4 documents an
       additional ENOLINK error condition.

NOTES
       Not checking the return value of close is a common but
       nevertheless serious programming error.  File system
       implementations which use techniques as `write-behind' to
       increase performance may lead to write(2) succeeding, although
       the data has not been written yet.  The error status may be
       reported at a later write operation, but it is guaranteed to be
       reported on closing the file.  Not checking the return value when
       closing the file may lead to silent loss of data.  This can
       especially be observed with NFS and disk quotas.

       A successful close does not guarantee that the data has
       been successfully saved to  disk, as the kernel defers
       writes.  It is not common for a filesystem to flush the
       buffers when the stream is closed. If you need to be sure
       that the data is physically stored use fsync(2) or
       sync(2), they will get you closer to that goal (it will
       depend on the disk hardware at this point).

SEE ALSO
       open(2), fcntl(2), shutdown(2), unlink(2), fclose(3)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-17  2:22                             ` Elladan
  2002-07-17  2:54                               ` Thunder from the hill
  2002-07-17  4:17                               ` Stevie O
@ 2002-07-17  7:34                               ` Kai Henningsen
  2 siblings, 0 replies; 82+ messages in thread
From: Kai Henningsen @ 2002-07-17  7:34 UTC (permalink / raw)
  To: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 9347 bytes --]

elladan@eskimo.com (Elladan)  wrote on 16.07.02 in <20020717022252.GA30570@eskimo.com>:

> I believe this demonstrates the need for a standard, one way, or the
> other.  :-)

So then let's see what the actual standard says ...

--- snip ---

                 The Open Group Base Specifications Issue 6
                            IEEE Std 1003.1-2001
     Copyright + 2001 The IEEE and The Open Group, All Rights reserved.
     _________________________________________________________________

    NAME

     close - close a file descriptor

    SYNOPSIS

     #include <unistd.h>
     int close(int fildes);

    DESCRIPTION

     The close() function shall deallocate the file descriptor indicated
     by fildes. To deallocate means to make the file descriptor
     available for return by subsequent calls to open() or other
     functions that allocate file descriptors. All outstanding record
     locks owned by the process on the file associated with the file
     descriptor shall be removed (that is, unlocked).

     If close() is interrupted by a signal that is to be caught, it
     shall return -1 with errno set to [EINTR] and the state of fildes
     is unspecified. If an I/O error occurred while reading from or
     writing to the file system during close(), it may return -1 with
     errno set to [EIO]; if this error is returned, the state of fildes
     is unspecified.

     When all file descriptors associated with a pipe or FIFO special
     file are closed, any data remaining in the pipe or FIFO shall be
     discarded.

     When all file descriptors associated with an open file description
     have been closed, the open file description shall be freed.

     If the link count of the file is 0, when all file descriptors
     associated with the file are closed, the space occupied by the file
     shall be freed and the file shall no longer be accessible.

     [XSR] [Option Start] If a STREAMS-based fildes is closed and the
     calling process was previously registered to receive a SIGPOLL
     signal for events associated with that STREAM, the calling process
     shall be unregistered for events associated with the STREAM. The
     last close() for a STREAM shall cause the STREAM associated with
     fildes to be dismantled. If O_NONBLOCK is not set and there have
     been no signals posted for the STREAM, and if there is data on the
     module's write queue, close() shall wait for an unspecified time
     (for each module and driver) for any output to drain before
     dismantling the STREAM. The time delay can be changed via an
     I_SETCLTIME ioctl() request. If the O_NONBLOCK flag is set, or if
     there are any pending signals, close() shall not wait for output to
     drain, and shall dismantle the STREAM immediately.

     If the implementation supports STREAMS-based pipes, and fildes is
     associated with one end of a pipe, the last close() shall cause a
     hangup to occur on the other end of the pipe. In addition, if the
     other end of the pipe has been named by fattach(), then the last
     close() shall force the named end to be detached by fdetach(). If
     the named end has no open file descriptors associated with it and
     gets detached, the STREAM associated with that end shall also be
     dismantled. [Option End]

     [XSI] [Option Start] If fildes refers to the master side of a
     pseudo-terminal, and this is the last close, a SIGHUP signal shall
     be sent to the process group, if any, for which the slave side of
     the pseudo-terminal is the controlling terminal. It is unspecified
     whether closing the master side of the pseudo-terminal flushes all
     queued input and output. [Option End]

     [XSR] [Option Start] If fildes refers to the slave side of a
     STREAMS-based pseudo-terminal, a zero-length message may be sent to
     the master. [Option End]

     [AIO] [Option Start] When there is an outstanding cancelable
     asynchronous I/O operation against fildes when close() is called,
     that I/O operation may be canceled. An I/O operation that is not
     canceled completes as if the close() operation had not yet
     occurred. All operations that are not canceled shall complete as if
     the close() blocked until the operations completed. The close()
     operation itself need not block awaiting such I/O completion.
     Whether any I/O operation is canceled, and which I/O operation may
     be canceled upon close(), is implementation-defined. [Option End]

     [MF|SHM] [Option Start] If a shared memory object or a memory
     mapped file remains referenced at the last close (that is, a
     process has it mapped), then the entire contents of the memory
     object shall persist until the memory object becomes unreferenced.
     If this is the last close of a shared memory object or a memory
     mapped file and the close results in the memory object becoming
     unreferenced, and the memory object has been unlinked, then the
     memory object shall be removed. [Option End]

     If fildes refers to a socket, close() shall cause the socket to be
     destroyed. If the socket is in connection-mode, and the SO_LINGER
     option is set for the socket with non-zero linger time, and the
     socket has untransmitted data, then close() shall block for up to
     the current linger interval until all data is transmitted.

    RETURN VALUE

     Upon successful completion, 0 shall be returned; otherwise, -1
     shall be returned and errno set to indicate the error.

    ERRORS

     The close() function shall fail if:
   [EBADF]
          The fildes argument is not a valid file descriptor.
   [EINTR]
          The close() function was interrupted by a signal.

     The close() function may fail if:
   [EIO]
          An I/O error occurred while reading from or writing to the file
          system.
     _________________________________________________________________

   The following sections are informative.

    EXAMPLES

      Reassigning a File Descriptor

     The following example closes the file descriptor associated with
     standard output for the current process, re-assigns standard output
     to a new file descriptor, and closes the original file descriptor
     to clean up. This example assumes that the file descriptor 0 (which
     is the descriptor for standard input) is not closed.
#include <unistd.h>
...
int pfd;
...
close(1);
dup(pfd);
close(pfd);
...

     Incidentally, this is exactly what could be achieved using:
dup2(pfd, 1);
close(pfd);

      Closing a File Descriptor

     In the following example, close() is used to close a file
     descriptor after an unsuccessful attempt is made to associate that
     file descriptor with a stream.
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#define LOCKFILE "/etc/ptmp"
...
int pfd;
FILE *fpfd;
...
if ((fpfd = fdopen (pfd, "w")) == NULL) {
    close(pfd);
    unlink(LOCKFILE);
    exit(1);
}
...

    APPLICATION USAGE

     An application that had used the stdio routine fopen() to open a
     file should use the corresponding fclose() routine rather than
     close(). Once a file is closed, the file descriptor no longer
     exists, since the integer corresponding to it no longer refers to a
     file.

    RATIONALE

     The use of interruptible device close routines should be
     discouraged to avoid problems with the implicit closes of file
     descriptors by exec and exit(). This volume of IEEE Std 1003.1-2001
     only intends to permit such behavior by specifying the [EINTR]
     error condition.

    FUTURE DIRECTIONS

     None.

    SEE ALSO

     STREAMS , fattach() , fclose() , fdetach() , fopen() , ioctl() ,
     open() , the Base Definitions volume of IEEE Std 1003.1-2001,
     <unistd.h>

    CHANGE HISTORY

     First released in Issue 1. Derived from Issue 1 of the SVID.

    Issue 5

     The DESCRIPTION is updated for alignment with the POSIX Realtime
     Extension.

    Issue 6

     The DESCRIPTION related to a STREAMS-based file or pseudo-terminal
     is marked as part of the XSI STREAMS Option Group.

     The following new requirements on POSIX implementations derive from
     alignment with the Single UNIX Specification:
     * The [EIO] error condition is added as an optional error.
     * The DESCRIPTION is updated to describe the state of the fildes
       file descriptor as unspecified if an I/O error occurs and an [EIO]
       error condition is returned.

     Text referring to sockets is added to the DESCRIPTION.

     The DESCRIPTION is updated for alignment with IEEE Std 1003.1j-2000
     by specifying that shared memory objects and memory mapped files
     (and not typed memory objects) are the types of memory objects to
     which the paragraph on last closes applies.

   End of informative text.
     _________________________________________________________________
     _________________________________________________________________

            UNIX « is a registered Trademark of The Open Group.
               POSIX « is a registered Trademark of The IEEE.
                  [ Main Index | XBD | XCU | XSH | XRAT ]
     _________________________________________________________________
--- snip ---

The standard is very explicit here: When close() returns an error,
*YOU LOSE*.

MfG Kai

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  0:10                             ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
  2002-07-17  1:45                               ` Alan Cox
@ 2002-07-17  8:00                               ` Lars Marowsky-Bree
  2002-07-17 15:49                                 ` Thunder from the hill
  1 sibling, 1 reply; 82+ messages in thread
From: Lars Marowsky-Bree @ 2002-07-17  8:00 UTC (permalink / raw)
  To: Zack Weinberg, Alan Cox; +Cc: linux-kernel

On 2002-07-16T17:10:32,
   Zack Weinberg <zack@codesourcery.com> said:

> Therefore, if you've checked the return value of fsync, there's no
> point in checking the subsequent close; and if you don't care to call
> fsync, the close return value is useless since it isn't guaranteed to
> detect anything.

There is _always_ a point in checking a return value of non void functions.

EOD.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17  1:23                                       ` Linus Torvalds
@ 2002-07-17 11:51                                         ` Matthias Andree
  2002-07-17 17:23                                           ` Andries Brouwer
  2002-07-20  8:00                                         ` Florian Weimer
  1 sibling, 1 reply; 82+ messages in thread
From: Matthias Andree @ 2002-07-17 11:51 UTC (permalink / raw)
  To: linux-kernel

On Tue, 16 Jul 2002, Linus Torvalds wrote:

> Yes, EAGAIN doesn't really work as a close return value, simply because
> _nobody_ expects that (and leaving the file descriptor open after a
> close() is definitely unexpected, ie people can very validly complain
> about buggy behaviour).

non-issue, since EAGAIN would violates the specs that don't list EGAIN
(and EAGAIN in response does not make sense either, the kernel should
then try harder to get the I/O completed).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  4:38                                 ` Elladan
@ 2002-07-17 14:39                                   ` Andreas Schwab
  2002-07-17 16:49                                     ` Elladan
  2002-07-17 17:17                                   ` Andries Brouwer
  1 sibling, 1 reply; 82+ messages in thread
From: Andreas Schwab @ 2002-07-17 14:39 UTC (permalink / raw)
  To: Elladan; +Cc: Stevie O, Alan Cox, Zack Weinberg, linux-kernel

Elladan <elladan@eskimo.com> writes:

|> On Wed, Jul 17, 2002 at 12:17:40AM -0400, Stevie O wrote:
|> > At 07:22 PM 7/16/2002 -0700, Elladan wrote:
|> > >  1. Thread 1 performs close() on a file descriptor.  close fails.
|> > >  2. Thread 2 performs open().
|> > >* 3. Thread 1 performs close() again, just to make sure.
|> > >
|> > >
|> > >open() may return any file descriptor not currently in use.
|> > 
|> > I'm confused here... the only way close() can fail is if the file
|> > descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
|> > a known-to-be-bad descriptor?
|> 
|> Well, obviously, if that's the case.  However, the man page for close(2)
|> doesn't agree (see below).  close() is allowed to return EBADF, EINTR,
|> or EIO.
|> 
|> The question is, does the OS standard guarantee that the fd is closed,
|> even if close() returns EINTR or EIO?  Just going by the normal usage of
|> EINTR, one might think otherwise.  It doesn't appear to be documented
|> one way or another.

POSIX says the state of the file descriptor when close fails (with errno
!= EBADF) is unspecified, which means:

    The value or behavior may vary among implementations that conform to
    IEEE Std 1003.1-2001. An application should not rely on the existence
    or validity of the value or behavior. An application that relies on
    any particular value or behavior cannot be assured to be portable
    across conforming implementations.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  8:00                               ` Lars Marowsky-Bree
@ 2002-07-17 15:49                                 ` Thunder from the hill
  0 siblings, 0 replies; 82+ messages in thread
From: Thunder from the hill @ 2002-07-17 15:49 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Zack Weinberg, Alan Cox, linux-kernel

Hi,

On Tue, 16 Jul 2002, Zack Weinberg wrote:
> the close return value is useless since it isn't guaranteed to detect
> anything.

"Isn't guaranteed to detect anything" is still a lot more encouraging to 
see if it does detect anything than "Is guaranteed not to detect anything".

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17 14:39                                   ` Andreas Schwab
@ 2002-07-17 16:49                                     ` Elladan
  2002-07-17 17:43                                       ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Elladan @ 2002-07-17 16:49 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Elladan, Stevie O, Alan Cox, Zack Weinberg, linux-kernel

On Wed, Jul 17, 2002 at 04:39:28PM +0200, Andreas Schwab wrote:
> Elladan <elladan@eskimo.com> writes:
> 
> |> On Wed, Jul 17, 2002 at 12:17:40AM -0400, Stevie O wrote:
> |> > At 07:22 PM 7/16/2002 -0700, Elladan wrote:
> |> > >  1. Thread 1 performs close() on a file descriptor.  close fails.
> |> > >  2. Thread 2 performs open().
> |> > >* 3. Thread 1 performs close() again, just to make sure.
> |> > >
> |> > >
> |> > >open() may return any file descriptor not currently in use.
> |> > 
> |> > I'm confused here... the only way close() can fail is if the file
> |> > descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
> |> > a known-to-be-bad descriptor?
> |> 
> |> Well, obviously, if that's the case.  However, the man page for close(2)
> |> doesn't agree (see below).  close() is allowed to return EBADF, EINTR,
> |> or EIO.
> |> 
> |> The question is, does the OS standard guarantee that the fd is closed,
> |> even if close() returns EINTR or EIO?  Just going by the normal usage of
> |> EINTR, one might think otherwise.  It doesn't appear to be documented
> |> one way or another.
> 
> POSIX says the state of the file descriptor when close fails (with errno
> != EBADF) is unspecified, which means:
> 
>     The value or behavior may vary among implementations that conform to
>     IEEE Std 1003.1-2001. An application should not rely on the existence
>     or validity of the value or behavior. An application that relies on
>     any particular value or behavior cannot be assured to be portable
>     across conforming implementations.

This doesn't mean an OS shouldn't specify the behavior.  Just because
the cross-platform standard leaves it unspecified doesn't mean the OS
should.

Consider what this says, if a particular OS doesn't pick a standard
which the application can port to.  It means that the *only way* to
correctly close a file descriptor is like this:

int ret;
do {
	ret = close(fd);
} while(ret == -1 && errno != EBADF);

That means, if we get an error, we have to loop until the kernel throws
a BADF error!  We can't detect that the file is closed from any other
error value, because only BADF has a defined behavior.

This would sort of work, though of course be hideous, for a single
threaded app.  Now consider a multithreaded app.  To correctly implement
this we have to lock around all calls to close and
open/socket/dup/pipe/creat/etc...

This is clearly ridiculous, and not at all as intended.  Either standard
will work for an OS (though guaranteeing close the first time is much
simpler all around), but it needs to be specified and stuck to, or you
get horrible things like this to work around a bad spec:


void lock_syscalls();
void unlock_syscalls();

int threadsafe_open(const char *file, int flags, mode_t mode)
{
	int fd;
	lock_syscalls();
	fd = open(file, flags, mode);
	unlock_syscalls();
	return fd;
}

int threadsafe_close(int fd)
{
	int ret;
	lock_syscalls();
	do {
		ret = close(fd);
	} while(ret == -1 && errno != EBADF);
	unlock_syscalls();
	return ret;
}

int threadsafe_socket() ...
int threadsafe_pipe() ...
int threadsafe_dup() ...
int threadsafe_creat() ...
int threadsafe_socketpair() ...
int threadsafe_accept() ...

-J


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  4:38                                 ` Elladan
  2002-07-17 14:39                                   ` Andreas Schwab
@ 2002-07-17 17:17                                   ` Andries Brouwer
  2002-07-17 17:51                                     ` Richard Gooch
  1 sibling, 1 reply; 82+ messages in thread
From: Andries Brouwer @ 2002-07-17 17:17 UTC (permalink / raw)
  To: Elladan; +Cc: Stevie O, Alan Cox, Zack Weinberg, linux-kernel

On Tue, Jul 16, 2002 at 09:38:53PM -0700, Elladan wrote:

> The question is, does the OS standard guarantee that the fd is closed,
> even if close() returns EINTR or EIO?  Just going by the normal usage of
> EINTR, one might think otherwise.  It doesn't appear to be documented
> one way or another.
> 
> Alan said you could just issue close again to make sure - the example
> shows that this is not the case.  A second close is either required or
> forbidden in that example - and the behavior has to be well defined or
> you won't know which to do.

No, the behaviour is not well-defined at all.
The standard explicitly leaves undefined what happens when close returns
EINTR or EIO.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17 11:51                                         ` Matthias Andree
@ 2002-07-17 17:23                                           ` Andries Brouwer
  0 siblings, 0 replies; 82+ messages in thread
From: Andries Brouwer @ 2002-07-17 17:23 UTC (permalink / raw)
  To: linux-kernel

On Wed, Jul 17, 2002 at 01:51:25PM +0200, Matthias Andree wrote:

> non-issue, since EAGAIN would violates the specs that don't list EGAIN

"Implementations may support additional errors not included in this
list, may generate errors included in this list under circumstances
other than those described here, or may contain extensions or
limitations that prevent some errors from occurring. The ERRORS
section on each reference page specifies whether an error shall be
returned, or whether it may be returned. Implementations shall not
generate a different error number from the ones described here for
error conditions described in this volume of IEEE Std 1003.1-2001, but
may generate additional errors unless explicitly disallowed for a
particular function."


Not listing an error in the spec does not mean it cannot occur.
Especially EFAULT is not usually listed.

Andries

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17 16:49                                     ` Elladan
@ 2002-07-17 17:43                                       ` Linus Torvalds
  2002-07-17 22:07                                         ` Elladan
  2002-07-18  9:48                                         ` Ketil Froyn
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2002-07-17 17:43 UTC (permalink / raw)
  To: linux-kernel

In article <20020717164933.GA2136@eskimo.com>,
Elladan  <elladan@eskimo.com> wrote:
>
>Consider what this says, if a particular OS doesn't pick a standard
>which the application can port to.  It means that the *only way* to
>correctly close a file descriptor is like this:
>
>int ret;
>do {
>	ret = close(fd);
>} while(ret == -1 && errno != EBADF);

NO.

The above is
 (a) not portable
 (b) not current practice

The "not portable" part comes from the fact that (as somebody pointed
out), a threaded environment in which the kernel _does_ close the FD on
errors, the FD may have been validly re-used (by the kernel) for some
other thread, and closing the FD a second time is a BUG.

The "not practice" comes from the fact that applications do not do what
you suggest.

The fact is, what Linux does and has always done is the only reasonable
thing to do: the close _will_ tear down the FD, and the error value is
nothing but a warning to the application that there may still be IO
pending (or there may have been failed IO) on the file that the (now
closed) descriptor pointed to.

The application may want to take evasive action (ie try to write the
file again, make a backup, or just warn the user), but the file
descriptor is _gone_. 

>That means, if we get an error, we have to loop until the kernel throws
>a BADF error!  We can't detect that the file is closed from any other
>error value, because only BADF has a defined behavior.

But your loop is _provably_ incorrect for a threaded application.  Your
explicit system call locking approach doesn't work either, because I'm
pretty certain that POSIX already states that open/close are thread
safe, so you can't just invalidate that _other_ standard. 

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17 17:17                                   ` Andries Brouwer
@ 2002-07-17 17:51                                     ` Richard Gooch
  0 siblings, 0 replies; 82+ messages in thread
From: Richard Gooch @ 2002-07-17 17:51 UTC (permalink / raw)
  To: Andries Brouwer; +Cc: Elladan, Stevie O, Alan Cox, Zack Weinberg, linux-kernel

Andries Brouwer writes:
> On Tue, Jul 16, 2002 at 09:38:53PM -0700, Elladan wrote:
> 
> > The question is, does the OS standard guarantee that the fd is closed,
> > even if close() returns EINTR or EIO?  Just going by the normal usage of
> > EINTR, one might think otherwise.  It doesn't appear to be documented
> > one way or another.
> > 
> > Alan said you could just issue close again to make sure - the example
> > shows that this is not the case.  A second close is either required or
> > forbidden in that example - and the behavior has to be well defined or
> > you won't know which to do.
> 
> No, the behaviour is not well-defined at all.
> The standard explicitly leaves undefined what happens when close
> returns EINTR or EIO.

However, the only sane thing to do is to explicitly define one way or
another. The standard is broken. Consider a threaded application,
where one thread tries to call close(), gets an error and re-tries,
because it's not sure if the fd was closed or not. If the fd *is*
closed, and the thread loops calling close(), checking for EBADF,
there is a race if another thread tries calling open()/creat()/dup().

The ambiguity in the standard thus results in the impossibility of
writing a race-free application. And no, forcing the application to
protect system calls with mutexes isn't a solution.

Linux should define explicitly what happens on error return from
close(). Let that be the new standard.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  1:45                               ` Alan Cox
@ 2002-07-17 18:24                                 ` Zack Weinberg
  2002-07-22 16:42                                 ` Rogier Wolff
  1 sibling, 0 replies; 82+ messages in thread
From: Zack Weinberg @ 2002-07-17 18:24 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On Wed, Jul 17, 2002 at 02:45:40AM +0100, Alan Cox wrote:
> On Wed, 2002-07-17 at 01:10, Zack Weinberg wrote:
> > My first point is that a portable application cannot rely on close to
> > detect any error.  Only fsync guarantees to detect any errors at all
> > (except ENOSPC/EDQUOT, which should come back on write; yes, I know
> > about the buggy NFS implementations that report them only on close).
> 
> They are not buggy merely inconvenient. The reality of the NFS protocol
> makes it the only viable way to do it

You are referring to the way NFSv2 lacks any way to request space
allocation on the server without also flushing data to disk?  It was
my understanding that NFSv2 clients that did not accept the
performance hit and do all writes synchronously were considered
broken.  (since, for instance, POSIX write-visibility guarantees are
violated if writes are delayed on the client.)

In v3 or v4, the WRITE/COMMIT separation lets the implementor generate
prompt ENOSPC and EDQUOT errors without performance penalty.

Another thing to keep in mind is that an application is often in a
much better position to recover from an error, particularly a
disk-full error, if it's reported on write rather than on close.
That's just a quality-of-implementation question, though.

> > > If it bothers you close it again 8)
> > 
> > And watch it come back with an error again, repeat ad infinitum?
> 
> The use of intelligence doesn't help. Come on I know you aren't a cobol
> programmer. Check for -EBADF ...

I wasn't talking about EBADF.  How does the application know the
kernel will ever succeed in closing the file?

> Disagree. It says
> 
> It is quite possible that errors on a  previous  write(2)  operation 
> are first  reported  at  the  final  close
> 
> Not checking the return value when closing the file may lead to silent
> loss of  data.
> 
>        A successful close does not guarantee that  the  data  has
>        been  successfully  saved  to  disk,  as the kernel defers
>        writes. It is not common for a  filesystem  to  flush  the
>        buffers  when the stream is closed. If you need to be sure
>        that the data is physically stored use fsync(2).  (It will
>        depend on the disk hardware at this point.)
> 
> None of which guarantee what you say, and which agree about the use of
> fsync being appropriate now and then

That is not the text quoted upthread.  Looks like the manpage did get
fixed, although I think the current wording is still suboptimal.

zw

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17 17:43                                       ` Linus Torvalds
@ 2002-07-17 22:07                                         ` Elladan
  2002-07-18  9:48                                         ` Ketil Froyn
  1 sibling, 0 replies; 82+ messages in thread
From: Elladan @ 2002-07-17 22:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Wed, Jul 17, 2002 at 05:43:57PM +0000, Linus Torvalds wrote:
> In article <20020717164933.GA2136@eskimo.com>,
> Elladan  <elladan@eskimo.com> wrote:
> >
> >Consider what this says, if a particular OS doesn't pick a standard
> >which the application can port to.  It means that the *only way* to
> >correctly close a file descriptor is like this:
> >
> >int ret;
> >do {
> >	ret = close(fd);
> >} while(ret == -1 && errno != EBADF);
> 
> NO.
> 
> The above is
>  (a) not portable
>  (b) not current practice
> 
> The "not portable" part comes from the fact that (as somebody pointed
> out), a threaded environment in which the kernel _does_ close the FD on
> errors, the FD may have been validly re-used (by the kernel) for some
> other thread, and closing the FD a second time is a BUG.

That somebody was me.  It appears we're in extremely violent agreement
on this issue.  We both agree the code I wrote is crap.  :-)

-J

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
       [not found]                                   ` <mailman.1026868201.10433.linux-kernel2news@redhat.com>
@ 2002-07-18  0:01                                     ` Pete Zaitcev
  2002-07-18  0:10                                       ` Thunder from the hill
                                                         ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Pete Zaitcev @ 2002-07-18  0:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

>>From: David S. Miller <davem@redhat.com>

>>   From: Alan Cox <alan@lxorguk.ukuu.org.uk>
>>   Date: 17 Jul 2002 02:35:41 +0100
>>
>>   Our NFS can return errors from close().
>>
>>Better tell Linus.
> 
> Oh, Linus knows.  In fact, Linus wrote some of the code in question. 
> 
> But the thing is, Linus doesn't want to have people have the same issues
> with local filesystems.  I _know_ there are broken applications that do
> not test the error return from close(), and I think it is a politeness
> issue to return error codes that you can know about as soon as humanly
> possible. 

> For NFS, you simply cannot do any reasonable performance without doing
> deferred error reporting.  The same isn't true of other filesystems. 
> Even in the presense of delayed block allocation, a local filesystem can
> _reserve_ the blocks early, and has no excuse for giving errors late
> (except, of course, for actual IO errors). 

I really hate to disagree with the chief penguin here, but
it's extremely dumb to return errors from close(). The last
time we trashed this issue on this list was when a newbie used
an error return from release() to communicate with his driver.

The problem with errors from close() is that NOTHING SMART can be
done by the application when it receives it. And application can:

 a) print a message "Your data are lost, have a nice day\n".
 b) loop retrying close() until it works.
 c) do (a) then (b).

The thing about (b) is that the kernel can do it much better.
Another thing proponents of errors from close() better ask themselves
is if the file descriptor stays open or closed if close() abends.
If it remains open, your exit() is bust. If it closes, you
cannot retry the error (b).

-- Pete

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18  0:01                                     ` close return value Pete Zaitcev
@ 2002-07-18  0:10                                       ` Thunder from the hill
       [not found]                                       ` <mit.lcs.mail.linux-kernel/200207180001.g6I015f02681@devserv.devel.redhat.com>
  2002-07-18 20:09                                       ` Hildo.Biersma
  2 siblings, 0 replies; 82+ messages in thread
From: Thunder from the hill @ 2002-07-18  0:10 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Linus Torvalds, linux-kernel

Hi,

On Wed, 17 Jul 2002, Pete Zaitcev wrote:
> The problem with errors from close() is that NOTHING SMART can be
> done by the application when it receives it. And application can:
> 
>  a) print a message "Your data are lost, have a nice day\n".
>  b) loop retrying close() until it works.
>  c) do (a) then (b).

(a) is much saner than silently loosing data.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17 17:43                                       ` Linus Torvalds
  2002-07-17 22:07                                         ` Elladan
@ 2002-07-18  9:48                                         ` Ketil Froyn
  1 sibling, 0 replies; 82+ messages in thread
From: Ketil Froyn @ 2002-07-18  9:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Wed, 17 Jul 2002, Linus Torvalds wrote:

> >int ret;
> >do {
> >	ret = close(fd);
> >} while(ret == -1 && errno != EBADF);
>
> NO.
>
> The above is
>  (a) not portable
>  (b) not current practice
>
> The "not portable" part comes from the fact that (as somebody pointed
> out), a threaded environment in which the kernel _does_ close the FD on
> errors, the FD may have been validly re-used (by the kernel) for some
> other thread, and closing the FD a second time is a BUG.
>
> The "not practice" comes from the fact that applications do not do what
> you suggest.
>
> The fact is, what Linux does and has always done is the only reasonable
> thing to do: the close _will_ tear down the FD, and the error value is
> nothing but a warning to the application that there may still be IO
> pending (or there may have been failed IO) on the file that the (now
> closed) descriptor pointed to.

Is this what happens when EINTR is received as well? If so, is there any
point to EINTR? Ie. close() was interrupted, but finished anyway. Would
any application care?

If there is any pending IO when this happens, is it possible to find out
when this is finished? If not, an MTA getting this would have to
temporarily defer the mail it received and hope it doesn't get an EINTR on
close() next time, I guess.

Ketil



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
       [not found]                                       ` <mit.lcs.mail.linux-kernel/200207180001.g6I015f02681@devserv.devel.redhat.com>
@ 2002-07-18 14:42                                         ` Patrick J. LoPresti
  2002-07-18 15:13                                           ` Richard B. Johnson
  2002-07-18 23:47                                           ` Albert D. Cahalan
  0 siblings, 2 replies; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-18 14:42 UTC (permalink / raw)
  To: linux-kernel

Pete Zaitcev <zaitcev@redhat.com> writes:

> The problem with errors from close() is that NOTHING SMART can be
> done by the application when it receives it.

This is like saying "nothing smart" can be done when write() returns
ENOSPC.  Such statements are either trivially true or blatantly false,
depending on what you mean by "smart".

Failures happen.  They can happen on write(), they can happen on
close(), and they can happen on any system call for which the API
allows it.  There is no difference!  Your application either deals
with them and is correct or fails to deal with them and is broken.

If the API allows an error return, you *must* check for it, period.
This includes "impossible" errors.  You may think it is impossible for
gettimeofday() to return an error in some case, but if it ever did,
you should darn well want to know about it right away.

If you are that convinced that close() can not return an error in your
particular application (e.g., because you "know" you are using a local
disk, or the file descriptor is read-only), then treat such errors
like assertion failures.  Because that is what they are.

Checking system calls for errors, always, is fundamental to writing
reliable code.  Failing to check them is shoddy and amateurish
programming.  It is amazing that so many people would argue this
point.  Then again, maybe not, given how bad most software is...

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18 14:42                                         ` Patrick J. LoPresti
@ 2002-07-18 15:13                                           ` Richard B. Johnson
  2002-07-18 15:32                                             ` Sandy Harris
  2002-07-18 23:47                                           ` Albert D. Cahalan
  1 sibling, 1 reply; 82+ messages in thread
From: Richard B. Johnson @ 2002-07-18 15:13 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On 18 Jul 2002, Patrick J. LoPresti wrote:

> Pete Zaitcev <zaitcev@redhat.com> writes:
> 
> > The problem with errors from close() is that NOTHING SMART can be
> > done by the application when it receives it.
> 
> This is like saying "nothing smart" can be done when write() returns
> ENOSPC.  Such statements are either trivially true or blatantly false,
> depending on what you mean by "smart".
> 
> Failures happen.  They can happen on write(), they can happen on
> close(), and they can happen on any system call for which the API
> allows it.  There is no difference!  Your application either deals
> with them and is correct or fails to deal with them and is broken.
> 
> If the API allows an error return, you *must* check for it, period.
[SNIPPED..]

Well no. Many procedures are called for effect. When is the last
time you checked the return-value of printf() or puts()? If your
code does this it's wasting CPU cycles.

When it is necessary to perform code reviews, because your company
does FDA or some similar critical software, then you show that
you know you are ignoring a return value by casting it to void.
This shows that the writer knew that he or she was deliberately
ignoring a return-value.

In the specific close(fd) function, my reading of the man page
on this system says that it can only return an error of EBADF
on Linux. Which means that if you make Linux-only code, you
can ignore any error because the fd has become invalid somehow
and subsequent attempts to close with the same fd will surely
fail in the exact same way.

But most systems can return -1 and have an error code of EINTR
(interrupted system call) on any system call. Also, deferred
writing, such as happens in network file-systems, may not return
an error during the write. Such systems are supposed to return
an error during a later call that uses the same file descriptor.
If that call is a close(), then you may get an error. I don't
know what you do under those circumstances, but at the very least,
somebody/something should 'know' that the network write didn't
go as planned.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18 15:13                                           ` Richard B. Johnson
@ 2002-07-18 15:32                                             ` Sandy Harris
  0 siblings, 0 replies; 82+ messages in thread
From: Sandy Harris @ 2002-07-18 15:32 UTC (permalink / raw)
  To: linux-kernel

"Richard B. Johnson" wrote:
> 
> On 18 Jul 2002, Patrick J. LoPresti wrote:
> 
> > Pete Zaitcev <zaitcev@redhat.com> writes:
> >
> > > The problem with errors from close() is that NOTHING SMART can be
> > > done by the application when it receives it.
> >
> > This is like saying "nothing smart" can be done when write() returns
> > ENOSPC.  Such statements are either trivially true or blatantly false,
> > depending on what you mean by "smart".
> >
> > Failures happen.  They can happen on write(), they can happen on
> > close(), and they can happen on any system call for which the API
> > allows it.  There is no difference!  Your application either deals
> > with them and is correct or fails to deal with them and is broken.
> >
> > If the API allows an error return, you *must* check for it, period.
> [SNIPPED..]
> 
> Well no. Many procedures are called for effect. When is the last
> time you checked the return-value of printf() or puts()? If your
> code does this it's wasting CPU cycles.

There's a classic paper on this:
http://www.apocalypse.org/pub/u/paul/docs/canthappen.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18  0:01                                     ` close return value Pete Zaitcev
  2002-07-18  0:10                                       ` Thunder from the hill
       [not found]                                       ` <mit.lcs.mail.linux-kernel/200207180001.g6I015f02681@devserv.devel.redhat.com>
@ 2002-07-18 20:09                                       ` Hildo.Biersma
  2002-07-18 23:55                                         ` Pete Zaitcev
  2 siblings, 1 reply; 82+ messages in thread
From: Hildo.Biersma @ 2002-07-18 20:09 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Linus Torvalds, linux-kernel

>>>>> "Pete" == Pete Zaitcev <zaitcev@redhat.com> writes:

Pete> I really hate to disagree with the chief penguin here, but it's
Pete> extremely dumb to return errors from close(). The last time we
Pete> trashed this issue on this list was when a newbie used an error
Pete> return from release() to communicate with his driver.

Pete> The problem with errors from close() is that NOTHING SMART can be
Pete> done by the application when it receives it. And application can:

Pete>  a) print a message "Your data are lost, have a nice day\n".
Pete>  b) loop retrying close() until it works.
Pete>  c) do (a) then (b).

I must disagree with you.  We run the Andrew File System (AFS), which
has client-side caching with write-on-close semantics.  If an error
occurs goes wrong at close() time, a well-written application can
actually do something useful - such as sending an alert, or letting
the user know the action failed.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18 14:42                                         ` Patrick J. LoPresti
  2002-07-18 15:13                                           ` Richard B. Johnson
@ 2002-07-18 23:47                                           ` Albert D. Cahalan
  2002-07-19 16:12                                             ` Patrick J. LoPresti
  1 sibling, 1 reply; 82+ messages in thread
From: Albert D. Cahalan @ 2002-07-18 23:47 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

Patrick J. LoPrest writes:

> Failures happen.  They can happen on write(), they can happen on
> close(), and they can happen on any system call for which the API
> allows it.  There is no difference!  Your application either deals
> with them and is correct or fails to deal with them and is broken.
> 
> If the API allows an error return, you *must* check for it, period.
> This includes "impossible" errors.  You may think it is impossible for
> gettimeofday() to return an error in some case, but if it ever did,
> you should darn well want to know about it right away.
> 
> If you are that convinced that close() can not return an error in your
> particular application (e.g., because you "know" you are using a local
> disk, or the file descriptor is read-only), then treat such errors
> like assertion failures.  Because that is what they are.
> 
> Checking system calls for errors, always, is fundamental to writing
> reliable code.  Failing to check them is shoddy and amateurish
> programming.  It is amazing that so many people would argue this
> point.  Then again, maybe not, given how bad most software is...

You check printf() and fprintf() then? Like this?

///////////////////////////////////////////
void err_print(int err){
  const char *msg;
  int rc;

  msg = strerror(err);
  if(!msg) err_print(errno);

  do{
    rc = fprintf(stderr,"Problem: %s\n",msg);
  }while(rc<0 && errno==EINTR);
  if(rc<0) err_print(errno);
}
///////////////////////////////////////////

Get off your high horse.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18 20:09                                       ` Hildo.Biersma
@ 2002-07-18 23:55                                         ` Pete Zaitcev
  2002-07-19 11:31                                           ` Hildo.Biersma
  2002-07-23 22:19                                           ` Bill Davidsen
  0 siblings, 2 replies; 82+ messages in thread
From: Pete Zaitcev @ 2002-07-18 23:55 UTC (permalink / raw)
  To: Hildo.Biersma; +Cc: Pete Zaitcev, linux-kernel

> Date: Thu, 18 Jul 2002 16:09:51 -0400 (EDT)
> From: Hildo.Biersma@morganstanley.com

> Pete> The problem with errors from close() is that NOTHING SMART can be
> Pete> done by the application when it receives it. And application can:
> 
> Pete>  a) print a message "Your data are lost, have a nice day\n".
> Pete>  b) loop retrying close() until it works.
> Pete>  c) do (a) then (b).
> 
> I must disagree with you.  We run the Andrew File System (AFS), which
> has client-side caching with write-on-close semantics.  If an error
> occurs goes wrong at close() time, a well-written application can
> actually do something useful - such as sending an alert, or letting
> the user know the action failed.

The above is an example of an application covering up for
a filesystem that breaks the general expactions for the
operating environment. Remember your precursor with a broken
driver who received his beating a couple of months ago.
He also had an appliction which processed his errors from
close just fine. A workaround can be done in every specific
instance, but it does not make this practice any smarter.

What AFS designers should have done if they had a brain larger
than a pea was:

 1. Make close to block indefinitely, retrying writes.
    Allow overlapping writes, let clients to sort it out.
 2. Provide an ioctl to flush writes before close() or
    make fsync() work right. Your "smart" applications have had
    to use that, so that no ambiguity existed between tearing down
    the descriptor and writing out the data.

This way, naive applications such as cat and cc would
continue to work. There is no reason to penalize them just
because some application _could_ possibly post idiotic alerts
(Abort, Retry, Fail).

-- Pete

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18 23:55                                         ` Pete Zaitcev
@ 2002-07-19 11:31                                           ` Hildo.Biersma
  2002-07-19 16:16                                             ` Pete Zaitcev
  2002-07-23 22:19                                           ` Bill Davidsen
  1 sibling, 1 reply; 82+ messages in thread
From: Hildo.Biersma @ 2002-07-19 11:31 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: linux-kernel

>>>>> "Pete" == Pete Zaitcev <zaitcev@redhat.com> writes:

>> Date: Thu, 18 Jul 2002 16:09:51 -0400 (EDT)
>> From: Hildo.Biersma@morganstanley.com

Pete> The problem with errors from close() is that NOTHING SMART can be
Pete> done by the application when it receives it. And application can:
>> 
Pete> a) print a message "Your data are lost, have a nice day\n".
Pete> b) loop retrying close() until it works.
Pete> c) do (a) then (b).
>> 
>> I must disagree with you.  We run the Andrew File System (AFS), which
>> has client-side caching with write-on-close semantics.  If an error
>> occurs goes wrong at close() time, a well-written application can
>> actually do something useful - such as sending an alert, or letting
>> the user know the action failed.

Pete> The above is an example of an application covering up for
Pete> a filesystem that breaks the general expactions for the
Pete> operating environment. Remember your precursor with a broken
Pete> driver who received his beating a couple of months ago.
Pete> He also had an appliction which processed his errors from
Pete> close just fine. A workaround can be done in every specific
Pete> instance, but it does not make this practice any smarter.

I agree in general, but you should realize that there are valid
reasons why Unix filesystem semantics are sometimes violated.

We have slightly over 8,000 Unix hosts using the same networked
filesystem against the same set of file-servers.  This is only
feasible if you minimize the number of client<->server interactions.

This is done in two ways:
- persistent (disk-based) client-side caching, where the server will
  let a client know if a file is updated and needs to be evicted from
  the client's cache
- close-on-write semantics for files

Pete> What AFS designers should have done if they had a brain larger
Pete> than a pea was:

Pete>  1. Make close to block indefinitely, retrying writes.
Pete>     Allow overlapping writes, let clients to sort it out.

None of these things work, as security may be denied, a volume may be
taken off-line, or hvaing overlppaing writes from clients increases
the amount of client<->server interaction.

Pete>  2. Provide an ioctl to flush writes before close() or
Pete>     make fsync() work right. Your "smart" applications have had
Pete>     to use that, so that no ambiguity existed between tearing down
Pete>     the descriptor and writing out the data.

This is provided - sync, fsync, msync all work.

Pete> This way, naive applications such as cat and cc would
Pete> continue to work. There is no reason to penalize them just
Pete> because some application _could_ possibly post idiotic alerts
Pete> (Abort, Retry, Fail).

That's work the trade-offs come in.  The AFS designers found that
relaxing the Unix filesystem semantics vastly improves scalability.

Many of the high-performance filesystems (not XFS, the _really_
high-performance filesystems) that you run on supercomputers also
vioilate Unix semantics in various ways.  Yes, that breaks na\"ive
apps, but that trade-off is generally accepted.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18 23:47                                           ` Albert D. Cahalan
@ 2002-07-19 16:12                                             ` Patrick J. LoPresti
  2002-07-19 16:24                                               ` Joseph Malicki
  0 siblings, 1 reply; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-19 16:12 UTC (permalink / raw)
  To: Albert D. Cahalan; +Cc: linux-kernel

"Albert D. Cahalan" <acahalan@cs.uml.edu> writes:

> You check printf() and fprintf() then? Like this?
> 
> ///////////////////////////////////////////
> void err_print(int err){
>   const char *msg;
>   int rc;
> 
>   msg = strerror(err);
>   if(!msg) err_print(errno);
> 
>   do{
>     rc = fprintf(stderr,"Problem: %s\n",msg);
>   }while(rc<0 && errno==EINTR);
>   if(rc<0) err_print(errno);
> }
> ///////////////////////////////////////////

Wow, I hardly know where to begin.

I could point out that, at least according to my man page, fprintf()
returns the number of characters printed; it tells you nothing about
errors.  Also, fprintf() is a library funciton, not a system call, so
you cannot expect it to put anything meaningful in errno.  (I am not
sure whether these mistakes were part of your sarcasm or your
ignorance.)

Or I could ask, what part of "assertion failure" did you not
understand?  Yes, the code above is idiotic.  But checking that
fprintf() did not return zero, and calling abort() otherwise, is often
the right thing to do.

Yes, I exaggerated.  There are times when you can reasonably skip
checking a system call for errors; namely, when you have coded
defensively enough that any error can do no harm.  If you can show
that the rest of your program operates correctly whether the call
succeeded or not, then you can skip the error check.

But my main point still holds: You should *not* skip error checks
because you "know" that the error is "impossible".  It takes little
experience with real-world systems to learn that the "impossible"
happens with alarming frequency.  And when it does, aborting
immediately is much better than proceeding, because your subsequent
code is unpredictable and therefore dangerous when your assumptions
have been violated.

Once you have taken the hit of making a system call, the additional
cost of checking the return value is irrelevant.  So do yourself and
your users a favor and add the checks.

> Get off your high horse.

Actually, I would rather give others a lift to join me.  The view is
pretty good from up here.

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 11:31                                           ` Hildo.Biersma
@ 2002-07-19 16:16                                             ` Pete Zaitcev
  0 siblings, 0 replies; 82+ messages in thread
From: Pete Zaitcev @ 2002-07-19 16:16 UTC (permalink / raw)
  To: Hildo.Biersma; +Cc: Pete Zaitcev, linux-kernel

> Date: Fri, 19 Jul 2002 07:31:54 -0400 (EDT)
> From: Hildo.Biersma@morganstanley.com

> Pete>  1. Make close to block indefinitely, retrying writes.
> Pete>     Allow overlapping writes, let clients to sort it out.
> 
> None of these things work, as security may be denied, a volume may be
> taken off-line, or hvaing overlppaing writes from clients increases
> the amount of client<->server interaction.
> 
> Pete>  2. Provide an ioctl to flush writes before close() or
> Pete>     make fsync() work right. Your "smart" applications have had
> Pete>     to use that, so that no ambiguity existed between tearing down
> Pete>     the descriptor and writing out the data.
> 
> This is provided - sync, fsync, msync all work.

It is unfair for you to separate 1. and 2. They should work
together. Remember, you said "return error from close is
useful BECAUSE my smart application may deal with it."
If fsync works, the argument does not hold water at all.
Your smart application can do fsync just as easily.
If it does, it does not need the return code from close.

> That's work the trade-offs come in.  The AFS designers found that
> relaxing the Unix filesystem semantics vastly improves scalability.

I know about the improvements. They are applicable to NFS too.
What I am trying to tell you is that there was NO reason to break
close in particular. Even on ancient AIXes without fsync they
could have used an ioctl.

-- Pete

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 16:12                                             ` Patrick J. LoPresti
@ 2002-07-19 16:24                                               ` Joseph Malicki
  2002-07-19 18:48                                                 ` Patrick J. LoPresti
  2002-07-20 14:42                                                 ` Andries Brouwer
  0 siblings, 2 replies; 82+ messages in thread
From: Joseph Malicki @ 2002-07-19 16:24 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

Those mistakes are your ignorance.  The manpage is wrong.  It does return -1
on error.
Also, errno is in libc, not the kernel.  Man library functions do in fact
use errno.

And it's not an issue of whether an error is "impossible".  It's whether or
not you would do
anything if it failed.  It's not totally uncommon to actually not care
whether or not it succeeds, but a valiant attempt is enough, such as in the
case of printf.

Sure, if you require an event to be successful to continue you should always
check it.  And yes, it's nice to print an error message on close sometimes,
if something is critical.  But the question to ask is what you would
actually _DO_ about an error... if the answer is nothing,
then why check it?

-joe


----- Original Message -----
From: "Patrick J. LoPresti" <patl@curl.com>
To: "Albert D. Cahalan" <acahalan@cs.uml.edu>
Cc: <linux-kernel@vger.kernel.org>
Sent: Friday, July 19, 2002 12:12 PM
Subject: Re: close return value


> "Albert D. Cahalan" <acahalan@cs.uml.edu> writes:
>
> > You check printf() and fprintf() then? Like this?
> >
> > ///////////////////////////////////////////
> > void err_print(int err){
> >   const char *msg;
> >   int rc;
> >
> >   msg = strerror(err);
> >   if(!msg) err_print(errno);
> >
> >   do{
> >     rc = fprintf(stderr,"Problem: %s\n",msg);
> >   }while(rc<0 && errno==EINTR);
> >   if(rc<0) err_print(errno);
> > }
> > ///////////////////////////////////////////
>
> Wow, I hardly know where to begin.
>
> I could point out that, at least according to my man page, fprintf()
> returns the number of characters printed; it tells you nothing about
> errors.  Also, fprintf() is a library funciton, not a system call, so
> you cannot expect it to put anything meaningful in errno.  (I am not
> sure whether these mistakes were part of your sarcasm or your
> ignorance.)
>
> Or I could ask, what part of "assertion failure" did you not
> understand?  Yes, the code above is idiotic.  But checking that
> fprintf() did not return zero, and calling abort() otherwise, is often
> the right thing to do.
>
> Yes, I exaggerated.  There are times when you can reasonably skip
> checking a system call for errors; namely, when you have coded
> defensively enough that any error can do no harm.  If you can show
> that the rest of your program operates correctly whether the call
> succeeded or not, then you can skip the error check.
>
> But my main point still holds: You should *not* skip error checks
> because you "know" that the error is "impossible".  It takes little
> experience with real-world systems to learn that the "impossible"
> happens with alarming frequency.  And when it does, aborting
> immediately is much better than proceeding, because your subsequent
> code is unpredictable and therefore dangerous when your assumptions
> have been violated.
>
> Once you have taken the hit of making a system call, the additional
> cost of checking the return value is irrelevant.  So do yourself and
> your users a favor and add the checks.
>
> > Get off your high horse.
>
> Actually, I would rather give others a lift to join me.  The view is
> pretty good from up here.
>
>  - Pat
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 16:24                                               ` Joseph Malicki
@ 2002-07-19 18:48                                                 ` Patrick J. LoPresti
  2002-07-19 19:25                                                   ` Lars Marowsky-Bree
  2002-07-20 14:42                                                 ` Andries Brouwer
  1 sibling, 1 reply; 82+ messages in thread
From: Patrick J. LoPresti @ 2002-07-19 18:48 UTC (permalink / raw)
  To: Joseph Malicki; +Cc: linux-kernel

"Joseph Malicki" <jmalicki@starbak.net> writes:

> Those mistakes are your ignorance.  The manpage is wrong.  It does
> return -1 on error.  Also, errno is in libc, not the kernel.  Man
> library functions do in fact use errno.

Sigh.  OK, so I should have read SuSv2 instead of my local man page.
Mea culpa.  (Once upon a time, the buffered I/O libc routines made no
promises about which system calls they made or when.  On such systems,
errno after printf() had no guaranteed semantics.)

> And it's not an issue of whether an error is "impossible".  It's
> whether or not you would do anything if it failed.  It's not totally
> uncommon to actually not care whether or not it succeeds, but a
> valiant attempt is enough, such as in the case of printf.

If it is a diagnostic printf() to the screen, sure.  But an fprintf()
to update some state file on disk is a different matter entirely.

> Sure, if you require an event to be successful to continue you
> should always check it.  And yes, it's nice to print an error
> message on close sometimes, if something is critical.  But the
> question to ask is what you would actually _DO_ about an error... if
> the answer is nothing, then why check it?

To abort, plain and simple.  As I said, if you really think your call
to close() or gettimeofday() or whatever can never fail, you are much
better off dying immediately than proceeding on the assumption that it
succeeded.

Of course, checking errors in order to handle them sanely is a good
thing.  Nobody is arguing that.  What I am arguing is that failing to
check errors when they can "never happen" is wrong.

Anyway, back to lurker mode for me.

 - Pat

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 18:48                                                 ` Patrick J. LoPresti
@ 2002-07-19 19:25                                                   ` Lars Marowsky-Bree
  2002-07-19 19:30                                                     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 82+ messages in thread
From: Lars Marowsky-Bree @ 2002-07-19 19:25 UTC (permalink / raw)
  To: Patrick J. LoPresti, Joseph Malicki; +Cc: linux-kernel

On 2002-07-19T14:48:44,
   "Patrick J. LoPresti" <patl@curl.com> said:

> Of course, checking errors in order to handle them sanely is a good
> thing.  Nobody is arguing that.  What I am arguing is that failing to
> check errors when they can "never happen" is wrong.

Actually, checking for _all_ even remotely possible and checkable error
conditions (if the check doesn't incur an intolerable overhead) is a very very
important requirement for writing high quality code; even if it isn't "fault
tolerant" (because it may not know how to recover, as with the ill-defined
semantics of close() returning error), it will at least be "fail-fast"; giving
an error message close to the cause and terminate in a co-ordinated manner
before corrupting data.

It troubles me deeply that some people hacking on the Linux kernel do not
consider this a good thing.

And with that, I conclude my point and step out of the discussion for good.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 19:25                                                   ` Lars Marowsky-Bree
@ 2002-07-19 19:30                                                     ` Arnaldo Carvalho de Melo
  2002-07-19 19:45                                                       ` Joseph Malicki
  0 siblings, 1 reply; 82+ messages in thread
From: Arnaldo Carvalho de Melo @ 2002-07-19 19:30 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Patrick J. LoPresti, Joseph Malicki, linux-kernel

Em Fri, Jul 19, 2002 at 09:25:24PM +0200, Lars Marowsky-Bree escreveu:
> On 2002-07-19T14:48:44,
>    "Patrick J. LoPresti" <patl@curl.com> said:
> 
> > Of course, checking errors in order to handle them sanely is a good
> > thing.  Nobody is arguing that.  What I am arguing is that failing to
> > check errors when they can "never happen" is wrong.
> 
> Actually, checking for _all_ even remotely possible and checkable error
> conditions (if the check doesn't incur an intolerable overhead) is a very very
> important requirement for writing high quality code; even if it isn't "fault

If the function is not to be checked for errors, lets make it return void and
be done with it. There are few _exceptions_, but one has to understand the
meaning of that word 8)

- Arnaldo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 19:30                                                     ` Arnaldo Carvalho de Melo
@ 2002-07-19 19:45                                                       ` Joseph Malicki
  2002-07-19 19:55                                                         ` Arnaldo Carvalho de Melo
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Joseph Malicki @ 2002-07-19 19:45 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Lars Marowsky-Bree
  Cc: Patrick J. LoPresti, linux-kernel

It's an issue when it MIGHT be important.  Such as, fprintf to an important
data file should be checked, fprintf to stderr is usually cool not to check.

People are going on the assumption that ignoring an error to a system call
will interfere with program operation or corrupt data - which is NOT
necessarily true.  Sure many people write programs that way.  But it is
quite often that if something fails, you don't particularly care, and you
know, with certainty, that it does not materially affect the operation of
your program.  For instance, should shutdown fail just because it couldn't
write a message to everyone's console?  That would be wonderful.
Administrator wants to shut down system because it is broken - but since a
programmer follows your mantras, the system CANNOT
successfully shutdown anyway because then it wouldn't be "reliable".

-joe

----- Original Message -----
From: "Arnaldo Carvalho de Melo" <acme@conectiva.com.br>
To: "Lars Marowsky-Bree" <lmb@suse.de>
Cc: "Patrick J. LoPresti" <patl@curl.com>; "Joseph Malicki"
<jmalicki@starbak.net>; <linux-kernel@vger.kernel.org>
Sent: Friday, July 19, 2002 3:30 PM
Subject: Re: close return value


> Em Fri, Jul 19, 2002 at 09:25:24PM +0200, Lars Marowsky-Bree escreveu:
> > On 2002-07-19T14:48:44,
> >    "Patrick J. LoPresti" <patl@curl.com> said:
> >
> > > Of course, checking errors in order to handle them sanely is a good
> > > thing.  Nobody is arguing that.  What I am arguing is that failing to
> > > check errors when they can "never happen" is wrong.
> >
> > Actually, checking for _all_ even remotely possible and checkable error
> > conditions (if the check doesn't incur an intolerable overhead) is a
very very
> > important requirement for writing high quality code; even if it isn't
"fault
>
> If the function is not to be checked for errors, lets make it return void
and
> be done with it. There are few _exceptions_, but one has to understand the
> meaning of that word 8)
>
> - Arnaldo
>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 19:45                                                       ` Joseph Malicki
@ 2002-07-19 19:55                                                         ` Arnaldo Carvalho de Melo
  2002-07-20 18:25                                                         ` Bernd Eckenfels
  2002-07-20 23:06                                                         ` Sandy Harris
  2 siblings, 0 replies; 82+ messages in thread
From: Arnaldo Carvalho de Melo @ 2002-07-19 19:55 UTC (permalink / raw)
  To: Joseph Malicki; +Cc: Lars Marowsky-Bree, Patrick J. LoPresti, linux-kernel

Em Fri, Jul 19, 2002 at 03:45:40PM -0400, Joseph Malicki escreveu:
> programmer follows your mantras, the system CANNOT
> successfully shutdown anyway because then it wouldn't be "reliable".

Oh well, look at the word _exceptions_ in my post.

- Arnaldo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-17  1:23                                       ` Linus Torvalds
  2002-07-17 11:51                                         ` Matthias Andree
@ 2002-07-20  8:00                                         ` Florian Weimer
  2002-07-20 16:45                                           ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Florian Weimer @ 2002-07-20  8:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> Yes, EAGAIN doesn't really work as a close return value, simply because
> _nobody_ expects that (and leaving the file descriptor open after a
> close() is definitely unexpected, ie people can very validly complain
> about buggy behaviour).

Returning an error and still doing the operation is slightly awkward.
Are there any other syscalls which do similar things?

Of course, a significant portion of TCP related code would leak
descriptors like hell if the behavior of close() ischanged (there are
quite a few protocols which do not avoid race conditions resulting in
ECONNRESET connection teardown).

-- 
Florian Weimer 	                  Weimer@CERT.Uni-Stuttgart.DE
University of Stuttgart           http://CERT.Uni-Stuttgart.DE/people/fw/
RUS-CERT                          fax +49-711-685-5898

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 16:24                                               ` Joseph Malicki
  2002-07-19 18:48                                                 ` Patrick J. LoPresti
@ 2002-07-20 14:42                                                 ` Andries Brouwer
  1 sibling, 0 replies; 82+ messages in thread
From: Andries Brouwer @ 2002-07-20 14:42 UTC (permalink / raw)
  To: Joseph Malicki; +Cc: Patrick J. LoPresti, linux-kernel

On Fri, Jul 19, 2002 at 12:24:33PM -0400, Joseph Malicki wrote:

> Those mistakes are your ignorance.  The manpage is wrong.
> It does return -1 on error.

Yes, you are right (or, at least, "a negative value").
Now you deserve a beating for noting that there is a bug on
a man page without submitting a correction, or at least
telling the maintainer. (Yes, that's me.)

> Sure, if you require an event to be successful to continue you should always
> check it.  And yes, it's nice to print an error message on close sometimes,
> if something is critical.  But the question to ask is what you would
> actually _DO_ about an error... if the answer is nothing,
> then why check it?

But here you are wrong. Even if the program doesn't know what to do,
the user will want to know about it. If I make a backup and some error
occurs then I would be very unhappy if the program were silent about it.

Andries
aeb@cwi.nl

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-20  8:00                                         ` Florian Weimer
@ 2002-07-20 16:45                                           ` Linus Torvalds
  2002-07-26  0:06                                             ` EFAULT vs. SIGSEGV [was Re: close return value] Pavel Machek
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2002-07-20 16:45 UTC (permalink / raw)
  To: Florian Weimer; +Cc: David S. Miller, linux-kernel



On Sat, 20 Jul 2002, Florian Weimer wrote:
>
> Returning an error and still doing the operation is slightly awkward.
> Are there any other syscalls which do similar things?

mmap(MAP_FIXED) may have already unmapped any underlying old area if an
error occurs.

And EFAULT may have strange behaviour for left-over stuff. If I remember
correctly, at some point, for example, EFAULT on a write to a TCP socket
(if the fault happened in the middle) would still send out the full-sized
packet zero-padded, because not doing so would have screwed up the state
machine.

(But EFAULT is a special case in general, it's documented to be undefined
behaviour).

I can't think of any others, but at least close() isn't _completely_
alone.

And as you say, we really cannot change it anyway, even if we wanted to
(which I'm personally convinced we do not).

		Linus


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 19:45                                                       ` Joseph Malicki
  2002-07-19 19:55                                                         ` Arnaldo Carvalho de Melo
@ 2002-07-20 18:25                                                         ` Bernd Eckenfels
  2002-07-20 23:06                                                         ` Sandy Harris
  2 siblings, 0 replies; 82+ messages in thread
From: Bernd Eckenfels @ 2002-07-20 18:25 UTC (permalink / raw)
  To: linux-kernel

In article <000e01c22f5c$dce9c600$da5b903f@starbak.net> you wrote:
> It's an issue when it MIGHT be important.  Such as, fprintf to an important
> data file should be checked, fprintf to stderr is usually cool not to check.

well, writing to stdout/stderr can fail with a normal IO Error. It depends
on what kind of data you actually output. If it is a log message and you are
sure you do not need intact audit trails you might ignore it. If you write a
pipe tool (e.g. sort) you better check that write state and terminate.

Greetings
Bernd
yy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-19 19:45                                                       ` Joseph Malicki
  2002-07-19 19:55                                                         ` Arnaldo Carvalho de Melo
  2002-07-20 18:25                                                         ` Bernd Eckenfels
@ 2002-07-20 23:06                                                         ` Sandy Harris
  2 siblings, 0 replies; 82+ messages in thread
From: Sandy Harris @ 2002-07-20 23:06 UTC (permalink / raw)
  To: linux-kernel

Joseph Malicki wrote:
> 
> It's an issue when it MIGHT be important.  Such as, fprintf to an important
> data file should be checked, fprintf to stderr is usually cool not to check.

That's an application issue. From the kernel point of view, we cannot
tell
which errors matter to the application, so we just return error on any
we
can detect and let the app worry about it.

> People are going on the assumption that ignoring an error to a system call
> will interfere with program operation or corrupt data - which is NOT
> necessarily true.  Sure many people write programs that way.  But it is
> quite often that if something fails, you don't particularly care, and you
> know, with certainty, that it does not materially affect the operation of
> your program.  For instance, should shutdown fail just because it couldn't
> write a message to everyone's console?

Again, that's an application issue; shutdown should succeed no matter
what
files or devices become inaccessible, so it should be written to
continue
despite error codes, likely with a console message about the error.

>From the kernel point of view, the only question is whether to return an
error when it cannot write where it is asked to. Of course it must.

I don't see why anyone is bothering to argue on the kernel list about
what applications should do with error returns. That's not our problem.

All we need to worry about is:

	what errors are possible,
	whether they can be detected
	whether any merit a panic or kernel logging
	what to return to the application in each case

If the kernel gets those right, it has done its part.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-17  1:45                               ` Alan Cox
  2002-07-17 18:24                                 ` Zack Weinberg
@ 2002-07-22 16:42                                 ` Rogier Wolff
  1 sibling, 0 replies; 82+ messages in thread
From: Rogier Wolff @ 2002-07-22 16:42 UTC (permalink / raw)
  To: Alan Cox; +Cc: Zack Weinberg, linux-kernel

Alan Cox wrote:
> > And watch it come back with an error again, repeat ad infinitum?
> 
> The use of intelligence doesn't help. Come on I know you aren't a cobol
> programmer. Check for -EBADF ...

Huh? My mgetty/sendfax setup did something interesting lately.

I had not finished installing it, and I got a fax. It recieved it into
/tmp, tried moving it to /var/spool/fax/incoming, failed, and left the
tempfile in /tmp. It then mailed me about the recieved fax in /tmp. 

This is EXACTLY the intelligent behaviour that an application writer
can chose for when checking for error codes. Especially "don't unlink
your tempfiles" is easy if you get errors on conversion or copying....

			Roger. 


-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: close return value
  2002-07-18 23:55                                         ` Pete Zaitcev
  2002-07-19 11:31                                           ` Hildo.Biersma
@ 2002-07-23 22:19                                           ` Bill Davidsen
  1 sibling, 0 replies; 82+ messages in thread
From: Bill Davidsen @ 2002-07-23 22:19 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Hildo.Biersma, linux-kernel

On Thu, 18 Jul 2002, Pete Zaitcev wrote:


>  1. Make close to block indefinitely, retrying writes.

We went through this with sync() a while ago. You don't want things to
loop forever. That's what status returns are for, if the program wants to
retry it can. Consider the f/s being out of space, the write can't work,
the process can't die, the f/s can't unmount because there's i/o in
progress, the system can't shutdown cleanly.

Let the program handle the problems, and decide what to retry.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
  for (;;) exit(0);


^ permalink raw reply	[flat|nested] 82+ messages in thread

* EFAULT vs. SIGSEGV [was Re: close return value]
  2002-07-20 16:45                                           ` Linus Torvalds
@ 2002-07-26  0:06                                             ` Pavel Machek
  2002-07-26 14:01                                               ` (no subject) Alexis Deruelle
  0 siblings, 1 reply; 82+ messages in thread
From: Pavel Machek @ 2002-07-26  0:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Florian Weimer, David S. Miller, linux-kernel

Hi!

> > Returning an error and still doing the operation is slightly awkward.
> > Are there any other syscalls which do similar things?
> 
> mmap(MAP_FIXED) may have already unmapped any underlying old area if an
> error occurs.
> 
> And EFAULT may have strange behaviour for left-over stuff. If I remember
> correctly, at some point, for example, EFAULT on a write to a TCP socket
> (if the fault happened in the middle) would still send out the full-sized
> packet zero-padded, because not doing so would have screwed up the state
> machine.
> 
> (But EFAULT is a special case in general, it's documented to be undefined
> behaviour).

SOme time ago you said you'd agree to doing SIGSEGV in addition to
segfault. What about following patch? It should make difference
between VSYSCALL and normal syscall smaller...

								Pavel

--- clean.2.5/arch/i386/mm/fault.c	Thu Jul 25 22:21:08 2002
+++ linux/arch/i386/mm/fault.c	Thu Jul 25 22:21:24 2002
@@ -305,6 +305,15 @@
 no_context:
 	/* Are we prepared to handle this kernel fault?  */
 	if ((fixup = search_exception_table(regs->eip)) != 0) {
+		tsk->thread.cr2 = address;
+		tsk->thread.error_code = error_code;
+		tsk->thread.trap_no = 14;
+		info.si_signo = SIGSEGV;
+		info.si_errno = 0;
+		/* info.si_code has been set above */
+		info.si_addr = (void *)address;
+		force_sig_info(SIGSEGV, &info, tsk);
+
 		regs->eip = fixup;
 		return;
 	}

-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 82+ messages in thread

* (no subject)
  2002-07-26  0:06                                             ` EFAULT vs. SIGSEGV [was Re: close return value] Pavel Machek
@ 2002-07-26 14:01                                               ` Alexis Deruelle
  0 siblings, 0 replies; 82+ messages in thread
From: Alexis Deruelle @ 2002-07-26 14:01 UTC (permalink / raw)
  To: linux-kernel

unsubscribe linux-kernel


^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2002-07-26 13:52 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20020712162306$aa7d@traf.lcs.mit.edu>
     [not found] ` <mit.lcs.mail.linux-kernel/20020712162306$aa7d@traf.lcs.mit.edu>
2002-07-15 15:22   ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Patrick J. LoPresti
2002-07-15 17:31     ` Chris Mason
2002-07-15 18:33     ` Matthias Andree
     [not found]     ` <20020715173337$acad@traf.lcs.mit.edu>
     [not found]       ` <mit.lcs.mail.linux-kernel/20020715173337$acad@traf.lcs.mit.edu>
2002-07-15 19:13         ` Patrick J. LoPresti
2002-07-15 20:55           ` Matthias Andree
2002-07-15 21:23             ` Patrick J. LoPresti
2002-07-15 21:38               ` Thunder from the hill
2002-07-16 12:31                 ` Matthias Andree
2002-07-16 15:53                   ` Thunder from the hill
2002-07-16 19:26                     ` Matthias Andree
2002-07-16 19:38                       ` Thunder from the hill
2002-07-16 23:22                         ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
2002-07-17  1:03                           ` Alan Cox
2002-07-16 23:52                             ` close return value David S. Miller
2002-07-17  1:35                               ` Alan Cox
2002-07-17  0:20                                 ` David S. Miller
2002-07-17  1:05                                   ` Linus Torvalds
2002-07-17  1:05                                     ` David S. Miller
2002-07-17  1:23                                       ` Linus Torvalds
2002-07-17 11:51                                         ` Matthias Andree
2002-07-17 17:23                                           ` Andries Brouwer
2002-07-20  8:00                                         ` Florian Weimer
2002-07-20 16:45                                           ` Linus Torvalds
2002-07-26  0:06                                             ` EFAULT vs. SIGSEGV [was Re: close return value] Pavel Machek
2002-07-26 14:01                                               ` (no subject) Alexis Deruelle
     [not found]                                   ` <mailman.1026868201.10433.linux-kernel2news@redhat.com>
2002-07-18  0:01                                     ` close return value Pete Zaitcev
2002-07-18  0:10                                       ` Thunder from the hill
     [not found]                                       ` <mit.lcs.mail.linux-kernel/200207180001.g6I015f02681@devserv.devel.redhat.com>
2002-07-18 14:42                                         ` Patrick J. LoPresti
2002-07-18 15:13                                           ` Richard B. Johnson
2002-07-18 15:32                                             ` Sandy Harris
2002-07-18 23:47                                           ` Albert D. Cahalan
2002-07-19 16:12                                             ` Patrick J. LoPresti
2002-07-19 16:24                                               ` Joseph Malicki
2002-07-19 18:48                                                 ` Patrick J. LoPresti
2002-07-19 19:25                                                   ` Lars Marowsky-Bree
2002-07-19 19:30                                                     ` Arnaldo Carvalho de Melo
2002-07-19 19:45                                                       ` Joseph Malicki
2002-07-19 19:55                                                         ` Arnaldo Carvalho de Melo
2002-07-20 18:25                                                         ` Bernd Eckenfels
2002-07-20 23:06                                                         ` Sandy Harris
2002-07-20 14:42                                                 ` Andries Brouwer
2002-07-18 20:09                                       ` Hildo.Biersma
2002-07-18 23:55                                         ` Pete Zaitcev
2002-07-19 11:31                                           ` Hildo.Biersma
2002-07-19 16:16                                             ` Pete Zaitcev
2002-07-23 22:19                                           ` Bill Davidsen
2002-07-17  0:10                             ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks) Zack Weinberg
2002-07-17  1:45                               ` Alan Cox
2002-07-17 18:24                                 ` Zack Weinberg
2002-07-22 16:42                                 ` Rogier Wolff
2002-07-17  8:00                               ` Lars Marowsky-Bree
2002-07-17 15:49                                 ` Thunder from the hill
2002-07-17  2:22                             ` Elladan
2002-07-17  2:54                               ` Thunder from the hill
2002-07-17  3:00                                 ` Elladan
2002-07-17  3:10                                   ` Thunder from the hill
2002-07-17  3:31                                     ` Elladan
2002-07-17  4:17                               ` Stevie O
2002-07-17  4:38                                 ` Elladan
2002-07-17 14:39                                   ` Andreas Schwab
2002-07-17 16:49                                     ` Elladan
2002-07-17 17:43                                       ` Linus Torvalds
2002-07-17 22:07                                         ` Elladan
2002-07-18  9:48                                         ` Ketil Froyn
2002-07-17 17:17                                   ` Andries Brouwer
2002-07-17 17:51                                     ` Richard Gooch
2002-07-17  7:34                               ` close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks Kai Henningsen
2002-07-15 21:59               ` Ketil Froyn
2002-07-15 23:08                 ` Matti Aarnio
2002-07-16 12:33                   ` Matthias Andree
2002-07-15 22:55             ` Alan Cox
2002-07-15 21:58               ` Matthias Andree
2002-07-15 21:14           ` Chris Mason
2002-07-15 21:31             ` Patrick J. LoPresti
2002-07-15 22:12               ` Richard A Nelson
2002-07-16  1:02               ` Lawrence Greenfield
     [not found]                 ` <mit.lcs.mail.linux-kernel/200207160102.g6G12BiH022986@lin2.andrew.cmu.edu>
2002-07-16  1:43                   ` Patrick J. LoPresti
2002-07-16  1:56                     ` Thunder from the hill
2002-07-16 12:47                     ` Matthias Andree
2002-07-16 21:09                     ` James Antill
2002-07-16 12:35             ` Matthias Andree
2002-07-16  7:07     ` Dax Kelson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).