* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
[not found] ` <cl2N9-6Bj-9@gated-at.bofh.it>
@ 2009-03-31 21:27 ` Bodo Eggert
2009-04-01 0:06 ` Theodore Tso
0 siblings, 1 reply; 20+ messages in thread
From: Bodo Eggert @ 2009-03-31 21:27 UTC (permalink / raw)
To: Pavel Machek, Artem Bityutskiy, Artem Bityutskiy,
Linux Kernel Mailing List
Pavel Machek <pavel@ucw.cz> wrote:
> My proposal is
>
> rename() stays.
>
> replace(src, bar) is rename that ensures that bar will contain valid
> data after powerfail.
This can be done using implicit logic:
->E.g. on close(), mark inodes without being sync()ed as poisoned.
(I can think of more sophisticated logic, but ...)
->On completing the inode with the delayed allocations, unpoison it.
->Don't commit rename()s if the corresponding inode is poisoned.
Et Voila, everything replace is supposed to guarantee is guaranteed.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-31 21:27 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Bodo Eggert
@ 2009-04-01 0:06 ` Theodore Tso
2009-04-01 20:52 ` Pavel Machek
0 siblings, 1 reply; 20+ messages in thread
From: Theodore Tso @ 2009-04-01 0:06 UTC (permalink / raw)
To: Bodo Eggert
Cc: Pavel Machek, Artem Bityutskiy, Artem Bityutskiy,
Linux Kernel Mailing List
On Tue, Mar 31, 2009 at 11:27:33PM +0200, Bodo Eggert wrote:
> This can be done using implicit logic:
>
> ->E.g. on close(), mark inodes without being sync()ed as poisoned.
> (I can think of more sophisticated logic, but ...)
> ->On completing the inode with the delayed allocations, unpoison it.
> ->Don't commit rename()s if the corresponding inode is poisoned.
Send us patches if you think it's that easy to do what you are
proposing. I assure it's not easy.
- Ted
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-04-01 0:06 ` Theodore Tso
@ 2009-04-01 20:52 ` Pavel Machek
2009-04-01 22:58 ` Bodo Eggert
0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2009-04-01 20:52 UTC (permalink / raw)
To: Theodore Tso, Bodo Eggert, Artem Bityutskiy, Artem Bityutskiy,
Linux Kernel Mailing List
On Tue 2009-03-31 20:06:57, Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 11:27:33PM +0200, Bodo Eggert wrote:
> > This can be done using implicit logic:
> >
> > ->E.g. on close(), mark inodes without being sync()ed as poisoned.
> > (I can think of more sophisticated logic, but ...)
> > ->On completing the inode with the delayed allocations, unpoison it.
> > ->Don't commit rename()s if the corresponding inode is poisoned.
>
> Send us patches if you think it's that easy to do what you are
> proposing. I assure it's not easy.
Well, implementing replace() syscall would be quite easy. Would you
want such a patch?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-04-01 20:52 ` Pavel Machek
@ 2009-04-01 22:58 ` Bodo Eggert
0 siblings, 0 replies; 20+ messages in thread
From: Bodo Eggert @ 2009-04-01 22:58 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Bodo Eggert, Artem Bityutskiy, Artem Bityutskiy,
Linux Kernel Mailing List
On Wed, 1 Apr 2009, Pavel Machek wrote:
> On Tue 2009-03-31 20:06:57, Theodore Tso wrote:
> > On Tue, Mar 31, 2009 at 11:27:33PM +0200, Bodo Eggert wrote:
> > > This can be done using implicit logic:
> > >
> > > ->E.g. on close(), mark inodes without being sync()ed as poisoned.
> > > (I can think of more sophisticated logic, but ...)
> > > ->On completing the inode with the delayed allocations, unpoison it.
> > > ->Don't commit rename()s if the corresponding inode is poisoned.
> >
> > Send us patches if you think it's that easy to do what you are
> > proposing. I assure it's not easy.
>
> Well, implementing replace() syscall would be quite easy. Would you
> want such a patch?
You'd need - minus setting the flag on close - about the same logic:
- detecting if the inode are dirty
- detecting when the inode gets into a clean state
- delaying the commit of the rename() until then
- not replaying parts of the journal
As I understand, the replace() should not behave different from rename(),
except for making a guaranty. Since this guaranty is desired anyway, I
think making it automatically would be a good thing. Especially until
all applications are rewritten ...
Off cause if you call replace, you know it's going to be OK, but you could
also use pathconf($configdir, _PC_SYNC_BEFORE_RENAME)
(== 0: Sync or barrier implicit, (data = ordered)
1: small chance of data loss due to not syncing, (vfat?)
2: guarantee of data loss without sync (ext4))
// userspace.c
static inline int sync_before_rename(char * where, int is_important_data)
{
#ifndef _PC_SYNC_BEFORE_RENAME
// be optimistic for unimportant data
return is_important_data;
#else
int ret = pathconf(where, _PC_SYNC_BEFORE_RENAME);
if (is_important_data)
return ret;
return (ret > 1);
#endif
}
--
How do I set my laser printer on stun?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-30 17:19 ` Ric Wheeler
@ 2009-03-30 22:11 ` Pavel Machek
0 siblings, 0 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-30 22:11 UTC (permalink / raw)
To: Ric Wheeler; +Cc: Artem Bityutskiy, Artem Bityutskiy, Linux Kernel Mailing List
Hi!
>> My proposal is
>>
>> rename() stays.
>>
>> replace(src, bar) is rename that ensures that bar will contain valid
>> data after powerfail.
>>
>
> Surely the only way to "insure" this is to spin up the drive, write the
> meta-data and data back and make sure that it is not held in volatile
> write cache?
Well, no. "will contain valid data" but may contain _old_ valid data.
So the way to do that would be "wait until you have to spin disk up
anyway or until timeout, then write data first, then do rename".
AFAICT that's semantics gnome (etc) wants.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 12:50 ` Pavel Machek
2009-03-29 13:00 ` Artem Bityutskiy
@ 2009-03-30 17:19 ` Ric Wheeler
2009-03-30 22:11 ` Pavel Machek
1 sibling, 1 reply; 20+ messages in thread
From: Ric Wheeler @ 2009-03-30 17:19 UTC (permalink / raw)
To: Pavel Machek
Cc: Artem Bityutskiy, Artem Bityutskiy, Linux Kernel Mailing List
Pavel Machek wrote:
>>>> We have a problem that user-space people do not want to
>>>> use 'fsync()', even when they are pointed to their code
>>>> which is doing create/write/rename/close without fsync().
>>>>
>>> Well... they really don't want to spin the disk up for the
>>> fsync(). I'm not sure if fsync() is really sensible operation to use
>>> there.
>>>
>> I'm personally concerned about hand-held, and in case of UBIFS
>> fsync is not too expensive - we work on flash and on fsync() we
>> write back only the stuff belonging to inode in question, and
>> nothing else.
>>
>
> Well, I'm more concerned about spinning disks, having one even in my
> zaurus. And I do believe that fsync() will write more data than
> neccessary even in flash case.
>
>
>>>> 1. truncate/write/close leads to empty files
>>>>
>>> this is buggy.
>>>
>> In FS, or in application?
>>
>
> Application is buggy; no way kernel can help there.
>
>
>>>> 2. create/write/rename leads to empty files
>>>>
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>>>
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
>>
>
> My proposal is
>
> rename() stays.
>
> replace(src, bar) is rename that ensures that bar will contain valid
> data after powerfail.
>
Surely the only way to "insure" this is to spin up the drive, write the
meta-data and data back and make sure that it is not held in volatile
write cache?
Why would calling this replace be better or more power efficient than
what you need to do today?
ric
>
>>> It is somehow similar to fsync()/rename(), but does not force disk
>>> spin up immediately -- it only inserts "barrier" between data blocks
>>> and rename. (And yes, it should be implemented as fsync()+rename() for
>>> filesystems like xfs. It can be implemented as plain rename for ext3
>>> and ext4 after the fixes...)
>>>
>> Right. But I guess only few file-systems would really implement
>> this, because this is complex.
>>
>
> Complex yes, but at least ext3+ext4+btrfs should, and they really have
> 90% of "market share" :-). ext3 and ext4 implementations are already
> done :-).
> Pavel
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
2009-03-29 12:42 ` Artem Bityutskiy
@ 2009-03-30 15:58 ` Diego Calleja
1 sibling, 0 replies; 20+ messages in thread
From: Diego Calleja @ 2009-03-30 15:58 UTC (permalink / raw)
To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
On Domingo 29 Marzo 2009 14:26:00 Pavel Machek escribió:
> ...but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.
An "ad Linus-em" counterexample:
"And if we have a Linux-specific magic system call or sync action, it's
going to be even more rarely used than fsync(). Do you think anybody
really uses the OS X FSYNC_FULL ioctl? Nope. Outside of a few databases,
it is almost certainly not going to be used, and fsync() will not be
reliable in general.
So rather than come up with new barriers that nobody will use, filesystem
people should aim to make "badly written" code "just work" unless people
are really really unlucky. Because like it or not, that's what 99% of all
code is."
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:57 ` Artem Bityutskiy
@ 2009-03-29 14:00 ` Pavel Machek
0 siblings, 0 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 14:00 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
On Sun 2009-03-29 16:57:06, Artem Bityutskiy wrote:
> ext Pavel Machek wrote:
>> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>>> Pavel Machek wrote:
>>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>>> that source file is completely on media before commiting the rename.
>>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>>> to have final agreement on all this stuff.
>>>>>> My proposal is
>>>>>>
>>>>>> rename() stays.
>>>>> It stays and:
>>>>>
>>>>> 1. does _not_ fsync
>>>> Does not fsync. If someone wants to make sure one of the files is on
>>>> the disk, he should use replace(). [On non-linux systems, replace()
>>>> should be implemented as fsync/rename in libc or something.]
>>> I would be happy with these rules. But the fact is, application
>>> people just refuse to add fsync before rename. They say that the
>>> FS has to do this. And they say that even Linus supports them,
>>
>> That's good. fsync before rename would be ugly regression (on ext3 at
>> least). We should get them to use replace() syscall, not get them to
>> add fsyncs. [Of course, that means we need replace syscall first. :-)]
>
> I'd say it is better to fix ext3 then.
? I don't get this.
ext3's rename() is already equivalent to proposed replace(). The
problem is that btrfs's and ubifs's renames are not.
So doing extra fsync() on ext3 is actually an performance regression
-> we do not want applications to randomly add open-coded fsyncs().
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:40 ` Pavel Machek
@ 2009-03-29 13:57 ` Artem Bityutskiy
2009-03-29 14:00 ` Pavel Machek
0 siblings, 1 reply; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:57 UTC (permalink / raw)
To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
ext Pavel Machek wrote:
> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>> Pavel Machek wrote:
>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>> that source file is completely on media before commiting the rename.
>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>> to have final agreement on all this stuff.
>>>>> My proposal is
>>>>>
>>>>> rename() stays.
>>>> It stays and:
>>>>
>>>> 1. does _not_ fsync
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename. They say that the
>> FS has to do this. And they say that even Linus supports them,
>
> That's good. fsync before rename would be ugly regression (on ext3 at
> least). We should get them to use replace() syscall, not get them to
> add fsyncs. [Of course, that means we need replace syscall first. :-)]
I'd say it is better to fix ext3 then.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:22 ` Andreas T.Auer
@ 2009-03-29 13:55 ` Artem Bityutskiy
0 siblings, 0 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:55 UTC (permalink / raw)
To: Andreas T.Auer; +Cc: Pavel Machek, Artem Bityutskiy, Linux Kernel Mailing List
Andreas T.Auer wrote:
> On 29.03.2009 15:07 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
> As a user I will avoid using any fs, which requires the tons of
> applications to be changed for a reasonable amount of data safety.
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename.
> Because it slows down the performance.
>> They say that the
>> FS has to do this.
> They say that FS should not write metadata for non-existing data and
> even overwrite "clean" metadata with "dirty" metadata. It is up to the
> fs to decide, whether fsync is needed to achieve this.
Well, this makes sense, but the fact is that FS developers did
not keep this in mind. And when we have been developing UBIFS,
we also naively assumed that user-space would just call fsync
if needed. And it was easier to implement stuff this way. And
it looked like POSIX and other Linux FSes assumed that.
But well, we can change UBIFS behavior, but it would be nice
to have some agreement on all this.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:07 ` Artem Bityutskiy
2009-03-29 13:22 ` Andreas T.Auer
@ 2009-03-29 13:40 ` Pavel Machek
2009-03-29 13:57 ` Artem Bityutskiy
1 sibling, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 13:40 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>>>>> 2. create/write/rename leads to empty files
>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>> that source file is completely on media before commiting the rename.
>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>> to have final agreement on all this stuff.
>>>> My proposal is
>>>>
>>>> rename() stays.
>>> It stays and:
>>>
>>> 1. does _not_ fsync
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename. They say that the
> FS has to do this. And they say that even Linus supports them,
That's good. fsync before rename would be ugly regression (on ext3 at
least). We should get them to use replace() syscall, not get them to
add fsyncs. [Of course, that means we need replace syscall first. :-)]
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:07 ` Artem Bityutskiy
@ 2009-03-29 13:22 ` Andreas T.Auer
2009-03-29 13:55 ` Artem Bityutskiy
2009-03-29 13:40 ` Pavel Machek
1 sibling, 1 reply; 20+ messages in thread
From: Andreas T.Auer @ 2009-03-29 13:22 UTC (permalink / raw)
To: Artem.Bityutskiy
Cc: Pavel Machek, Artem Bityutskiy, Linux Kernel Mailing List
On 29.03.2009 15:07 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
As a user I will avoid using any fs, which requires the tons of
applications to be changed for a reasonable amount of data safety.
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename.
Because it slows down the performance.
> They say that the
> FS has to do this.
They say that FS should not write metadata for non-existing data and
even overwrite "clean" metadata with "dirty" metadata. It is up to the
fs to decide, whether fsync is needed to achieve this.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:02 ` Pavel Machek
@ 2009-03-29 13:07 ` Artem Bityutskiy
2009-03-29 13:22 ` Andreas T.Auer
2009-03-29 13:40 ` Pavel Machek
0 siblings, 2 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:07 UTC (permalink / raw)
To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
Pavel Machek wrote:
> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>>>>> 2. create/write/rename leads to empty files
>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>> that source file is completely on media before commiting the rename.
>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>> to have final agreement on all this stuff.
>>> My proposal is
>>>
>>> rename() stays.
>> It stays and:
>>
>> 1. does _not_ fsync
>
> Does not fsync. If someone wants to make sure one of the files is on
> the disk, he should use replace(). [On non-linux systems, replace()
> should be implemented as fsync/rename in libc or something.]
I would be happy with these rules. But the fact is, application
people just refuse to add fsync before rename. They say that the
FS has to do this. And they say that even Linus supports them,
which is an argument I find difficult to fight against. This is
why I want clean rules.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:01 ` Andreas T.Auer
@ 2009-03-29 13:06 ` Artem Bityutskiy
0 siblings, 0 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:06 UTC (permalink / raw)
To: Andreas T.Auer; +Cc: Artem Bityutskiy, Pavel Machek, Linux Kernel Mailing List
ext Andreas T.Auer wrote:
>
> On 29.03.2009 14:42 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>
>>>> 1. truncate/write/close leads to empty files
>>> this is buggy.
>> In FS, or in application?
> In application of course. If you rewrite a huge file that way, you have
> a long-time risk of loosing data in a crash, even with sychronous writes.
You know, after reading all these blogs and discussions,
I will not be surprised if someone says this is an FS bug.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 13:00 ` Artem Bityutskiy
@ 2009-03-29 13:02 ` Pavel Machek
2009-03-29 13:07 ` Artem Bityutskiy
0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 13:02 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>>>> 2. create/write/rename leads to empty files
>>>> ..but this should not be. If we want to make that explicit, we should
>>>> provide "replace()" operation; where replace is rename that makes sure
>>>> that source file is completely on media before commiting the rename.
>>> Well, OK, we can fsync() before rename, we just need clean rules
>>> for this, so that all Linux FSes would follow them. Would be nice
>>> to have final agreement on all this stuff.
>>
>> My proposal is
>>
>> rename() stays.
>
> It stays and:
>
> 1. does _not_ fsync
Does not fsync. If someone wants to make sure one of the files is on
the disk, he should use replace(). [On non-linux systems, replace()
should be implemented as fsync/rename in libc or something.]
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 12:42 ` Artem Bityutskiy
2009-03-29 12:50 ` Pavel Machek
@ 2009-03-29 13:01 ` Andreas T.Auer
2009-03-29 13:06 ` Artem Bityutskiy
1 sibling, 1 reply; 20+ messages in thread
From: Andreas T.Auer @ 2009-03-29 13:01 UTC (permalink / raw)
To: Artem Bityutskiy, Pavel Machek
Cc: Artem Bityutskiy, Linux Kernel Mailing List
On 29.03.2009 14:42 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>
>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?
In application of course. If you rewrite a huge file that way, you have
a long-time risk of loosing data in a crash, even with sychronous writes.
>
>>> 2. create/write/rename leads to empty files
In the that case the time for the risk is reduced to the rename from the
viewpoint of the application developers, which don't know modern
re-ordering filesystems.
>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
It is a hard task to change all the applications, there a lot of
orphaned projects, which are still used.
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.
>
This slows down things, but you could also delay the writing of the
metadata pointing to non-existing data. Or is there any use for it after
the crash?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 12:50 ` Pavel Machek
@ 2009-03-29 13:00 ` Artem Bityutskiy
2009-03-29 13:02 ` Pavel Machek
2009-03-30 17:19 ` Ric Wheeler
1 sibling, 1 reply; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:00 UTC (permalink / raw)
To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
Pavel Machek wrote:
>>>> 2. create/write/rename leads to empty files
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
>
> My proposal is
>
> rename() stays.
It stays and:
1. does _not_ fsync
2. has synchronous fsync added
3. stays and have asynchronous fsync added?
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 12:42 ` Artem Bityutskiy
@ 2009-03-29 12:50 ` Pavel Machek
2009-03-29 13:00 ` Artem Bityutskiy
2009-03-30 17:19 ` Ric Wheeler
2009-03-29 13:01 ` Andreas T.Auer
1 sibling, 2 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 12:50 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
>>> We have a problem that user-space people do not want to
>>> use 'fsync()', even when they are pointed to their code
>>> which is doing create/write/rename/close without fsync().
>>
>> Well... they really don't want to spin the disk up for the
>> fsync(). I'm not sure if fsync() is really sensible operation to use
>> there.
>
> I'm personally concerned about hand-held, and in case of UBIFS
> fsync is not too expensive - we work on flash and on fsync() we
> write back only the stuff belonging to inode in question, and
> nothing else.
Well, I'm more concerned about spinning disks, having one even in my
zaurus. And I do believe that fsync() will write more data than
neccessary even in flash case.
>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?
Application is buggy; no way kernel can help there.
>>> 2. create/write/rename leads to empty files
>>
>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
>
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.
My proposal is
rename() stays.
replace(src, bar) is rename that ensures that bar will contain valid
data after powerfail.
>> It is somehow similar to fsync()/rename(), but does not force disk
>> spin up immediately -- it only inserts "barrier" between data blocks
>> and rename. (And yes, it should be implemented as fsync()+rename() for
>> filesystems like xfs. It can be implemented as plain rename for ext3
>> and ext4 after the fixes...)
>
> Right. But I guess only few file-systems would really implement
> this, because this is complex.
Complex yes, but at least ext3+ext4+btrfs should, and they really have
90% of "market share" :-). ext3 and ext4 implementations are already
done :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
@ 2009-03-29 12:42 ` Artem Bityutskiy
2009-03-29 12:50 ` Pavel Machek
2009-03-29 13:01 ` Andreas T.Auer
2009-03-30 15:58 ` Diego Calleja
1 sibling, 2 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 12:42 UTC (permalink / raw)
To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List
Pavel Machek wrote:
> On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
>> UBIFS has exactly the same properties like ext4 - in case
>> of power cuts:
>>
>> 1. truncate/write/close leads to empty files
>> 2. create/write/rename leads to empty files
>>
>> UBIFS is used in hand-held and and power-cuts are very
>> often there, because users just remove battery often.
>>
>> I realize the "reality is different" argument, and already
>> concluded that we need a similar changes as Theo has done
>> for ext4:
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>>
>> We have a problem that user-space people do not want to
>> use 'fsync()', even when they are pointed to their code
>> which is doing create/write/rename/close without fsync().
>
> Well... they really don't want to spin the disk up for the
> fsync(). I'm not sure if fsync() is really sensible operation to use
> there.
I'm personally concerned about hand-held, and in case of UBIFS
fsync is not too expensive - we work on flash and on fsync() we
write back only the stuff belonging to inode in question, and
nothing else.
>> 1. truncate/write/close leads to empty files
>
> this is buggy.
In FS, or in application?
>> 2. create/write/rename leads to empty files
>
> ..but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.
Well, OK, we can fsync() before rename, we just need clean rules
for this, so that all Linux FSes would follow them. Would be nice
to have final agreement on all this stuff.
> It is somehow similar to fsync()/rename(), but does not force disk
> spin up immediately -- it only inserts "barrier" between data blocks
> and rename. (And yes, it should be implemented as fsync()+rename() for
> filesystems like xfs. It can be implemented as plain rename for ext3
> and ext4 after the fixes...)
Right. But I guess only few file-systems would really implement
this, because this is complex.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
^ permalink raw reply [flat|nested] 20+ messages in thread
* replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
@ 2009-03-29 12:26 ` Pavel Machek
2009-03-29 12:42 ` Artem Bityutskiy
2009-03-30 15:58 ` Diego Calleja
0 siblings, 2 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 12:26 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: Linux Kernel Mailing List
On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
> UBIFS has exactly the same properties like ext4 - in case
> of power cuts:
>
> 1. truncate/write/close leads to empty files
> 2. create/write/rename leads to empty files
>
> UBIFS is used in hand-held and and power-cuts are very
> often there, because users just remove battery often.
>
> I realize the "reality is different" argument, and already
> concluded that we need a similar changes as Theo has done
> for ext4:
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>
> We have a problem that user-space people do not want to
> use 'fsync()', even when they are pointed to their code
> which is doing create/write/rename/close without fsync().
Well... they really don't want to spin the disk up for the
fsync(). I'm not sure if fsync() is really sensible operation to use
there.
> 1. truncate/write/close leads to empty files
this is buggy.
> 2. create/write/rename leads to empty files
...but this should not be. If we want to make that explicit, we should
provide "replace()" operation; where replace is rename that makes sure
that source file is completely on media before commiting the rename.
It is somehow similar to fsync()/rename(), but does not force disk
spin up immediately -- it only inserts "barrier" between data blocks
and rename. (And yes, it should be implemented as fsync()+rename() for
filesystems like xfs. It can be implemented as plain rename for ext3
and ext4 after the fixes...)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2009-04-01 22:58 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <ckjPq-2Dl-15@gated-at.bofh.it>
[not found] ` <cl2jy-65z-1@gated-at.bofh.it>
[not found] ` <cl2CZ-6q2-21@gated-at.bofh.it>
[not found] ` <cl2N9-6Bj-9@gated-at.bofh.it>
2009-03-31 21:27 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Bodo Eggert
2009-04-01 0:06 ` Theodore Tso
2009-04-01 20:52 ` Pavel Machek
2009-04-01 22:58 ` Bodo Eggert
2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
2009-03-29 12:42 ` Artem Bityutskiy
2009-03-29 12:50 ` Pavel Machek
2009-03-29 13:00 ` Artem Bityutskiy
2009-03-29 13:02 ` Pavel Machek
2009-03-29 13:07 ` Artem Bityutskiy
2009-03-29 13:22 ` Andreas T.Auer
2009-03-29 13:55 ` Artem Bityutskiy
2009-03-29 13:40 ` Pavel Machek
2009-03-29 13:57 ` Artem Bityutskiy
2009-03-29 14:00 ` Pavel Machek
2009-03-30 17:19 ` Ric Wheeler
2009-03-30 22:11 ` Pavel Machek
2009-03-29 13:01 ` Andreas T.Auer
2009-03-29 13:06 ` Artem Bityutskiy
2009-03-30 15:58 ` Diego Calleja
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).