linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: replace() system call needed (was Re: EXT4-ish "fixes" in  UBIFS)
       [not found]     ` <cl2N9-6Bj-9@gated-at.bofh.it>
@ 2009-03-31 21:27       ` Bodo Eggert
  2009-04-01  0:06         ` Theodore Tso
  0 siblings, 1 reply; 20+ messages in thread
From: Bodo Eggert @ 2009-03-31 21:27 UTC (permalink / raw)
  To: Pavel Machek, Artem Bityutskiy, Artem Bityutskiy,
	Linux Kernel Mailing List

Pavel Machek <pavel@ucw.cz> wrote:

> My proposal is
> 
> rename() stays.
> 
> replace(src, bar) is rename that ensures that bar will contain valid
> data after powerfail.

This can be done using implicit logic:

->E.g. on close(), mark inodes without being sync()ed as poisoned.
(I can think of more sophisticated logic, but ...)
->On completing the inode with the delayed allocations, unpoison it.
->Don't commit rename()s if the corresponding inode is poisoned.

Et Voila, everything replace is supposed to guarantee is guaranteed.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-31 21:27       ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Bodo Eggert
@ 2009-04-01  0:06         ` Theodore Tso
  2009-04-01 20:52           ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Theodore Tso @ 2009-04-01  0:06 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Pavel Machek, Artem Bityutskiy, Artem Bityutskiy,
	Linux Kernel Mailing List

On Tue, Mar 31, 2009 at 11:27:33PM +0200, Bodo Eggert wrote:
> This can be done using implicit logic:
> 
> ->E.g. on close(), mark inodes without being sync()ed as poisoned.
> (I can think of more sophisticated logic, but ...)
> ->On completing the inode with the delayed allocations, unpoison it.
> ->Don't commit rename()s if the corresponding inode is poisoned.

Send us patches if you think it's that easy to do what you are
proposing.   I assure it's not easy.

				- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-04-01  0:06         ` Theodore Tso
@ 2009-04-01 20:52           ` Pavel Machek
  2009-04-01 22:58             ` Bodo Eggert
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2009-04-01 20:52 UTC (permalink / raw)
  To: Theodore Tso, Bodo Eggert, Artem Bityutskiy, Artem Bityutskiy,
	Linux Kernel Mailing List

On Tue 2009-03-31 20:06:57, Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 11:27:33PM +0200, Bodo Eggert wrote:
> > This can be done using implicit logic:
> > 
> > ->E.g. on close(), mark inodes without being sync()ed as poisoned.
> > (I can think of more sophisticated logic, but ...)
> > ->On completing the inode with the delayed allocations, unpoison it.
> > ->Don't commit rename()s if the corresponding inode is poisoned.
> 
> Send us patches if you think it's that easy to do what you are
> proposing.   I assure it's not easy.

Well, implementing replace() syscall would be quite easy. Would you
want such a patch?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-04-01 20:52           ` Pavel Machek
@ 2009-04-01 22:58             ` Bodo Eggert
  0 siblings, 0 replies; 20+ messages in thread
From: Bodo Eggert @ 2009-04-01 22:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Bodo Eggert, Artem Bityutskiy, Artem Bityutskiy,
	Linux Kernel Mailing List

On Wed, 1 Apr 2009, Pavel Machek wrote:
> On Tue 2009-03-31 20:06:57, Theodore Tso wrote:
> > On Tue, Mar 31, 2009 at 11:27:33PM +0200, Bodo Eggert wrote:

> > > This can be done using implicit logic:
> > > 
> > > ->E.g. on close(), mark inodes without being sync()ed as poisoned.
> > > (I can think of more sophisticated logic, but ...)
> > > ->On completing the inode with the delayed allocations, unpoison it.
> > > ->Don't commit rename()s if the corresponding inode is poisoned.
> > 
> > Send us patches if you think it's that easy to do what you are
> > proposing.   I assure it's not easy.
> 
> Well, implementing replace() syscall would be quite easy. Would you
> want such a patch?

You'd need - minus setting the flag on close - about the same logic:
- detecting if the inode are dirty
- detecting when the inode gets into a clean state
- delaying the commit of the rename() until then
- not replaying parts of the journal

As I understand, the replace() should not behave different from rename(),
except for making a guaranty. Since this guaranty is desired anyway, I 
think making it automatically would be a good thing. Especially until
all applications are rewritten ...

Off cause if you call replace, you know it's going to be OK, but you could 
also use pathconf($configdir, _PC_SYNC_BEFORE_RENAME)
(== 0: Sync or barrier implicit, (data = ordered)
    1: small chance of data loss due to not syncing, (vfat?)
    2: guarantee of data loss without sync (ext4))

// userspace.c
static inline int sync_before_rename(char * where, int is_important_data)
{
	#ifndef _PC_SYNC_BEFORE_RENAME
		// be optimistic for unimportant data
		return is_important_data;
	#else
		int ret = pathconf(where, _PC_SYNC_BEFORE_RENAME);
		if (is_important_data)
			return ret;
		return (ret > 1);
	#endif
}
-- 
How do I set my laser printer on stun?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-30 17:19       ` Ric Wheeler
@ 2009-03-30 22:11         ` Pavel Machek
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-30 22:11 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Artem Bityutskiy, Artem Bityutskiy, Linux Kernel Mailing List

Hi!

>> My proposal is 
>>
>> rename() stays.
>>
>> replace(src, bar) is rename that ensures that bar will contain valid
>> data after powerfail.
>>   
>
> Surely the only way to "insure" this is to spin up the drive, write the  
> meta-data and data back and make sure that it is not held in volatile  
> write cache?

Well, no. "will contain valid data" but may contain _old_ valid data.

So the way to do that would be "wait until you have to spin disk up
anyway or until timeout, then write data first, then do rename".

AFAICT that's semantics gnome (etc) wants.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:50     ` Pavel Machek
  2009-03-29 13:00       ` Artem Bityutskiy
@ 2009-03-30 17:19       ` Ric Wheeler
  2009-03-30 22:11         ` Pavel Machek
  1 sibling, 1 reply; 20+ messages in thread
From: Ric Wheeler @ 2009-03-30 17:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Artem Bityutskiy, Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
>>>> We have a problem that user-space people do not want to
>>>> use 'fsync()', even when they are pointed to their code
>>>> which is doing create/write/rename/close without fsync().
>>>>         
>>> Well... they really don't want to spin the disk up for the
>>> fsync(). I'm not sure if fsync() is really sensible operation to use
>>> there.
>>>       
>> I'm personally concerned about hand-held, and in case of UBIFS
>> fsync is not too expensive - we work on flash and on fsync() we
>> write back only the stuff belonging to inode in question, and
>> nothing else.
>>     
>
> Well, I'm more concerned about spinning disks, having one even in my
> zaurus. And I do believe that fsync() will write more data than
> neccessary even in flash case.
>
>   
>>>> 1. truncate/write/close leads to empty files
>>>>         
>>> this is buggy.
>>>       
>> In FS, or in application?
>>     
>
> Application is buggy; no way kernel can help there.
>
>   
>>>> 2. create/write/rename leads to empty files
>>>>         
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>>>       
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
>>     
>
> My proposal is 
>
> rename() stays.
>
> replace(src, bar) is rename that ensures that bar will contain valid
> data after powerfail.
>   

Surely the only way to "insure" this is to spin up the drive, write the 
meta-data and data back and make sure that it is not held in volatile 
write cache?

Why would calling this replace be better or more power efficient than 
what you need to do today?

ric

>   
>>> It is somehow similar to fsync()/rename(), but does not force disk
>>> spin up immediately -- it only inserts "barrier" between data blocks
>>> and rename. (And yes, it should be implemented as fsync()+rename() for
>>> filesystems like xfs. It can be implemented as plain rename for ext3
>>> and ext4 after the fixes...)
>>>       
>> Right. But I guess only few file-systems would really implement
>> this, because this is complex.
>>     
>
> Complex yes, but at least ext3+ext4+btrfs should, and they really have
> 90% of "market share" :-). ext3 and ext4 implementations are already
> done :-).
> 								Pavel
>   


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
  2009-03-29 12:42   ` Artem Bityutskiy
@ 2009-03-30 15:58   ` Diego Calleja
  1 sibling, 0 replies; 20+ messages in thread
From: Diego Calleja @ 2009-03-30 15:58 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Domingo 29 Marzo 2009 14:26:00 Pavel Machek escribió:

> ...but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.

An "ad Linus-em" counterexample:

"And if we have a Linux-specific magic system call or sync action, it's 
going to be even more rarely used than fsync(). Do you think anybody 
really uses the OS X FSYNC_FULL ioctl? Nope. Outside of a few databases, 
it is almost certainly not going to be used, and fsync() will not be 
reliable in general.

So rather than come up with new barriers that nobody will use, filesystem 
people should aim to make "badly written" code "just work" unless people 
are really really unlucky. Because like it or not, that's what 99% of all 
code is."

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:57               ` Artem Bityutskiy
@ 2009-03-29 14:00                 ` Pavel Machek
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 14:00 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Sun 2009-03-29 16:57:06, Artem Bityutskiy wrote:
> ext Pavel Machek wrote:
>> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>>> Pavel Machek wrote:
>>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>>> that source file is completely on media before commiting the rename.
>>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>>> to have final agreement on all this stuff.
>>>>>> My proposal is 
>>>>>>
>>>>>> rename() stays.
>>>>> It stays and:
>>>>>
>>>>> 1. does _not_ fsync
>>>> Does not fsync. If someone wants to make sure one of the files is on
>>>> the disk, he should use replace(). [On non-linux systems, replace()
>>>> should be implemented as fsync/rename in libc or something.]
>>> I would be happy with these rules. But the fact is, application
>>> people just refuse to add fsync before rename. They say that the
>>> FS has to do this. And they say that even Linus supports them,
>>
>> That's good. fsync before rename would be ugly regression (on ext3 at
>> least). We should get them to use replace() syscall, not get them to
>> add fsyncs. [Of course, that means we need replace syscall first. :-)]
>
> I'd say it is better to fix ext3 then.

? I don't get this.

ext3's rename() is already equivalent to proposed replace(). The
problem is that btrfs's and ubifs's renames are not.

So doing extra fsync() on ext3 is actually an performance regression
-> we do not want applications to randomly add open-coded fsyncs().

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:40             ` Pavel Machek
@ 2009-03-29 13:57               ` Artem Bityutskiy
  2009-03-29 14:00                 ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:57 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

ext Pavel Machek wrote:
> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>> Pavel Machek wrote:
>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>> that source file is completely on media before commiting the rename.
>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>> to have final agreement on all this stuff.
>>>>> My proposal is 
>>>>>
>>>>> rename() stays.
>>>> It stays and:
>>>>
>>>> 1. does _not_ fsync
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename. They say that the
>> FS has to do this. And they say that even Linus supports them,
> 
> That's good. fsync before rename would be ugly regression (on ext3 at
> least). We should get them to use replace() syscall, not get them to
> add fsyncs. [Of course, that means we need replace syscall first. :-)]

I'd say it is better to fix ext3 then.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:22             ` Andreas T.Auer
@ 2009-03-29 13:55               ` Artem Bityutskiy
  0 siblings, 0 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:55 UTC (permalink / raw)
  To: Andreas T.Auer; +Cc: Pavel Machek, Artem Bityutskiy, Linux Kernel Mailing List

Andreas T.Auer wrote:
> On 29.03.2009 15:07 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
> As a user I will avoid using any fs, which requires the tons of
> applications to be changed for a reasonable amount of data safety.
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename.
> Because it slows down the performance.
>> They say that the
>> FS has to do this. 
> They say that FS should not write metadata for non-existing data and
> even overwrite "clean" metadata with "dirty" metadata. It is up to the
> fs to decide, whether fsync is needed to achieve this.
 
Well, this makes sense, but the fact is that FS developers did
not keep this in mind. And when we have been developing UBIFS,
we also naively assumed that user-space would just call fsync
if needed. And it was easier to implement stuff this way. And
it looked like POSIX and other Linux FSes assumed that.

But well, we can change UBIFS behavior, but it would be nice
to have some agreement on all this.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:07           ` Artem Bityutskiy
  2009-03-29 13:22             ` Andreas T.Auer
@ 2009-03-29 13:40             ` Pavel Machek
  2009-03-29 13:57               ` Artem Bityutskiy
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 13:40 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>>>>> 2. create/write/rename leads to empty files
>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>> that source file is completely on media before commiting the rename.
>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>> to have final agreement on all this stuff.
>>>> My proposal is 
>>>>
>>>> rename() stays.
>>> It stays and:
>>>
>>> 1. does _not_ fsync
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename. They say that the
> FS has to do this. And they say that even Linus supports them,

That's good. fsync before rename would be ugly regression (on ext3 at
least). We should get them to use replace() syscall, not get them to
add fsyncs. [Of course, that means we need replace syscall first. :-)]
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:07           ` Artem Bityutskiy
@ 2009-03-29 13:22             ` Andreas T.Auer
  2009-03-29 13:55               ` Artem Bityutskiy
  2009-03-29 13:40             ` Pavel Machek
  1 sibling, 1 reply; 20+ messages in thread
From: Andreas T.Auer @ 2009-03-29 13:22 UTC (permalink / raw)
  To: Artem.Bityutskiy
  Cc: Pavel Machek, Artem Bityutskiy, Linux Kernel Mailing List



On 29.03.2009 15:07 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
As a user I will avoid using any fs, which requires the tons of
applications to be changed for a reasonable amount of data safety.
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename.
Because it slows down the performance.
> They say that the
> FS has to do this. 
They say that FS should not write metadata for non-existing data and
even overwrite "clean" metadata with "dirty" metadata. It is up to the
fs to decide, whether fsync is needed to achieve this.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:02         ` Pavel Machek
@ 2009-03-29 13:07           ` Artem Bityutskiy
  2009-03-29 13:22             ` Andreas T.Auer
  2009-03-29 13:40             ` Pavel Machek
  0 siblings, 2 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:07 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>>>>> 2. create/write/rename leads to empty files
>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>> that source file is completely on media before commiting the rename.
>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>> to have final agreement on all this stuff.
>>> My proposal is 
>>>
>>> rename() stays.
>> It stays and:
>>
>> 1. does _not_ fsync
> 
> Does not fsync. If someone wants to make sure one of the files is on
> the disk, he should use replace(). [On non-linux systems, replace()
> should be implemented as fsync/rename in libc or something.]

I would be happy with these rules. But the fact is, application
people just refuse to add fsync before rename. They say that the
FS has to do this. And they say that even Linus supports them,
which is an argument I find difficult to fight against. This is
why I want clean rules.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:01     ` Andreas T.Auer
@ 2009-03-29 13:06       ` Artem Bityutskiy
  0 siblings, 0 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:06 UTC (permalink / raw)
  To: Andreas T.Auer; +Cc: Artem Bityutskiy, Pavel Machek, Linux Kernel Mailing List

ext Andreas T.Auer wrote:
> 
> On 29.03.2009 14:42 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>
>>>> 1. truncate/write/close leads to empty files
>>> this is buggy.
>> In FS, or in application?
> In application of course. If you rewrite a huge file that way, you have
> a long-time risk of loosing data in a crash, even with sychronous writes.

You know, after reading all these blogs and discussions,
I will not be surprised if someone says this is an FS bug.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:00       ` Artem Bityutskiy
@ 2009-03-29 13:02         ` Pavel Machek
  2009-03-29 13:07           ` Artem Bityutskiy
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 13:02 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>>>> 2. create/write/rename leads to empty files
>>>> ..but this should not be. If we want to make that explicit, we should
>>>> provide "replace()" operation; where replace is rename that makes sure
>>>> that source file is completely on media before commiting the rename.
>>> Well, OK, we can fsync() before rename, we just need clean rules
>>> for this, so that all Linux FSes would follow them. Would be nice
>>> to have final agreement on all this stuff.
>>
>> My proposal is 
>>
>> rename() stays.
>
> It stays and:
>
> 1. does _not_ fsync

Does not fsync. If someone wants to make sure one of the files is on
the disk, he should use replace(). [On non-linux systems, replace()
should be implemented as fsync/rename in libc or something.]
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:42   ` Artem Bityutskiy
  2009-03-29 12:50     ` Pavel Machek
@ 2009-03-29 13:01     ` Andreas T.Auer
  2009-03-29 13:06       ` Artem Bityutskiy
  1 sibling, 1 reply; 20+ messages in thread
From: Andreas T.Auer @ 2009-03-29 13:01 UTC (permalink / raw)
  To: Artem Bityutskiy, Pavel Machek
  Cc: Artem Bityutskiy, Linux Kernel Mailing List



On 29.03.2009 14:42 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>
>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?
In application of course. If you rewrite a huge file that way, you have
a long-time risk of loosing data in a crash, even with sychronous writes.
>
>>> 2. create/write/rename leads to empty files
In the that case the time for the risk is reduced to the rename from the
viewpoint of the application developers, which don't know modern
re-ordering filesystems.

>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
It is a hard task to change all the applications, there a lot of
orphaned projects, which are still used.
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.
>
This slows down things, but you could also delay the writing of the
metadata pointing to non-existing data. Or is there any use for it after
the crash?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:50     ` Pavel Machek
@ 2009-03-29 13:00       ` Artem Bityutskiy
  2009-03-29 13:02         ` Pavel Machek
  2009-03-30 17:19       ` Ric Wheeler
  1 sibling, 1 reply; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:00 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
>>>> 2. create/write/rename leads to empty files
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
> 
> My proposal is 
> 
> rename() stays.

It stays and:

1. does _not_ fsync
2. has synchronous fsync added
3. stays and have asynchronous fsync added?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:42   ` Artem Bityutskiy
@ 2009-03-29 12:50     ` Pavel Machek
  2009-03-29 13:00       ` Artem Bityutskiy
  2009-03-30 17:19       ` Ric Wheeler
  2009-03-29 13:01     ` Andreas T.Auer
  1 sibling, 2 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 12:50 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List


>>> We have a problem that user-space people do not want to
>>> use 'fsync()', even when they are pointed to their code
>>> which is doing create/write/rename/close without fsync().
>>
>> Well... they really don't want to spin the disk up for the
>> fsync(). I'm not sure if fsync() is really sensible operation to use
>> there.
>
> I'm personally concerned about hand-held, and in case of UBIFS
> fsync is not too expensive - we work on flash and on fsync() we
> write back only the stuff belonging to inode in question, and
> nothing else.

Well, I'm more concerned about spinning disks, having one even in my
zaurus. And I do believe that fsync() will write more data than
neccessary even in flash case.

>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?

Application is buggy; no way kernel can help there.

>>> 2. create/write/rename leads to empty files
>>
>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
>
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.

My proposal is 

rename() stays.

replace(src, bar) is rename that ensures that bar will contain valid
data after powerfail.

>> It is somehow similar to fsync()/rename(), but does not force disk
>> spin up immediately -- it only inserts "barrier" between data blocks
>> and rename. (And yes, it should be implemented as fsync()+rename() for
>> filesystems like xfs. It can be implemented as plain rename for ext3
>> and ext4 after the fixes...)
>
> Right. But I guess only few file-systems would really implement
> this, because this is complex.

Complex yes, but at least ext3+ext4+btrfs should, and they really have
90% of "market share" :-). ext3 and ext4 implementations are already
done :-).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
@ 2009-03-29 12:42   ` Artem Bityutskiy
  2009-03-29 12:50     ` Pavel Machek
  2009-03-29 13:01     ` Andreas T.Auer
  2009-03-30 15:58   ` Diego Calleja
  1 sibling, 2 replies; 20+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 12:42 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
> On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
>> UBIFS has exactly the same properties like ext4 - in case
>> of power cuts:
>>
>> 1. truncate/write/close leads to empty files
>> 2. create/write/rename leads to empty files
>>
>> UBIFS is used in hand-held and and power-cuts are very
>> often there, because users just remove battery often.
>>
>> I realize the "reality is different" argument, and already
>> concluded that we need a similar changes as Theo has done
>> for ext4:
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>>
>> We have a problem that user-space people do not want to
>> use 'fsync()', even when they are pointed to their code
>> which is doing create/write/rename/close without fsync().
> 
> Well... they really don't want to spin the disk up for the
> fsync(). I'm not sure if fsync() is really sensible operation to use
> there.

I'm personally concerned about hand-held, and in case of UBIFS
fsync is not too expensive - we work on flash and on fsync() we
write back only the stuff belonging to inode in question, and
nothing else.

>> 1. truncate/write/close leads to empty files
> 
> this is buggy.

In FS, or in application?

>> 2. create/write/rename leads to empty files
> 
> ..but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.

Well, OK, we can fsync() before rename, we just need clean rules
for this, so that all Linux FSes would follow them. Would be nice
to have final agreement on all this stuff.

> It is somehow similar to fsync()/rename(), but does not force disk
> spin up immediately -- it only inserts "barrier" between data blocks
> and rename. (And yes, it should be implemented as fsync()+rename() for
> filesystems like xfs. It can be implemented as plain rename for ext3
> and ext4 after the fixes...)

Right. But I guess only few file-systems would really implement
this, because this is complex.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
@ 2009-03-29 12:26 ` Pavel Machek
  2009-03-29 12:42   ` Artem Bityutskiy
  2009-03-30 15:58   ` Diego Calleja
  0 siblings, 2 replies; 20+ messages in thread
From: Pavel Machek @ 2009-03-29 12:26 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Linux Kernel Mailing List

On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
> UBIFS has exactly the same properties like ext4 - in case
> of power cuts:
>
> 1. truncate/write/close leads to empty files
> 2. create/write/rename leads to empty files
>
> UBIFS is used in hand-held and and power-cuts are very
> often there, because users just remove battery often.
>
> I realize the "reality is different" argument, and already
> concluded that we need a similar changes as Theo has done
> for ext4:
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>
> We have a problem that user-space people do not want to
> use 'fsync()', even when they are pointed to their code
> which is doing create/write/rename/close without fsync().

Well... they really don't want to spin the disk up for the
fsync(). I'm not sure if fsync() is really sensible operation to use
there.

> 1. truncate/write/close leads to empty files

this is buggy.

> 2. create/write/rename leads to empty files

...but this should not be. If we want to make that explicit, we should
provide "replace()" operation; where replace is rename that makes sure
that source file is completely on media before commiting the rename.

It is somehow similar to fsync()/rename(), but does not force disk
spin up immediately -- it only inserts "barrier" between data blocks
and rename. (And yes, it should be implemented as fsync()+rename() for
filesystems like xfs. It can be implemented as plain rename for ext3
and ext4 after the fixes...)

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-04-01 22:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ckjPq-2Dl-15@gated-at.bofh.it>
     [not found] ` <cl2jy-65z-1@gated-at.bofh.it>
     [not found]   ` <cl2CZ-6q2-21@gated-at.bofh.it>
     [not found]     ` <cl2N9-6Bj-9@gated-at.bofh.it>
2009-03-31 21:27       ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Bodo Eggert
2009-04-01  0:06         ` Theodore Tso
2009-04-01 20:52           ` Pavel Machek
2009-04-01 22:58             ` Bodo Eggert
2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
2009-03-29 12:42   ` Artem Bityutskiy
2009-03-29 12:50     ` Pavel Machek
2009-03-29 13:00       ` Artem Bityutskiy
2009-03-29 13:02         ` Pavel Machek
2009-03-29 13:07           ` Artem Bityutskiy
2009-03-29 13:22             ` Andreas T.Auer
2009-03-29 13:55               ` Artem Bityutskiy
2009-03-29 13:40             ` Pavel Machek
2009-03-29 13:57               ` Artem Bityutskiy
2009-03-29 14:00                 ` Pavel Machek
2009-03-30 17:19       ` Ric Wheeler
2009-03-30 22:11         ` Pavel Machek
2009-03-29 13:01     ` Andreas T.Auer
2009-03-29 13:06       ` Artem Bityutskiy
2009-03-30 15:58   ` Diego Calleja

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).