linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* EXT4-ish "fixes" in UBIFS
@ 2009-03-27 12:48 Artem Bityutskiy
  2009-03-28  1:22 ` Kyungmin Park
                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-27 12:48 UTC (permalink / raw)
  To: Linux Kernel Mailing List

UBIFS has exactly the same properties like ext4 - in case
of power cuts:

1. truncate/write/close leads to empty files
2. create/write/rename leads to empty files

UBIFS is used in hand-held and and power-cuts are very
often there, because users just remove battery often.

I realize the "reality is different" argument, and already
concluded that we need a similar changes as Theo has done
for ext4:
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705

We have a problem that user-space people do not want to
use 'fsync()', even when they are pointed to their code
which is doing create/write/rename/close without fsync().

They just say - this is file-system bug, it is fixed in
ext4 now, just fix the bug in UBIFS.

I tell them, that is not a fix, that is band-aid, because
ext4 issues asynchronous write, and a power cut can lead
to corruptions anyway.

I tell them, we can make this in UBIFS, but please, anyway
add fsync() to your application. They say - now, we will
will not - you fix your UBIFS.

And because there is so much flood and about this, it is
so difficult to have reasonable arguments. I want to say
people - please, still use fsync(), if this is about the
performance/reliability trade-off - make it optional.
But they instead say - respected people are on our side,
go away. And point me this:
http://www.advogato.org/person/mjg59/diary/195.html
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811700
http://article.gmane.org/gmane.comp.lang.perl.perl5.porters/67352

And they say that BTRFS and XFS are going to fix userspace
as well, and point me at this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/175

This all became so messy and controversial. What should I do
to persuade userspace to use 'fsync()' even if we hack UBIFS
similarly to ext4? Suggestions?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
@ 2009-03-28  1:22 ` Kyungmin Park
  2009-03-29 12:31   ` Artem Bityutskiy
  2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
  2009-04-03  0:09 ` EXT4-ish "fixes" in UBIFS Christian Kujau
  2 siblings, 1 reply; 45+ messages in thread
From: Kyungmin Park @ 2009-03-28  1:22 UTC (permalink / raw)
  To: Artem.Bityutskiy; +Cc: Linux Kernel Mailing List

Hi,

I also got these request. the file is empty at rename operatoin in
case of sudden power off.
they say it's different from jffs2. in case of jffs2, it points old
files even though power off.
then why is UBIFS different. fix it as before. I said it's not
filesystem bug. it's expected behaviors.

In my case, I persuade the application people to change their
application to use fsync. also if fsync doesn't solve this problem,
add mirror scheme, duplicate file to avoid empty file problem.

Frankly I'm not sure which one is better. how much filesystem support
it. but remember that application programmer also don't want to change
their application when filesystem is changed.
"The application is not changed, only filesystem is changed. so it's
filesystem problem, not us"

Thank you,
Kyungmin Park

On Fri, Mar 27, 2009 at 9:48 PM, Artem Bityutskiy
<Artem.Bityutskiy@nokia.com> wrote:
> UBIFS has exactly the same properties like ext4 - in case
> of power cuts:
>
> 1. truncate/write/close leads to empty files
> 2. create/write/rename leads to empty files
>
> UBIFS is used in hand-held and and power-cuts are very
> often there, because users just remove battery often.
>
> I realize the "reality is different" argument, and already
> concluded that we need a similar changes as Theo has done
> for ext4:
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>
> We have a problem that user-space people do not want to
> use 'fsync()', even when they are pointed to their code
> which is doing create/write/rename/close without fsync().
>
> They just say - this is file-system bug, it is fixed in
> ext4 now, just fix the bug in UBIFS.
>
> I tell them, that is not a fix, that is band-aid, because
> ext4 issues asynchronous write, and a power cut can lead
> to corruptions anyway.
>
> I tell them, we can make this in UBIFS, but please, anyway
> add fsync() to your application. They say - now, we will
> will not - you fix your UBIFS.
>
> And because there is so much flood and about this, it is
> so difficult to have reasonable arguments. I want to say
> people - please, still use fsync(), if this is about the
> performance/reliability trade-off - make it optional.
> But they instead say - respected people are on our side,
> go away. And point me this:
> http://www.advogato.org/person/mjg59/diary/195.html
> http://thread.gmane.org/gmane.linux.kernel/811167/focus=811700
> http://article.gmane.org/gmane.comp.lang.perl.perl5.porters/67352
>
> And they say that BTRFS and XFS are going to fix userspace
> as well, and point me at this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/175
>
> This all became so messy and controversial. What should I do
> to persuade userspace to use 'fsync()' even if we hack UBIFS
> similarly to ext4? Suggestions?
>
> --
> Best Regards,
> Artem Bityutskiy (Артём Битюцкий)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
  2009-03-28  1:22 ` Kyungmin Park
@ 2009-03-29 12:26 ` Pavel Machek
  2009-03-29 12:42   ` Artem Bityutskiy
  2009-03-30 15:58   ` Diego Calleja
  2009-04-03  0:09 ` EXT4-ish "fixes" in UBIFS Christian Kujau
  2 siblings, 2 replies; 45+ messages in thread
From: Pavel Machek @ 2009-03-29 12:26 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Linux Kernel Mailing List

On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
> UBIFS has exactly the same properties like ext4 - in case
> of power cuts:
>
> 1. truncate/write/close leads to empty files
> 2. create/write/rename leads to empty files
>
> UBIFS is used in hand-held and and power-cuts are very
> often there, because users just remove battery often.
>
> I realize the "reality is different" argument, and already
> concluded that we need a similar changes as Theo has done
> for ext4:
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>
> We have a problem that user-space people do not want to
> use 'fsync()', even when they are pointed to their code
> which is doing create/write/rename/close without fsync().

Well... they really don't want to spin the disk up for the
fsync(). I'm not sure if fsync() is really sensible operation to use
there.

> 1. truncate/write/close leads to empty files

this is buggy.

> 2. create/write/rename leads to empty files

...but this should not be. If we want to make that explicit, we should
provide "replace()" operation; where replace is rename that makes sure
that source file is completely on media before commiting the rename.

It is somehow similar to fsync()/rename(), but does not force disk
spin up immediately -- it only inserts "barrier" between data blocks
and rename. (And yes, it should be implemented as fsync()+rename() for
filesystems like xfs. It can be implemented as plain rename for ext3
and ext4 after the fixes...)

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-03-28  1:22 ` Kyungmin Park
@ 2009-03-29 12:31   ` Artem Bityutskiy
  2009-03-29 12:54     ` Artem Bityutskiy
  0 siblings, 1 reply; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 12:31 UTC (permalink / raw)
  To: Kyungmin Park; +Cc: Artem.Bityutskiy, Linux Kernel Mailing List

Kyungmin Park wrote:
> I also got these request. the file is empty at rename operatoin in
> case of sudden power off.
> they say it's different from jffs2. in case of jffs2, it points old
> files even though power off.

Right, because JFFS2 is synchronous :-)

> then why is UBIFS different. fix it as before. I said it's not
> filesystem bug. it's expected behaviors.

Right, this is what I've been always thinking. I've always been
thinking the FS gives no guarantees, and if you want a 100%
guarantee, use fsync() before renaming. Frankly, I still think
so. But we'll make ext4-like changes in UBIFS as well to help
the applications which do not do the sync.

> Frankly I'm not sure which one is better. how much filesystem support
> it. but remember that application programmer also don't want to change
> their application when filesystem is changed.
> "The application is not changed, only filesystem is changed. so it's
> filesystem problem, not us"

I hope Linux gurus will put it clearly after all - to fsync() or to
not fsync(). We do need clear rules of the game. For now, I still
assume the following:

1. If applications want atomic update which gives 100% guarantee,
   they should fsync before rename.
2. If the application does not use fsync, FS should try to minimize
   the probability of data loss by running asynchronous write-back
   on rename which unlinks a direntry.
3. All this performance vs. reliability hassle should be solved
   by fixing the FS, by having good defaults, by having a
   "fsync/not fsync" knobs in applications.

Indeed, people mostly talk about ext3, desktops, etc. But there
is also the embedded world, where battery is removed randomly.

But will see where this all leads. I really want clean rules
for this.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
@ 2009-03-29 12:42   ` Artem Bityutskiy
  2009-03-29 12:50     ` Pavel Machek
  2009-03-29 13:01     ` Andreas T.Auer
  2009-03-30 15:58   ` Diego Calleja
  1 sibling, 2 replies; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 12:42 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
> On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
>> UBIFS has exactly the same properties like ext4 - in case
>> of power cuts:
>>
>> 1. truncate/write/close leads to empty files
>> 2. create/write/rename leads to empty files
>>
>> UBIFS is used in hand-held and and power-cuts are very
>> often there, because users just remove battery often.
>>
>> I realize the "reality is different" argument, and already
>> concluded that we need a similar changes as Theo has done
>> for ext4:
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>>
>> We have a problem that user-space people do not want to
>> use 'fsync()', even when they are pointed to their code
>> which is doing create/write/rename/close without fsync().
> 
> Well... they really don't want to spin the disk up for the
> fsync(). I'm not sure if fsync() is really sensible operation to use
> there.

I'm personally concerned about hand-held, and in case of UBIFS
fsync is not too expensive - we work on flash and on fsync() we
write back only the stuff belonging to inode in question, and
nothing else.

>> 1. truncate/write/close leads to empty files
> 
> this is buggy.

In FS, or in application?

>> 2. create/write/rename leads to empty files
> 
> ..but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.

Well, OK, we can fsync() before rename, we just need clean rules
for this, so that all Linux FSes would follow them. Would be nice
to have final agreement on all this stuff.

> It is somehow similar to fsync()/rename(), but does not force disk
> spin up immediately -- it only inserts "barrier" between data blocks
> and rename. (And yes, it should be implemented as fsync()+rename() for
> filesystems like xfs. It can be implemented as plain rename for ext3
> and ext4 after the fixes...)

Right. But I guess only few file-systems would really implement
this, because this is complex.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:42   ` Artem Bityutskiy
@ 2009-03-29 12:50     ` Pavel Machek
  2009-03-29 13:00       ` Artem Bityutskiy
  2009-03-30 17:19       ` Ric Wheeler
  2009-03-29 13:01     ` Andreas T.Auer
  1 sibling, 2 replies; 45+ messages in thread
From: Pavel Machek @ 2009-03-29 12:50 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List


>>> We have a problem that user-space people do not want to
>>> use 'fsync()', even when they are pointed to their code
>>> which is doing create/write/rename/close without fsync().
>>
>> Well... they really don't want to spin the disk up for the
>> fsync(). I'm not sure if fsync() is really sensible operation to use
>> there.
>
> I'm personally concerned about hand-held, and in case of UBIFS
> fsync is not too expensive - we work on flash and on fsync() we
> write back only the stuff belonging to inode in question, and
> nothing else.

Well, I'm more concerned about spinning disks, having one even in my
zaurus. And I do believe that fsync() will write more data than
neccessary even in flash case.

>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?

Application is buggy; no way kernel can help there.

>>> 2. create/write/rename leads to empty files
>>
>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
>
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.

My proposal is 

rename() stays.

replace(src, bar) is rename that ensures that bar will contain valid
data after powerfail.

>> It is somehow similar to fsync()/rename(), but does not force disk
>> spin up immediately -- it only inserts "barrier" between data blocks
>> and rename. (And yes, it should be implemented as fsync()+rename() for
>> filesystems like xfs. It can be implemented as plain rename for ext3
>> and ext4 after the fixes...)
>
> Right. But I guess only few file-systems would really implement
> this, because this is complex.

Complex yes, but at least ext3+ext4+btrfs should, and they really have
90% of "market share" :-). ext3 and ext4 implementations are already
done :-).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-03-29 12:31   ` Artem Bityutskiy
@ 2009-03-29 12:54     ` Artem Bityutskiy
  0 siblings, 0 replies; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 12:54 UTC (permalink / raw)
  To: Kyungmin Park; +Cc: Artem.Bityutskiy, Linux Kernel Mailing List

Artem Bityutskiy wrote:
> Kyungmin Park wrote:
>> I also got these request. the file is empty at rename operatoin in
>> case of sudden power off.
>> they say it's different from jffs2. in case of jffs2, it points old
>> files even though power off.
> 
> Right, because JFFS2 is synchronous :-)
> 
>> then why is UBIFS different. fix it as before. I said it's not
>> filesystem bug. it's expected behaviors.
> 
> Right, this is what I've been always thinking. I've always been
> thinking the FS gives no guarantees, and if you want a 100%
> guarantee, use fsync() before renaming. Frankly, I still think
> so. But we'll make ext4-like changes in UBIFS as well to help
> the applications which do not do the sync.
> 
>> Frankly I'm not sure which one is better. how much filesystem support
>> it. but remember that application programmer also don't want to change
>> their application when filesystem is changed.
>> "The application is not changed, only filesystem is changed. so it's
>> filesystem problem, not us"
> 
> I hope Linux gurus will put it clearly after all - to fsync() or to
> not fsync(). We do need clear rules of the game. For now, I still
> assume the following:
> 
> 1. If applications want atomic update which gives 100% guarantee,
>   they should fsync before rename.
> 2. If the application does not use fsync, FS should try to minimize
>   the probability of data loss by running asynchronous write-back
>   on rename which unlinks a direntry.
> 3. All this performance vs. reliability hassle should be solved
>   by fixing the FS, by having good defaults, by having a
>   "fsync/not fsync" knobs in applications.
> 
> Indeed, people mostly talk about ext3, desktops, etc. But there
> is also the embedded world, where battery is removed randomly.

Let me elaborate why I tell about embedded. Looking into the
"Linux-2.6.29" thread, it _seems_ people assume that it is enough
if FS will start _asynchronous_ write-back after rename, so that
dirty data will not sit in the cache for long time. E.g., many
people are happy with ext3's 5 seconds. So for me it seems like
some people do not care about 100% atomicity guarantees, they are
fine with just low data loss probability.

So what I say, that in embedded we need 100% atomic updates,
because our power cuts may be frequent and random. And at this
moment only fsync() before rename may guarantee this.

And updating a file using truncate/rewrite does not guarantee
anything at all.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:50     ` Pavel Machek
@ 2009-03-29 13:00       ` Artem Bityutskiy
  2009-03-29 13:02         ` Pavel Machek
  2009-03-30 17:19       ` Ric Wheeler
  1 sibling, 1 reply; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:00 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
>>>> 2. create/write/rename leads to empty files
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
> 
> My proposal is 
> 
> rename() stays.

It stays and:

1. does _not_ fsync
2. has synchronous fsync added
3. stays and have asynchronous fsync added?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:42   ` Artem Bityutskiy
  2009-03-29 12:50     ` Pavel Machek
@ 2009-03-29 13:01     ` Andreas T.Auer
  2009-03-29 13:06       ` Artem Bityutskiy
  1 sibling, 1 reply; 45+ messages in thread
From: Andreas T.Auer @ 2009-03-29 13:01 UTC (permalink / raw)
  To: Artem Bityutskiy, Pavel Machek
  Cc: Artem Bityutskiy, Linux Kernel Mailing List



On 29.03.2009 14:42 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>
>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?
In application of course. If you rewrite a huge file that way, you have
a long-time risk of loosing data in a crash, even with sychronous writes.
>
>>> 2. create/write/rename leads to empty files
In the that case the time for the risk is reduced to the rename from the
viewpoint of the application developers, which don't know modern
re-ordering filesystems.

>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
It is a hard task to change all the applications, there a lot of
orphaned projects, which are still used.
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.
>
This slows down things, but you could also delay the writing of the
metadata pointing to non-existing data. Or is there any use for it after
the crash?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:00       ` Artem Bityutskiy
@ 2009-03-29 13:02         ` Pavel Machek
  2009-03-29 13:07           ` Artem Bityutskiy
  0 siblings, 1 reply; 45+ messages in thread
From: Pavel Machek @ 2009-03-29 13:02 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>>>> 2. create/write/rename leads to empty files
>>>> ..but this should not be. If we want to make that explicit, we should
>>>> provide "replace()" operation; where replace is rename that makes sure
>>>> that source file is completely on media before commiting the rename.
>>> Well, OK, we can fsync() before rename, we just need clean rules
>>> for this, so that all Linux FSes would follow them. Would be nice
>>> to have final agreement on all this stuff.
>>
>> My proposal is 
>>
>> rename() stays.
>
> It stays and:
>
> 1. does _not_ fsync

Does not fsync. If someone wants to make sure one of the files is on
the disk, he should use replace(). [On non-linux systems, replace()
should be implemented as fsync/rename in libc or something.]
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:01     ` Andreas T.Auer
@ 2009-03-29 13:06       ` Artem Bityutskiy
  0 siblings, 0 replies; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:06 UTC (permalink / raw)
  To: Andreas T.Auer; +Cc: Artem Bityutskiy, Pavel Machek, Linux Kernel Mailing List

ext Andreas T.Auer wrote:
> 
> On 29.03.2009 14:42 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>
>>>> 1. truncate/write/close leads to empty files
>>> this is buggy.
>> In FS, or in application?
> In application of course. If you rewrite a huge file that way, you have
> a long-time risk of loosing data in a crash, even with sychronous writes.

You know, after reading all these blogs and discussions,
I will not be surprised if someone says this is an FS bug.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:02         ` Pavel Machek
@ 2009-03-29 13:07           ` Artem Bityutskiy
  2009-03-29 13:22             ` Andreas T.Auer
  2009-03-29 13:40             ` Pavel Machek
  0 siblings, 2 replies; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:07 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>>>>> 2. create/write/rename leads to empty files
>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>> that source file is completely on media before commiting the rename.
>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>> to have final agreement on all this stuff.
>>> My proposal is 
>>>
>>> rename() stays.
>> It stays and:
>>
>> 1. does _not_ fsync
> 
> Does not fsync. If someone wants to make sure one of the files is on
> the disk, he should use replace(). [On non-linux systems, replace()
> should be implemented as fsync/rename in libc or something.]

I would be happy with these rules. But the fact is, application
people just refuse to add fsync before rename. They say that the
FS has to do this. And they say that even Linus supports them,
which is an argument I find difficult to fight against. This is
why I want clean rules.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:07           ` Artem Bityutskiy
@ 2009-03-29 13:22             ` Andreas T.Auer
  2009-03-29 13:55               ` Artem Bityutskiy
  2009-03-29 13:40             ` Pavel Machek
  1 sibling, 1 reply; 45+ messages in thread
From: Andreas T.Auer @ 2009-03-29 13:22 UTC (permalink / raw)
  To: Artem.Bityutskiy
  Cc: Pavel Machek, Artem Bityutskiy, Linux Kernel Mailing List



On 29.03.2009 15:07 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
As a user I will avoid using any fs, which requires the tons of
applications to be changed for a reasonable amount of data safety.
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename.
Because it slows down the performance.
> They say that the
> FS has to do this. 
They say that FS should not write metadata for non-existing data and
even overwrite "clean" metadata with "dirty" metadata. It is up to the
fs to decide, whether fsync is needed to achieve this.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:07           ` Artem Bityutskiy
  2009-03-29 13:22             ` Andreas T.Auer
@ 2009-03-29 13:40             ` Pavel Machek
  2009-03-29 13:57               ` Artem Bityutskiy
  1 sibling, 1 reply; 45+ messages in thread
From: Pavel Machek @ 2009-03-29 13:40 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>>>>> 2. create/write/rename leads to empty files
>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>> that source file is completely on media before commiting the rename.
>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>> to have final agreement on all this stuff.
>>>> My proposal is 
>>>>
>>>> rename() stays.
>>> It stays and:
>>>
>>> 1. does _not_ fsync
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename. They say that the
> FS has to do this. And they say that even Linus supports them,

That's good. fsync before rename would be ugly regression (on ext3 at
least). We should get them to use replace() syscall, not get them to
add fsyncs. [Of course, that means we need replace syscall first. :-)]
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:22             ` Andreas T.Auer
@ 2009-03-29 13:55               ` Artem Bityutskiy
  0 siblings, 0 replies; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:55 UTC (permalink / raw)
  To: Andreas T.Auer; +Cc: Pavel Machek, Artem Bityutskiy, Linux Kernel Mailing List

Andreas T.Auer wrote:
> On 29.03.2009 15:07 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
> As a user I will avoid using any fs, which requires the tons of
> applications to be changed for a reasonable amount of data safety.
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename.
> Because it slows down the performance.
>> They say that the
>> FS has to do this. 
> They say that FS should not write metadata for non-existing data and
> even overwrite "clean" metadata with "dirty" metadata. It is up to the
> fs to decide, whether fsync is needed to achieve this.
 
Well, this makes sense, but the fact is that FS developers did
not keep this in mind. And when we have been developing UBIFS,
we also naively assumed that user-space would just call fsync
if needed. And it was easier to implement stuff this way. And
it looked like POSIX and other Linux FSes assumed that.

But well, we can change UBIFS behavior, but it would be nice
to have some agreement on all this.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:40             ` Pavel Machek
@ 2009-03-29 13:57               ` Artem Bityutskiy
  2009-03-29 14:00                 ` Pavel Machek
  0 siblings, 1 reply; 45+ messages in thread
From: Artem Bityutskiy @ 2009-03-29 13:57 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

ext Pavel Machek wrote:
> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>> Pavel Machek wrote:
>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>> that source file is completely on media before commiting the rename.
>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>> to have final agreement on all this stuff.
>>>>> My proposal is 
>>>>>
>>>>> rename() stays.
>>>> It stays and:
>>>>
>>>> 1. does _not_ fsync
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename. They say that the
>> FS has to do this. And they say that even Linus supports them,
> 
> That's good. fsync before rename would be ugly regression (on ext3 at
> least). We should get them to use replace() syscall, not get them to
> add fsyncs. [Of course, that means we need replace syscall first. :-)]

I'd say it is better to fix ext3 then.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 13:57               ` Artem Bityutskiy
@ 2009-03-29 14:00                 ` Pavel Machek
  0 siblings, 0 replies; 45+ messages in thread
From: Pavel Machek @ 2009-03-29 14:00 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Sun 2009-03-29 16:57:06, Artem Bityutskiy wrote:
> ext Pavel Machek wrote:
>> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>>> Pavel Machek wrote:
>>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>>> that source file is completely on media before commiting the rename.
>>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>>> to have final agreement on all this stuff.
>>>>>> My proposal is 
>>>>>>
>>>>>> rename() stays.
>>>>> It stays and:
>>>>>
>>>>> 1. does _not_ fsync
>>>> Does not fsync. If someone wants to make sure one of the files is on
>>>> the disk, he should use replace(). [On non-linux systems, replace()
>>>> should be implemented as fsync/rename in libc or something.]
>>> I would be happy with these rules. But the fact is, application
>>> people just refuse to add fsync before rename. They say that the
>>> FS has to do this. And they say that even Linus supports them,
>>
>> That's good. fsync before rename would be ugly regression (on ext3 at
>> least). We should get them to use replace() syscall, not get them to
>> add fsyncs. [Of course, that means we need replace syscall first. :-)]
>
> I'd say it is better to fix ext3 then.

? I don't get this.

ext3's rename() is already equivalent to proposed replace(). The
problem is that btrfs's and ubifs's renames are not.

So doing extra fsync() on ext3 is actually an performance regression
-> we do not want applications to randomly add open-coded fsyncs().

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
  2009-03-29 12:42   ` Artem Bityutskiy
@ 2009-03-30 15:58   ` Diego Calleja
  1 sibling, 0 replies; 45+ messages in thread
From: Diego Calleja @ 2009-03-30 15:58 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Domingo 29 Marzo 2009 14:26:00 Pavel Machek escribió:

> ...but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.

An "ad Linus-em" counterexample:

"And if we have a Linux-specific magic system call or sync action, it's 
going to be even more rarely used than fsync(). Do you think anybody 
really uses the OS X FSYNC_FULL ioctl? Nope. Outside of a few databases, 
it is almost certainly not going to be used, and fsync() will not be 
reliable in general.

So rather than come up with new barriers that nobody will use, filesystem 
people should aim to make "badly written" code "just work" unless people 
are really really unlucky. Because like it or not, that's what 99% of all 
code is."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-29 12:50     ` Pavel Machek
  2009-03-29 13:00       ` Artem Bityutskiy
@ 2009-03-30 17:19       ` Ric Wheeler
  2009-03-30 22:11         ` Pavel Machek
  1 sibling, 1 reply; 45+ messages in thread
From: Ric Wheeler @ 2009-03-30 17:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Artem Bityutskiy, Artem Bityutskiy, Linux Kernel Mailing List

Pavel Machek wrote:
>>>> We have a problem that user-space people do not want to
>>>> use 'fsync()', even when they are pointed to their code
>>>> which is doing create/write/rename/close without fsync().
>>>>         
>>> Well... they really don't want to spin the disk up for the
>>> fsync(). I'm not sure if fsync() is really sensible operation to use
>>> there.
>>>       
>> I'm personally concerned about hand-held, and in case of UBIFS
>> fsync is not too expensive - we work on flash and on fsync() we
>> write back only the stuff belonging to inode in question, and
>> nothing else.
>>     
>
> Well, I'm more concerned about spinning disks, having one even in my
> zaurus. And I do believe that fsync() will write more data than
> neccessary even in flash case.
>
>   
>>>> 1. truncate/write/close leads to empty files
>>>>         
>>> this is buggy.
>>>       
>> In FS, or in application?
>>     
>
> Application is buggy; no way kernel can help there.
>
>   
>>>> 2. create/write/rename leads to empty files
>>>>         
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>>>       
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
>>     
>
> My proposal is 
>
> rename() stays.
>
> replace(src, bar) is rename that ensures that bar will contain valid
> data after powerfail.
>   

Surely the only way to "insure" this is to spin up the drive, write the 
meta-data and data back and make sure that it is not held in volatile 
write cache?

Why would calling this replace be better or more power efficient than 
what you need to do today?

ric

>   
>>> It is somehow similar to fsync()/rename(), but does not force disk
>>> spin up immediately -- it only inserts "barrier" between data blocks
>>> and rename. (And yes, it should be implemented as fsync()+rename() for
>>> filesystems like xfs. It can be implemented as plain rename for ext3
>>> and ext4 after the fixes...)
>>>       
>> Right. But I guess only few file-systems would really implement
>> this, because this is complex.
>>     
>
> Complex yes, but at least ext3+ext4+btrfs should, and they really have
> 90% of "market share" :-). ext3 and ext4 implementations are already
> done :-).
> 								Pavel
>   


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)
  2009-03-30 17:19       ` Ric Wheeler
@ 2009-03-30 22:11         ` Pavel Machek
  0 siblings, 0 replies; 45+ messages in thread
From: Pavel Machek @ 2009-03-30 22:11 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Artem Bityutskiy, Artem Bityutskiy, Linux Kernel Mailing List

Hi!

>> My proposal is 
>>
>> rename() stays.
>>
>> replace(src, bar) is rename that ensures that bar will contain valid
>> data after powerfail.
>>   
>
> Surely the only way to "insure" this is to spin up the drive, write the  
> meta-data and data back and make sure that it is not held in volatile  
> write cache?

Well, no. "will contain valid data" but may contain _old_ valid data.

So the way to do that would be "wait until you have to spin disk up
anyway or until timeout, then write data first, then do rename".

AFAICT that's semantics gnome (etc) wants.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
  2009-03-28  1:22 ` Kyungmin Park
  2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
@ 2009-04-03  0:09 ` Christian Kujau
  2009-04-03  0:24   ` Trenton D. Adams
                     ` (2 more replies)
  2 siblings, 3 replies; 45+ messages in thread
From: Christian Kujau @ 2009-04-03  0:09 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Linux Kernel Mailing List

On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
> They just say - this is file-system bug, it is fixed in
> ext4 now, just fix the bug in UBIFS.

Would *mounting* the filesystem with "-o sync" help? This way no 
filesystem "fixes" are needed and userland would not have to be rewritten.

Christian.
-- 
Alice and Bob met for the first time at Bruce Schneier's pool-party

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:09 ` EXT4-ish "fixes" in UBIFS Christian Kujau
@ 2009-04-03  0:24   ` Trenton D. Adams
  2009-04-03  0:28     ` Trenton D. Adams
  2009-04-03  2:05   ` Theodore Tso
  2009-04-03  6:53   ` Artem Bityutskiy
  2 siblings, 1 reply; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  0:24 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 6:09 PM, Christian Kujau <lists@nerdbynature.de> wrote:
> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
>> They just say - this is file-system bug, it is fixed in
>> ext4 now, just fix the bug in UBIFS.
>
> Would *mounting* the filesystem with "-o sync" help? This way no
> filesystem "fixes" are needed and userland would not have to be rewritten.
>
> Christian.

Yes, mounting "-o sync" does improve ext3 performance.  It sucks
though, because I do want quick writes.  And mounting with sync option
slows down to disk io speeds.  In my case, that's between 20 and 23
megabytes per second *big frown, quivering lip, and tears in my eyes*.
:P

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:24   ` Trenton D. Adams
@ 2009-04-03  0:28     ` Trenton D. Adams
  2009-04-03  0:38       ` Christian Kujau
  2009-04-03  1:55       ` David Rees
  0 siblings, 2 replies; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  0:28 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
<trenton.d.adams@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 6:09 PM, Christian Kujau <lists@nerdbynature.de> wrote:
>> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
>>> They just say - this is file-system bug, it is fixed in
>>> ext4 now, just fix the bug in UBIFS.
>>
>> Would *mounting* the filesystem with "-o sync" help? This way no
>> filesystem "fixes" are needed and userland would not have to be rewritten.
>>
>> Christian.
>
> Yes, mounting "-o sync" does improve ext3 performance.  It sucks
> though, because I do want quick writes.  And mounting with sync option
> slows down to disk io speeds.  In my case, that's between 20 and 23
> megabytes per second *big frown, quivering lip, and tears in my eyes*.
> :P
>

Oh, I should have clarified.  It improves performance under heavy
load.  Under normal load, mounting without sync is fine.  What I tend
to do is mount with "remount,rw,sync" when heavy load is starting.
Then my system goes slowly, but latency is good.  Then, when it's all
done (say a big compile, or job, or whatever), I remount without sync
again.

I'm thinking of writing a script that monitors performance, and
remounts as needed, lol.  WHAT A HACK. hehe.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:28     ` Trenton D. Adams
@ 2009-04-03  0:38       ` Christian Kujau
  2009-04-03  0:54         ` Trenton D. Adams
  2009-04-03  0:54         ` Trenton D. Adams
  2009-04-03  1:55       ` David Rees
  1 sibling, 2 replies; 45+ messages in thread
From: Christian Kujau @ 2009-04-03  0:38 UTC (permalink / raw)
  To: Trenton D. Adams; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, 2 Apr 2009, Trenton D. Adams wrote:
> Oh, I should have clarified.  It improves performance under heavy
> load.  Under normal load, mounting without sync is fine.  What I tend
> to do is mount with "remount,rw,sync" when heavy load is starting.

Really? How does mounting with "-o sync" *improve* performance? I am 
certainly aware that mounting with "-o sync" has severe performance 
impacts, but was proposing it anyway *only* to tackle the data integrity 
problem. However, I'm curious if usescaes in the embedded world are 
equally affected by this.

> I'm thinking of writing a script that monitors performance, and
> remounts as needed, lol.  WHAT A HACK. hehe.

Ugh....my brain hurts :-\

Christian.
-- 
Bruce Schneier once found three distinct natural number divisors of a prime
number.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:38       ` Christian Kujau
@ 2009-04-03  0:54         ` Trenton D. Adams
  2009-04-03  0:54         ` Trenton D. Adams
  1 sibling, 0 replies; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  0:54 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 6:38 PM, Christian Kujau <lists@nerdbynature.de> wrote:
> On Thu, 2 Apr 2009, Trenton D. Adams wrote:
>> Oh, I should have clarified.  It improves performance under heavy
>> load.  Under normal load, mounting without sync is fine.  What I tend
>> to do is mount with "remount,rw,sync" when heavy load is starting.
>
> Really? How does mounting with "-o sync" *improve* performance? I am
> certainly aware that mounting with "-o sync" has severe performance
> impacts, but was proposing it anyway *only* to tackle the data integrity
> problem. However, I'm curious if usescaes in the embedded world are
> equally affected by this.
>

Oh, well for my system, if I do heavy IO, my *fsync* performance drops
like a rock.  fsync on even 1M takes 15-20 seconds at times.  I have
even seen 50 seconds.  If I mount with sync option, the fsyncs of 1M
take only a couple hundred milliseconds, while the other heavy IO is
happening.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:38       ` Christian Kujau
  2009-04-03  0:54         ` Trenton D. Adams
@ 2009-04-03  0:54         ` Trenton D. Adams
  2009-04-03  0:59           ` Trenton D. Adams
  1 sibling, 1 reply; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  0:54 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 6:38 PM, Christian Kujau <lists@nerdbynature.de> wrote:
> On Thu, 2 Apr 2009, Trenton D. Adams wrote:
>> I'm thinking of writing a script that monitors performance, and
>> remounts as needed, lol.  WHAT A HACK. hehe.
>
> Ugh....my brain hurts :-\
>
> Christian.

Yeah, mine too.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:54         ` Trenton D. Adams
@ 2009-04-03  0:59           ` Trenton D. Adams
  0 siblings, 0 replies; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  0:59 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 6:54 PM, Trenton D. Adams
<trenton.d.adams@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 6:38 PM, Christian Kujau <lists@nerdbynature.de> wrote:
>> On Thu, 2 Apr 2009, Trenton D. Adams wrote:
>>> I'm thinking of writing a script that monitors performance, and
>>> remounts as needed, lol.  WHAT A HACK. hehe.
>>
>> Ugh....my brain hurts :-\
>>
>> Christian.
>
> Yeah, mine too.
>

Just to make it hurt more for you, here you go...

On one console I run...
dd if=/dev/zero of=/tmp/bigfile bs=1M count=2000

On another I run...
perf-mon.sh
remounting with sync option, performance dropping
remounting without sync option, performance has stabilized

It may be better to write a C program that does a 1M fsync, and if
it's taking too long, then remount, lol.  Also, this script here,
using 1 min load average, will catch CPU intensity as well, which is
not really what I want.  Ah, it is a hack indeed. ROFL

#!/bin/sh

while true; do
    UPTIME=$(uptime | xargs | cut -d ' ' -f10 | sed 's/,//');
    if [ "$(echo "$UPTIME > 1" | bc)" -eq "1" ]; then
        mount | egrep 's-sys.*sync' >/dev/null
        if [ "$?" -ne "0" ]; then
            echo "remounting with sync option, performance dropping";
            mount -o remount,rw,sync /dev/s/sys /;
        fi
    else
        mount | egrep 's-sys.*sync' > /dev/null
        if [ "$?" -eq "0" ]; then
            echo "remounting without sync option, performance has stabilized";
            mount -o remount,rw /dev/s/sys /;
        fi
    fi;
    sleep 1
done

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:28     ` Trenton D. Adams
  2009-04-03  0:38       ` Christian Kujau
@ 2009-04-03  1:55       ` David Rees
  2009-04-03  2:05         ` Trenton D. Adams
  2009-04-03  2:26         ` Trenton D. Adams
  1 sibling, 2 replies; 45+ messages in thread
From: David Rees @ 2009-04-03  1:55 UTC (permalink / raw)
  To: Trenton D. Adams
  Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
<trenton.d.adams@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
> <trenton.d.adams@gmail.com> wrote:
>> Yes, mounting "-o sync" does improve ext3 performance.  It sucks
>> though, because I do want quick writes.  And mounting with sync option
>> slows down to disk io speeds.  In my case, that's between 20 and 23
>> megabytes per second *big frown, quivering lip, and tears in my eyes*.
>> :P
>>
>
> Oh, I should have clarified.  It improves performance under heavy
> load.  Under normal load, mounting without sync is fine.  What I tend
> to do is mount with "remount,rw,sync" when heavy load is starting.
> Then my system goes slowly, but latency is good.  Then, when it's all
> done (say a big compile, or job, or whatever), I remount without sync
> again.
>
> I'm thinking of writing a script that monitors performance, and
> remounts as needed, lol.  WHAT A HACK. hehe.

All you're doing here is implementing the lowering of dirty data
limits in the VM dynamically based on how long fsyncs take.

Linus outlined this specific strategy as "the ideal siutation"
somewhere in the depths of "That filesystem thread".

Look at the new in 2.6.29 dirty*bytes parameters in
Documentation/sysctl/vm.txt for more info.  By lowering those values,
you can effectively turn normal writes into synchronous writes which
will greatly reduce latency of fsync under heavy write load.

In previous kernels you can tweak dirty_ratio and
dirty_background_ratio, but they don't have the granularity of the new
knobs.  Although if you are talking about just remounting in sync
mode, they may work for you at least as a proof of concept. ;-)

-Dave

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:09 ` EXT4-ish "fixes" in UBIFS Christian Kujau
  2009-04-03  0:24   ` Trenton D. Adams
@ 2009-04-03  2:05   ` Theodore Tso
  2009-04-03  2:45     ` Christian Kujau
  2009-04-03  6:53   ` Artem Bityutskiy
  2 siblings, 1 reply; 45+ messages in thread
From: Theodore Tso @ 2009-04-03  2:05 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 05:09:39PM -0700, Christian Kujau wrote:
> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
> > They just say - this is file-system bug, it is fixed in
> > ext4 now, just fix the bug in UBIFS.
> 
> Would *mounting* the filesystem with "-o sync" help? This way no 
> filesystem "fixes" are needed and userland would not have to be rewritten.

It will, but you might not like the performance....  the reason why
it's there is that some users might want the particular tradeoff, but
it probably wouldn't make a good default.

						- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  1:55       ` David Rees
@ 2009-04-03  2:05         ` Trenton D. Adams
  2009-04-03  2:19           ` David Rees
  2009-04-03  2:26         ` Trenton D. Adams
  1 sibling, 1 reply; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  2:05 UTC (permalink / raw)
  To: David Rees; +Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 7:55 PM, David Rees <drees76@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
> <trenton.d.adams@gmail.com> wrote:
>> On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
>> <trenton.d.adams@gmail.com> wrote:
>>> Yes, mounting "-o sync" does improve ext3 performance.  It sucks
>>> though, because I do want quick writes.  And mounting with sync option
>>> slows down to disk io speeds.  In my case, that's between 20 and 23
>>> megabytes per second *big frown, quivering lip, and tears in my eyes*.
>>> :P
>>>
>>
>> Oh, I should have clarified.  It improves performance under heavy
>> load.  Under normal load, mounting without sync is fine.  What I tend
>> to do is mount with "remount,rw,sync" when heavy load is starting.
>> Then my system goes slowly, but latency is good.  Then, when it's all
>> done (say a big compile, or job, or whatever), I remount without sync
>> again.
>>
>> I'm thinking of writing a script that monitors performance, and
>> remounts as needed, lol.  WHAT A HACK. hehe.
>
> All you're doing here is implementing the lowering of dirty data
> limits in the VM dynamically based on how long fsyncs take.
>
> Linus outlined this specific strategy as "the ideal siutation"
> somewhere in the depths of "That filesystem thread".

I thought he said it was a HORRIBLE solution. :D  I recall him
slamming Andrew over it.  Unless you're referring to the kernel
actually doing it on the fly.

>
> Look at the new in 2.6.29 dirty*bytes parameters in
> Documentation/sysctl/vm.txt for more info.  By lowering those values,
> you can effectively turn normal writes into synchronous writes which
> will greatly reduce latency of fsync under heavy write load.
>
> In previous kernels you can tweak dirty_ratio and
> dirty_background_ratio, but they don't have the granularity of the new
> knobs.  Although if you are talking about just remounting in sync
> mode, they may work for you at least as a proof of concept. ;-)
>
> -Dave
>

dirty_ratio and dirty_background never really had any affect for me.
I'll look into the other parameters.  Waiting for the checkout again,
as I am currently under a heavy rsync load (*rolls eyes*).

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  2:05         ` Trenton D. Adams
@ 2009-04-03  2:19           ` David Rees
  2009-04-03  2:28             ` Trenton D. Adams
  0 siblings, 1 reply; 45+ messages in thread
From: David Rees @ 2009-04-03  2:19 UTC (permalink / raw)
  To: Trenton D. Adams
  Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 7:05 PM, Trenton D. Adams
<trenton.d.adams@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 7:55 PM, David Rees <drees76@gmail.com> wrote:
>> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
>> <trenton.d.adams@gmail.com> wrote:
>>> Oh, I should have clarified.  It improves performance under heavy
>>> load.  Under normal load, mounting without sync is fine.  What I tend
>>> to do is mount with "remount,rw,sync" when heavy load is starting.
>>> Then my system goes slowly, but latency is good.  Then, when it's all
>>> done (say a big compile, or job, or whatever), I remount without sync
>>> again.
>>>
>>> I'm thinking of writing a script that monitors performance, and
>>> remounts as needed, lol.  WHAT A HACK. hehe.
>>
>> All you're doing here is implementing the lowering of dirty data
>> limits in the VM dynamically based on how long fsyncs take.
>>
>> Linus outlined this specific strategy as "the ideal siutation"
>> somewhere in the depths of "That filesystem thread".
>
> I thought he said it was a HORRIBLE solution. :D  I recall him
> slamming Andrew over it.  Unless you're referring to the kernel
> actually doing it on the fly.

Yes - you are correct - doing it in userspace isn't the best place to
put it - but if you can do it there, the same ideas could then be
pushed into the kernel and further enhanced.

>> Look at the new in 2.6.29 dirty*bytes parameters in
>> Documentation/sysctl/vm.txt for more info.  By lowering those values,
>> you can effectively turn normal writes into synchronous writes which
>> will greatly reduce latency of fsync under heavy write load.
>>
>> In previous kernels you can tweak dirty_ratio and
>> dirty_background_ratio, but they don't have the granularity of the new
>> knobs.  Although if you are talking about just remounting in sync
>> mode, they may work for you at least as a proof of concept. ;-)
>
> dirty_ratio and dirty_background never really had any affect for me.
> I'll look into the other parameters.  Waiting for the checkout again,
> as I am currently under a heavy rsync load (*rolls eyes*).

How low have you set them?  Try setting them to 2 and 1 respectively.
It cuts down fsync latencies by a significant amount in my experience.

-Dave

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  1:55       ` David Rees
  2009-04-03  2:05         ` Trenton D. Adams
@ 2009-04-03  2:26         ` Trenton D. Adams
  1 sibling, 0 replies; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  2:26 UTC (permalink / raw)
  To: David Rees; +Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 7:55 PM, David Rees <drees76@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
> <trenton.d.adams@gmail.com> wrote:
>> On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
>> <trenton.d.adams@gmail.com> wrote:
>>> Yes, mounting "-o sync" does improve ext3 performance.  It sucks
>>> though, because I do want quick writes.  And mounting with sync option
>>> slows down to disk io speeds.  In my case, that's between 20 and 23
>>> megabytes per second *big frown, quivering lip, and tears in my eyes*.
>>> :P
>>>
>>
>> Oh, I should have clarified.  It improves performance under heavy
>> load.  Under normal load, mounting without sync is fine.  What I tend
>> to do is mount with "remount,rw,sync" when heavy load is starting.
>> Then my system goes slowly, but latency is good.  Then, when it's all
>> done (say a big compile, or job, or whatever), I remount without sync
>> again.
>>
>> I'm thinking of writing a script that monitors performance, and
>> remounts as needed, lol.  WHAT A HACK. hehe.
>
> All you're doing here is implementing the lowering of dirty data
> limits in the VM dynamically based on how long fsyncs take.
>
> Linus outlined this specific strategy as "the ideal siutation"
> somewhere in the depths of "That filesystem thread".
>
> Look at the new in 2.6.29 dirty*bytes parameters in
> Documentation/sysctl/vm.txt for more info.  By lowering those values,
> you can effectively turn normal writes into synchronous writes which
> will greatly reduce latency of fsync under heavy write load.

WOW, that makes a huge difference.  If I set it to 100M, I get the
10-15 second delay I was talking about.  But, if I set it to 1M, I get
0.3 to 0.4 second delay on a 1M fsync.  That is way better.  Perhaps I
should auto-tune based on that parameter then.  Although I do agree
with Linus that it sucks to do userland auto-tuning. :P

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  2:19           ` David Rees
@ 2009-04-03  2:28             ` Trenton D. Adams
  2009-04-03  2:58               ` David Rees
  0 siblings, 1 reply; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  2:28 UTC (permalink / raw)
  To: David Rees; +Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 8:19 PM, David Rees <drees76@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 7:05 PM, Trenton D. Adams
> <trenton.d.adams@gmail.com> wrote:
>> On Thu, Apr 2, 2009 at 7:55 PM, David Rees <drees76@gmail.com> wrote:
>>> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
>>> <trenton.d.adams@gmail.com> wrote:
>> dirty_ratio and dirty_background never really had any affect for me.
>> I'll look into the other parameters.  Waiting for the checkout again,
>> as I am currently under a heavy rsync load (*rolls eyes*).
>
> How low have you set them?  Try setting them to 2 and 1 respectively.
> It cuts down fsync latencies by a significant amount in my experience.
>
> -Dave
>

That's the odd thing, I was setting them to 2 and 1.  I was just
looking at the 2.6.29 code, and it should have made a difference.  I
don't know what version of the kernel I was using at the time.  And,
I'm not sure if I had the 1M fsync tests in place at the time either,
to be sure about what I was testing.  It could be that I wasn't being
very scientific about it at the time.  Thanks though, that setting
makes a huge difference.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  2:05   ` Theodore Tso
@ 2009-04-03  2:45     ` Christian Kujau
  2009-04-03  2:49       ` Trenton D. Adams
  0 siblings, 1 reply; 45+ messages in thread
From: Christian Kujau @ 2009-04-03  2:45 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Artem Bityutskiy, Linux Kernel Mailing List

On Thu, 2 Apr 2009, Theodore Tso wrote:
> It will, but you might not like the performance....  the reason why
> it's there is that some users might want the particular tradeoff, but
> it probably wouldn't make a good default.

Thanks for confirming this. Yes, I know about the performance impact, but 
perhaps it's feasible for some setups.

Christian.

PS: I was curious *how* bad the impact was and so I tried generating a
    477 MB tarball, first on an async, then on an sync mounted partition:

   /dev/md0 /mnt/md0 ext4 rw,noatime,barrier=1,data=ordered
   $ time tar -cf /mnt/md0/test.tar /usr
   real	1m36.615s

   /dev/md0 /mnt/md0 ext4 rw,sync,noatime,barrier=1,data=ordered
   $ time tar -cf /mnt/md0/test.tar /usr
   real	5m23.793s

-- 
Bruce Schneier does not get kidney stones. He gets Rosetta Stones.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  2:45     ` Christian Kujau
@ 2009-04-03  2:49       ` Trenton D. Adams
  0 siblings, 0 replies; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  2:49 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Theodore Tso, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 8:45 PM, Christian Kujau <lists@nerdbynature.de> wrote:
> On Thu, 2 Apr 2009, Theodore Tso wrote:
>> It will, but you might not like the performance....  the reason why
>> it's there is that some users might want the particular tradeoff, but
>> it probably wouldn't make a good default.
>
> Thanks for confirming this. Yes, I know about the performance impact, but
> perhaps it's feasible for some setups.
>
> Christian.
>
> PS: I was curious *how* bad the impact was and so I tried generating a
>    477 MB tarball, first on an async, then on an sync mounted partition:
>
>   /dev/md0 /mnt/md0 ext4 rw,noatime,barrier=1,data=ordered
>   $ time tar -cf /mnt/md0/test.tar /usr
>   real 1m36.615s
>
>   /dev/md0 /mnt/md0 ext4 rw,sync,noatime,barrier=1,data=ordered
>   $ time tar -cf /mnt/md0/test.tar /usr
>   real 5m23.793s

lol, yep sounds about right.  Probably much worse on my machine, given
the disk speed is around 20-23M/sec.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  2:28             ` Trenton D. Adams
@ 2009-04-03  2:58               ` David Rees
  2009-04-03  3:13                 ` Trenton D. Adams
  2009-04-03  5:02                 ` Theodore Tso
  0 siblings, 2 replies; 45+ messages in thread
From: David Rees @ 2009-04-03  2:58 UTC (permalink / raw)
  To: Trenton D. Adams
  Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 7:28 PM, Trenton D. Adams
<trenton.d.adams@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 8:19 PM, David Rees <drees76@gmail.com> wrote:
>> On Thu, Apr 2, 2009 at 7:05 PM, Trenton D. Adams <trenton.d.adams@gmail.com> wrote:
>>> On Thu, Apr 2, 2009 at 7:55 PM, David Rees <drees76@gmail.com> wrote:
>>>> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams <trenton.d.adams@gmail.com> wrote:
>>> dirty_ratio and dirty_background never really had any affect for me.
>>> I'll look into the other parameters.  Waiting for the checkout again,
>>> as I am currently under a heavy rsync load (*rolls eyes*).
>>
>> How low have you set them?  Try setting them to 2 and 1 respectively.
>> It cuts down fsync latencies by a significant amount in my experience.
>
> That's the odd thing, I was setting them to 2 and 1.  I was just
> looking at the 2.6.29 code, and it should have made a difference.  I
> don't know what version of the kernel I was using at the time.  And,
> I'm not sure if I had the 1M fsync tests in place at the time either,
> to be sure about what I was testing.  It could be that I wasn't being
> very scientific about it at the time.  Thanks though, that setting
> makes a huge difference.

Well, it depends on how much memory you have.  Keep in mind that those
are percentages - so if you have 2GB RAM, that's the same as setting
it to 40MB and 20MB respectively - both are a lot larger than the 1M
you were setting the dirty*bytes vm knobs to.

I've got a problematic server with 8GB RAM.  Even if set both to 1,
that's 80MB and the crappy disks I have in it will often only write
10-20MB/s or less due to the seekiness of the workload.  That means
delays of 5-10 seconds worst case which isn't fun.

-Dave

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  2:58               ` David Rees
@ 2009-04-03  3:13                 ` Trenton D. Adams
  2009-04-03  3:14                   ` Trenton D. Adams
  2009-04-03  5:02                 ` Theodore Tso
  1 sibling, 1 reply; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  3:13 UTC (permalink / raw)
  To: David Rees; +Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 8:58 PM, David Rees <drees76@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 7:28 PM, Trenton D. Adams
>> That's the odd thing, I was setting them to 2 and 1.  I was just
>> looking at the 2.6.29 code, and it should have made a difference.  I
>> don't know what version of the kernel I was using at the time.  And,
>> I'm not sure if I had the 1M fsync tests in place at the time either,
>> to be sure about what I was testing.  It could be that I wasn't being
>> very scientific about it at the time.  Thanks though, that setting
>> makes a huge difference.
>
> Well, it depends on how much memory you have.  Keep in mind that those
> are percentages - so if you have 2GB RAM, that's the same as setting
> it to 40MB and 20MB respectively - both are a lot larger than the 1M
> you were setting the dirty*bytes vm knobs to.
>
> I've got a problematic server with 8GB RAM.  Even if set both to 1,
> that's 80MB and the crappy disks I have in it will often only write
> 10-20MB/s or less due to the seekiness of the workload.  That means
> delays of 5-10 seconds worst case which isn't fun.
>
> -Dave
>

Yeah, I just finished doing the calculation. :P  40M is what I'm
seeing.  Yeah, that sounds like the same as my problem.  Even setting
it to 10M dirty_bytes has a very serious latency problem.  I'm glad
that option was added, because 1M works much better.  I'll have to
change my shell script to dynamically tune on that.  Because under
normal load, I want the 40M+ of queueing.  It's just when things get
really heavy, and stuff starts getting flushed, that this problem
starts happening.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  3:13                 ` Trenton D. Adams
@ 2009-04-03  3:14                   ` Trenton D. Adams
  0 siblings, 0 replies; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  3:14 UTC (permalink / raw)
  To: David Rees; +Cc: Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

I'm really sorry, I just realized I hijacked this thread.  I'll stop now.

On Thu, Apr 2, 2009 at 9:13 PM, Trenton D. Adams
<trenton.d.adams@gmail.com> wrote:
> On Thu, Apr 2, 2009 at 8:58 PM, David Rees <drees76@gmail.com> wrote:
>> On Thu, Apr 2, 2009 at 7:28 PM, Trenton D. Adams
>>> That's the odd thing, I was setting them to 2 and 1.  I was just
>>> looking at the 2.6.29 code, and it should have made a difference.  I
>>> don't know what version of the kernel I was using at the time.  And,
>>> I'm not sure if I had the 1M fsync tests in place at the time either,
>>> to be sure about what I was testing.  It could be that I wasn't being
>>> very scientific about it at the time.  Thanks though, that setting
>>> makes a huge difference.
>>
>> Well, it depends on how much memory you have.  Keep in mind that those
>> are percentages - so if you have 2GB RAM, that's the same as setting
>> it to 40MB and 20MB respectively - both are a lot larger than the 1M
>> you were setting the dirty*bytes vm knobs to.
>>
>> I've got a problematic server with 8GB RAM.  Even if set both to 1,
>> that's 80MB and the crappy disks I have in it will often only write
>> 10-20MB/s or less due to the seekiness of the workload.  That means
>> delays of 5-10 seconds worst case which isn't fun.
>>
>> -Dave
>>
>
> Yeah, I just finished doing the calculation. :P  40M is what I'm
> seeing.  Yeah, that sounds like the same as my problem.  Even setting
> it to 10M dirty_bytes has a very serious latency problem.  I'm glad
> that option was added, because 1M works much better.  I'll have to
> change my shell script to dynamically tune on that.  Because under
> normal load, I want the 40M+ of queueing.  It's just when things get
> really heavy, and stuff starts getting flushed, that this problem
> starts happening.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  2:58               ` David Rees
  2009-04-03  3:13                 ` Trenton D. Adams
@ 2009-04-03  5:02                 ` Theodore Tso
  2009-04-03  5:15                   ` Trenton D. Adams
                                     ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Theodore Tso @ 2009-04-03  5:02 UTC (permalink / raw)
  To: David Rees
  Cc: Trenton D. Adams, Christian Kujau, Artem Bityutskiy,
	Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 07:58:17PM -0700, David Rees wrote:
> 
> I've got a problematic server with 8GB RAM.  Even if set both to 1,
> that's 80MB and the crappy disks I have in it will often only write
> 10-20MB/s or less due to the seekiness of the workload.  That means
> delays of 5-10 seconds worst case which isn't fun.
> 

Well, one solution is data=writeback.  If you're confident your server
isn't going to randomly crash (i.e., it's on a UPS, and you're not
running unstable video drivers), that might be a solution.  It has
tradeoffs, though.

One thing which I'll probably implement is some patches to ext3 so
that when it's in data=writeback mode, it will use the same
replace-via-rename and replace-via-truncate hueristics that I added in
ext4 so that it will start an aysnchronous writeout on the rename() or
close() w/ truncate().  That should avoid existing files getting
corrupted when they are replaced right before the system crashes.  

People will still be better off moving to ext4, but for people who
aren't quite confident in ext4's stability yet and who want to stick
with ext3, maybe it's a good short-term solution.  Maybe
data=writeback with the rename hueristic would be a better default
than data=ordered for ext3.

						- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  5:02                 ` Theodore Tso
@ 2009-04-03  5:15                   ` Trenton D. Adams
  2009-04-03  6:30                     ` Theodore Tso
  2009-04-03 18:05                   ` David Rees
  2009-04-09 20:17                   ` Pavel Machek
  2 siblings, 1 reply; 45+ messages in thread
From: Trenton D. Adams @ 2009-04-03  5:15 UTC (permalink / raw)
  To: Theodore Tso, David Rees, Trenton D. Adams, Christian Kujau,
	Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 11:02 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Thu, Apr 02, 2009 at 07:58:17PM -0700, David Rees wrote:
>>
>> I've got a problematic server with 8GB RAM.  Even if set both to 1,
>> that's 80MB and the crappy disks I have in it will often only write
>> 10-20MB/s or less due to the seekiness of the workload.  That means
>> delays of 5-10 seconds worst case which isn't fun.
>>
>
> People will still be better off moving to ext4, but for people who
> aren't quite confident in ext4's stability yet and who want to stick
> with ext3, maybe it's a good short-term solution.  Maybe
> data=writeback with the rename hueristic would be a better default
> than data=ordered for ext3.
>
>                                                - Ted
>

I've tried that before...

tdamac ~ # mount -t ext3 -o data=writeback,remount,rw /dev/s/sys /
mount: / not mounted already, or bad option

Does it have to be done on initial mount?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  5:15                   ` Trenton D. Adams
@ 2009-04-03  6:30                     ` Theodore Tso
  2009-04-03 18:53                       ` Chris Adams
  0 siblings, 1 reply; 45+ messages in thread
From: Theodore Tso @ 2009-04-03  6:30 UTC (permalink / raw)
  To: Trenton D. Adams
  Cc: David Rees, Christian Kujau, Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 02, 2009 at 11:15:29PM -0600, Trenton D. Adams wrote:
> 
> tdamac ~ # mount -t ext3 -o data=writeback,remount,rw /dev/s/sys /
> mount: / not mounted already, or bad option
> 
> Does it have to be done on initial mount?

Yes, which means you have to use the rootflags boot command-line
option.

It's a pain that we can't switch data= modes on the fly.  I believe
the problematic transiations are between data=journal and
data=!journal.  Transitions between data=ordered and data=writeback
should be easy to add.

						- Ted




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  0:09 ` EXT4-ish "fixes" in UBIFS Christian Kujau
  2009-04-03  0:24   ` Trenton D. Adams
  2009-04-03  2:05   ` Theodore Tso
@ 2009-04-03  6:53   ` Artem Bityutskiy
  2 siblings, 0 replies; 45+ messages in thread
From: Artem Bityutskiy @ 2009-04-03  6:53 UTC (permalink / raw)
  To: ext Christian Kujau; +Cc: Linux Kernel Mailing List

ext Christian Kujau wrote:
> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
>> They just say - this is file-system bug, it is fixed in
>> ext4 now, just fix the bug in UBIFS.
> 
> Would *mounting* the filesystem with "-o sync" help? This way no 
> filesystem "fixes" are needed and userland would not have to be rewritten.

It would, but the overall FS performance would suffer a lot too.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  5:02                 ` Theodore Tso
  2009-04-03  5:15                   ` Trenton D. Adams
@ 2009-04-03 18:05                   ` David Rees
  2009-04-09 20:17                   ` Pavel Machek
  2 siblings, 0 replies; 45+ messages in thread
From: David Rees @ 2009-04-03 18:05 UTC (permalink / raw)
  To: Theodore Tso, David Rees, Trenton D. Adams, Christian Kujau,
	Artem Bityutskiy, Linux Kernel Mailing List

On Thu, Apr 2, 2009 at 10:02 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Thu, Apr 02, 2009 at 07:58:17PM -0700, David Rees wrote:
>>
>> I've got a problematic server with 8GB RAM.  Even if set both to 1,
>> that's 80MB and the crappy disks I have in it will often only write
>> 10-20MB/s or less due to the seekiness of the workload.  That means
>> delays of 5-10 seconds worst case which isn't fun.
>
> Well, one solution is data=writeback.  If you're confident your server
> isn't going to randomly crash (i.e., it's on a UPS, and you're not
> running unstable video drivers), that might be a solution.  It has
> tradeoffs, though.

Yeah, that's probably a good workaround for the server in question.  I
don't recall it ever crashing.

> One thing which I'll probably implement is some patches to ext3 so
> that when it's in data=writeback mode, it will use the same
> replace-via-rename and replace-via-truncate hueristics that I added in
> ext4 so that it will start an aysnchronous writeout on the rename() or
> close() w/ truncate().  That should avoid existing files getting
> corrupted when they are replaced right before the system crashes.

I think that would be a welcome addition to the writeback mode of ext3.

> People will still be better off moving to ext4, but for people who
> aren't quite confident in ext4's stability yet and who want to stick
> with ext3, maybe it's a good short-term solution.  Maybe
> data=writeback with the rename hueristic would be a better default
> than data=ordered for ext3.

I've been waiting for Fedora to ship either the latest stable 2.6.28
or 2.6.29 kernel before putting any serious data on ext4 - from what
I've seen it seems like those kernels should have the vast majority of
stability bugs fixed in them.  Last I remember reading the 2.6.27
doesn't quite have all the fixes due to difficulties in backporting
those fixes to that kernel.

-Dave

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  6:30                     ` Theodore Tso
@ 2009-04-03 18:53                       ` Chris Adams
  0 siblings, 0 replies; 45+ messages in thread
From: Chris Adams @ 2009-04-03 18:53 UTC (permalink / raw)
  To: linux-kernel

Once upon a time, Theodore Tso  <tytso@mit.edu> said:
>On Thu, Apr 02, 2009 at 11:15:29PM -0600, Trenton D. Adams wrote:
>> 
>> tdamac ~ # mount -t ext3 -o data=writeback,remount,rw /dev/s/sys /
>> mount: / not mounted already, or bad option
>> 
>> Does it have to be done on initial mount?
>
>Yes, which means you have to use the rootflags boot command-line
>option.

Can't you also set this in the superblock options (e.g. "tune2fs -o
+journal_data_writeback" /dev/sda1)?

-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: EXT4-ish "fixes" in UBIFS
  2009-04-03  5:02                 ` Theodore Tso
  2009-04-03  5:15                   ` Trenton D. Adams
  2009-04-03 18:05                   ` David Rees
@ 2009-04-09 20:17                   ` Pavel Machek
  2 siblings, 0 replies; 45+ messages in thread
From: Pavel Machek @ 2009-04-09 20:17 UTC (permalink / raw)
  To: Theodore Tso, David Rees, Trenton D. Adams, Christian Kujau,
	Artem Bityutskiy, Linux Kernel Mailing List

Hi!

> > I've got a problematic server with 8GB RAM.  Even if set both to 1,
> > that's 80MB and the crappy disks I have in it will often only write
> > 10-20MB/s or less due to the seekiness of the workload.  That means
> > delays of 5-10 seconds worst case which isn't fun.
> > 
> 
> Well, one solution is data=writeback.  If you're confident your server
> isn't going to randomly crash (i.e., it's on a UPS, and you're not
> running unstable video drivers), that might be a solution.  It has
> tradeoffs, though.
> 
> One thing which I'll probably implement is some patches to ext3 so
> that when it's in data=writeback mode, it will use the same
> replace-via-rename and replace-via-truncate hueristics that I added in
> ext4 so that it will start an aysnchronous writeout on the rename() or
> close() w/ truncate().  That should avoid existing files getting
> corrupted when they are replaced right before the system crashes.  

Truncate case is unfixable, but would it be possible to only do rename
after data are on disk? Because async writeout only makes catastrophic
data loss 'less probable'...
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2009-04-09 20:17 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-27 12:48 EXT4-ish "fixes" in UBIFS Artem Bityutskiy
2009-03-28  1:22 ` Kyungmin Park
2009-03-29 12:31   ` Artem Bityutskiy
2009-03-29 12:54     ` Artem Bityutskiy
2009-03-29 12:26 ` replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS) Pavel Machek
2009-03-29 12:42   ` Artem Bityutskiy
2009-03-29 12:50     ` Pavel Machek
2009-03-29 13:00       ` Artem Bityutskiy
2009-03-29 13:02         ` Pavel Machek
2009-03-29 13:07           ` Artem Bityutskiy
2009-03-29 13:22             ` Andreas T.Auer
2009-03-29 13:55               ` Artem Bityutskiy
2009-03-29 13:40             ` Pavel Machek
2009-03-29 13:57               ` Artem Bityutskiy
2009-03-29 14:00                 ` Pavel Machek
2009-03-30 17:19       ` Ric Wheeler
2009-03-30 22:11         ` Pavel Machek
2009-03-29 13:01     ` Andreas T.Auer
2009-03-29 13:06       ` Artem Bityutskiy
2009-03-30 15:58   ` Diego Calleja
2009-04-03  0:09 ` EXT4-ish "fixes" in UBIFS Christian Kujau
2009-04-03  0:24   ` Trenton D. Adams
2009-04-03  0:28     ` Trenton D. Adams
2009-04-03  0:38       ` Christian Kujau
2009-04-03  0:54         ` Trenton D. Adams
2009-04-03  0:54         ` Trenton D. Adams
2009-04-03  0:59           ` Trenton D. Adams
2009-04-03  1:55       ` David Rees
2009-04-03  2:05         ` Trenton D. Adams
2009-04-03  2:19           ` David Rees
2009-04-03  2:28             ` Trenton D. Adams
2009-04-03  2:58               ` David Rees
2009-04-03  3:13                 ` Trenton D. Adams
2009-04-03  3:14                   ` Trenton D. Adams
2009-04-03  5:02                 ` Theodore Tso
2009-04-03  5:15                   ` Trenton D. Adams
2009-04-03  6:30                     ` Theodore Tso
2009-04-03 18:53                       ` Chris Adams
2009-04-03 18:05                   ` David Rees
2009-04-09 20:17                   ` Pavel Machek
2009-04-03  2:26         ` Trenton D. Adams
2009-04-03  2:05   ` Theodore Tso
2009-04-03  2:45     ` Christian Kujau
2009-04-03  2:49       ` Trenton D. Adams
2009-04-03  6:53   ` Artem Bityutskiy
     [not found] <cmFiD-8uc-9@gated-at.bofh.it>
     [not found] ` <cmFss-ft-15@gated-at.bofh.it>
     [not found]   ` <cmFsu-ft-23@gated-at.bofh.it>
     [not found]     ` <cmGRt-2hq-7@gated-at.bofh.it>
     [not found]       ` <cmH1b-2K0-11@gated-at.bofh.it>
     [not found]         ` <cmHkz-3d3-5@gated-at.bofh.it>
     [not found]           ` <cmHkA-3d3-7@gated-at.bofh.it>
     [not found]             ` <cmHND-3Oz-5@gated-at.bofh.it>
     [not found]               ` <cmJPm-7hd-5@gated-at.bofh.it>

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).