All of lore.kernel.org
 help / color / mirror / Atom feed
* Rename+crash behaviour of btrfs - nearly ext3!
@ 2010-05-17 18:04 Jakob Unterwurzacher
  2010-05-17 19:12 ` Ric Wheeler
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Jakob Unterwurzacher @ 2010-05-17 18:04 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1592 bytes --]

Hi!

Following Ubuntu's dpkg+ext4 problems I wanted to see if btrfs would
solve them all. And it nearly does! Now I wonder if the remaining 0.2
seconds window of exposing 0-size files could be closed too.

I tested using two simple scripts (attached for reference) on kernel
2.6.34-rc7:
- rentest creates files $i.tmp and renames to $i.cur,
- owtest does the same but overwrites existing $i.cur files,
letting them run for 30-50 seconds then resetting the virtual machine.

The results for ext3 are as expected: 0-size files are never exposed as
$i.cur, overwrites are atomic.

ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
lots of 0-size files are exposed in rentest (30 seconds window).

btrfs *nearly* does as well as ext3. Overwrites are atomic.

The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
so that a "ls --full-time" after the crash looks like this (notice the
time between 01281.cur and 01292.tmp, only 0.2 seconds):
[...]
-rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
[...]
-rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
-rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp


Finally, xfs kills lots of existing files in owtest and exposes lots of
0-size files in rentest (both 40 seconds window).

If anybody is interested, the bunch of trimmed "ls --full-time" output
for all filesystems is attached.


Thanks,
Jakob

[-- Attachment #2: rentest --]
[-- Type: text/plain, Size: 204 bytes --]

#!/bin/bash

set -eu

cd rentest-tmp
rm -Rf *
sync

echo "$0: Running rename loop"
( sleep 30s ; echo "$0: 30s. " )&
for i in `seq -w 1 10000`
do
	echo "File content $i." > $i.tmp
	mv $i.tmp $i.cur
done


[-- Attachment #3: owtest --]
[-- Type: text/plain, Size: 329 bytes --]

#!/bin/bash

set -eu

cd owtest-tmp
rm -Rf *
sync

echo "$0: Running create loop"

for i in `seq -w 1 10000`
do
	echo "old file content" > $i.cur
done
sync

echo "$0: Running overwrite loop"

( sleep 30s ; echo "$0: 30s" )&
for i in `seq -w 1 10000`
do
	echo "new file content of different size" > $i.tmp
	mv $i.tmp $i.cur
done


[-- Attachment #4: results.txt --]
[-- Type: text/plain, Size: 5295 bytes --]

Linux lucid-crash-burn 2.6.34-020634rc7-generic #020634rc7 SMP Mon May 10 10:08:20 UTC 2010 i686 GNU/Linux

########
ext3 data=ordered
########

********
rentest
********
-rw-r--r-- 1 root root 20 2010-05-17 19:05:05.000000000 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 20 2010-05-17 19:05:50.000000000 +0200 02013.cur

********
owtest
********
-rw-r--r-- 1 root root 35 2010-05-17 19:05:04.000000000 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 35 2010-05-17 19:05:50.000000000 +0200 02027.cur
-rw-r--r-- 1 root root 35 2010-05-17 19:05:50.000000000 +0200 02028.cur
-rw-r--r-- 1 root root 35 2010-05-17 19:05:50.000000000 +0200 02029.cur
-rw-r--r-- 1 root root 17 2010-05-17 19:05:03.000000000 +0200 02030.cur
-rw-r--r-- 1 root root 35 2010-05-17 19:05:50.000000000 +0200 02030.tmp
-rw-r--r-- 1 root root 17 2010-05-17 19:05:03.000000000 +0200 02031.cur
-rw-r--r-- 1 root root 17 2010-05-17 19:05:03.000000000 +0200 02032.cur
-rw-r--r-- 1 root root 17 2010-05-17 19:05:03.000000000 +0200 02033.cur
[...]
-rw-r--r-- 1 root root 17 2010-05-17 19:05:04.000000000 +0200 10000.cur



########
ext4 data=ordered
########

********
rentest
********
-rw-r--r-- 1 root root 20 2010-05-17 17:05:56.000000000 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 20 2010-05-17 17:06:06.000000000 +0200 00429.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:06:06.000000000 +0200 00430.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:06:06.000000000 +0200 00431.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:06.000000000 +0200 00432.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:06.000000000 +0200 00433.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:06.000000000 +0200 00434.cur
[...]
-rw-r--r-- 1 root root  0 2010-05-17 17:06:36.000000000 +0200 01748.cur

********
owtest
********
-rw-r--r-- 1 root root 35 2010-05-17 16:47:25.000000000 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 35 2010-05-17 16:48:01.000000000 +0200 01451.cur
-rw-r--r-- 1 root root 35 2010-05-17 16:48:01.000000000 +0200 01452.cur
-rw-r--r-- 1 root root 35 2010-05-17 16:48:01.000000000 +0200 01453.cur
-rw-r--r-- 1 root root  0 2010-05-17 16:48:01.000000000 +0200 01454.cur
-rw-r--r-- 1 root root 17 2010-05-17 16:47:22.000000000 +0200 01455.cur
-rw-r--r-- 1 root root 17 2010-05-17 16:47:22.000000000 +0200 01456.cur
-rw-r--r-- 1 root root 17 2010-05-17 16:47:22.000000000 +0200 01457.cur
[...]
-rw-r--r-- 1 root root 17 2010-05-17 16:47:24.000000000 +0200 10000.cur



########
btrfs
########

********
rentest
********
-rw-r--r-- 1 root root 20 2010-05-17 17:05:56.260291651 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 20 2010-05-17 17:06:25.788083319 +0200 01279.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:25.887995591 +0200 01283.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:25.912002104 +0200 01284.cur
[...]
-rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
-rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp

********
owtest
********
-rw-r--r-- 1 root root 35 2010-05-17 16:47:26.310637092 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 35 2010-05-17 16:47:56.770578127 +0200 01226.cur
-rw-r--r-- 1 root root 35 2010-05-17 16:47:56.794580123 +0200 01227.cur
-rw-r--r-- 1 root root 35 2010-05-17 16:47:56.818574731 +0200 01228.cur
-rw-r--r-- 1 root root 17 2010-05-17 16:47:22.122599650 +0200 01229.cur
-rw-rw-rw- 1 root root  0 2010-05-17 16:47:56.842574557 +0200 01229.new
-rw-r--r-- 1 root root 17 2010-05-17 16:47:22.122599650 +0200 01230.cur
-rw-r--r-- 1 root root 17 2010-05-17 16:47:22.122599650 +0200 01231.cur
-rw-r--r-- 1 root root 17 2010-05-17 16:47:22.122599650 +0200 01232.cur
[...]
-rw-r--r-- 1 root root 17 2010-05-17 16:47:25.906584357 +0200 10000.cur



########
xfs
########

********
rentest
********
-rw-r--r-- 1 root root 20 2010-05-17 17:45:29.856656127 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 20 2010-05-17 17:45:39.836115618 +0200 00346.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:45:39.860120683 +0200 00347.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:45:39.892119985 +0200 00348.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:45:39.928119346 +0200 00349.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:45:39.964136036 +0200 00350.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:45:39.992118460 +0200 00351.cur
[...]
-rw-r--r-- 1 root root  0 2010-05-17 17:46:19.036119284 +0200 01657.cur

********
owtest
********
-rw-r--r-- 1 root root 35 2010-05-17 18:02:00.046591863 +0200 00001.cur
[...]
-rw-r--r-- 1 root root 35 2010-05-17 18:02:04.886617159 +0200 00193.cur
-rw-r--r-- 1 root root 35 2010-05-17 18:02:04.906607613 +0200 00194.cur
-rw-r--r-- 1 root root 35 2010-05-17 18:02:04.930592724 +0200 00195.cur
-rw-r--r-- 1 root root  0 2010-05-17 18:02:04.958586922 +0200 00196.cur
-rw-r--r-- 1 root root  0 2010-05-17 18:02:04.982590376 +0200 00197.cur
-rw-r--r-- 1 root root  0 2010-05-17 18:02:05.006606400 +0200 00198.cur
[...]
-rw-r--r-- 1 root root  0 2010-05-17 18:02:44.878590296 +0200 01775.cur
-rw-r--r-- 1 root root 17 2010-05-17 18:01:46.674588629 +0200 01776.cur
[...]
-rw-r--r-- 1 root root 17 2010-05-17 18:01:53.202593995 +0200 10000.cur

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-17 18:04 Rename+crash behaviour of btrfs - nearly ext3! Jakob Unterwurzacher
@ 2010-05-17 19:12 ` Ric Wheeler
  2010-05-17 19:25 ` Josef Bacik
  2010-05-17 19:36 ` Chris Mason
  2 siblings, 0 replies; 23+ messages in thread
From: Ric Wheeler @ 2010-05-17 19:12 UTC (permalink / raw)
  To: Jakob Unterwurzacher; +Cc: linux-btrfs

On 05/17/2010 02:04 PM, Jakob Unterwurzacher wrote:
> Hi!
>
> Following Ubuntu's dpkg+ext4 problems I wanted to see if btrfs would
> solve them all. And it nearly does! Now I wonder if the remaining 0.2
> seconds window of exposing 0-size files could be closed too.
>    

Nearly does not seem that reassuring. What would happen if the server 
was under an intense load, swapping away crazily and running multiple 
writers to that same file system?

ric

> I tested using two simple scripts (attached for reference) on kernel
> 2.6.34-rc7:
> - rentest creates files $i.tmp and renames to $i.cur,
> - owtest does the same but overwrites existing $i.cur files,
> letting them run for 30-50 seconds then resetting the virtual machine.
>
> The results for ext3 are as expected: 0-size files are never exposed as
> $i.cur, overwrites are atomic.
>
> ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
> lots of 0-size files are exposed in rentest (30 seconds window).
>
> btrfs *nearly* does as well as ext3. Overwrites are atomic.
>
> The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
> so that a "ls --full-time" after the crash looks like this (notice the
> time between 01281.cur and 01292.tmp, only 0.2 seconds):
> [...]
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
> [...]
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
> -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp
>
>
> Finally, xfs kills lots of existing files in owtest and exposes lots of
> 0-size files in rentest (both 40 seconds window).
>
> If anybody is interested, the bunch of trimmed "ls --full-time" output
> for all filesystems is attached.
>
>
> Thanks,
> Jakob
>    


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-17 18:04 Rename+crash behaviour of btrfs - nearly ext3! Jakob Unterwurzacher
  2010-05-17 19:12 ` Ric Wheeler
@ 2010-05-17 19:25 ` Josef Bacik
  2010-05-17 20:09   ` Chris Mason
  2010-05-17 19:36 ` Chris Mason
  2 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2010-05-17 19:25 UTC (permalink / raw)
  To: Jakob Unterwurzacher; +Cc: linux-btrfs

On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher wrote:
> Hi!
> 
> Following Ubuntu's dpkg+ext4 problems I wanted to see if btrfs would
> solve them all. And it nearly does! Now I wonder if the remaining 0.2
> seconds window of exposing 0-size files could be closed too.
> 
> I tested using two simple scripts (attached for reference) on kernel
> 2.6.34-rc7:
> - rentest creates files $i.tmp and renames to $i.cur,
> - owtest does the same but overwrites existing $i.cur files,
> letting them run for 30-50 seconds then resetting the virtual machine.
> 
> The results for ext3 are as expected: 0-size files are never exposed as
> $i.cur, overwrites are atomic.
> 
> ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
> lots of 0-size files are exposed in rentest (30 seconds window).
> 
> btrfs *nearly* does as well as ext3. Overwrites are atomic.
> 
> The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
> so that a "ls --full-time" after the crash looks like this (notice the
> time between 01281.cur and 01292.tmp, only 0.2 seconds):
> [...]
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
> [...]
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
> -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp
> 

This isn't actually true.  There is no window, the inode isn't written to disk
until all of the data is flushed to disk.  So the in memory inode will be
update, and therefore show an i_size of 0 since the io hasn't finished, but if
you were to crash at this point, when you came back up you'd have the old data
in place because the new inode data wasn't written to disk.  I have a feeling
ext4 is the same way, but I'd have to check for sure.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-17 18:04 Rename+crash behaviour of btrfs - nearly ext3! Jakob Unterwurzacher
  2010-05-17 19:12 ` Ric Wheeler
  2010-05-17 19:25 ` Josef Bacik
@ 2010-05-17 19:36 ` Chris Mason
  2010-05-18  0:14   ` Jakob Unterwurzacher
  2 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2010-05-17 19:36 UTC (permalink / raw)
  To: Jakob Unterwurzacher; +Cc: linux-btrfs

On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher wrote:
> Hi!
> 
> Following Ubuntu's dpkg+ext4 problems I wanted to see if btrfs would
> solve them all. And it nearly does! Now I wonder if the remaining 0.2
> seconds window of exposing 0-size files could be closed too.

That should be a zero second window, we try to force things to disk
during renames.

Could you please try this patch:

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index c9f1020..9370a71 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
 	 * if this file hasn't been changed since the last transaction
 	 * commit, we can safely return without doing anything
 	 */
-	if (last_mod < root->fs_info->last_trans_committed)
+	if (0 && last_mod < root->fs_info->last_trans_committed)
 		return 0;
 
 	/*

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-17 19:25 ` Josef Bacik
@ 2010-05-17 20:09   ` Chris Mason
  2010-05-17 20:30     ` Jakob Unterwurzacher
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2010-05-17 20:09 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Jakob Unterwurzacher, linux-btrfs

On Mon, May 17, 2010 at 03:25:54PM -0400, Josef Bacik wrote:
> On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher wrote:
> > Hi!
> > 
> > Following Ubuntu's dpkg+ext4 problems I wanted to see if btrfs would
> > solve them all. And it nearly does! Now I wonder if the remaining 0.2
> > seconds window of exposing 0-size files could be closed too.
> > 
> > I tested using two simple scripts (attached for reference) on kernel
> > 2.6.34-rc7:
> > - rentest creates files $i.tmp and renames to $i.cur,
> > - owtest does the same but overwrites existing $i.cur files,
> > letting them run for 30-50 seconds then resetting the virtual machine.
> > 
> > The results for ext3 are as expected: 0-size files are never exposed as
> > $i.cur, overwrites are atomic.
> > 
> > ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
> > lots of 0-size files are exposed in rentest (30 seconds window).
> > 
> > btrfs *nearly* does as well as ext3. Overwrites are atomic.
> > 
> > The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
> > so that a "ls --full-time" after the crash looks like this (notice the
> > time between 01281.cur and 01292.tmp, only 0.2 seconds):
> > [...]
> > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
> > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
> > -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
> > [...]
> > -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
> > -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp
> > 
> 
> This isn't actually true.  There is no window, the inode isn't written to disk
> until all of the data is flushed to disk.  So the in memory inode will be
> update, and therefore show an i_size of 0 since the io hasn't finished, but if
> you were to crash at this point, when you came back up you'd have the old data
> in place because the new inode data wasn't written to disk.  I have a feeling
> ext4 is the same way, but I'd have to check for sure.  Thanks,

Jacob, could you please confirm if your test includes a crash?

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-17 20:09   ` Chris Mason
@ 2010-05-17 20:30     ` Jakob Unterwurzacher
  0 siblings, 0 replies; 23+ messages in thread
From: Jakob Unterwurzacher @ 2010-05-17 20:30 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, linux-btrfs

On 17/05/10 22:09, Chris Mason wrote:
>>> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
>>> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
>>> -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
>>> [...]
>>> -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
>>> -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp
>>>
>>
>> This isn't actually true.  There is no window, the inode isn't written to disk
>> until all of the data is flushed to disk.  So the in memory inode will be
>> update, and therefore show an i_size of 0 since the io hasn't finished, but if
>> you were to crash at this point, when you came back up you'd have the old data
>> in place because the new inode data wasn't written to disk.  I have a feeling
>> ext4 is the same way, but I'd have to check for sure.  Thanks,
> 
> Jacob, could you please confirm if your test includes a crash?
> 
> -chris

Yes, i crash the VM by pressing reset in VirtualBox.
Note that the "ls" above is from the rename test that does NOT overwrite
existing files.

Jakob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-17 19:36 ` Chris Mason
@ 2010-05-18  0:14   ` Jakob Unterwurzacher
  2010-05-18  0:30     ` Chris Mason
  0 siblings, 1 reply; 23+ messages in thread
From: Jakob Unterwurzacher @ 2010-05-18  0:14 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On 17/05/10 21:36, Chris Mason wrote:
> 
> That should be a zero second window, we try to force things to disk
> during renames.
> 
> Could you please try this patch:
> 
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index c9f1020..9370a71 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
>  	 * if this file hasn't been changed since the last transaction
>  	 * commit, we can safely return without doing anything
>  	 */
> -	if (last_mod < root->fs_info->last_trans_committed)
> +	if (0 && last_mod < root->fs_info->last_trans_committed)


Ok, I upgraded to 2.6.34 final and switched to defconfig.
I only did the rename test ( i.e. no overwrite ), the window is now
1.1s, both with vanilla and with the patch.

Jakob


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18  0:14   ` Jakob Unterwurzacher
@ 2010-05-18  0:30     ` Chris Mason
  2010-05-18  0:59       ` Chris Mason
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2010-05-18  0:30 UTC (permalink / raw)
  To: Jakob Unterwurzacher; +Cc: linux-btrfs

On Tue, May 18, 2010 at 02:14:05AM +0200, Jakob Unterwurzacher wrote:
> On 17/05/10 21:36, Chris Mason wrote:
> > 
> > That should be a zero second window, we try to force things to disk
> > during renames.
> > 
> > Could you please try this patch:
> > 
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index c9f1020..9370a71 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
> >  	 * if this file hasn't been changed since the last transaction
> >  	 * commit, we can safely return without doing anything
> >  	 */
> > -	if (last_mod < root->fs_info->last_trans_committed)
> > +	if (0 && last_mod < root->fs_info->last_trans_committed)
> 
> 
> Ok, I upgraded to 2.6.34 final and switched to defconfig.
> I only did the rename test ( i.e. no overwrite ), the window is now
> 1.1s, both with vanilla and with the patch.

Thanks, so much for the easy fix.  I'll take a look.

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18  0:30     ` Chris Mason
@ 2010-05-18  0:59       ` Chris Mason
  2010-05-18 12:03         ` Jakob Unterwurzacher
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2010-05-18  0:59 UTC (permalink / raw)
  To: Jakob Unterwurzacher, linux-btrfs

On Mon, May 17, 2010 at 08:30:32PM -0400, Chris Mason wrote:
> On Tue, May 18, 2010 at 02:14:05AM +0200, Jakob Unterwurzacher wrote:
> > On 17/05/10 21:36, Chris Mason wrote:
> > > 
> > > That should be a zero second window, we try to force things to disk
> > > during renames.
> > > 
> > > Could you please try this patch:
> > > 
> > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > > index c9f1020..9370a71 100644
> > > --- a/fs/btrfs/ordered-data.c
> > > +++ b/fs/btrfs/ordered-data.c
> > > @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
> > >  	 * if this file hasn't been changed since the last transaction
> > >  	 * commit, we can safely return without doing anything
> > >  	 */
> > > -	if (last_mod < root->fs_info->last_trans_committed)
> > > +	if (0 && last_mod < root->fs_info->last_trans_committed)
> > 
> > 
> > Ok, I upgraded to 2.6.34 final and switched to defconfig.
> > I only did the rename test ( i.e. no overwrite ), the window is now
> > 1.1s, both with vanilla and with the patch.
> 
> Thanks, so much for the easy fix.  I'll take a look.

Ohhhhh, I read your initial email wrong, I'm sorry.  The test we're
failing, the rentest, doesn't overwrite one file with another.  It is
just creating a file and then renaming it.

Btrfs is explicitly choosing not to sync the file in this case because
the rename isn't replacing good old data with new unwritten data.  The
rename is taking new unwritten data and giving it a different name.

Are there applications that rely on this? 

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18  0:59       ` Chris Mason
@ 2010-05-18 12:03         ` Jakob Unterwurzacher
  2010-05-18 13:13           ` Chris Mason
  0 siblings, 1 reply; 23+ messages in thread
From: Jakob Unterwurzacher @ 2010-05-18 12:03 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On 18/05/10 02:59, Chris Mason wrote:
>>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
>>> I only did the rename test ( i.e. no overwrite ), the window is now
>>> 1.1s, both with vanilla and with the patch.
>>
>> Thanks, so much for the easy fix.  I'll take a look.
> 
> Ohhhhh, I read your initial email wrong, I'm sorry.  The test we're
> failing, the rentest, doesn't overwrite one file with another.  It is
> just creating a file and then renaming it.

Yes, the overwrite test goes perfectly fine.

> Btrfs is explicitly choosing not to sync the file in this case because
> the rename isn't replacing good old data with new unwritten data.  The
> rename is taking new unwritten data and giving it a different name.
> 
> Are there applications that rely on this? 
> 
> -chris

Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the
default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
fsync()ing everything and is about 2x slower than it was with ext3 [2].

Btrfs is so close to getting it "right" that i wondered whether the new
file name hitting the disk could be delayed that one second for the data
to make it to disk first.

Anyway, btrfs is still a factor 30 better than ext4 of xfs!

Thanks,
Jakob






[1] https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/512096 (notice
the massive duplicate list on the right!)

[2] https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/537241

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 12:03         ` Jakob Unterwurzacher
@ 2010-05-18 13:13           ` Chris Mason
  2010-05-18 13:28             ` Oystein Viggen
                               ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Chris Mason @ 2010-05-18 13:13 UTC (permalink / raw)
  To: Jakob Unterwurzacher; +Cc: linux-btrfs

On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote:
> On 18/05/10 02:59, Chris Mason wrote:
> >>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
> >>> I only did the rename test ( i.e. no overwrite ), the window is now
> >>> 1.1s, both with vanilla and with the patch.
> >>
> >> Thanks, so much for the easy fix.  I'll take a look.
> > 
> > Ohhhhh, I read your initial email wrong, I'm sorry.  The test we're
> > failing, the rentest, doesn't overwrite one file with another.  It is
> > just creating a file and then renaming it.
> 
> Yes, the overwrite test goes perfectly fine.
> 
> > Btrfs is explicitly choosing not to sync the file in this case because
> > the rename isn't replacing good old data with new unwritten data.  The
> > rename is taking new unwritten data and giving it a different name.
> > 
> > Are there applications that rely on this? 
> > 
> > -chris
> 
> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the
> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
> fsync()ing everything and is about 2x slower than it was with ext3 [2].
> 
> Btrfs is so close to getting it "right" that i wondered whether the new
> file name hitting the disk could be delayed that one second for the data
> to make it to disk first.
> 

The thing is that different apps have a different version of 'right'.  Rename
is atomically replacing one file with another, and I completely agree
that when we have an established file on disk, we shouldn't replace it
with something that is potentially garbage.

But for the zeros case we have a file that isn't on disk and we're just
giving it a new name.  I can see a different class of applications
getting upset about renames slowing the system down dramatically because
they suddenly imply a lot of IO.

I'm more than open to discussion on this one, but I don't see how:

rm -f foo2
dd if=/dev/zero of=foo bs=1M count=1000
mv foo foo2

Should be expected to write 1GB of data.

-chris

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 13:13           ` Chris Mason
@ 2010-05-18 13:28             ` Oystein Viggen
  2010-05-18 14:47               ` Thomas Bellman
  2010-05-18 13:39             ` Aidan Van Dyk
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Oystein Viggen @ 2010-05-18 13:28 UTC (permalink / raw)
  To: linux-btrfs

* [Chris Mason]=20

> I'm more than open to discussion on this one, but I don't see how:
>
> rm -f foo2
> dd if=3D/dev/zero of=3Dfoo bs=3D1M count=3D1000
> mv foo foo2
>
> Should be expected to write 1GB of data.

IIRC, the answer you're looking for is "it did with ext3 in the default
data=3Dordered mode".  Combine that with the ext3 data=3Dordered fsync(=
)
escalation where (again IIRC) fsync() tended to force a full sync() of
the file system, and it's not that difficult to see why someone would
program with the expectation above.

Anyway, there's still a question of if a new file system should emulate
the quirks of the old file system (read: be bug compatible), or if you
can just expect to be popular enough that userspace adapts to the new
order and lets you do The Right Thing instead.

=D8ystein
--=20
Outgoing mail is certified Virus Free.
=2E.of course, the virus would tell you the same thing..

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 13:13           ` Chris Mason
  2010-05-18 13:28             ` Oystein Viggen
@ 2010-05-18 13:39             ` Aidan Van Dyk
  2010-05-18 14:06             ` Jakob Unterwurzacher
                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Aidan Van Dyk @ 2010-05-18 13:39 UTC (permalink / raw)
  To: Chris Mason, Jakob Unterwurzacher, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

* Chris Mason <chris.mason@oracle.com> [100518 09:13]:
 
> I'm more than open to discussion on this one, but I don't see how:

> Should be expected to write 1GB of data.

++

Please don't mess up BTRFS because older, less better things are messed
up in certain ways.  If we're just going to continually perpetuate the
ideas that broken-by-desing apps are "right", we might as well just give
up on a better FS, and stick to "what broken apps are expecting" (i.e.
ext3).


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 13:13           ` Chris Mason
  2010-05-18 13:28             ` Oystein Viggen
  2010-05-18 13:39             ` Aidan Van Dyk
@ 2010-05-18 14:06             ` Jakob Unterwurzacher
  2010-05-18 14:36               ` Chris Mason
  2010-05-18 23:00             ` Ric Wheeler
  2010-05-19  1:34             ` Andy Lutomirski
  4 siblings, 1 reply; 23+ messages in thread
From: Jakob Unterwurzacher @ 2010-05-18 14:06 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On 18/05/10 15:13, Chris Mason wrote:
> 
> The thing is that different apps have a different version of 'right'.  Rename
> is atomically replacing one file with another, and I completely agree
> that when we have an established file on disk, we shouldn't replace it
> with something that is potentially garbage.
> 
> But for the zeros case we have a file that isn't on disk and we're just
> giving it a new name.  I can see a different class of applications
> getting upset about renames slowing the system down dramatically because
> they suddenly imply a lot of IO.
> 
> I'm more than open to discussion on this one, but I don't see how:
> 
> rm -f foo2
> dd if=/dev/zero of=foo bs=1M count=1000
> mv foo foo2
> 
> Should be expected to write 1GB of data.
> 
> -chris

The idea would be to delay the rename hitting the disk until the data
has been written anyway.
The mv would return immediately, and someday, after the data has been
written to disk, the rename would be written to disk.

Jakob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 14:06             ` Jakob Unterwurzacher
@ 2010-05-18 14:36               ` Chris Mason
  2010-05-18 15:57                 ` Jakob Unterwurzacher
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2010-05-18 14:36 UTC (permalink / raw)
  To: Jakob Unterwurzacher; +Cc: linux-btrfs

On Tue, May 18, 2010 at 04:06:45PM +0200, Jakob Unterwurzacher wrote:
> On 18/05/10 15:13, Chris Mason wrote:
> > 
> > The thing is that different apps have a different version of 'right'.  Rename
> > is atomically replacing one file with another, and I completely agree
> > that when we have an established file on disk, we shouldn't replace it
> > with something that is potentially garbage.
> > 
> > But for the zeros case we have a file that isn't on disk and we're just
> > giving it a new name.  I can see a different class of applications
> > getting upset about renames slowing the system down dramatically because
> > they suddenly imply a lot of IO.
> > 
> > I'm more than open to discussion on this one, but I don't see how:
> > 
> > rm -f foo2
> > dd if=/dev/zero of=foo bs=1M count=1000
> > mv foo foo2
> > 
> > Should be expected to write 1GB of data.
> > 
> > -chris
> 
> The idea would be to delay the rename hitting the disk until the data
> has been written anyway.
> The mv would return immediately, and someday, after the data has been
> written to disk, the rename would be written to disk.

This is possible, but we have to choose between consuming unbounded
resources while we queue up all the mvs or sometimes forcing the things
to disk.  At the end of the day, disks are so slow that eventually you
do end up waiting on them.

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 13:28             ` Oystein Viggen
@ 2010-05-18 14:47               ` Thomas Bellman
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Bellman @ 2010-05-18 14:47 UTC (permalink / raw)
  To: linux-btrfs

On 05/18/10 15:28, Oystein Viggen wrote:

> * [Chris Mason]
>
>> I'm more than open to discussion on this one, but I don't see how:
>>
>> rm -f foo2
>> dd if=/dev/zero of=foo bs=1M count=1000
>> mv foo foo2
>>
>> Should be expected to write 1GB of data.
>
> IIRC, the answer you're looking for is "it did with ext3 in the default
> data=ordered mode".  Combine that with the ext3 data=ordered fsync()
> escalation where (again IIRC) fsync() tended to force a full sync() of
> the file system, and it's not that difficult to see why someone would
> program with the expectation above.
>
> Anyway, there's still a question of if a new file system should emulate
> the quirks of the old file system (read: be bug compatible), or if you
> can just expect to be popular enough that userspace adapts to the new
> order and lets you do The Right Thing instead.

So what *is* the right thing?  What kind of API should userspace have?
If the obvious thing for an application programmer to do is wrong, and
the right thing requires going through more hoops, that will ensure
that the majority of applications will be buggy.  We should strive
to make it easy to get things right.

It's easy for the kernel, and the filesystem, to just ask the userspace
programmers to jump through the hoops, and declare those programs that
don't to be broken.

On the other hand, if you go *too* far in absolving applications of
responsibility for making things safe, you would end up making all
filesystem operations synchronous, and that obviously hurts performance
in big ways.  So we need some kind of compromise, and where that
compromise should end up being, I don't really have the answer to.
It's just that I feel that often only the kernel programmers view is
represented here.


The pattern of writing to a file and then changing its name *without*
overwriting an existing file, is quite common when you write files to
a spool directory, and have another program that picks up files from
that directory and processes them.  You

     fd = open("foo4711.tmp", O_CREAT|O_EXCL|O_RDWR);
     write(fd, "data", strlen("data"));
     close(fd);
     link("foo4711.tmp", "foo4711");
     unlink("foo4711.tmp");

(And note that careful programs don't use rename() here, because that
would risk clobbering a file some other process has written, and instead
use link()+unlink().  And I really wish a "safe_rename()" syscall that
didn't clobber existing files existed.)

The programs I personally have written that did this, also had an fsync()
there, because I received data from another system and didn't want to ACK
until I knew it was safely on disk at my end.  But I am a fairly careful
programmer.


Note that in my previous life I was a userspace programmer, and in my
current life I'm a sysadmin.  I'm speaking as an interrested user of
Btrfs, not as a kernel programmer.


	/Thomas Bellman

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 14:36               ` Chris Mason
@ 2010-05-18 15:57                 ` Jakob Unterwurzacher
  2010-05-18 16:10                   ` Chris Mason
  0 siblings, 1 reply; 23+ messages in thread
From: Jakob Unterwurzacher @ 2010-05-18 15:57 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On 18/05/10 16:36, Chris Mason wrote:
>>
>> The idea would be to delay the rename hitting the disk until the data
>> has been written anyway.
>> The mv would return immediately, and someday, after the data has been
>> written to disk, the rename would be written to disk.
> 
> This is possible, but we have to choose between consuming unbounded
> resources while we queue up all the mvs or sometimes forcing the things
> to disk.  At the end of the day, disks are so slow that eventually you
> do end up waiting on them.
> 
> -chris
> 

I'm not sure how much memory a queued rename takes up, but the time that
would be spent flushing it to disk would then be spent flushing file
data, draining the write buffer and freeing memory, no?

That would be writing to disk

 [Data..................][Rename]  or
 [Rename][Data..................]

Whether you drain the file data queue or the rename queue first, in the
end you'd have to write it all....

I thought the problem of delaying the renames was complexity, well, at
least T'Tso said it was [1] - I'm not sure if this applies to btrfs as well.


Thanks,
Jakob



[1] https://bugzilla.kernel.org/show_bug.cgi?id=15910#c9

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 15:57                 ` Jakob Unterwurzacher
@ 2010-05-18 16:10                   ` Chris Mason
  2010-05-18 18:01                     ` Goffredo Baroncelli
  2010-05-18 18:24                     ` Jakob Unterwurzacher
  0 siblings, 2 replies; 23+ messages in thread
From: Chris Mason @ 2010-05-18 16:10 UTC (permalink / raw)
  To: Jakob Unterwurzacher; +Cc: linux-btrfs

On Tue, May 18, 2010 at 05:57:49PM +0200, Jakob Unterwurzacher wrote:
> On 18/05/10 16:36, Chris Mason wrote:
> >>
> >> The idea would be to delay the rename hitting the disk until the data
> >> has been written anyway.
> >> The mv would return immediately, and someday, after the data has been
> >> written to disk, the rename would be written to disk.
> > 
> > This is possible, but we have to choose between consuming unbounded
> > resources while we queue up all the mvs or sometimes forcing the things
> > to disk.  At the end of the day, disks are so slow that eventually you
> > do end up waiting on them.
> > 
> > -chris
> > 
> 
> I'm not sure how much memory a queued rename takes up, but the time that
> would be spent flushing it to disk would then be spent flushing file
> data, draining the write buffer and freeing memory, no?
> 
> That would be writing to disk
> 
>  [Data..................][Rename]  or
>  [Rename][Data..................]

Actually it is:

[Data..................][allow the transaction commit to complete]  or
[allow the transaction commit to complete][Data..................]

The problem is that people think of the rename as a tiny thing, but it
is really bundled in with all of the other metadata operations that were
done in the current transaction.   The space that was allocated to hold
the new file name, the space that was freed to remove the old file name,
the directory entries, the directory inode etc etc.

This means that holding back that one rename requires holding back every
operation done to the filesystem.

In btrfs, we're still able to do fsyncs quickly in this case
because we have a dedicated log for that.  But there are a few different
types of operations (like disk management) that require us to wait for
the transaction to complete even when we use the dedicated log.

> 
> Whether you drain the file data queue or the rename queue first, in the
> end you'd have to write it all....

It's about latency.  The latency required to write the entire file is
unbounded (the size of the file is unbounded).  The latency required to
commit the transaction without the file data is bounded because we are
able to control the amount of metadata in each transaction.

See the firefox vs ext3 wars for an example of all of this, it's the
latency the firefox people were (rightly) complaining about.

> 
> I thought the problem of delaying the renames was complexity, well, at
> least T'Tso said it was [1] - I'm not sure if this applies to btrfs as well.

I'm afraid there are lots and lots of different issues at play.  The
most important way to look at it is that forcing data to disk is very
slow, which is why we try to avoid it whenever we can.

Applications can request that the data go to disk via lots of different
ways.  Rename was never ever meant to be one of them, but it really does
make sense to provide atomic replacement of old good data with new good
data, so we've implemented that extra syncing.

Implementing syncing when userland doesn't expect extra syncing usually
just make userland very unhappy.  It's not that we can't do it it's that
doing it has implications for every application that uses rename.

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 16:10                   ` Chris Mason
@ 2010-05-18 18:01                     ` Goffredo Baroncelli
  2010-05-18 18:24                     ` Jakob Unterwurzacher
  1 sibling, 0 replies; 23+ messages in thread
From: Goffredo Baroncelli @ 2010-05-18 18:01 UTC (permalink / raw)
  To: linux-btrfs

On Tuesday, May 18, 2010, Chris Mason wrote:
> On Tue, May 18, 2010 at 05:57:49PM +0200, Jakob Unterwurzacher wrote:
> > On 18/05/10 16:36, Chris Mason wrote:
[...]
> > 
> > I thought the problem of delaying the renames was complexity, well, at
> > least T'Tso said it was [1] - I'm not sure if this applies to btrfs as 
well.
> 
> I'm afraid there are lots and lots of different issues at play.  The
> most important way to look at it is that forcing data to disk is very
> slow, which is why we try to avoid it whenever we can.
> 
> Applications can request that the data go to disk via lots of different
> ways.  Rename was never ever meant to be one of them, but it really does
> make sense to provide atomic replacement of old good data with new good
> data, so we've implemented that extra syncing.
> 
> Implementing syncing when userland doesn't expect extra syncing usually
> just make userland very unhappy.  It's not that we can't do it it's that
> doing it has implications for every application that uses rename.
> 
> -chris


Funny, the first thing that comes to my mind reading this thread, is that this 
kind of complaint is raised about a file-system which is able to support a 
full rollback via the snapshot. 

I think that a "right" solution should be to integrate the package manager 
with the btrfs snapshot capability (as nexenta does [1]). But it is clear that 
this is a long term solution (IIRC Fedora is working on this).

In the mean time, which should be the "right" solution to solve the dpkg 
problem ( and in a more general form the package manager problem) with btrfs ?

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[1] http://www.nexenta.org/os/TransactionalZFSUpgrades

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijackATinwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 16:10                   ` Chris Mason
  2010-05-18 18:01                     ` Goffredo Baroncelli
@ 2010-05-18 18:24                     ` Jakob Unterwurzacher
  1 sibling, 0 replies; 23+ messages in thread
From: Jakob Unterwurzacher @ 2010-05-18 18:24 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On 18/05/10 18:10, Chris Mason wrote:
>>
>> I'm not sure how much memory a queued rename takes up, but the time that
>> would be spent flushing it to disk would then be spent flushing file
>> data, draining the write buffer and freeing memory, no?
>>
>> That would be writing to disk
>>
>>  [Data..................][Rename]  or
>>  [Rename][Data..................]
> 
> Actually it is:
> 
> [Data..................][allow the transaction commit to complete]  or
> [allow the transaction commit to complete][Data..................]
> 
> The problem is that people think of the rename as a tiny thing, but it
> is really bundled in with all of the other metadata operations that were
> done in the current transaction.   The space that was allocated to hold
> the new file name, the space that was freed to remove the old file name,
> the directory entries, the directory inode etc etc.
> 
> This means that holding back that one rename requires holding back every
> operation done to the filesystem.
> 
> In btrfs, we're still able to do fsyncs quickly in this case
> because we have a dedicated log for that.  But there are a few different
> types of operations (like disk management) that require us to wait for
> the transaction to complete even when we use the dedicated log.
> 
>>
>> Whether you drain the file data queue or the rename queue first, in the
>> end you'd have to write it all....
> 
> It's about latency.  The latency required to write the entire file is
> unbounded (the size of the file is unbounded).  The latency required to
> commit the transaction without the file data is bounded because we are
> able to control the amount of metadata in each transaction.
> 
> See the firefox vs ext3 wars for an example of all of this, it's the
> latency the firefox people were (rightly) complaining about.
> 
>>
>> I thought the problem of delaying the renames was complexity, well, at
>> least T'Tso said it was [1] - I'm not sure if this applies to btrfs as well.
> 
> I'm afraid there are lots and lots of different issues at play.  The
> most important way to look at it is that forcing data to disk is very
> slow, which is why we try to avoid it whenever we can.
> 
> Applications can request that the data go to disk via lots of different
> ways.  Rename was never ever meant to be one of them, but it really does
> make sense to provide atomic replacement of old good data with new good
> data, so we've implemented that extra syncing.
> 
> Implementing syncing when userland doesn't expect extra syncing usually
> just make userland very unhappy.  It's not that we can't do it it's that
> doing it has implications for every application that uses rename.
> 
> -chris

Thanks for all the insight.

I will update the wiki FAQ to make clear what "data=ordered" in btrfs
means, what not, and why (or something like that).


Jakob

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 13:13           ` Chris Mason
                               ` (2 preceding siblings ...)
  2010-05-18 14:06             ` Jakob Unterwurzacher
@ 2010-05-18 23:00             ` Ric Wheeler
  2010-05-19  1:05               ` Bruce Guenter
  2010-05-19  1:34             ` Andy Lutomirski
  4 siblings, 1 reply; 23+ messages in thread
From: Ric Wheeler @ 2010-05-18 23:00 UTC (permalink / raw)
  To: Chris Mason, Jakob Unterwurzacher, linux-btrfs

On 05/18/2010 09:13 AM, Chris Mason wrote:
> On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote:
>    
>> On 18/05/10 02:59, Chris Mason wrote:
>>      
>>>>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
>>>>> I only did the rename test ( i.e. no overwrite ), the window is now
>>>>> 1.1s, both with vanilla and with the patch.
>>>>>            
>>>> Thanks, so much for the easy fix.  I'll take a look.
>>>>          
>>> Ohhhhh, I read your initial email wrong, I'm sorry.  The test we're
>>> failing, the rentest, doesn't overwrite one file with another.  It is
>>> just creating a file and then renaming it.
>>>        
>> Yes, the overwrite test goes perfectly fine.
>>
>>      
>>> Btrfs is explicitly choosing not to sync the file in this case because
>>> the rename isn't replacing good old data with new unwritten data.  The
>>> rename is taking new unwritten data and giving it a different name.
>>>
>>> Are there applications that rely on this?
>>>
>>> -chris
>>>        
>> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the
>> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
>> fsync()ing everything and is about 2x slower than it was with ext3 [2].
>>
>> Btrfs is so close to getting it "right" that i wondered whether the new
>> file name hitting the disk could be delayed that one second for the data
>> to make it to disk first.
>>
>>      
> The thing is that different apps have a different version of 'right'.  Rename
> is atomically replacing one file with another, and I completely agree
> that when we have an established file on disk, we shouldn't replace it
> with something that is potentially garbage.
>
> But for the zeros case we have a file that isn't on disk and we're just
> giving it a new name.  I can see a different class of applications
> getting upset about renames slowing the system down dramatically because
> they suddenly imply a lot of IO.
>
> I'm more than open to discussion on this one, but I don't see how:
>
> rm -f foo2
> dd if=/dev/zero of=foo bs=1M count=1000
> mv foo foo2
>
> Should be expected to write 1GB of data.
>
> -chris
>    

Just to weigh in here, I think that you have the right behaviour 
already. If an application wants to force this to sync the data to disk, 
it should use fsync() after the rename.

Having application depend on semantics that only ext3 provided is not an 
excuse for making a rename take multiple seconds....

Thanks!

Ric


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 23:00             ` Ric Wheeler
@ 2010-05-19  1:05               ` Bruce Guenter
  0 siblings, 0 replies; 23+ messages in thread
From: Bruce Guenter @ 2010-05-19  1:05 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

On Tue, May 18, 2010 at 07:00:57PM -0400, Ric Wheeler wrote:
> Just to weigh in here, I think that you have the right behaviour 
> already. If an application wants to force this to sync the data to disk, 
> it should use fsync() after the rename.

Actually, it pretty much has to fsync before the rename (to ensure the
contents are on disk) and possibly fsync the directory after to ensure
the rename hits the disk.  If you fsync after the rename, there is still
no guarantee that a crash won't cause partial data on disk with the new
filename, unless you assume the filesystem orders the writes so the
rename happens after the data hits the disk.  AFAIK most filesystems
make no such guarantee.

-- 
Bruce Guenter <bruce@untroubled.org>                http://untroubled.org/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Rename+crash behaviour of btrfs - nearly ext3!
  2010-05-18 13:13           ` Chris Mason
                               ` (3 preceding siblings ...)
  2010-05-18 23:00             ` Ric Wheeler
@ 2010-05-19  1:34             ` Andy Lutomirski
  4 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2010-05-19  1:34 UTC (permalink / raw)
  To: Chris Mason, Jakob Unterwurzacher, linux-btrfs

Chris Mason wrote:
> On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote:
>> On 18/05/10 02:59, Chris Mason wrote:
>>>>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
>>>>> I only did the rename test ( i.e. no overwrite ), the window is now
>>>>> 1.1s, both with vanilla and with the patch.
>>>> Thanks, so much for the easy fix.  I'll take a look.
>>> Ohhhhh, I read your initial email wrong, I'm sorry.  The test we're
>>> failing, the rentest, doesn't overwrite one file with another.  It is
>>> just creating a file and then renaming it.
>> Yes, the overwrite test goes perfectly fine.
>>
>>> Btrfs is explicitly choosing not to sync the file in this case because
>>> the rename isn't replacing good old data with new unwritten data.  The
>>> rename is taking new unwritten data and giving it a different name.
>>>
>>> Are there applications that rely on this? 
>>>
>>> -chris
>> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the
>> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
>> fsync()ing everything and is about 2x slower than it was with ext3 [2].
>>
>> Btrfs is so close to getting it "right" that i wondered whether the new
>> file name hitting the disk could be delayed that one second for the data
>> to make it to disk first.
>>
> 
> The thing is that different apps have a different version of 'right'.  Rename
> is atomically replacing one file with another, and I completely agree
> that when we have an established file on disk, we shouldn't replace it
> with something that is potentially garbage.
> 
> But for the zeros case we have a file that isn't on disk and we're just
> giving it a new name.  I can see a different class of applications
> getting upset about renames slowing the system down dramatically because
> they suddenly imply a lot of IO.
> 
> I'm more than open to discussion on this one, but I don't see how:
> 
> rm -f foo2
> dd if=/dev/zero of=foo bs=1M count=1000
> mv foo foo2
> 
> Should be expected to write 1GB of data.

[disclaimer: I don't know much about btrfs internals]

foo2 being gone after a crash is, of course, fine.  But, depending on 
the programmer, there are a few answers:

1. I want foo2 to either not exist or to contain the data I just wrote. 
  So please wait for it to hit disk.

2. I want foo2 to either not exist or to contain the data I just wrote. 
  So, btrfs, please learn how to make sure that the metadata doesn't get 
written until the data gets written.  Presumably this means that the 
rename needs to go into a log somewhere (in memory) but not become a 
part of the current transaction to avoid all kinds of latency.

3. I want speed.  Do whatever's fastest.

Of course, there's a harder case:

dd if=/dev/zero of=foo bs=1M count=1000
mv foo foo2
dd if=<something else> of=foo2 bs=1k count=1

Now what?


A lot of application programmers probably want the metadata to happen 
after the data, but they don't want to use fsync because they don't want 
to wait for anything to hit disk.  It would be nice to ask the FS for 
help, but that might be distinctly nontrivial.

--Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-05-19  1:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-17 18:04 Rename+crash behaviour of btrfs - nearly ext3! Jakob Unterwurzacher
2010-05-17 19:12 ` Ric Wheeler
2010-05-17 19:25 ` Josef Bacik
2010-05-17 20:09   ` Chris Mason
2010-05-17 20:30     ` Jakob Unterwurzacher
2010-05-17 19:36 ` Chris Mason
2010-05-18  0:14   ` Jakob Unterwurzacher
2010-05-18  0:30     ` Chris Mason
2010-05-18  0:59       ` Chris Mason
2010-05-18 12:03         ` Jakob Unterwurzacher
2010-05-18 13:13           ` Chris Mason
2010-05-18 13:28             ` Oystein Viggen
2010-05-18 14:47               ` Thomas Bellman
2010-05-18 13:39             ` Aidan Van Dyk
2010-05-18 14:06             ` Jakob Unterwurzacher
2010-05-18 14:36               ` Chris Mason
2010-05-18 15:57                 ` Jakob Unterwurzacher
2010-05-18 16:10                   ` Chris Mason
2010-05-18 18:01                     ` Goffredo Baroncelli
2010-05-18 18:24                     ` Jakob Unterwurzacher
2010-05-18 23:00             ` Ric Wheeler
2010-05-19  1:05               ` Bruce Guenter
2010-05-19  1:34             ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.