linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BK][PATCH] Reiser4, will double Linux FS performance, please apply
@ 2002-10-31 21:23 Hans Reiser
  2002-10-31 22:34 ` Dieter Nützel
  0 siblings, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2002-10-31 21:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, Reiserfs-List

[-- Attachment #1: Type: text/plain, Size: 3503 bytes --]

Scary costume sent separately in case your spam filters reject it.

Reasons to apply:

* will more than double linux filesystem performance (see 
http://www.namesys.com/v4/fast_reiser4.html), this was measured for 
reading and writing the linux source code tree

* applying will allow vm and vfs changes to be tested and benchmarked on 
what will be the fastest linux fs by a factor of two when 2.6.0 ships

* performs all fs operations as an atomic transaction, so that, for 
instance, write() and truncate() system calls either happen entirely or 
not at all

* creates necessary infrastructure for an atomic fs transaction API (not 
yet included in Reiser4).  

* scalable by design to arbitrarily large numbers of CPUs (use per node 
locks rather than system wide locks)

* eliminates fixed size journal area

* creates a complete plugin based infrastructure.  This will allow 
folding in such features as constraints and inheritance as easily coded 
plugins.  It will make it possible to implement new security attributes 
as just files with new plugins.

* First installment of an effective competitor to the Microsoft OFS 
project.  No other Linux FS is even trying to provide an alternative to OFS.

We are quite excited over having combined such dramatic performance 
increases with atomic transactions, even better packing of small files, 
and a plugin infrastructure.  This functionality that has killed 
performance in other filesystems.  (BeFS for instance was forced to 
abandon important parts of its original vision for performance reasons.)

You once told me that you agreed that filesystems should have until 6 
weeks after VM/VFS stabilizes.   I regret that I have the need to remind 
you of that.  Reiser4 could not be ready earlier.  The changes we need 
in the core code are all fairly trivial, exporting functions and the 
like, I'll let you read the details yourself.   I hope that my fellow 
tribesman will look at the wooly mammoth on my shoulders as I come back 
from the hunt, forgive me for being late for dinner, think a thought for 
the poor hungry MS tribe, and help me make a roasting spit.;-)

We circulated all of the changes we needed in the core something like 
two weeks ago, nobody objected, and Andrew Morton actually read through 
them and ok'd them.  Viro and Hellwig of course didn't read them on the 
first posting, and then waited until today to find something to object 
to, and complained there wasn't enough time left in today.  (Being just 
as helpful to our integration as with V3....)  We will be happy to fix 
things in the manner the discussion leads to as soon as the discussion 
resolves, it seems to be still in progress as I write.

Reiser4 is clearly labeled as EXPERIMENTAL with notes that it should 
only be used by developers, benchmarkers, and testers for now.  It 
passes fsx and dbench, it passes mongo.pl for ump, it crashes for 
mongo.pl smp.  We expect it to be suitable for removal of the 
EXPERIMENTAL label before 2.6.0 ships (when it is suitable to remove it 
from the rest of the kernel. ;-) )  

I'd like to offer you a seminar on Reiser4 if you have time.   I am in 
the US/bayarea for Halloween and next month.  (My kids get to try their 
first Halloween today.  I hope your kids have fun too.)

I won't send you the other Nikita patches emails as I see you are 
already reading them.  Please consider Nikita to be authorized as the 
official maintainer of Reiser4 for the next month (until my return to 
Moscow).

Best,

Hans


[-- Attachment #2: [reiserfs-list] [PATCH]: reiser4 [0/8] overview --]
[-- Type: message/rfc822, Size: 1847 bytes --]

From: Nikita Danilov <Nikita@Namesys.COM>
To: Linus Torvalds <Torvalds@Transmeta.COM>
Cc: Linux Kernel Mailing List <Linux-Kernel@Vger.Kernel.ORG>, Reiserfs mail-list <Reiserfs-List@Namesys.COM>
Subject: [reiserfs-list] [PATCH]: reiser4 [0/8] overview
Date: Thu, 31 Oct 2002 19:02:49 +0300
Message-ID: <15809.21545.509551.601735@laputa.namesys.com>

Hello, Linus,

This message starts set of 8 patches against your current BK tree to
include reiser4.

Changes to the core code are fairly small and trivial: mostly function
exports, plus one patch to share ->journal_info pointer with Ext3.

All patches are available at http://namesys.com/snapshots/2002.10.31/,
they can be applied in any order.

Utilities, including mkfs.reiser4 are available at
http://namesys.com/snapshots/2002.10.31/reiser4progs-0.1.0.tar.gz

Nikita.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, please apply
  2002-10-31 21:23 [BK][PATCH] Reiser4, will double Linux FS performance, please apply Hans Reiser
@ 2002-10-31 22:34 ` Dieter Nützel
  2002-10-31 22:47   ` Hans Reiser
  0 siblings, 1 reply; 38+ messages in thread
From: Dieter Nützel @ 2002-10-31 22:34 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Hans Reiser, Linux Kernel, Reiserfs-List

Am Donnerstag, 31. Oktober 2002 22:05 schrieb Jeff Garzik:
> Hans Reiser wrote:
>
> > If you want to talk about 2.6 then you should talk about reiser4 not 
> > reiserfs v3, and reiser4 is 7.6 times the write performance of ext3 
> > for 30 copies of the linux kernel source code using modern IDE drives 
> > and modern processors on a dual-CPU box, so I don't think any amount 
> > of improved scalability will make ext3 competitive with reiser4 for 
> > performance usages. 
>
> What is the read performance like?

>From his mentioned paper http://www.namesys.com/v4/fast_reiser4.html, it is 
more then doubled compared to ext3 and ReiserFS v3.

To be fair he should explain if it was compared to the latest ext3 (htree) 
stuff or not, yet.

It looks truly impressive.

Regards,
	Dieter
-- 
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: Dieter.Nuetzel at hamburg.de (replace at with @)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, please apply
  2002-10-31 22:34 ` Dieter Nützel
@ 2002-10-31 22:47   ` Hans Reiser
  2002-11-01  1:17     ` [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2002-10-31 22:47 UTC (permalink / raw)
  To: Dieter Nützel
  Cc: Jeff Garzik, Linux Kernel, Reiserfs-List, Oleg Drokin, zam, umka

Dieter Nützel wrote:

>Am Donnerstag, 31. Oktober 2002 22:05 schrieb Jeff Garzik:
>  
>
>>Hans Reiser wrote:
>>
>>    
>>
>>>If you want to talk about 2.6 then you should talk about reiser4 not 
>>>reiserfs v3, and reiser4 is 7.6 times the write performance of ext3 
>>>for 30 copies of the linux kernel source code using modern IDE drives 
>>>and modern processors on a dual-CPU box, so I don't think any amount 
>>>of improved scalability will make ext3 competitive with reiser4 for 
>>>performance usages. 
>>>      
>>>
>>What is the read performance like?
>>    
>>
>
>From his mentioned paper http://www.namesys.com/v4/fast_reiser4.html, it is 
>more then doubled compared to ext3 and ReiserFS v3.
>
>To be fair he should explain if it was compared to the latest ext3 (htree) 
>stuff or not, yet.
>
>It looks truly impressive.
>
>Regards,
>	Dieter
>  
>
Unfortunately that was an older version of reiser4, and we are still 
analyzing why it has higher read performance than what we are shipping 
today.  Give me a week, and I'll have a better answer for you.  What we 
shipped has higher read performance than ext3, but something is not what 
it should be and needs fixing.

Green and Zam and Umka, on Monday please start work on seriously 
analyzing how the block allocation differs between the new and the old 
kernel, now that you can finally reproduce the benchmark on the old kernel.

-- 
Hans



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-10-31 22:47   ` Hans Reiser
@ 2002-11-01  1:17     ` Andrew Morton
  2002-11-01  1:27       ` Andrew Morton
  2002-11-01  1:27       ` Hans Reiser
  0 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2002-11-01  1:17 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Dieter Nützel, Jeff Garzik, Linux Kernel, Reiserfs-List,
	Oleg Drokin, zam, umka

Hans Reiser wrote:
> 
> Green and Zam and Umka, on Monday please start work on seriously
> analyzing how the block allocation differs between the new and the old
> kernel, now that you can finally reproduce the benchmark on the old kernel.

I just sent the Orlov allocator patch to Linus.  It will double or
triple ext2 performance in that test, so please make sure you compare
against the latest.  There's a copy at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.45/shpte-stuff/broken-out/orlov-allocator.patch

We can expect similar gains for ext3, when that's done.

(The 2x-3x is on an 8meg filesystem.  Larger filesystems should
gain more)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  1:17     ` [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply Andrew Morton
@ 2002-11-01  1:27       ` Andrew Morton
  2002-11-01  1:27       ` Hans Reiser
  1 sibling, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2002-11-01  1:27 UTC (permalink / raw)
  To: Hans Reiser, Dieter Nützel, Jeff Garzik, Linux Kernel,
	Reiserfs-List, Oleg Drokin, zam, umka

Andrew Morton wrote:
> 
> ...
> (The 2x-3x is on an 8meg filesystem.  Larger filesystems should
> gain more)

s/meg/gig/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  1:17     ` [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply Andrew Morton
  2002-11-01  1:27       ` Andrew Morton
@ 2002-11-01  1:27       ` Hans Reiser
  2002-11-01  1:33         ` Andrew Morton
  1 sibling, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2002-11-01  1:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dieter Nützel, Jeff Garzik, Linux Kernel, Reiserfs-List,
	Oleg Drokin, zam, umka

Andrew Morton wrote:

>Hans Reiser wrote:
>  
>
>>Green and Zam and Umka, on Monday please start work on seriously
>>analyzing how the block allocation differs between the new and the old
>>kernel, now that you can finally reproduce the benchmark on the old kernel.
>>    
>>
>
>I just sent the Orlov allocator patch to Linus.  It will double or
>triple ext2 performance in that test, so please make sure you compare
>against the latest.  There's a copy at
>http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.45/shpte-stuff/broken-out/orlov-allocator.patch
>
>We can expect similar gains for ext3, when that's done.
>
>(The 2x-3x is on an 8meg filesystem.  Larger filesystems should
>gain more)
>
>
>  
>
Well, if we are only 2.5 times as fast for writes as ext3 after your 
patch is applied, I'll still feel good.;-)  

Better benchmarks will be conducted during the next 3 months, the ones 
we have are still a bit raw....

-- 
Hans



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  1:27       ` Hans Reiser
@ 2002-11-01  1:33         ` Andrew Morton
  2002-11-01  1:44           ` Dieter Nützel
                             ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Andrew Morton @ 2002-11-01  1:33 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Dieter Nützel, Jeff Garzik, Linux Kernel, Oleg Drokin, zam, umka

Hans Reiser wrote:
> 
> Well, if we are only 2.5 times as fast for writes as ext3 after your
> patch is applied, I'll still feel good.;-)
> 

whupping ext3's butt on write performance isn't very hard, really ;)

But it should be done based on "feature equivalency".  By default,
ext3 uses ordered data writes.  Data is written to disk before
the metadata to which that data refers is committed to journal.

It would be questionable to compare a metadata-only journalling
approach to ext3 with data=journal or data=ordered.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  1:33         ` Andrew Morton
@ 2002-11-01  1:44           ` Dieter Nützel
  2002-11-01  1:55           ` Hans Reiser
  2002-11-01  4:36           ` Linus Torvalds
  2 siblings, 0 replies; 38+ messages in thread
From: Dieter Nützel @ 2002-11-01  1:44 UTC (permalink / raw)
  To: Andrew Morton, Hans Reiser
  Cc: Jeff Garzik, Linux Kernel, Oleg Drokin, zam, umka

Am Freitag, 1. November 2002 02:33 schrieb Andrew Morton:
> Hans Reiser wrote:
> > Well, if we are only 2.5 times as fast for writes as ext3 after your
> > patch is applied, I'll still feel good.;-)
>
> whupping ext3's butt on write performance isn't very hard, really ;)
>
> But it should be done based on "feature equivalency".  By default,
> ext3 uses ordered data writes.  Data is written to disk before
> the metadata to which that data refers is committed to journal.
>
> It would be questionable to compare a metadata-only journalling
> approach to ext3 with data=journal or data=ordered.

As I understood it Reiser4 would have that from the beginning.
It is all new and not ReiserFS v3 which get this with Chris's data-logging 
patches delayed for 2.4.21/2.5.45+.

Plugins for encryption, ACLs, etc are in the works.

-Dieter

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  1:33         ` Andrew Morton
  2002-11-01  1:44           ` Dieter Nützel
@ 2002-11-01  1:55           ` Hans Reiser
  2002-11-01 10:23             ` Tomas Szepe
  2002-11-01  4:36           ` Linus Torvalds
  2 siblings, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2002-11-01  1:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dieter Nützel, Jeff Garzik, Linux Kernel, Oleg Drokin, zam, umka

Andrew Morton wrote:

>Hans Reiser wrote:
>  
>
>>Well, if we are only 2.5 times as fast for writes as ext3 after your
>>patch is applied, I'll still feel good.;-)
>>
>>    
>>
>
>whupping ext3's butt on write performance isn't very hard, really ;)
>
>But it should be done based on "feature equivalency".  By default,
>ext3 uses ordered data writes.  Data is written to disk before
>the metadata to which that data refers is committed to journal.
>
>It would be questionable to compare a metadata-only journalling
>approach to ext3 with data=journal or data=ordered.
>
>
>
>  
>
The atomic transactions that reiser4 offers are a much higher level of 
data security than data journaling.  Really, you should read the 17 page 
papers I send you URLs to;-).....
(www.namesys.com/v4/fast_reiser4.html).

-- 
Hans



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  1:33         ` Andrew Morton
  2002-11-01  1:44           ` Dieter Nützel
  2002-11-01  1:55           ` Hans Reiser
@ 2002-11-01  4:36           ` Linus Torvalds
  2002-11-01 10:59             ` Nikita Danilov
  2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2002-11-01  4:36 UTC (permalink / raw)
  To: linux-kernel

In article <3DC1D9D0.684326AC@digeo.com>,
Andrew Morton  <akpm@digeo.com> wrote:
>
>But it should be done based on "feature equivalency".  By default,
>ext3 uses ordered data writes.  Data is written to disk before
>the metadata to which that data refers is committed to journal.

Andrew, that's not necessarily a _good_ feature. 

Journaling is _not_ a great idea.  There are other approaches to
handling atomicity than journaling, like phase trees, that give
equivalent atomicity guarantees without having to write out extra stuff,
or even impose a very strict ordering between data and meta-data.

I didn't read the reiser papers yet, but from Hans' description it
sounds like reiser4 gives all the guarantees ext3 does with ordered
writes, _and_ they get good performance. 

(In fact, from the description it sounds like it gives _more_ guarantees
than even ext3 with ordered writes, in that it gives transactional
behaviour for arbitrary writes. Maybe I should read the paper).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  1:55           ` Hans Reiser
@ 2002-11-01 10:23             ` Tomas Szepe
  2002-11-01 17:19               ` Alexander Zarochentcev
  0 siblings, 1 reply; 38+ messages in thread
From: Tomas Szepe @ 2002-11-01 10:23 UTC (permalink / raw)
  To: Hans Reiser; +Cc: lkml, Oleg Drokin, zam, umka

> The atomic transactions that reiser4 offers are a much higher level of 
> data security than data journaling.  Really, you should read the 17 page 
> papers I send you URLs to;-).....
> (www.namesys.com/v4/fast_reiser4.html).

Am I to assume the following is expected behavior then?

# mkfs.reiser4 /dev/sda2
mkfs.reiser4, 0.1.0
Information: Reiser4 is going to be created on /dev/sda2.
(Yes/No): y
Creating reiser4 on /dev/sda2 with default40 profile...done
Synchronizing /dev/sda2...done
# mount /dev/sda2 /ap
# df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332       136   1490196   1% /ap
# (cd /ap && tar xzf /usr/src/linux-2.5.45.tgz)
# df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332    200508   1289824  14% /ap
# sync
# df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332    200468   1289864  14% /ap
# rm -rf /ap/linux-2.5.45
# df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332    255436   1234896  18% /ap
# # wtf is going on here?
# sync
# df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332     85848   1404484   6% /ap
# umount /ap
# mount /dev/sda2 /ap
# df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332     54532   1435800   4% /ap
# # and here?

T.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01  4:36           ` Linus Torvalds
@ 2002-11-01 10:59             ` Nikita Danilov
  0 siblings, 0 replies; 38+ messages in thread
From: Nikita Danilov @ 2002-11-01 10:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:
 > In article <3DC1D9D0.684326AC@digeo.com>,
 > Andrew Morton  <akpm@digeo.com> wrote:
 > >
 > >But it should be done based on "feature equivalency".  By default,
 > >ext3 uses ordered data writes.  Data is written to disk before
 > >the metadata to which that data refers is committed to journal.
 > 
 > Andrew, that's not necessarily a _good_ feature. 
 > 
 > Journaling is _not_ a great idea.  There are other approaches to
 > handling atomicity than journaling, like phase trees, that give
 > equivalent atomicity guarantees without having to write out extra stuff,
 > or even impose a very strict ordering between data and meta-data.
 > 
 > I didn't read the reiser papers yet, but from Hans' description it
 > sounds like reiser4 gives all the guarantees ext3 does with ordered
 > writes, _and_ they get good performance. 

Reiser4 uses "wandered logs" that are similar to phase-tree or things
that are called "shadows" or "side files" in the data bases world.

Idea is that most blocks with file system data (and meta-data) are
accessed by first reading their block number from some other "parent"
block (like indirect block in ext2). Now, if block is modified during
transaction *and* its parent block is also dirty, one can avoid writing
copy of block into the journal by:

 - allocating new block number ("wandered block")

 - storing modified content in the newly allocated wandered block

 - updating parent block to point to the new location

Old block is now unreachable from the parent, and if its block number is
stored somewhere in the journal one can use it for recovery.

Reiser4 balanced tree lends itself nicely into this model, of course.

Usual problem with such techniques is that they tend to destroy packing
due to frequent relocations. But in reality this can be used exactly for
the purpose of improving packing, if allocation of wandered blocks if
delayed for sufficiently long time (like until transaction commit).

 > 
 > (In fact, from the description it sounds like it gives _more_ guarantees
 > than even ext3 with ordered writes, in that it gives transactional
 > behaviour for arbitrary writes. Maybe I should read the paper).
 > 
 > 		Linus

Nikita.

 > -

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01 10:23             ` Tomas Szepe
@ 2002-11-01 17:19               ` Alexander Zarochentcev
  2002-11-02 13:24                 ` Tomas Szepe
  2002-11-02 13:38                 ` Tomas Szepe
  0 siblings, 2 replies; 38+ messages in thread
From: Alexander Zarochentcev @ 2002-11-01 17:19 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Hans Reiser, lkml, Oleg Drokin, umka

Tomas Szepe writes:
 > > The atomic transactions that reiser4 offers are a much higher level of 
 > > data security than data journaling.  Really, you should read the 17 page 
 > > papers I send you URLs to;-).....
 > > (www.namesys.com/v4/fast_reiser4.html).
 > 
 > Am I to assume the following is expected behavior then?
 > 
 > # mkfs.reiser4 /dev/sda2
 > mkfs.reiser4, 0.1.0
 > Information: Reiser4 is going to be created on /dev/sda2.
 > (Yes/No): y
 > Creating reiser4 on /dev/sda2 with default40 profile...done
 > Synchronizing /dev/sda2...done
 > # mount /dev/sda2 /ap
 > # df /ap
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda2              1490332       136   1490196   1% /ap
 > # (cd /ap && tar xzf /usr/src/linux-2.5.45.tgz)
 > # df /ap
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda2              1490332    200508   1289824  14% /ap
 > # sync
 > # df /ap
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda2              1490332    200468   1289864  14% /ap
 > # rm -rf /ap/linux-2.5.45
 > # df /ap
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda2              1490332    255436   1234896  18% /ap
 > # # wtf is going on here?
 > # sync
 > # df /ap
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda2              1490332     85848   1404484   6% /ap
 > # umount /ap
 > # mount /dev/sda2 /ap
 > # df /ap
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda2              1490332     54532   1435800   4% /ap
 > # # and here?

This should help:

diff -Nru a/txnmgr.c b/txnmgr.c
--- a/txnmgr.c	Wed Oct 30 18:58:09 2002
+++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
@@ -1917,7 +1917,7 @@
 		return;
 	}
 
-	if (!jnode_is_unformatted) {
+	if (jnode_is_znode(node)) {
 		if ( /**jnode_get_block(node) &&*/
 			   !blocknr_is_fake(jnode_get_block(node))) {
 			/* jnode has assigned real disk block. Put it into

 > 
 > T.

Thank you for report.

-- 
Alex.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01 17:19               ` Alexander Zarochentcev
@ 2002-11-02 13:24                 ` Tomas Szepe
  2002-11-04 11:00                   ` Nikita Danilov
  2002-11-02 13:38                 ` Tomas Szepe
  1 sibling, 1 reply; 38+ messages in thread
From: Tomas Szepe @ 2002-11-02 13:24 UTC (permalink / raw)
  To: Alexander Zarochentcev; +Cc: Hans Reiser, lkml, Oleg Drokin, umka

> This should help:
> 
> diff -Nru a/txnmgr.c b/txnmgr.c
> --- a/txnmgr.c	Wed Oct 30 18:58:09 2002
> +++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
> @@ -1917,7 +1917,7 @@
>  		return;
>  	}
>  
> -	if (!jnode_is_unformatted) {
> +	if (jnode_is_znode(node)) {
>  		if ( /**jnode_get_block(node) &&*/
>  			   !blocknr_is_fake(jnode_get_block(node))) {
>  			/* jnode has assigned real disk block. Put it into


Jup, this fixes the leak, but free space still isn't reported accurately
until after sync gets called, which I believe is a bug too.

Compare:
[reiser3]
$ pwd
/tmp
$ dd if=/dev/zero of=testfile bs=16k count=64
64+0 records in
64+0 records out
$ df /
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda1               526296    330696    195600  63% /
$ rm testfile
$ df /
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda1               526296    329672    196624  63% /
$ sync
$ df /
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda1               526296    329672    196624  63% /

[reiser4]
$ pwd
/ap/tmp
$ dd if=/dev/zero of=testfile bs=16k count=64
64+0 records in
64+0 records out
$ df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332      1152   1489180   1% /ap
$ rm testfile
$ df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332      1160   1489172   1% /ap
$ sync
$ df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332       128   1490204   1% /ap

T.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-01 17:19               ` Alexander Zarochentcev
  2002-11-02 13:24                 ` Tomas Szepe
@ 2002-11-02 13:38                 ` Tomas Szepe
  2002-11-04 12:02                   ` Nikita Danilov
  1 sibling, 1 reply; 38+ messages in thread
From: Tomas Szepe @ 2002-11-02 13:38 UTC (permalink / raw)
  To: Alexander Zarochentcev; +Cc: Hans Reiser, lkml, Oleg Drokin, umka

Hi,

Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:

reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 128
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 256
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 512
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 1024
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 2048
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 4096
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 8192
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 16384
reiser4[fixdep(952)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
WARNING: Partial conversion of 105116: 1 of 2
reiser4[cc1(957)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
WARNING: Partial conversion of 105116: 0 of 2
[snip]

... after which r4 crashes completely --
Starts to hog all cpu time and umount() never goes through.

T.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-02 13:24                 ` Tomas Szepe
@ 2002-11-04 11:00                   ` Nikita Danilov
  2002-11-04 19:56                     ` Andreas Dilger
  2002-11-05  7:30                     ` reiser
  0 siblings, 2 replies; 38+ messages in thread
From: Nikita Danilov @ 2002-11-04 11:00 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Alexander Zarochentcev, Hans Reiser, lkml, Oleg Drokin, umka

Tomas Szepe writes:
 > > This should help:
 > > 
 > > diff -Nru a/txnmgr.c b/txnmgr.c
 > > --- a/txnmgr.c	Wed Oct 30 18:58:09 2002
 > > +++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
 > > @@ -1917,7 +1917,7 @@
 > >  		return;
 > >  	}
 > >  
 > > -	if (!jnode_is_unformatted) {
 > > +	if (jnode_is_znode(node)) {
 > >  		if ( /**jnode_get_block(node) &&*/
 > >  			   !blocknr_is_fake(jnode_get_block(node))) {
 > >  			/* jnode has assigned real disk block. Put it into
 > 
 > 
 > Jup, this fixes the leak, but free space still isn't reported accurately
 > until after sync gets called, which I believe is a bug too.

In reiser4 allocation of disk space is delayed to transaction commit. It
is not possible to estimate precisely amount of disk space that will be
allocated during commit, and hence statfs(2) results are not updated
until one does sync(2) (forcing commit) or transaction is committed due
to age (10 minutes by default).

 > 
 > Compare:
 > [reiser3]
 > $ pwd
 > /tmp
 > $ dd if=/dev/zero of=testfile bs=16k count=64
 > 64+0 records in
 > 64+0 records out
 > $ df /
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda1               526296    330696    195600  63% /
 > $ rm testfile
 > $ df /
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda1               526296    329672    196624  63% /
 > $ sync
 > $ df /

[...]

Nikita.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-02 13:38                 ` Tomas Szepe
@ 2002-11-04 12:02                   ` Nikita Danilov
  2002-11-04 17:10                     ` Tomas Szepe
  0 siblings, 1 reply; 38+ messages in thread
From: Nikita Danilov @ 2002-11-04 12:02 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Alexander Zarochentcev, Hans Reiser, lkml, Oleg Drokin, umka

Tomas Szepe writes:
 > Hi,
 > 
 > Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:
 > 
 > reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
 > WARNING: Flush raced against extent->tail
 > reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
 > WARNING: flush failed: -11
 > jnode_flush failed with err = -11

Can you please try the following patch to the fs/reiser4/flush.c:
----------------------------------------------------------------------
--- /tmp/flush.c	Mon Nov  4 14:32:21 2002
+++ flush.c	Mon Nov  4 14:32:32 2002
@@ -3149,7 +3149,8 @@ flush_scan_extent(flush_scan * scan, int
 				   only. Will be removed. */
 				warning("nikita-2732", 
 					"Flush raced against extent->tail");
-				ret = -EAGAIN;
+				scan->stop = 1;
+				ret = 0;
 				goto exit;
 			}
 			assert("jmacd-1230", item_is_extent(&scan->parent_coord));
----------------------------------------------------------------------

 > reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
 > WARNING: Flush raced against extent->tail

[...]

 > WARNING: Too many iterations: 8192
 > reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
 > WARNING: Too many iterations: 16384
 > reiser4[fixdep(952)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
 > WARNING: Partial conversion of 105116: 1 of 2
 > reiser4[cc1(957)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
 > WARNING: Partial conversion of 105116: 0 of 2
 > [snip]
 > 
 > ... after which r4 crashes completely --
 > Starts to hog all cpu time and umount() never goes through.

Try to wait a bit more and check whether any more "WARNING: Too many
iterations" appear, OK?

 > 
 > T.

Nikita.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-04 12:02                   ` Nikita Danilov
@ 2002-11-04 17:10                     ` Tomas Szepe
  2002-11-04 17:53                       ` Nikita Danilov
  0 siblings, 1 reply; 38+ messages in thread
From: Tomas Szepe @ 2002-11-04 17:10 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Alexander Zarochentcev, Hans Reiser, lkml, Oleg Drokin, umka

>  > Hi,
>  > 
>  > Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:
>  > 
>  > reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
>  > WARNING: Flush raced against extent->tail
>  > reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
>  > WARNING: flush failed: -11
>  > jnode_flush failed with err = -11
> 
> Can you please try the following patch to the fs/reiser4/flush.c:
> ----------------------------------------------------------------------
> --- /tmp/flush.c	Mon Nov  4 14:32:21 2002
> +++ flush.c	Mon Nov  4 14:32:32 2002
> @@ -3149,7 +3149,8 @@ flush_scan_extent(flush_scan * scan, int
>  				   only. Will be removed. */
>  				warning("nikita-2732", 
>  					"Flush raced against extent->tail");
> -				ret = -EAGAIN;
> +				scan->stop = 1;
> +				ret = 0;
>  				goto exit;
>  			}
>  			assert("jmacd-1230", item_is_extent(&scan->parent_coord));

Seems to fix the flush errors, however, I can still see the race warnings.
Worse though, at one point I stumbled upon the following:

$ df /ap
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              1490332 -73786976294838198272   1498808 101% /ap

This was right after I hit the reset button while compiling the kernel
off a reiser4 mountpoint, went on to finish the build after reboot and
then "rm -rf"'d the whole source tree (i.e. there was nothing on the
filesystem again).

reiser4.o is 20021031 plus the rmdir leak fix from this thread plus
your patch above.

>  > ... after which r4 crashes completely --
>  > Starts to hog all cpu time and umount() never goes through.
> 
> Try to wait a bit more and check whether any more "WARNING: Too many
> iterations" appear, OK?

Jup, now all I get is the race warnings.

-- 
tomas szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-04 17:10                     ` Tomas Szepe
@ 2002-11-04 17:53                       ` Nikita Danilov
  2002-11-04 18:10                         ` Tomas Szepe
  0 siblings, 1 reply; 38+ messages in thread
From: Nikita Danilov @ 2002-11-04 17:53 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Alexander Zarochentcev, Hans Reiser, lkml, Oleg Drokin, umka

Tomas Szepe writes:
 > >  > Hi,
 > >  > 
 > >  > Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:
 > >  > 
 > >  > reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
 > >  > WARNING: Flush raced against extent->tail
 > >  > reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
 > >  > WARNING: flush failed: -11
 > >  > jnode_flush failed with err = -11
 > > 
 > > Can you please try the following patch to the fs/reiser4/flush.c:
 > > ----------------------------------------------------------------------
 > > --- /tmp/flush.c	Mon Nov  4 14:32:21 2002
 > > +++ flush.c	Mon Nov  4 14:32:32 2002
 > > @@ -3149,7 +3149,8 @@ flush_scan_extent(flush_scan * scan, int
 > >  				   only. Will be removed. */
 > >  				warning("nikita-2732", 
 > >  					"Flush raced against extent->tail");
 > > -				ret = -EAGAIN;
 > > +				scan->stop = 1;
 > > +				ret = 0;
 > >  				goto exit;
 > >  			}
 > >  			assert("jmacd-1230", item_is_extent(&scan->parent_coord));
 > 
 > Seems to fix the flush errors, however, I can still see the race warnings.

Good. Warning was left there for debugging. I shall remove it.

 > Worse though, at one point I stumbled upon the following:
 > 
 > $ df /ap
 > Filesystem           1k-blocks      Used Available Use% Mounted on
 > /dev/sda2              1490332 -73786976294838198272   1498808 101% /ap
 > 
 > This was right after I hit the reset button while compiling the kernel
 > off a reiser4 mountpoint, went on to finish the build after reboot and
 > then "rm -rf"'d the whole source tree (i.e. there was nothing on the
 > filesystem again).
 > 
 > reiser4.o is 20021031 plus the rmdir leak fix from this thread plus
 > your patch above.

Do you have debugging on?

 > 
 > >  > ... after which r4 crashes completely --
 > >  > Starts to hog all cpu time and umount() never goes through.
 > > 
 > > Try to wait a bit more and check whether any more "WARNING: Too many
 > > iterations" appear, OK?
 > 
 > Jup, now all I get is the race warnings.
 > 
 > -- 
 > tomas szepe <szepe@pinerecords.com>

Nikita.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-04 17:53                       ` Nikita Danilov
@ 2002-11-04 18:10                         ` Tomas Szepe
  0 siblings, 0 replies; 38+ messages in thread
From: Tomas Szepe @ 2002-11-04 18:10 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Alexander Zarochentcev, Hans Reiser, lkml, Oleg Drokin, umka

>  > Worse though, at one point I stumbled upon the following:
>  > 
>  > $ df /ap
>  > Filesystem           1k-blocks      Used Available Use% Mounted on
>  > /dev/sda2              1490332 -73786976294838198272   1498808 101% /ap
>  > 
>  > This was right after I hit the reset button while compiling the kernel
>  > off a reiser4 mountpoint, went on to finish the build after reboot and
>  > then "rm -rf"'d the whole source tree (i.e. there was nothing on the
>  > filesystem again).
>  > 
>  > reiser4.o is 20021031 plus the rmdir leak fix from this thread plus
>  > your patch above.
> 
> Do you have debugging on?

Nop.

-- 
tomas szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-04 11:00                   ` Nikita Danilov
@ 2002-11-04 19:56                     ` Andreas Dilger
  2002-11-05  7:30                     ` reiser
  1 sibling, 0 replies; 38+ messages in thread
From: Andreas Dilger @ 2002-11-04 19:56 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Tomas Szepe, Alexander Zarochentcev, Hans Reiser, lkml,
	Oleg Drokin, umka

On Nov 04, 2002  14:00 +0300, Nikita Danilov wrote:
>  > Jup, this fixes the leak, but free space still isn't reported accurately
>  > until after sync gets called, which I believe is a bug too.
> 
> In reiser4 allocation of disk space is delayed to transaction commit. It
> is not possible to estimate precisely amount of disk space that will be
> allocated during commit, and hence statfs(2) results are not updated
> until one does sync(2) (forcing commit) or transaction is committed due
> to age (10 minutes by default).

I find this more than a bit frightening, and it could obviously be a
huge source of reiser4's dramatic performance improvements - nothing is
being written to disk until long after a benchmark is complete (provided
you have enough RAM) if it isn't explicitly syncing before completing
the test (benchmarks like dbench and iozone don't necessarily sync).

Even more importantly, people losing 10 minutes of work is pretty
unacceptable, IMHO.  The default flush interval is 30 seconds for a
reason, and in realistic scenarios files don't grow over a 10 minute
period, and even if they do you would want to start flushing that to
disk long before you have a few GB of outstanding changes.  Also, this
would be a real source of problems (as I previously read was hinted at
in another reiser4 email) with filesystem full conditions.

At the very least, you need to reserve blocks in the filesystem for writes
that are under delayed allocation.  Overestimating space requirements
(i.e. reserve a full block for each file, regardless of whether it will be
packed in the future or not) is far preferrable to underestimating and
running out of space after a write which already "completed" suddenly
finding itself out of space.  If you get close to filling the filesystem,
then you can always flush the transaction to disk to "solidify your
estimates" before returning a needless ENOSPC.  This will also make your
"statfs" space reporting fairly consistent, because you will return the
"reserved" stats even if they are only slightly off.

Cheers, Andreas
--
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-04 11:00                   ` Nikita Danilov
  2002-11-04 19:56                     ` Andreas Dilger
@ 2002-11-05  7:30                     ` reiser
  2002-11-05  8:28                       ` Alexander Zarochentcev
                                         ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: reiser @ 2002-11-05  7:30 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Tomas Szepe, Alexander Zarochentcev, lkml, Oleg Drokin, umka

Nikita Danilov wrote:

>Tomas Szepe writes:
> > > This should help:
> > > 
> > > diff -Nru a/txnmgr.c b/txnmgr.c
> > > --- a/txnmgr.c	Wed Oct 30 18:58:09 2002
> > > +++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
> > > @@ -1917,7 +1917,7 @@
> > >  		return;
> > >  	}
> > >  
> > > -	if (!jnode_is_unformatted) {
> > > +	if (jnode_is_znode(node)) {
> > >  		if ( /**jnode_get_block(node) &&*/
> > >  			   !blocknr_is_fake(jnode_get_block(node))) {
> > >  			/* jnode has assigned real disk block. Put it into
> > 
> > 
> > Jup, this fixes the leak, but free space still isn't reported accurately
> > until after sync gets called, which I believe is a bug too.
>
>In reiser4 allocation of disk space is delayed to transaction commit. It
>is not possible to estimate precisely amount of disk space that will be
>allocated during commit, and hence statfs(2) results are not updated
>until one does sync(2) (forcing commit) or transaction is committed due
>to age (10 minutes by default).
>
>  
>
The above is badly phrased, and the behavior complained of is indeed a 
bug not a feature.  Please fix.  

statfs should be updated immediately in accordance with estimates used 
by the space reservation code, and then adjusted at commit time in 
accordance with actual usage.

Andreas, the performance advantage is achieved using much more than the 
amount of RAM available on the computer, and is therefore mostly 
independent of max transaction age.  The appropriate setting of 
transaction max age depends on the user.  The setting we chose is 
appropriate for software developers doing compiles.  It is not clear to 
me yet what the right setting is.  Perhaps 3 minutes is more 
appropriate.  I was probably overly influenced by Drew Roselli's 
statistics on how long the cyle is between rewrites.  Her statistics are 
probably skewed by having lots of CS students using the machines she got 
her data from.  5 seconds is too short to perform good layout 
optimization for subsequent reads.

Hans


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  7:30                     ` reiser
@ 2002-11-05  8:28                       ` Alexander Zarochentcev
  2002-11-05  8:44                         ` reiser
  2002-11-05  9:29                       ` Andreas Dilger
  2002-11-05  9:59                       ` Tomas Szepe
  2 siblings, 1 reply; 38+ messages in thread
From: Alexander Zarochentcev @ 2002-11-05  8:28 UTC (permalink / raw)
  To: reiser; +Cc: Nikita Danilov, Tomas Szepe, lkml, Oleg Drokin, umka

reiser writes:
 > Nikita Danilov wrote:
 > 
 > >Tomas Szepe writes:
 > > > > This should help:
 > > > > 
 > > > > diff -Nru a/txnmgr.c b/txnmgr.c
 > > > > --- a/txnmgr.c	Wed Oct 30 18:58:09 2002
 > > > > +++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
 > > > > @@ -1917,7 +1917,7 @@
 > > > >  		return;
 > > > >  	}
 > > > >  
 > > > > -	if (!jnode_is_unformatted) {
 > > > > +	if (jnode_is_znode(node)) {
 > > > >  		if ( /**jnode_get_block(node) &&*/
 > > > >  			   !blocknr_is_fake(jnode_get_block(node))) {
 > > > >  			/* jnode has assigned real disk block. Put it into
 > > > 
 > > > 
 > > > Jup, this fixes the leak, but free space still isn't reported accurately
 > > > until after sync gets called, which I believe is a bug too.
 > >
 > >In reiser4 allocation of disk space is delayed to transaction commit. It
 > >is not possible to estimate precisely amount of disk space that will be
 > >allocated during commit, and hence statfs(2) results are not updated
 > >until one does sync(2) (forcing commit) or transaction is committed due
 > >to age (10 minutes by default).
 > >
 > >  
 > >
 > The above is badly phrased, and the behavior complained of is indeed a 
 > bug not a feature.  Please fix.  
 > 
 > statfs should be updated immediately in accordance with estimates used 
 > by the space reservation code, and then adjusted at commit time in 
 > accordance with actual usage.

We should not do that unless we implement forcing of commits at out of free
space situation.

 > 
 > Andreas, the performance advantage is achieved using much more than the 
 > amount of RAM available on the computer, and is therefore mostly 
 > independent of max transaction age.  The appropriate setting of 
 > transaction max age depends on the user.  The setting we chose is 
 > appropriate for software developers doing compiles.  It is not clear to 
 > me yet what the right setting is.  Perhaps 3 minutes is more 
 > appropriate.  I was probably overly influenced by Drew Roselli's 
 > statistics on how long the cyle is between rewrites.  Her statistics are 
 > probably skewed by having lots of CS students using the machines she got 
 > her data from.  5 seconds is too short to perform good layout 
 > optimization for subsequent reads.
 > 
 > Hans
 > 

-- 
Alex.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  8:28                       ` Alexander Zarochentcev
@ 2002-11-05  8:44                         ` reiser
  2002-11-05  8:49                           ` Alexander Zarochentcev
  0 siblings, 1 reply; 38+ messages in thread
From: reiser @ 2002-11-05  8:44 UTC (permalink / raw)
  To: Alexander Zarochentcev
  Cc: Nikita Danilov, Tomas Szepe, lkml, Oleg Drokin, umka

Alexander Zarochentcev wrote:

> > >
> > >In reiser4 allocation of disk space is delayed to transaction commit. It
> > >is not possible to estimate precisely amount of disk space that will be
> > >allocated during commit, and hence statfs(2) results are not updated
> > >until one does sync(2) (forcing commit) or transaction is committed due
> > >to age (10 minutes by default).
> > >
> > >  
> > >
> > The above is badly phrased, and the behavior complained of is indeed a 
> > bug not a feature.  Please fix.  
> > 
> > statfs should be updated immediately in accordance with estimates used 
> > by the space reservation code, and then adjusted at commit time in 
> > accordance with actual usage.
>
>We should not do that unless we implement forcing of commits at out of free
>space situation.
>
I thought we had agreed to do forcing of commits at out of free space 
quite some time ago?  In any event, we should do forcing of commits at 
out of free space.  Yes?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  8:44                         ` reiser
@ 2002-11-05  8:49                           ` Alexander Zarochentcev
  2002-11-05 21:08                             ` reiser
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Zarochentcev @ 2002-11-05  8:49 UTC (permalink / raw)
  To: reiser; +Cc: Nikita Danilov, Tomas Szepe, lkml, Oleg Drokin, umka

reiser writes:
 > Alexander Zarochentcev wrote:
 > 
 > > > >
 > > > >In reiser4 allocation of disk space is delayed to transaction commit. It
 > > > >is not possible to estimate precisely amount of disk space that will be
 > > > >allocated during commit, and hence statfs(2) results are not updated
 > > > >until one does sync(2) (forcing commit) or transaction is committed due
 > > > >to age (10 minutes by default).
 > > > >
 > > > >  
 > > > >
 > > > The above is badly phrased, and the behavior complained of is indeed a 
 > > > bug not a feature.  Please fix.  
 > > > 
 > > > statfs should be updated immediately in accordance with estimates used 
 > > > by the space reservation code, and then adjusted at commit time in 
 > > > accordance with actual usage.
 > >
 > >We should not do that unless we implement forcing of commits at out of free
 > >space situation.
 > >
 > I thought we had agreed to do forcing of commits at out of free space 
 > quite some time ago?  In any event, we should do forcing of commits at 
 > out of free space.  Yes?

we will control this by a block allocator flag, we set it when we can close
current transaction. I think for most cases it will be set.
 

-- 
Alex.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  7:30                     ` reiser
  2002-11-05  8:28                       ` Alexander Zarochentcev
@ 2002-11-05  9:29                       ` Andreas Dilger
  2002-11-05 21:39                         ` reiser
  2002-11-05  9:59                       ` Tomas Szepe
  2 siblings, 1 reply; 38+ messages in thread
From: Andreas Dilger @ 2002-11-05  9:29 UTC (permalink / raw)
  To: reiser
  Cc: Nikita Danilov, Tomas Szepe, Alexander Zarochentcev, lkml,
	Oleg Drokin, umka

On Nov 04, 2002  23:30 -0800, reiser wrote:
> The appropriate setting of 
> transaction max age depends on the user.  The setting we chose is 
> appropriate for software developers doing compiles.  It is not clear to 
> me yet what the right setting is.  Perhaps 3 minutes is more 
> appropriate.  I was probably overly influenced by Drew Roselli's 
> statistics on how long the cyle is between rewrites.  Her statistics are 
> probably skewed by having lots of CS students using the machines she got 
> her data from.  5 seconds is too short to perform good layout 
> optimization for subsequent reads.

I think the bdflush defaults are (were?) something like 5 seconds for
metadata, and 30 seconds for file data. reiser4 should (if it doesn't
already) use the parameters set by sys_bdflush() to tune the writeout
intervals.

I would think that either:
a) A file was completely written in under 30 seconds (e.g. untar or gcc
   or whatever else you are doing), so deferring allocation and writing
   to disk does not help you at all.
b) A file is continuing to be written for more than 30 seconds that
   has a very large amount of outstanding data which can be committed
   to disk with (probably) the same read optimization quality as any
   larger amount of data.
c) A file is continuing to be written for more than 30 seconds that
   is growing slowly and no matter how long you defer the write you
   will only get an incremental read layout.  Presumably you could do
   something to pre-allocate/reserve a bunch of space at the end of this
   file as it continues to grow.

So, except for the very unusual case of files with lifespans between 30
seconds and 300 seconds, or files that are written to between those
intervals, I would guess that you are not gaining much extra benefit by
deferring the writes another 270 seconds.

Cheers, Andreas
--
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  7:30                     ` reiser
  2002-11-05  8:28                       ` Alexander Zarochentcev
  2002-11-05  9:29                       ` Andreas Dilger
@ 2002-11-05  9:59                       ` Tomas Szepe
  2002-11-05 10:08                         ` Alexander Zarochentcev
  2002-11-05 10:46                         ` Nikita Danilov
  2 siblings, 2 replies; 38+ messages in thread
From: Tomas Szepe @ 2002-11-05  9:59 UTC (permalink / raw)
  To: reiser; +Cc: Nikita Danilov, Alexander Zarochentcev, lkml, Oleg Drokin, umka

> >> > This should help:
> >> > 
> >> > diff -Nru a/txnmgr.c b/txnmgr.c
> >> > --- a/txnmgr.c	Wed Oct 30 18:58:09 2002
> >> > +++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
> >> > @@ -1917,7 +1917,7 @@
> >> >  		return;
> >> >  	}
> >> >  
> >> > -	if (!jnode_is_unformatted) {
> >> > +	if (jnode_is_znode(node)) {
> >> >  		if ( /**jnode_get_block(node) &&*/
> >> >  			   !blocknr_is_fake(jnode_get_block(node))) {
> >> >  			/* jnode has assigned real disk block. Put it into
> >> 
> >> 
> >> Jup, this fixes the leak, but free space still isn't reported accurately
> >> until after sync gets called, which I believe is a bug too.
> >
> >In reiser4 allocation of disk space is delayed to transaction commit. It
> >is not possible to estimate precisely amount of disk space that will be
> >allocated during commit, and hence statfs(2) results are not updated
> >until one does sync(2) (forcing commit) or transaction is committed due
> >to age (10 minutes by default).
> >
> The above is badly phrased, and the behavior complained of is indeed
> a bug not a feature.  Please fix.

I just noticed the file
http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
had changed, the difference from the original 20021031 snapshot being:

--- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
+++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
@@ -46903,7 +46903,7 @@
 +#if REISER4_USER_LEVEL_SIMULATION
 +#    define check_spin_is_locked(s)     spin_is_locked(s)
 +#    define check_spin_is_not_locked(s) spin_is_not_locked(s)
-+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
 +#    define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
 +#    define spin_is_not_locked(s)       ( ( s ) -> owner == NULL )
 +#    define check_spin_is_locked(s)     ( ( s ) -> owner == get_current() )

So either someone is messing about with your webserver or you want multiple
versions of the supposedly same diff floating around (not exactly suitable
for gathering bugreports, is it?).  If you're short on disk space, how about
gzipping the fs diff?  Squeezes down to ~500k from almost 2MB.

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  9:59                       ` Tomas Szepe
@ 2002-11-05 10:08                         ` Alexander Zarochentcev
  2002-11-05 10:23                           ` Tomas Szepe
  2002-11-05 10:46                         ` Nikita Danilov
  1 sibling, 1 reply; 38+ messages in thread
From: Alexander Zarochentcev @ 2002-11-05 10:08 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: reiser, Nikita Danilov, lkml, Oleg Drokin, umka

Tomas Szepe writes:
 > > >> > This should help:
 > > >> > 
 > > >> > diff -Nru a/txnmgr.c b/txnmgr.c
 > > >> > --- a/txnmgr.c	Wed Oct 30 18:58:09 2002
 > > >> > +++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
 > > >> > @@ -1917,7 +1917,7 @@
 > > >> >  		return;
 > > >> >  	}
 > > >> >  
 > > >> > -	if (!jnode_is_unformatted) {
 > > >> > +	if (jnode_is_znode(node)) {
 > > >> >  		if ( /**jnode_get_block(node) &&*/
 > > >> >  			   !blocknr_is_fake(jnode_get_block(node))) {
 > > >> >  			/* jnode has assigned real disk block. Put it into
 > > >> 
 > > >> 
 > > >> Jup, this fixes the leak, but free space still isn't reported accurately
 > > >> until after sync gets called, which I believe is a bug too.
 > > >
 > > >In reiser4 allocation of disk space is delayed to transaction commit. It
 > > >is not possible to estimate precisely amount of disk space that will be
 > > >allocated during commit, and hence statfs(2) results are not updated
 > > >until one does sync(2) (forcing commit) or transaction is committed due
 > > >to age (10 minutes by default).
 > > >
 > > The above is badly phrased, and the behavior complained of is indeed
 > > a bug not a feature.  Please fix.
 > 
 > I just noticed the file
 > http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
 > had changed, the difference from the original 20021031 snapshot being:
 > 
 > --- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
 > +++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
 > @@ -46903,7 +46903,7 @@
 >  +#if REISER4_USER_LEVEL_SIMULATION
 >  +#    define check_spin_is_locked(s)     spin_is_locked(s)
 >  +#    define check_spin_is_not_locked(s) spin_is_not_locked(s)
 > -+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
 > ++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
 >  +#    define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
 >  +#    define spin_is_not_locked(s)       ( ( s ) -> owner == NULL )
 >  +#    define check_spin_is_locked(s)     ( ( s ) -> owner == get_current() )
 > 
 > So either someone is messing about with your webserver or you want multiple
 > versions of the supposedly same diff floating around (not exactly suitable
 > for gathering bugreports, is it?).  If you're short on disk space, how about
 > gzipping the fs diff?  Squeezes down to ~500k from almost 2MB.

done for 2002.10.31 snapshot.

 > 
 > -- 
 > Tomas Szepe <szepe@pinerecords.com>

-- 
Alex.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05 10:08                         ` Alexander Zarochentcev
@ 2002-11-05 10:23                           ` Tomas Szepe
  0 siblings, 0 replies; 38+ messages in thread
From: Tomas Szepe @ 2002-11-05 10:23 UTC (permalink / raw)
  To: Alexander Zarochentcev; +Cc: reiser, Nikita Danilov, lkml, Oleg Drokin, umka

>  > I just noticed the file
>  > http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
>  > had changed, the difference from the original 20021031 snapshot being:
>  > 
>  > --- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
>  > +++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
>  > @@ -46903,7 +46903,7 @@
>  >  +#if REISER4_USER_LEVEL_SIMULATION
>  >  +#    define check_spin_is_locked(s)     spin_is_locked(s)
>  >  +#    define check_spin_is_not_locked(s) spin_is_not_locked(s)
>  > -+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
>  > ++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
>  >  +#    define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
>  >  +#    define spin_is_not_locked(s)       ( ( s ) -> owner == NULL )
>  >  +#    define check_spin_is_locked(s)     ( ( s ) -> owner == get_current() )
>  > 
>  > So either someone is messing about with your webserver or you want multiple
>  > versions of the supposedly same diff floating around (not exactly suitable
>  > for gathering bugreports, is it?).  If you're short on disk space, how about
>  > gzipping the fs diff?  Squeezes down to ~500k from almost 2MB.
> 
> done for 2002.10.31 snapshot.

Well the point is -- could you create a new dir each time you do updates
to the current snapshot?

Here's export-pagevec_deactivate_inactive.diff for 2.5.46:

diff -urN linux-2.5.46/mm/Makefile linux-2.5.46r4/mm/Makefile
--- linux-2.5.46/mm/Makefile	2002-11-05 11:07:21.000000000 +0100
+++ linux-2.5.46.1/mm/Makefile	2002-11-05 11:13:11.000000000 +0100
@@ -2,7 +2,7 @@
 # Makefile for the linux memory manager.
 #
 
-export-objs := shmem.o filemap.o mempool.o page_alloc.o page-writeback.o
+export-objs := shmem.o filemap.o mempool.o page_alloc.o page-writeback.o swap.o
 
 obj-y	 := memory.o mmap.o filemap.o fremap.o mprotect.o mlock.o mremap.o \
 	    vmalloc.o slab.o bootmem.o swap.o vmscan.o page_alloc.o \
diff -urN linux-2.5.46/mm/swap.c linux-2.5.46.1/mm/swap.c
--- linux-2.5.46/mm/swap.c	2002-11-05 11:07:21.000000000 +0100
+++ linux-2.5.46.1/mm/swap.c	2002-11-05 11:13:35.000000000 +0100
@@ -23,6 +23,7 @@
 #include <linux/buffer_head.h>
 #include <linux/prefetch.h>
 #include <linux/percpu.h>
+#include <linux/module.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
@@ -227,6 +228,7 @@
 		spin_unlock_irq(&zone->lru_lock);
 	__pagevec_release(pvec);
 }
+EXPORT_SYMBOL(pagevec_deactivate_inactive);
 
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  9:59                       ` Tomas Szepe
  2002-11-05 10:08                         ` Alexander Zarochentcev
@ 2002-11-05 10:46                         ` Nikita Danilov
  1 sibling, 0 replies; 38+ messages in thread
From: Nikita Danilov @ 2002-11-05 10:46 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: reiser, Alexander Zarochentcev, lkml, Oleg Drokin, umka

Tomas Szepe writes:
 > > >> > This should help:
 > > >> > 
 > > >> > diff -Nru a/txnmgr.c b/txnmgr.c
 > > >> > --- a/txnmgr.c	Wed Oct 30 18:58:09 2002
 > > >> > +++ b/txnmgr.c	Fri Nov  1 20:13:27 2002
 > > >> > @@ -1917,7 +1917,7 @@
 > > >> >  		return;
 > > >> >  	}
 > > >> >  
 > > >> > -	if (!jnode_is_unformatted) {
 > > >> > +	if (jnode_is_znode(node)) {
 > > >> >  		if ( /**jnode_get_block(node) &&*/
 > > >> >  			   !blocknr_is_fake(jnode_get_block(node))) {
 > > >> >  			/* jnode has assigned real disk block. Put it into
 > > >> 
 > > >> 
 > > >> Jup, this fixes the leak, but free space still isn't reported accurately
 > > >> until after sync gets called, which I believe is a bug too.
 > > >
 > > >In reiser4 allocation of disk space is delayed to transaction commit. It
 > > >is not possible to estimate precisely amount of disk space that will be
 > > >allocated during commit, and hence statfs(2) results are not updated
 > > >until one does sync(2) (forcing commit) or transaction is committed due
 > > >to age (10 minutes by default).
 > > >
 > > The above is badly phrased, and the behavior complained of is indeed
 > > a bug not a feature.  Please fix.
 > 
 > I just noticed the file
 > http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
 > had changed, the difference from the original 20021031 snapshot being:
 > 
 > --- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
 > +++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
 > @@ -46903,7 +46903,7 @@
 >  +#if REISER4_USER_LEVEL_SIMULATION
 >  +#    define check_spin_is_locked(s)     spin_is_locked(s)
 >  +#    define check_spin_is_not_locked(s) spin_is_not_locked(s)
 > -+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
 > ++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
 >  +#    define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
 >  +#    define spin_is_not_locked(s)       ( ( s ) -> owner == NULL )
 >  +#    define check_spin_is_locked(s)     ( ( s ) -> owner == get_current() )
 > 
 > So either someone is messing about with your webserver or you want multiple
 > versions of the supposedly same diff floating around (not exactly suitable

Looks like you managed to download early buggy version of diff that only
existed on the server for the short time and was overwritten in place
later (yes, silly thing to do).

 > for gathering bugreports, is it?).  If you're short on disk space, how about
 > gzipping the fs diff?  Squeezes down to ~500k from almost 2MB.

OK.

 > 

Nikita.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  8:49                           ` Alexander Zarochentcev
@ 2002-11-05 21:08                             ` reiser
  0 siblings, 0 replies; 38+ messages in thread
From: reiser @ 2002-11-05 21:08 UTC (permalink / raw)
  To: Alexander Zarochentcev
  Cc: Nikita Danilov, Tomas Szepe, lkml, Oleg Drokin, umka

Alexander Zarochentcev wrote:

>reiser writes:
> > Alexander Zarochentcev wrote:
> > 
> > > > >
> > > > >In reiser4 allocation of disk space is delayed to transaction commit. It
> > > > >is not possible to estimate precisely amount of disk space that will be
> > > > >allocated during commit, and hence statfs(2) results are not updated
> > > > >until one does sync(2) (forcing commit) or transaction is committed due
> > > > >to age (10 minutes by default).
> > > > >
> > > > >  
> > > > >
> > > > The above is badly phrased, and the behavior complained of is indeed a 
> > > > bug not a feature.  Please fix.  
> > > > 
> > > > statfs should be updated immediately in accordance with estimates used 
> > > > by the space reservation code, and then adjusted at commit time in 
> > > > accordance with actual usage.
> > >
> > >We should not do that unless we implement forcing of commits at out of free
> > >space situation.
> > >
> > I thought we had agreed to do forcing of commits at out of free space 
> > quite some time ago?  In any event, we should do forcing of commits at 
> > out of free space.  Yes?
>
>we will control this by a block allocator flag, we set it when we can close
>current transaction. I think for most cases it will be set.
> 
>
>  
>
ok


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05  9:29                       ` Andreas Dilger
@ 2002-11-05 21:39                         ` reiser
  0 siblings, 0 replies; 38+ messages in thread
From: reiser @ 2002-11-05 21:39 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Nikita Danilov, Tomas Szepe, Alexander Zarochentcev, lkml,
	Oleg Drokin, umka

Drew Roselli did traces of overwrite patterns, and the typical time to 
overwrite was about 6 minutes, so if you want the write cache to be 
effective you want it to last for more than 6 minutes.  I encourage you 
to read the PhD thesis she wrote and argue with it and me on it, I am 
far from dogmatically certain that 10 minutes is the right amount of 
time.  60 seconds is the most I would want for my Dell laptop (laptops 
are crash prone).  10 minutes for a non-mobile computer with a UPS, or 
in an area with a competent electric utility company, is quite 
reasonable though.  10 minutes is clearly the right amount of time for, 
say, a user space programmer, and probably too risky for a kernel 
programmer.  Probably kernel programmers are outnumbered 10 to 1 by user 
space programmers?  ( I don't really know.)

There simply is not enough empirical data for what we argue about, 
unfortunately.  Drew Roselli's thesis is the only one, and there is a 
need for 5 such theses before one can consider the topic reasonably 
understandable by the discerning.  I worry a lot that her samples are 
distorted by site specific usage patterns that might not resemble those 
of the usual linux user.

I wish I personally had a better understanding of what the usual linux 
user does in the way of IO.....

Hans

Andreas Dilger wrote:

>On Nov 04, 2002  23:30 -0800, reiser wrote:
>  
>
>>The appropriate setting of 
>>transaction max age depends on the user.  The setting we chose is 
>>appropriate for software developers doing compiles.  It is not clear to 
>>me yet what the right setting is.  Perhaps 3 minutes is more 
>>appropriate.  I was probably overly influenced by Drew Roselli's 
>>statistics on how long the cyle is between rewrites.  Her statistics are 
>>probably skewed by having lots of CS students using the machines she got 
>>her data from.  5 seconds is too short to perform good layout 
>>optimization for subsequent reads.
>>    
>>
>
>I think the bdflush defaults are (were?) something like 5 seconds for
>metadata, and 30 seconds for file data. reiser4 should (if it doesn't
>already) use the parameters set by sys_bdflush() to tune the writeout
>intervals.
>
>I would think that either:
>a) A file was completely written in under 30 seconds (e.g. untar or gcc
>   or whatever else you are doing), so deferring allocation and writing
>   to disk does not help you at all.
>b) A file is continuing to be written for more than 30 seconds that
>   has a very large amount of outstanding data which can be committed
>   to disk with (probably) the same read optimization quality as any
>   larger amount of data.
>c) A file is continuing to be written for more than 30 seconds that
>   is growing slowly and no matter how long you defer the write you
>   will only get an incremental read layout.  Presumably you could do
>   something to pre-allocate/reserve a bunch of space at the end of this
>   file as it continues to grow.
>
>So, except for the very unusual case of files with lifespans between 30
>seconds and 300 seconds, or files that are written to between those
>intervals, I would guess that you are not gaining much extra benefit by
>deferring the writes another 270 seconds.
>

>
>Cheers, Andreas
>--
>Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
>                 \  would they cancel out, leaving him still hungry?"
>http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert
>
>
>  
>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-06 14:25     ` Daniel Egger
@ 2002-11-07 17:19       ` Pavel Machek
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Machek @ 2002-11-07 17:19 UTC (permalink / raw)
  To: Daniel Egger; +Cc: reiser, lkml

Hi!

> > There is also a longer PhD thesis by her.  10 minutes is about as much 
> > work as I personally am willing to lose and try to remember.  Avoiding 
> > 75% of writes instead of 20% is a substantial performance gain worth 
> > paying a cost for.  Unfortunately it is not easy to say if it is worth 
> > that much cost, but I suspect it is.  An approach we are exploring is 
> > for blocks to reach disk earlier than that if the device is not 
> > congested, on the grounds that if not much IO is occuring, then 
> > performance is not important.
> 
> Assuming your 10 minutes are just a default and tunable by sysctl I
> hardly can see any problems at all. Paranoid people can set it to 
> make any tradeoff between performance and speed they'd like including
> setting it to 0, no?

It has traditionaly been 30 seconds, so I'd suggest default stays.

								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-06  1:33   ` reiser
  2002-11-06 14:25     ` Daniel Egger
@ 2002-11-07 16:58     ` Bill Davidsen
  1 sibling, 0 replies; 38+ messages in thread
From: Bill Davidsen @ 2002-11-07 16:58 UTC (permalink / raw)
  To: reiser
  Cc: Peter Chubb, Andreas Dilger, Nikita Danilov, Tomas Szepe,
	Alexander Zarochentcev, lkml, Oleg Drokin, umka

On Tue, 5 Nov 2002, reiser wrote:

> There is also a longer PhD thesis by her.  10 minutes is about as much 
> work as I personally am willing to lose and try to remember.  Avoiding 
> 75% of writes instead of 20% is a substantial performance gain worth 
> paying a cost for.  Unfortunately it is not easy to say if it is worth 
> that much cost, but I suspect it is.  An approach we are exploring is 
> for blocks to reach disk earlier than that if the device is not 
> congested, on the grounds that if not much IO is occuring, then 
> performance is not important.

  I would certainly like to see that, lost data in case of problems is
more of a problem than performance for many people. 

  Particularly if (a) there is an idle CPU, (b) there are no blocks queued
for write to the device, and (c) there are dirty blocks to go to the
device, it would be good to ignore the age of the block or use a firly low
minimum age. If we dropped a few blocks onto the drive each time the
conditions were met, I suspect that with many systems that would result in
a lot more free write space in memory. The total blocks written to the
drive would go up, but it shouldn't hurt performance. 

  My first thought is that the check would be done after finding no
runable normal processes and before running batch priority processes. If
only a few blocks were written each time oldest first it shouldn't even
hurt the batch processes. 

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
@ 2002-11-06 18:37 Tom Reinhart
  0 siblings, 0 replies; 38+ messages in thread
From: Tom Reinhart @ 2002-11-06 18:37 UTC (permalink / raw)
  To: linux-kernel

I would think the default should be a lot more conservative than that, 
probably closer to 30 seconds.  Much better to default to safety, and allow 
knowledgable users to tradeoff for performance if they can live with the 
risks.

Tom


>>There is also a longer PhD thesis by her. 10 minutes is about as much
>>work as I personally am willing to lose and try to remember. Avoiding
>>75% of writes instead of 20% is a substantial performance gain worth
>>paying a cost for. Unfortunately it is not easy to say if it is worth
>>that much cost, but I suspect it is. An approach we are exploring is
>>for blocks to reach disk earlier than that if the device is not
>>congested, on the grounds that if not much IO is occuring, then
>performance is not important.
>
>Assuming your 10 minutes are just a default and tunable by sysctl I
>hardly can see any problems at all. Paranoid people can set it to
>make any tradeoff between performance and speed they'd like including
>setting it to 0, no?




_________________________________________________________________
Add photos to your e-mail with MSN 8. Get 2 months FREE*. 
http://join.msn.com/?page=features/featuredemail


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-06  1:33   ` reiser
@ 2002-11-06 14:25     ` Daniel Egger
  2002-11-07 17:19       ` Pavel Machek
  2002-11-07 16:58     ` Bill Davidsen
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Egger @ 2002-11-06 14:25 UTC (permalink / raw)
  To: reiser; +Cc: lkml

[-- Attachment #1: Type: text/plain, Size: 852 bytes --]

Am Mit, 2002-11-06 um 02.33 schrieb reiser:

> There is also a longer PhD thesis by her.  10 minutes is about as much 
> work as I personally am willing to lose and try to remember.  Avoiding 
> 75% of writes instead of 20% is a substantial performance gain worth 
> paying a cost for.  Unfortunately it is not easy to say if it is worth 
> that much cost, but I suspect it is.  An approach we are exploring is 
> for blocks to reach disk earlier than that if the device is not 
> congested, on the grounds that if not much IO is occuring, then 
> performance is not important.

Assuming your 10 minutes are just a default and tunable by sysctl I
hardly can see any problems at all. Paranoid people can set it to 
make any tradeoff between performance and speed they'd like including
setting it to 0, no?
 
-- 
Servus,
       Daniel

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
  2002-11-05 23:09 ` Peter Chubb
@ 2002-11-06  1:33   ` reiser
  2002-11-06 14:25     ` Daniel Egger
  2002-11-07 16:58     ` Bill Davidsen
  0 siblings, 2 replies; 38+ messages in thread
From: reiser @ 2002-11-06  1:33 UTC (permalink / raw)
  To: Peter Chubb
  Cc: Andreas Dilger, Nikita Danilov, Tomas Szepe,
	Alexander Zarochentcev, lkml, Oleg Drokin, umka

Peter Chubb wrote:

>
>Some benchmarking done at Berkeley showed that for development loads,
>30seconds was too short to avoid excessive writes.
>
>See Roselli, Lorch and Anderson, `A Comparison of File System
>Workloads' in Usenix 2000.
>
>http://research.microsoft.com/~lorch/papers/fs-workloads/fs-workloads.html
>
>Their observations (summarised) were that most blocks die because of
>overwriting, not because of file deletes.  Their workloads show that
>for NT, the write timeout to avoid commits blocks that will soon
>become dead needs to be around a day; for typical Unix loads (web
>serving, research, software development), an hour is enough.  To catch
>75%, a timeout of around 11 minutes is needed.  30seconds worked only
>for webserving and undergraduate teaching workloads, and caught around
>40% for those workloads; for a research workload and NT fileserving,
>30seconds catches only 10-20% of the rewrites.
>
>See especially figure 3 in that paper.
>
>  
>
There is also a longer PhD thesis by her.  10 minutes is about as much 
work as I personally am willing to lose and try to remember.  Avoiding 
75% of writes instead of 20% is a substantial performance gain worth 
paying a cost for.  Unfortunately it is not easy to say if it is worth 
that much cost, but I suspect it is.  An approach we are exploring is 
for blocks to reach disk earlier than that if the device is not 
congested, on the grounds that if not much IO is occuring, then 
performance is not important.

Hans


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply
       [not found] <877555917@toto.iv>
@ 2002-11-05 23:09 ` Peter Chubb
  2002-11-06  1:33   ` reiser
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Chubb @ 2002-11-05 23:09 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: reiser, Nikita Danilov, Tomas Szepe, Alexander Zarochentcev,
	lkml, Oleg Drokin, umka

>>>>> "Andreas" == Andreas Dilger <adilger@clusterfs.com> writes:


Andreas> I think the bdflush defaults are (were?) something like 5
Andreas> seconds for metadata, and 30 seconds for file data. reiser4
Andreas> should (if it doesn't already) use the parameters set by
Andreas> sys_bdflush() to tune the writeout intervals.

...

Andreas> So, except for the very unusual case of files with lifespans
Andreas> between 30 seconds and 300 seconds, or files that are written
Andreas> to between those intervals, I would guess that you are not
Andreas> gaining much extra benefit by deferring the writes another
Andreas> 270 seconds.


Some benchmarking done at Berkeley showed that for development loads,
30seconds was too short to avoid excessive writes.

See Roselli, Lorch and Anderson, `A Comparison of File System
Workloads' in Usenix 2000.

http://research.microsoft.com/~lorch/papers/fs-workloads/fs-workloads.html

Their observations (summarised) were that most blocks die because of
overwriting, not because of file deletes.  Their workloads show that
for NT, the write timeout to avoid commits blocks that will soon
become dead needs to be around a day; for typical Unix loads (web
serving, research, software development), an hour is enough.  To catch
75%, a timeout of around 11 minutes is needed.  30seconds worked only
for webserving and undergraduate teaching workloads, and caught around
40% for those workloads; for a research workload and NT fileserving,
30seconds catches only 10-20% of the rewrites.

See especially figure 3 in that paper.

--
Dr Peter Chubb				    peterc@gelato.unsw.edu.au
You are lost in a maze of BitKeeper repositories, all almost the same.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2002-11-10 11:43 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-31 21:23 [BK][PATCH] Reiser4, will double Linux FS performance, please apply Hans Reiser
2002-10-31 22:34 ` Dieter Nützel
2002-10-31 22:47   ` Hans Reiser
2002-11-01  1:17     ` [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply Andrew Morton
2002-11-01  1:27       ` Andrew Morton
2002-11-01  1:27       ` Hans Reiser
2002-11-01  1:33         ` Andrew Morton
2002-11-01  1:44           ` Dieter Nützel
2002-11-01  1:55           ` Hans Reiser
2002-11-01 10:23             ` Tomas Szepe
2002-11-01 17:19               ` Alexander Zarochentcev
2002-11-02 13:24                 ` Tomas Szepe
2002-11-04 11:00                   ` Nikita Danilov
2002-11-04 19:56                     ` Andreas Dilger
2002-11-05  7:30                     ` reiser
2002-11-05  8:28                       ` Alexander Zarochentcev
2002-11-05  8:44                         ` reiser
2002-11-05  8:49                           ` Alexander Zarochentcev
2002-11-05 21:08                             ` reiser
2002-11-05  9:29                       ` Andreas Dilger
2002-11-05 21:39                         ` reiser
2002-11-05  9:59                       ` Tomas Szepe
2002-11-05 10:08                         ` Alexander Zarochentcev
2002-11-05 10:23                           ` Tomas Szepe
2002-11-05 10:46                         ` Nikita Danilov
2002-11-02 13:38                 ` Tomas Szepe
2002-11-04 12:02                   ` Nikita Danilov
2002-11-04 17:10                     ` Tomas Szepe
2002-11-04 17:53                       ` Nikita Danilov
2002-11-04 18:10                         ` Tomas Szepe
2002-11-01  4:36           ` Linus Torvalds
2002-11-01 10:59             ` Nikita Danilov
     [not found] <877555917@toto.iv>
2002-11-05 23:09 ` Peter Chubb
2002-11-06  1:33   ` reiser
2002-11-06 14:25     ` Daniel Egger
2002-11-07 17:19       ` Pavel Machek
2002-11-07 16:58     ` Bill Davidsen
2002-11-06 18:37 Tom Reinhart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).