All of lore.kernel.org
 help / color / mirror / Atom feed
* fs test suite
@ 2003-10-07  4:11 Pat LaVarre
  2003-10-07 14:17 ` Randy.Dunlap
  0 siblings, 1 reply; 63+ messages in thread
From: Pat LaVarre @ 2003-10-07  4:11 UTC (permalink / raw)
  To: linux-fsdevel

Anybody got an fs test suite posted that I could easily apply to the udf
of linux-2.6.0-test6?

I ask because I'm seeing ext3 work fine on top of loop devices, not so
udf.

Pat LaVarre



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: fs test suite
  2003-10-07  4:11 fs test suite Pat LaVarre
@ 2003-10-07 14:17 ` Randy.Dunlap
  2003-10-07 14:59   ` Zachary Peterson
  0 siblings, 1 reply; 63+ messages in thread
From: Randy.Dunlap @ 2003-10-07 14:17 UTC (permalink / raw)
  To: Pat LaVarre; +Cc: linux-fsdevel

On 06 Oct 2003 22:11:28 -0600 Pat LaVarre <p.lavarre@ieee.org> wrote:

| Anybody got an fs test suite posted that I could easily apply to the udf
| of linux-2.6.0-test6?
| 
| I ask because I'm seeing ext3 work fine on top of loop devices, not so
| udf.

Are you looking for a test (suite) that tests fs metadata moreso
than fs IO?  People have asked for that a few times, but I don't
know of one that is made for that.

Here are some possibilities:

iozone - all sorts of read/write testing, little metadata
		http://www.iozone.org
postmark - email-like tester, mostly small files, with file
	create/delete
		[google for it]
fsx - tester that has stressed (and busted) extN and nfs several
	times  [I would start here.]
	from:  http://www.codemonkey.org.uk/cruft/
	or:    http://www.zip.com.au/~akpm/linux/patches/stuff/
		(these might be different versions of the same prog.)

--
~Randy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: fs test suite
  2003-10-07 14:17 ` Randy.Dunlap
@ 2003-10-07 14:59   ` Zachary Peterson
  2003-10-07 17:16     ` Randy.Dunlap
  2003-10-20  9:12     ` srfs - a new file system Nir Tzachar
  0 siblings, 2 replies; 63+ messages in thread
From: Zachary Peterson @ 2003-10-07 14:59 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: Pat LaVarre, linux-fsdevel


Also try Connectathon, which runs a series of individual system call
tests, that look for correctness and deliver performance metrics.

http://www.connectathon.org/

It's not great, but free.

Zachary


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Zachary Peterson       zachary@cse.ucsc.edu
                       http://znjp.com

856D 29FA E1F7 DB5E 9215  C68D 5F0F 3929 C929 9A72

On Tue, 7 Oct 2003, Randy.Dunlap wrote:

>On 06 Oct 2003 22:11:28 -0600 Pat LaVarre <p.lavarre@ieee.org> wrote:
>
>| Anybody got an fs test suite posted that I could easily apply to the udf
>| of linux-2.6.0-test6?
>|
>| I ask because I'm seeing ext3 work fine on top of loop devices, not so
>| udf.
>
>Are you looking for a test (suite) that tests fs metadata moreso
>than fs IO?  People have asked for that a few times, but I don't
>know of one that is made for that.
>
>Here are some possibilities:
>
>iozone - all sorts of read/write testing, little metadata
>		http://www.iozone.org
>postmark - email-like tester, mostly small files, with file
>	create/delete
>		[google for it]
>fsx - tester that has stressed (and busted) extN and nfs several
>	times  [I would start here.]
>	from:  http://www.codemonkey.org.uk/cruft/
>	or:    http://www.zip.com.au/~akpm/linux/patches/stuff/
>		(these might be different versions of the same prog.)
>
>--
>~Randy
>-
>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: fs test suite
  2003-10-07 14:59   ` Zachary Peterson
@ 2003-10-07 17:16     ` Randy.Dunlap
  2003-10-07 18:54       ` Pat LaVarre
  2003-10-20  9:12     ` srfs - a new file system Nir Tzachar
  1 sibling, 1 reply; 63+ messages in thread
From: Randy.Dunlap @ 2003-10-07 17:16 UTC (permalink / raw)
  To: Zachary Peterson; +Cc: p.lavarre, linux-fsdevel


In that vein, there's also the Linux Test Project (LTP),
  http://ltp.sourceforge.net/

--
~Randy


On Tue, 7 Oct 2003 07:59:59 -0700 (PDT) Zachary Peterson <zachary@cse.ucsc.edu> wrote:

| 
| Also try Connectathon, which runs a series of individual system call
| tests, that look for correctness and deliver performance metrics.
| 
| http://www.connectathon.org/
| 
| It's not great, but free.
| 
| Zachary
| 
| 
| =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
| Zachary Peterson       zachary@cse.ucsc.edu
|                        http://znjp.com
| 
| 856D 29FA E1F7 DB5E 9215  C68D 5F0F 3929 C929 9A72
| 
| On Tue, 7 Oct 2003, Randy.Dunlap wrote:
| 
| >On 06 Oct 2003 22:11:28 -0600 Pat LaVarre <p.lavarre@ieee.org> wrote:
| >
| >| Anybody got an fs test suite posted that I could easily apply to the udf
| >| of linux-2.6.0-test6?
| >|
| >| I ask because I'm seeing ext3 work fine on top of loop devices, not so
| >| udf.
| >
| >Are you looking for a test (suite) that tests fs metadata moreso
| >than fs IO?  People have asked for that a few times, but I don't
| >know of one that is made for that.
| >
| >Here are some possibilities:
| >
| >iozone - all sorts of read/write testing, little metadata
| >		http://www.iozone.org
| >postmark - email-like tester, mostly small files, with file
| >	create/delete
| >		[google for it]
| >fsx - tester that has stressed (and busted) extN and nfs several
| >	times  [I would start here.]
| >	from:  http://www.codemonkey.org.uk/cruft/
| >	or:    http://www.zip.com.au/~akpm/linux/patches/stuff/
| >		(these might be different versions of the same prog.)
| >
| >--

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: fs test suite
  2003-10-07 17:16     ` Randy.Dunlap
@ 2003-10-07 18:54       ` Pat LaVarre
  2003-10-07 18:58         ` Randy.Dunlap
  0 siblings, 1 reply; 63+ messages in thread
From: Pat LaVarre @ 2003-10-07 18:54 UTC (permalink / raw)
  To: rddunlap; +Cc: zachary, linux-fsdevel

> Are you looking for a test (suite) that tests
> fs metadata moreso than fs IO?  People have
> asked for that a few times, but I don't know
> of one that is made for that.

May I ask you to elaborate?  I'm not yet confident I understand the
question.  I mean to ask how do I increase my confidence that 2.4.x and
2.6.x udf.ko will read back to me what I wrote thru it.  I figure that
mixes together metadata and data, since the metadata tells me how much
and from where I read back my data.

I see my semi-private linux_udf@hpesjro.fc.hp.com thread titled "zeroes
read back more often than appended" says a write-read-compare test as
trivial as fopen-fwrite-fclose doesn't yet work.

I blame that either on my own bonehead newbie errors i.e. illegit test
setup, else low-hanging bugs.  I'm hear wondering, can I easily look for
other low-hanging fruit.

> http...

Thank you!  I will pursue:

http://www.codemonkey.org.uk/cruft/
http://www.zip.com.au/~akpm/linux/patches/stuff/
http://ltp.sourceforge.net/
http://www.iozone.org
http://www.google.com/search?q=postmark+linux
http://www.connectathon.org/
http://www.google.com/search?q=bonnie+linux

Pat LaVarre



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: fs test suite
  2003-10-07 18:54       ` Pat LaVarre
@ 2003-10-07 18:58         ` Randy.Dunlap
  2003-10-07 19:26           ` Pat LaVarre
  0 siblings, 1 reply; 63+ messages in thread
From: Randy.Dunlap @ 2003-10-07 18:58 UTC (permalink / raw)
  To: Pat LaVarre; +Cc: zachary, linux-fsdevel

On 07 Oct 2003 12:54:57 -0600 Pat LaVarre <p.lavarre@ieee.org> wrote:

| > Are you looking for a test (suite) that tests
| > fs metadata moreso than fs IO?  People have
| > asked for that a few times, but I don't know
| > of one that is made for that.
| 
| May I ask you to elaborate?  I'm not yet confident I understand the
| question.  I mean to ask how do I increase my confidence that 2.4.x and
| 2.6.x udf.ko will read back to me what I wrote thru it.  I figure that
| mixes together metadata and data, since the metadata tells me how much
| and from where I read back my data.

Sure, they are usually mixed, but some tests emphasize (or stress)
file data IO vs. metadata more than others do.
And sometimes people ask for a metadata stress test, which would
focus on mv, ln, stat, etc., more than reading/writing file data.

| I see my semi-private linux_udf@hpesjro.fc.hp.com thread titled "zeroes
| read back more often than appended" says a write-read-compare test as
| trivial as fopen-fwrite-fclose doesn't yet work.

That sounds like a filesystem IO test more than a metadata test,
though the problem could be in either area.

| I blame that either on my own bonehead newbie errors i.e. illegit test
| setup, else low-hanging bugs.  I'm hear wondering, can I easily look for
| other low-hanging fruit.

Hear?

--
~Randy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: fs test suite
  2003-10-07 18:58         ` Randy.Dunlap
@ 2003-10-07 19:26           ` Pat LaVarre
  0 siblings, 0 replies; 63+ messages in thread
From: Pat LaVarre @ 2003-10-07 19:26 UTC (permalink / raw)
  To: rddunlap; +Cc: zachary, linux-fsdevel

> some tests emphasize ...
> mv, ln, stat, etc.,
> more than reading/writing file

Clear now thank you.  Sorry to hear noone much exercises the many ways
of writing only metadata.  Immediately I think to include `ls` and and
`touch` in your etc., also I see `head` and `tail` and `tail -f` halfway
back towards stressing data.

> > I'm hear wondering,
> > can I easily look for other low-hanging fruit.

You know, “with enough eyeballs, all bugs are shallow”.

> > ... I'm hear wondering,
> > can I easily look for other low-hanging fruit.
>
> Hear?

I meant "here", I'm not sure if you understood me or not, sorry if not,
grin if yes.  Me, I learned American English as a phonetic foreign
language.  For example, I didn't discover people who pronounced "herb"
with an h until I discovered English English, and I prefer to spell
"façade" with a soft 'ç', and I still think "aisle" and "isle" ought to
sound more like "ayzel" and less like "island", and ...

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* srfs - a new file system.
  2003-10-07 14:59   ` Zachary Peterson
  2003-10-07 17:16     ` Randy.Dunlap
@ 2003-10-20  9:12     ` Nir Tzachar
  2003-10-20 21:00       ` Eric Sandall
                         ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-20  9:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

hello all.

We're proud to announce the availability of a _proof of concept_ file
system, called srfs. ( http://www.cs.bgu.ac.il/~srfs/ ).
a quick overview: [from the home page]
srfs is a global file system designed to be distributed geographicly over
multiple locations and provide a consistent, high available and durable
infrastructure for information.

Started as a research project into file systems and self-stabilization in
Ben Gurion University of the Negev Department of Computer Science, the
project aims to integrate self-stabilization methods and algorithms into
the file (and operation) systems to provide a system with a desired
behavior in the presence of transient faults.

Based on layered self-stabilizing algorithms, provide a tree replication
structure based on auto-discovery of servers using local and global IP
multicasting. The tree structure is providing the command and timing
infrastructure required for a distributed file system.

The project is basically divided into two components:
1) a kernel module, which provides the low level functionality, and
   disk management.
2) a user space caching daemon, which provide the stabilization and
   replication properties of the file system.
these two components communicate via a character device.

more info on the system architecture can be find on the web page, and
here: http://www.cs.bgu.ac.il/~tzachar/srfs.pdf

We hope some will find this interesting enough to take for a test drive,
and wont mind the latencies ( currently, the caching daemon is a bit slow.
hopefully, we will improve it in the future. )
anyway, please keep in mind this is a very early version that only works,
and keeps the stabilization properties. no posix compliance whatsoever...

the code contains several hacks and design flaws that we're aware of,
and probably many that we're not... so please be gentle ;)

if someone found this interesting, please contact us with ur insights.
cheers,
the srfs team.

p.s I would like to thank all members of this mailing list (fsdevel), for
ur continual help with problems we encountered during the development.
thanks guys (and girls???).

========================================================================
nir.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-20  9:12     ` srfs - a new file system Nir Tzachar
@ 2003-10-20 21:00       ` Eric Sandall
  2003-10-21 12:07         ` Nir Tzachar
  2003-10-23 17:46         ` Daniel Egger
  2003-10-22  4:57       ` Erik Andersen
  2003-10-25  9:27       ` Implementing writepage Charles Manning
  2 siblings, 2 replies; 63+ messages in thread
From: Eric Sandall @ 2003-10-20 21:00 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: linux-fsdevel, linux-kernel

Quoting Nir Tzachar <tzachar@cs.bgu.ac.il>:
> hello all.
> 
> We're proud to announce the availability of a _proof of concept_ file
> system, called srfs. ( http://www.cs.bgu.ac.il/~srfs/ ).
> a quick overview: [from the home page]
> srfs is a global file system designed to be distributed geographicly over
> multiple locations and provide a consistent, high available and durable
> infrastructure for information.
> 
> Started as a research project into file systems and self-stabilization in
> Ben Gurion University of the Negev Department of Computer Science, the
> project aims to integrate self-stabilization methods and algorithms into
> the file (and operation) systems to provide a system with a desired
> behavior in the presence of transient faults.
> 
> Based on layered self-stabilizing algorithms, provide a tree replication
> structure based on auto-discovery of servers using local and global IP
> multicasting. The tree structure is providing the command and timing
> infrastructure required for a distributed file system.
> 
> The project is basically divided into two components:
> 1) a kernel module, which provides the low level functionality, and
>    disk management.
> 2) a user space caching daemon, which provide the stabilization and
>    replication properties of the file system.
> these two components communicate via a character device.
> 
> more info on the system architecture can be find on the web page, and
> here: http://www.cs.bgu.ac.il/~tzachar/srfs.pdf
> 
> We hope some will find this interesting enough to take for a test drive,
> and wont mind the latencies ( currently, the caching daemon is a bit slow.
> hopefully, we will improve it in the future. )
> anyway, please keep in mind this is a very early version that only works,
> and keeps the stabilization properties. no posix compliance whatsoever...
> 
> the code contains several hacks and design flaws that we're aware of,
> and probably many that we're not... so please be gentle ;)
> 
> if someone found this interesting, please contact us with ur insights.
> cheers,
> the srfs team.
> 
> p.s I would like to thank all members of this mailing list (fsdevel), for
> ur continual help with problems we encountered during the development.
> thanks guys (and girls???).
> 
> ========================================================================
> nir.

This sounds fairly similar to Coda[0], which is already in development and use.

-sandalle

[0] http://www.coda.cs.cmu.edu/

-- 
PGP Key Fingerprint:  FCFF 26A1 BE21 08F4 BB91  FAED 1D7B 7D74 A8EF DD61
http://search.keyserver.net:11371/pks/lookup?op=get&search=0xA8EFDD61

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS/E/IT$ d-- s++:+>: a-- C++(+++) BL++++VIS>$ P+(++) L+++ E-(---) W++ N+@ o?
K? w++++>-- O M-@ V-- PS+(+++) PE(-) Y++(+) PGP++(+) t+() 5++ X(+) R+(++)
tv(--)b++(+++) DI+@ D++(+++) G>+++ e>+++ h---(++) r++ y+
------END GEEK CODE BLOCK------

Eric Sandall                     |  Source Mage GNU/Linux Developer
eric@sandall.us                  |  http://www.sourcemage.org/
http://eric.sandall.us/          |  SysAdmin @ Inst. Shock Physics @ WSU
http://counter.li.org/  #196285  |  http://www.shock.wsu.edu/

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-20 21:00       ` Eric Sandall
@ 2003-10-21 12:07         ` Nir Tzachar
  2003-10-21 14:29           ` Brian Beattie
                             ` (2 more replies)
  2003-10-23 17:46         ` Daniel Egger
  1 sibling, 3 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-21 12:07 UTC (permalink / raw)
  To: Eric Sandall; +Cc: linux-fsdevel, linux-kernel

>
> This sounds fairly similar to Coda[0], which is already in development and use.
>

not at all.

coda is not self stabilizing at all.
srfs is also a totally distributed file system -> see the doc.
bye
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-21 12:07         ` Nir Tzachar
@ 2003-10-21 14:29           ` Brian Beattie
  2003-10-21 16:59           ` Jan Harkes
  2003-10-23 13:58           ` Pavel Machek
  2 siblings, 0 replies; 63+ messages in thread
From: Brian Beattie @ 2003-10-21 14:29 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Eric Sandall, linux-fsdevel, linux-kernel

On Tue, 2003-10-21 at 08:07, Nir Tzachar wrote:
> >
> > This sounds fairly similar to Coda[0], which is already in development and use.
> >
> 
> not at all.
> 
> coda is not self stabilizing at all.
> srfs is also a totally distributed file system -> see the doc.

what does "self stabilizing" mean in this context?

> bye
bye bye
-- 
Brian Beattie            | Experienced kernel hacker/embedded systems
beattie@beattie-home.net | programmer, direct or contract, short or
www.beattie-home.net     | long term, available immediately.

"Honor isn't about making the right choices.
It's about dealing with the consequences." -- Midori Koto


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-21 12:07         ` Nir Tzachar
  2003-10-21 14:29           ` Brian Beattie
@ 2003-10-21 16:59           ` Jan Harkes
  2003-10-23 13:58           ` Pavel Machek
  2 siblings, 0 replies; 63+ messages in thread
From: Jan Harkes @ 2003-10-21 16:59 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: linux-fsdevel

On Tue, Oct 21, 2003 at 02:07:04PM +0200, Nir Tzachar wrote:
> not at all.
> 
> coda is not self stabilizing at all.
> srfs is also a totally distributed file system -> see the doc.

In what way do you think that Coda isn't distributed?

Also Coda does have 'self-stabilizing' properties, but probably in a
different way compared to how you think about self stabilization.

When a server becomes loaded (too many clients, heavy CPU/memory usage
by other processes, network trouble) it's responses slow down and
clients will automatically switch to some lighter loaded replica that
stores the same data. We work based on an estimate of the available
bandwidth on a per-client basis, and the switch is performed in a
non-deterministic fashion, i.e. we don't pick the 'fastest' machine, but
decide to switch to a random machine when we're talking to the 'slowest'
one. As a result this works very well at balancing the load across all
available replicas. This adaptation mostly affects read-oriented data
traffic.

Similarly, when a client happens to be sending modifications (writes) to
an overloaded server, it will at some point switch to writeback caching
(write-disconnected operation), in this state it keeps track of
modifications without writing them back to the server immediately.
During this time it can optimize away some operations (intermediate
files created during a compilation) and once the local data has 'aged'
enough to be considered stable, it reintegrates the modifications in
batches of multiple operations at a time. When several operations arrive
in a batch, the server only needs to commit a single transaction for up
to 100 operations at a time, which results in a far more efficient use
of the CPU and disk IO resources on the server. The trade-off is
ofcourse a weaker consistency model.

So there is definitely a self-stabilizing mechanism present in Coda.

Jan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-20  9:12     ` srfs - a new file system Nir Tzachar
  2003-10-20 21:00       ` Eric Sandall
@ 2003-10-22  4:57       ` Erik Andersen
  2003-10-22 10:16         ` Nir Tzachar
                           ` (2 more replies)
  2003-10-25  9:27       ` Implementing writepage Charles Manning
  2 siblings, 3 replies; 63+ messages in thread
From: Erik Andersen @ 2003-10-22  4:57 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: linux-fsdevel, linux-kernel

On Mon Oct 20, 2003 at 11:12:07AM +0200, Nir Tzachar wrote:
> more info on the system architecture can be find on the web page, and
> here: http://www.cs.bgu.ac.il/~tzachar/srfs.pdf

Suppose I install srfs on both my laptop and my server.  I then
move the CVS repository for my pet project onto the new srfs
filesystem and I take off for the weekend with my laptop.   Over
the weekend I commit several changes to file X.  Over the weekend
my friend also commits several changes to file X.

When I get home and plug in my laptop, presumably the caching
daemon will try to stabalize the system by deciding which version
of file X was changed last and replicating that latest version.  

Who's work will the caching daemon overwrite?  My work, or my
friends work?

Of course, this need not involve anything so extreme as days of
disconnected independent operation.  A rebooting router between
two previously syncd srfs peers seems sufficient to trigger this
kind of data loss, unless you make the logging daemon fail all
writes when disconnected.

 -Erik

--
Erik B. Andersen             http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-22  4:57       ` Erik Andersen
@ 2003-10-22 10:16         ` Nir Tzachar
  2003-10-22 14:22           ` Jan Harkes
  2003-10-22 10:21         ` Nir Tzachar
  2003-10-22 16:05         ` Valdis.Kletnieks
  2 siblings, 1 reply; 63+ messages in thread
From: Nir Tzachar @ 2003-10-22 10:16 UTC (permalink / raw)
  To: Erik Andersen, jaharkes, eric; +Cc: linux-fsdevel, linux-kernel

first, id like to thank u guys for playing around with the idea.

now, i want to apologize if my explanation was not clear enough:
self stabilization (original idea by Dijkstra) - A self stabilizing system
is a system that can automatically recover following the occurrence of
( transient ) faults. The idea is to design a system which can be started
in an arbitrary state and still converge to a desired behavior.

Our file system behaves like this:
lets say you have several servers, with different file system trees on
them. If (and when ...) you connect these file systems with an srfs
framework, all servers will display the same file system tree, which is
somewhat of a union between them all.
if you wish to talk in coda terms, you can say all servers operated
disconnectedly, and then were connected at the same time. the conflict
resolving mechanism we use, is by majority.

We differ from coda in the sense we don't have a main server, which pushes
Volumes to sub-servers (im not sure what the coda terminology is... ), and
data is served in a load-balanced way. In Srfs, all the data resides on
all servers (hosts) and is replicated between them.
replication takes place at two levels: tree view (plus meta data) and the
actual data.
tree view - the tree view on all hosts is the same. an `ls` on a dir
            on any host will produce the same output.
data - data will be replicated to all hosts upon a successful write,
       and upon each access to a dirty file on each host.

all replication is lazy, and happens only on access to dirs / files
(and on successful writes - when the file is being closed.)

Thus, the following behavior can be achieved:
lets say you have 2N+1 hosts, all with coherent file system trees.
now, take N of them offline, change the tree, put those N back online,
and their tree will be the same as the other N+1 other hosts.

The main goal of the file system is self stabilization, over long periods
of time and long distances. you can use it as a SAN, or as a data farm,
using system like LinuxVirtualServer to balance the load between nodes.

cheers.

========================================================================
nir.





^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-22  4:57       ` Erik Andersen
  2003-10-22 10:16         ` Nir Tzachar
@ 2003-10-22 10:21         ` Nir Tzachar
  2003-10-22 16:05         ` Valdis.Kletnieks
  2 siblings, 0 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-22 10:21 UTC (permalink / raw)
  To: Erik Andersen; +Cc: linux-fsdevel, linux-kernel

> Who's work will the caching daemon overwrite?My work, or my
> friends work?

well, in our system, unless u break the symmetry, the daemon will
pick a random file. Since no majority can be found, this is the default.
but, lets say your friend was connected to a third server, and his work
was saved there also. when you'll connect ur laptop, all of ur work will
be lost, and what ull see in only his work ;)


========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-22 10:16         ` Nir Tzachar
@ 2003-10-22 14:22           ` Jan Harkes
  2003-10-23  7:50             ` Nir Tzachar
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Harkes @ 2003-10-22 14:22 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: linux-fsdevel

On Wed, Oct 22, 2003 at 12:16:02PM +0200, Nir Tzachar wrote:
> if you wish to talk in coda terms, you can say all servers operated
> disconnectedly, and then were connected at the same time. the conflict
> resolving mechanism we use, is by majority.

That's annoying when >50% of your servers were unavailable for a period
of time, because all recent changes will be lost when connectivity is
restored.

> We differ from coda in the sense we don't have a main server, which pushes
> Volumes to sub-servers (im not sure what the coda terminology is... ), and

Where in the world did you get the idea that Coda has a main server that
pushes out modifications? That is so wrong, I don't even know where to
begin.

> data is served in a load-balanced way. In Srfs, all the data resides on
> all servers (hosts) and is replicated between them.
> replication takes place at two levels: tree view (plus meta data) and the
> actual data.
> tree view - the tree view on all hosts is the same. an `ls` on a dir
>             on any host will produce the same output.
> data - data will be replicated to all hosts upon a successful write,
>        and upon each access to a dirty file on each host.

Coda also uses a global namespace that's pretty normal for distributed
filesystems (AFS/DFS).

So the only differences really are that Coda uses a version-vector
based mechanism to detect and resolve version conflicts instead of
majority voting. i.e. even when only a single server is accessible for a
period of time, the committed updates will eventually propagate to
others. And we don't throw away a file just because 2 out of three
servers happen to have an old copy and vote against it.

And Coda gives an administrator the ability to use different replication
groups within his servers for different types of data based on for
instance expected access patterns. Temporary objects or files that are
rarely used could only have a single replica. Mail folders would have 2
replicas (as only one user would read it, so the replication is only
needed to protect against occasional server outage), and data shared by
many users (binaries) but rarely updated could be available from many
replicas.

> all replication is lazy, and happens only on access to dirs / files
> (and on successful writes - when the file is being closed.)

Did you read _any_ of the Coda papers that were written during the past
16 years?

Well, this one is pretty recent and nicely summarizes the history of
Coda, and provides an overview of what Coda actually does.

    M. Satyanarayanan, 'The Evolution of Coda'
    ACM Transactions on Computer Systems (TOCS)
    Volume 20, Issue 2 (May 2002)
    Pages: 85 - 124  

    http://portal.acm.org/citation.cfm?id=507052.507053&dl=GUIDE&dl=GUIDE&idx=J774&part=periodical&WantType=periodical&title=ACM%20Transactions%20on%20Computer%20Systems%20(TOCS)

Jan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-22  4:57       ` Erik Andersen
  2003-10-22 10:16         ` Nir Tzachar
  2003-10-22 10:21         ` Nir Tzachar
@ 2003-10-22 16:05         ` Valdis.Kletnieks
  2003-10-22 19:38           ` Erik Andersen
  2 siblings, 1 reply; 63+ messages in thread
From: Valdis.Kletnieks @ 2003-10-22 16:05 UTC (permalink / raw)
  To: andersen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 625 bytes --]

On Tue, 21 Oct 2003 22:57:09 MDT, Erik Andersen said:

> Suppose I install srfs on both my laptop and my server.  I then
> move the CVS repository for my pet project onto the new srfs
> filesystem and I take off for the weekend with my laptop.   Over
> the weekend I commit several changes to file X.  Over the weekend
> my friend also commits several changes to file X.
> 
> When I get home and plug in my laptop, presumably the caching
> daemon will try to stabalize the system by deciding which version
> of file X was changed last and replicating that latest version.  

Hey Larry - potential BitKeeper customer here. :)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-22 16:05         ` Valdis.Kletnieks
@ 2003-10-22 19:38           ` Erik Andersen
  2003-10-23  5:20             ` Miles Bader
  0 siblings, 1 reply; 63+ messages in thread
From: Erik Andersen @ 2003-10-22 19:38 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1069 bytes --]

On Wed Oct 22, 2003 at 12:05:46PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Tue, 21 Oct 2003 22:57:09 MDT, Erik Andersen said:
> 
> > Suppose I install srfs on both my laptop and my server.  I then
> > move the CVS repository for my pet project onto the new srfs
> > filesystem and I take off for the weekend with my laptop.   Over
> > the weekend I commit several changes to file X.  Over the weekend
> > my friend also commits several changes to file X.
> > 
> > When I get home and plug in my laptop, presumably the caching
> > daemon will try to stabalize the system by deciding which version
> > of file X was changed last and replicating that latest version.  
> 
> Hey Larry - potential BitKeeper customer here. :)

Not so much a potential BitKeeper customer, as pointing out that
the distributed filesystems prople are attacking the same
fundamental problem as the distributed version control folks.

 -Erik

--
Erik B. Andersen             http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-22 19:38           ` Erik Andersen
@ 2003-10-23  5:20             ` Miles Bader
  2003-10-23  5:37               ` Valdis.Kletnieks
  0 siblings, 1 reply; 63+ messages in thread
From: Miles Bader @ 2003-10-23  5:20 UTC (permalink / raw)
  To: andersen; +Cc: Valdis.Kletnieks, linux-kernel

Erik Andersen <andersen@codepoet.org> writes:
> Not so much a potential BitKeeper customer, as pointing out that
> the distributed filesystems prople are attacking the same
> fundamental problem as the distributed version control folks.

It may be the same at some level, but there's an important difference:
distributed filesystems are usually (AFAIK) attempting to maintain the
illusion of a single global filesystem that looks more or less to the
users like a local filesystem, and usually just an average unixy
filesystem.  This is very, very, hard...

Distributed version control systems, OTOH, because they're at a somewhat
higher level, have the huge advantage of distinct operational boundaries
which are exposed the user and can be used to manage the distribution.
Since users are used to these boundaries, and they usually occur at
fairly obvious and reasonable places, this isn't such a burden on the
users.

-Miles
-- 
97% of everything is grunge

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23  5:20             ` Miles Bader
@ 2003-10-23  5:37               ` Valdis.Kletnieks
  0 siblings, 0 replies; 63+ messages in thread
From: Valdis.Kletnieks @ 2003-10-23  5:37 UTC (permalink / raw)
  To: Miles Bader; +Cc: andersen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

On Thu, 23 Oct 2003 14:20:21 +0900, Miles Bader said:

> Distributed version control systems, OTOH, because they're at a somewhat
> higher level, have the huge advantage of distinct operational boundaries
> which are exposed the user and can be used to manage the distribution.
> Since users are used to these boundaries, and they usually occur at
> fairly obvious and reasonable places, this isn't such a burden on the
> users.

On the flip side, a filesystem only has to worry about who wrote which blocks
in what order.  I suspect if you tried to push the idea of a filesystem that did
the sort of intuiting of intent that BitKeeper has to do on a merge, it would
quickly get shouted down.

Unless of course somebody does BK as a Reiser4 module. :)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-22 14:22           ` Jan Harkes
@ 2003-10-23  7:50             ` Nir Tzachar
  2003-10-23 12:33               ` Jan Hudec
  0 siblings, 1 reply; 63+ messages in thread
From: Nir Tzachar @ 2003-10-23  7:50 UTC (permalink / raw)
  To: Jan Harkes; +Cc: linux-fsdevel

hi there.

> That's annoying when >50% of your servers were unavailable for a period
> of time, because all recent changes will be lost when connectivity is
> restored.

well, if u want a _full_ self stabilizing file system, you cannot behave 
any other way. When you have a self stabilizing algorithm, you __have__ to 
operate under the assumption that transient errors can and will happen.
so, a cosmic ray can hit N out of your 2N+1 hosts, and corrupt the data 
they hold. its very slim, but you have to take these kind of errors into 
account to prove the correctness of the algorithm.

> Where in the world did you get the idea that Coda has a main server that
> pushes out modifications? That is so wrong, I don't even know where to
> begin.
ur right, what i described was more like AFS, but you got my point...

> Coda also uses a global namespace that's pretty normal for distributed
> filesystems (AFS/DFS).
well, i was not talking about a global name space. surely u must have one, 
otherwise things will get though...
what i meant is, data is replicated at two levels: first, the meta data
(file attributes) is  replicated, and the actual data will only get 
replicated upon access.

to surmise:
At no time we set srfs to be "better" than CODA. 

Our design was aimed to be "close" to CODA, but our emphasis was 
self-stabilization, minimal dependency on a single point of failure, and 
trust no one (either a central mamanger or stored information).
from here came all our of our design. 

cheers.
-- 
========================================================================
nir.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23  7:50             ` Nir Tzachar
@ 2003-10-23 12:33               ` Jan Hudec
  2003-10-23 20:12                 ` Pat LaVarre
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Hudec @ 2003-10-23 12:33 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Jan Harkes, linux-fsdevel

On Thu, Oct 23, 2003 at 09:50:01 +0200, Nir Tzachar wrote:
> hi there.
> 
> > That's annoying when >50% of your servers were unavailable for a period
> > of time, because all recent changes will be lost when connectivity is
> > restored.
> 
> well, if u want a _full_ self stabilizing file system, you cannot behave 
> any other way. When you have a self stabilizing algorithm, you __have__ to 
> operate under the assumption that transient errors can and will happen.
> so, a cosmic ray can hit N out of your 2N+1 hosts, and corrupt the data 
> they hold. its very slim, but you have to take these kind of errors into 
> account to prove the correctness of the algorithm.

But the vector time approach solves this too and does so a lot better.

If we return to the example with notebook. Assume there is a computer
lab with 20 computers and all have replicas of some file. Assume, that
I take a laptop, connect it to the system, replicate the file and
disconnect.  Then I work on it while disconnected and then reconnect
again.

With vector time, the system decides that the copy on all 20 computers
is ancestor to my copy and replace everything with my copy. With
majority vote, my copy loses 1:20 and is lost.

Imagine further, that my friend does the same.

Now, with vector time, the system decides, that
  * All copies in the lab are old and invalidates them.
  * Our copies conflict. It does not blindly choose one, rather it asks
    for assistance.
While with majority vote, both our copies loose 1:20 and are discarded.

-------------------------------------------------------------------------------
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-21 12:07         ` Nir Tzachar
  2003-10-21 14:29           ` Brian Beattie
  2003-10-21 16:59           ` Jan Harkes
@ 2003-10-23 13:58           ` Pavel Machek
  2003-10-24  9:28             ` Nir Tzachar
  2 siblings, 1 reply; 63+ messages in thread
From: Pavel Machek @ 2003-10-23 13:58 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Eric Sandall, linux-fsdevel, linux-kernel

Hi!

> > This sounds fairly similar to Coda[0], which is already in development and use.
> >
> 
> not at all.
> 
> coda is not self stabilizing at all.
> srfs is also a totally distributed file system -> see the doc.
> bye

Yes, but perhaps differences can be localized to userspace daemon,
having same kernel part for coda and srfs?
That would be *good*.

				Pavel
-- 
				Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-20 21:00       ` Eric Sandall
  2003-10-21 12:07         ` Nir Tzachar
@ 2003-10-23 17:46         ` Daniel Egger
  2003-10-23 18:47           ` Eric Sandall
  2003-10-24  9:26           ` srfs - a new file system Nir Tzachar
  1 sibling, 2 replies; 63+ messages in thread
From: Daniel Egger @ 2003-10-23 17:46 UTC (permalink / raw)
  To: Eric Sandall; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 554 bytes --]

Am Mon, den 20.10.2003 schrieb Eric Sandall um 23:00:

> This sounds fairly similar to Coda[0], which is already in development and use.

The last time I looked Coda was a horrible mess of a code, closely
impossible to get it compile let alone configure and it seems to have
the same interoperability problems like intermezzo i.e. it didn't work
between i386<->powerpc. I haven't looked at Lustre light or srfs yet but
I certainly welcome any fresh projects in the area of distributed or
replicating filesystems.

-- 
Servus,
       Daniel

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23 17:46         ` Daniel Egger
@ 2003-10-23 18:47           ` Eric Sandall
  2003-10-23 23:15             ` Daniel Egger
  2003-10-24 15:45             ` srfs - a new file system.--OT Gadgeteer
  2003-10-24  9:26           ` srfs - a new file system Nir Tzachar
  1 sibling, 2 replies; 63+ messages in thread
From: Eric Sandall @ 2003-10-23 18:47 UTC (permalink / raw)
  To: Daniel Egger; +Cc: linux-kernel

Quoting Daniel Egger <degger@fhm.edu>:
> The last time I looked Coda was a horrible mess of a code, closely
> impossible to get it compile let alone configure and it seems to have
> the same interoperability problems like intermezzo i.e. it didn't work
> between i386<->powerpc. I haven't looked at Lustre light or srfs yet but
> I certainly welcome any fresh projects in the area of distributed or
> replicating filesystems.
> 
> -- 
> Servus,
>        Daniel

Agreed, more DFS' are always good.  As for Coda, it has compiled fine for me for
the last year (with some bison patches), but I have not actually tried it yet. 
NFS may be slow, but at least it works and I haven't lost any files due to
using it.

-sandalle

-- 
PGP Key Fingerprint:  FCFF 26A1 BE21 08F4 BB91  FAED 1D7B 7D74 A8EF DD61
http://search.keyserver.net:11371/pks/lookup?op=get&search=0xA8EFDD61

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS/E/IT$ d-- s++:+>: a-- C++(+++) BL++++VIS>$ P+(++) L+++ E-(---) W++ N+@ o?
K? w++++>-- O M-@ V-- PS+(+++) PE(-) Y++(+) PGP++(+) t+() 5++ X(+) R+(++)
tv(--)b++(+++) DI+@ D++(+++) G>+++ e>+++ h---(++) r++ y+
------END GEEK CODE BLOCK------

Eric Sandall                     |  Source Mage GNU/Linux Developer
eric@sandall.us                  |  http://www.sourcemage.org/
http://eric.sandall.us/          |  SysAdmin @ Inst. Shock Physics @ WSU
http://counter.li.org/  #196285  |  http://www.shock.wsu.edu/

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23 12:33               ` Jan Hudec
@ 2003-10-23 20:12                 ` Pat LaVarre
  2003-10-24  9:21                   ` Nir Tzachar
  0 siblings, 1 reply; 63+ messages in thread
From: Pat LaVarre @ 2003-10-23 20:12 UTC (permalink / raw)
  To: bulb; +Cc: tzachar, jaharkes, linux-fsdevel

> > transient errors can and will happen. so, a cosmic ray can hit N out
> > of your 2N+1 hosts, and corrupt the data  they hold. its very slim,
> > but you have to take these kind of errors into 
> > account to prove the correctness of the algorithm.
> 
> But the vector time approach solves this too and does so a lot better.
> 
> If we return to the example with notebook ...
> Now, with vector time, the system decides, that
>   * All copies in the lab are old and invalidates them.
>   * Our copies conflict. It does not blindly choose one,
> rather it asks for assistance.

We regard getting everyone to agree about what the time is as a solved
problem?

I'm inspired to ask because I'm posting this query from behind an
employer-owned firewall thru which I have not yet punched time service,
cvs, etc.  There was a day, not so long ago, when I couldn't punch thru
ftp and streaming media ...

Pat LaVarre



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23 18:47           ` Eric Sandall
@ 2003-10-23 23:15             ` Daniel Egger
  2003-10-24 15:45             ` srfs - a new file system.--OT Gadgeteer
  1 sibling, 0 replies; 63+ messages in thread
From: Daniel Egger @ 2003-10-23 23:15 UTC (permalink / raw)
  To: Eric Sandall; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 603 bytes --]

Am Don, den 23.10.2003 schrieb Eric Sandall um 20:47:

> Agreed, more DFS' are always good.  As for Coda, it has compiled fine for me for
> the last year (with some bison patches), but I have not actually tried it yet. 
> NFS may be slow, but at least it works and I haven't lost any files due to
> using it.

The slowless of NFS is not an issue for me (actually I'm getting quite
good performance over a switched 100Mbit network). However it doesn't
replicate which is annoying when being much on the road but normally
also using lots of computers in the lab.

-- 
Servus,
       Daniel

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23 20:12                 ` Pat LaVarre
@ 2003-10-24  9:21                   ` Nir Tzachar
  2003-10-24 12:08                     ` Matthew Wilcox
  2003-10-24 14:38                     ` Jan Harkes
  0 siblings, 2 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24  9:21 UTC (permalink / raw)
  To: Pat LaVarre; +Cc: bulb, jaharkes, linux-fsdevel

> > If we return to the example with notebook ...
> > Now, with vector time, the system decides, that
> >   * All copies in the lab are old and invalidates them.
> >   * Our copies conflict. It does not blindly choose one,
> > rather it asks for assistance.
> 
> We regard getting everyone to agree about what the time is as a solved
> problem?
> 
> I'm inspired to ask because I'm posting this query from behind an
> employer-owned firewall thru which I have not yet punched time service,
> cvs, etc.  There was a day, not so long ago, when I couldn't punch thru
> ftp and streaming media ...

i think ur right, but even when u do succeed, lets take a more byzantine 
approach: what if the time service is down? sabotaged? damaged? maybe it 
lies through its teeth(ports) ?? maybe ur vector time is corrupted? maybe 
a user deliberately changed his vector time - he will bring havoc upon 
ur system .

so, srfs takes the approach of 'trust no one, not even myself' .
a bit paranoid, but very useful (although the cost is very high... )


-- 
========================================================================
nir.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23 17:46         ` Daniel Egger
  2003-10-23 18:47           ` Eric Sandall
@ 2003-10-24  9:26           ` Nir Tzachar
  1 sibling, 0 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24  9:26 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Eric Sandall, linux-kernel

> The last time I looked Coda was a horrible mess of a code, closely
> impossible to get it compile let alone configure and it seems to have
> the same interoperability problems like intermezzo i.e. it didn't work
> between i386<->powerpc. I haven't looked at Lustre light or srfs yet but
> I certainly welcome any fresh projects in the area of distributed or
> replicating filesystems.

well, i hope u'll take a look at our file system, but keep in mind:
a) the code is in an early beta state.
b) srfs _cheats_ a bit, by enslaving a local file system.
c) the code is not as neat as i'd like it to be.
d) we havnt checked interoperability, since we dont have other platform.
   although, srfs should work on any platform running java.

cheers.
-- 
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-23 13:58           ` Pavel Machek
@ 2003-10-24  9:28             ` Nir Tzachar
  0 siblings, 0 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24  9:28 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Eric Sandall, linux-fsdevel, linux-kernel

hi

> Yes, but perhaps differences can be localized to userspace daemon,
> having same kernel part for coda and srfs?
> That would be *good*.
> 

in essence, ur correct. we would have taken that approach, if we were not 
aiming at building a file system on top of an object storage. this 
approach simplifies things a bit, and the kernel part is reduced.

-- 
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24  9:21                   ` Nir Tzachar
@ 2003-10-24 12:08                     ` Matthew Wilcox
  2003-10-24 19:14                       ` Nir Tzachar
  2003-10-24 14:38                     ` Jan Harkes
  1 sibling, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2003-10-24 12:08 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Pat LaVarre, bulb, jaharkes, linux-fsdevel

On Fri, Oct 24, 2003 at 11:21:38AM +0200, Nir Tzachar wrote:
> i think ur right, but even when u do succeed, lets take a more byzantine 
> approach: what if the time service is down? sabotaged? damaged? maybe it 
> lies through its teeth(ports) ?? maybe ur vector time is corrupted? maybe 
> a user deliberately changed his vector time - he will bring havoc upon 
> ur system .

uh, *vector time*, not real time.  Think CVS branches.

And if your server allows clients to corrupt it, then it's broken.  I doubt
Coda does that.

-- 
"It's not Hollywood.  War is real, war is primarily not about defeat or
victory, it is about death.  I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24  9:21                   ` Nir Tzachar
  2003-10-24 12:08                     ` Matthew Wilcox
@ 2003-10-24 14:38                     ` Jan Harkes
  2003-10-24 19:16                       ` Nir Tzachar
  1 sibling, 1 reply; 63+ messages in thread
From: Jan Harkes @ 2003-10-24 14:38 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Pat LaVarre, bulb, jaharkes, linux-fsdevel

On Fri, Oct 24, 2003 at 11:21:38AM +0200, Nir Tzachar wrote:
> i think ur right, but even when u do succeed, lets take a more byzantine 
> approach: what if the time service is down? sabotaged? damaged? maybe it 
> lies through its teeth(ports) ?? maybe ur vector time is corrupted? maybe 
> a user deliberately changed his vector time - he will bring havoc upon 
> ur system .

Ever heard of lamport clocks?

A version vector is incremented on updates. So it doesn't matter whether
I changed the file on Saturday and you changed it on Sunday, when we
both return on Monday the system _will_ detect that both of us have a
new version of the same original file and considers it a conflict. It is
just another way of detecting version differences.

If a server has been off-line for a while, the version on it's files are
lower than those of files that were updated on the on-line servers. So
we see that it simply has an older version and we can (trivially)
resolve the conflict by forcing the new versions to the restored server.
This even works if the broken server had to be rebuilt from scratch and
has no data (i.e. all 'versions-vectors' are all zeros).

But we don't need to have a majority of the servers available to perform
successfull writes. It is just a different solution from yours, with
it's own unique limitations (limited length of the version vector limits
maximal replication factor).

Jan


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.--OT
  2003-10-23 18:47           ` Eric Sandall
  2003-10-23 23:15             ` Daniel Egger
@ 2003-10-24 15:45             ` Gadgeteer
  1 sibling, 0 replies; 63+ messages in thread
From: Gadgeteer @ 2003-10-24 15:45 UTC (permalink / raw)
  To: linux-kernel

On Thursday 23 October 2003 12:47, Eric Sandall wrote:
> > I certainly welcome any fresh projects in the area of distributed or
> > replicating filesystems.

Just to interject here the November issue of Linux Magazine has an article on 
DRBD - Distributed Replicated Block Device (www.drbd.org).  I would be very 
interested in thoughts/comments regarding this project.

thanks in advance,
ken

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 12:08                     ` Matthew Wilcox
@ 2003-10-24 19:14                       ` Nir Tzachar
  0 siblings, 0 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24 19:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Pat LaVarre, bulb, jaharkes, linux-fsdevel

> uh, *vector time*, not real time.  Think CVS branches.
> 
> And if your server allows clients to corrupt it, then it's broken.  I doubt
> Coda does that.

maybe not intentionally, but as Murphy put it:

Anything that can go wrong will go wrong. 
If there is a possibility of several things going wrong, the one that will 
cause the most damage will be the one to go wrong. Corollary: If there is 
a worse time for something to go wrong, it will happen then. 
If anything simply cannot go wrong, it will anyway. 
If you perceive that there are four possible ways in which a procedure can 
go wrong, and circumvent these, then a fifth way, unprepared for, will 
promptly develop. 
Left to themselves, things tend to go from bad to worse. 
If everything seems to be going well, you have obviously overlooked 
something. 
Nature always sides with the hidden flaw. 
Mother nature is a bitch. 
It is impossible to make anything foolproof because fools are so ingenious

[ taken from http://dmawww.epfl.ch/roso.mosaic/dm/murphy.html ]

so, here u go... ;)

-- 
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 14:38                     ` Jan Harkes
@ 2003-10-24 19:16                       ` Nir Tzachar
  2003-10-24 20:11                         ` Andreas Dilger
  2003-10-25  8:01                         ` Jan Hudec
  0 siblings, 2 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24 19:16 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Pat LaVarre, bulb, linux-fsdevel

> Ever heard of lamport clocks?
i know them as vector clocks, but yes.

> A version vector is incremented on updates. So it doesn't matter whether
> I changed the file on Saturday and you changed it on Sunday, when we
> both return on Monday the system _will_ detect that both of us have a
> new version of the same original file and considers it a conflict. It is
> just another way of detecting version differences.

i know what ur talking about, but the model lamport uses does not fit 
ours. lamport's vector clocks are not self-stabilizing, and a corrupted 
vector (intentionally or un-intentionally) can break ur system.
[vector clocks are also not space bounded. (although there are some 
solutions to this problem.)]

let me give u an example: 
lets say, you connected ur laptop to the coda pool, worked on a file 
locally, and then disconnected from the pool.
now, b4 reconnecting ur laptop to the pool, u accidently dropped it, and 
ur hard disk got banged. as a result, some random bits on it changed, 
and the saved vector clock, as well as the local counter and the file's 
content, gets corrupted.
upon reconnection, the system will rightly decide ur version is the 
correct one, and will push ur corrupted replica to all other servers.
you have broken system integrity.

there is a one to 10^gazillion chance of this happening, but this is 
exactly what we're aiming at: dont take any chances (or prisoners). 
if we dont agree on a file, nuke it. lets stay on the safe side. 




-- 
========================================================================
nir.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 19:16                       ` Nir Tzachar
@ 2003-10-24 20:11                         ` Andreas Dilger
  2003-10-24 20:24                           ` Pat LaVarre
  2003-10-24 20:53                           ` Nir Tzachar
  2003-10-25  8:01                         ` Jan Hudec
  1 sibling, 2 replies; 63+ messages in thread
From: Andreas Dilger @ 2003-10-24 20:11 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Jan Harkes, Pat LaVarre, bulb, linux-fsdevel

On Oct 24, 2003  21:16 +0200, Nir Tzachar wrote:
> there is a one to 10^gazillion chance of this happening, but this is 
> exactly what we're aiming at: dont take any chances (or prisoners). 
> if we dont agree on a file, nuke it. lets stay on the safe side. 

So, system stabilizes when there are no files left ;-).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 20:11                         ` Andreas Dilger
@ 2003-10-24 20:24                           ` Pat LaVarre
  2003-10-24 20:38                             ` Andreas Dilger
  2003-10-24 20:53                           ` Nir Tzachar
  1 sibling, 1 reply; 63+ messages in thread
From: Pat LaVarre @ 2003-10-24 20:24 UTC (permalink / raw)
  To: adilger; +Cc: tzachar, jaharkes, bulb, linux-fsdevel

> >  a one to 10^gazillion chance of this happening,
> 
> So, system stabilizes when there are no files left ;-).

Anyone have a measure of how often these events actually do occur?

What little I've seen people say of how they design file and RAID
systems speaks as if HDD's reliably chose either to read back what you
wrote to them or else reported an error.

What about when the HDD actually reads back something else?

How can we know how commonly that occurs in practice, so that we can
know how often we're wrong to believe such things as our locally
recorded vector time?

Pat LaVarre



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 20:24                           ` Pat LaVarre
@ 2003-10-24 20:38                             ` Andreas Dilger
  2003-10-24 20:52                               ` Pat LaVarre
  0 siblings, 1 reply; 63+ messages in thread
From: Andreas Dilger @ 2003-10-24 20:38 UTC (permalink / raw)
  To: Pat LaVarre; +Cc: tzachar, jaharkes, bulb, linux-fsdevel

On Oct 24, 2003  14:24 -0600, Pat LaVarre wrote:
> > >  a one to 10^gazillion chance of this happening,
> > 
> > So, system stabilizes when there are no files left ;-).
> 
> Anyone have a measure of how often these events actually do occur?
> 
> What little I've seen people say of how they design file and RAID
> systems speaks as if HDD's reliably chose either to read back what you
> wrote to them or else reported an error.
> 
> What about when the HDD actually reads back something else?
> 
> How can we know how commonly that occurs in practice, so that we can
> know how often we're wrong to believe such things as our locally
> recorded vector time?

There are lots of ways to read back garbage from a disk unrelated to
physical HDD errors: memory errors, bad cables, software errors (driver,
fs, vm, etc), bad IDE DMA settings, power failures during write, etc...

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 20:38                             ` Andreas Dilger
@ 2003-10-24 20:52                               ` Pat LaVarre
  2003-10-24 21:00                                 ` Nir Tzachar
  2003-10-24 21:15                                 ` Andreas Dilger
  0 siblings, 2 replies; 63+ messages in thread
From: Pat LaVarre @ 2003-10-24 20:52 UTC (permalink / raw)
  To: adilger; +Cc: tzachar, jaharkes, bulb, linux-fsdevel

> There are lots of ways to read back garbage from a disk unrelated to
> physical HDD errors: memory errors, bad cables, software errors (driver,
> fs, vm, etc), bad IDE DMA settings, power failures during write, etc...

Yes, thank you for finding words to express that fact so much more
clearly than I did.

> There are lots of ways to read back garbage from a disk unrelated to
> physical HDD errors: memory errors, bad cables, software errors (driver,
> fs, vm, etc), bad IDE DMA settings, power failures during write, etc...

In particular, I see HDD's vary in their opinion of which cabling and
configuration and protocol is bad.  Therefore I ask:

"How can we know how commonly that occurs in practice, so that we can
know how often we're wrong to believe such things as our locally
recorded vector time?"

Is there, as yet, no linux filesystem that preserves the integrity of
the data and metadata despite such failures?  If I could write such an
fs on to a single directly-attached local drive, then I could measure
how often I myself experience such failures.

I'm confident I actually do experience these failures because often I
work in comparative, raid-like measures.  When I see one drive and
another disagree about what I wrote, then whenever I trust my write and
diff and read tools I must conclude one or both of the HDD's is wrong.

Pat LaVarre



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 20:11                         ` Andreas Dilger
  2003-10-24 20:24                           ` Pat LaVarre
@ 2003-10-24 20:53                           ` Nir Tzachar
  1 sibling, 0 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24 20:53 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jan Harkes, Pat LaVarre, bulb, linux-fsdevel

> > exactly what we're aiming at: dont take any chances (or prisoners). 
> > if we dont agree on a file, nuke it. lets stay on the safe side. 
> 
> So, system stabilizes when there are no files left ;-).

you know, as asys admin i always told my boss, that getting rid of all of 
our users will solve all of our problems <@-;)

-- 
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 20:52                               ` Pat LaVarre
@ 2003-10-24 21:00                                 ` Nir Tzachar
  2003-10-24 21:22                                   ` Pat LaVarre
  2003-10-25  0:23                                   ` Bryan Henderson
  2003-10-24 21:15                                 ` Andreas Dilger
  1 sibling, 2 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24 21:00 UTC (permalink / raw)
  To: Pat LaVarre; +Cc: adilger, jaharkes, bulb, linux-fsdevel

> I'm confident I actually do experience these failures because often I
> work in comparative, raid-like measures.  When I see one drive and
> another disagree about what I wrote, then whenever I trust my write and
> diff and read tools I must conclude one or both of the HDD's is wrong.

so, wont a f/s that can guarantee (and prove) its stability be nice?
thats what we're aiming at ;)

since these kind of errors are transient (meaning, in an infinite 
execution time only a finite number of errors occur), srfs should be 
capable to deal with'em.

 -- 
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 20:52                               ` Pat LaVarre
  2003-10-24 21:00                                 ` Nir Tzachar
@ 2003-10-24 21:15                                 ` Andreas Dilger
  1 sibling, 0 replies; 63+ messages in thread
From: Andreas Dilger @ 2003-10-24 21:15 UTC (permalink / raw)
  To: Pat LaVarre; +Cc: tzachar, jaharkes, bulb, linux-fsdevel

On Oct 24, 2003  14:52 -0600, Pat LaVarre wrote:
> "How can we know how commonly that occurs in practice, so that we can
> know how often we're wrong to believe such things as our locally
> recorded vector time?"
> 
> Is there, as yet, no linux filesystem that preserves the integrity of
> the data and metadata despite such failures?  If I could write such an
> fs on to a single directly-attached local drive, then I could measure
> how often I myself experience such failures.
> 
> I'm confident I actually do experience these failures because often I
> work in comparative, raid-like measures.  When I see one drive and
> another disagree about what I wrote, then whenever I trust my write and
> diff and read tools I must conclude one or both of the HDD's is wrong.

I recall there being a loopback driver that will write a checksum for each
block written to the device into a separate block device (probably just
another loop device on a separate filesystem) so you could use that to
verify your data on each read.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 21:00                                 ` Nir Tzachar
@ 2003-10-24 21:22                                   ` Pat LaVarre
  2003-10-24 23:03                                     ` Nir Tzachar
  2003-10-25  0:23                                   ` Bryan Henderson
  1 sibling, 1 reply; 63+ messages in thread
From: Pat LaVarre @ 2003-10-24 21:22 UTC (permalink / raw)
  To: tzachar; +Cc: adilger, jaharkes, bulb, linux-fsdevel

> I'm confident I actually do experience these failures because often I
> > work in comparative, raid-like measures.  When I see one drive and
> > another disagree about what I wrote, then whenever I trust my write and
> > diff and read tools I must conclude one or both of the HDD's is wrong.
> 
> so, wont a f/s that can guarantee (and prove) its stability be nice?

Yes.

> thats what we're aiming at ;)
> 
> since these kind of errors are transient (meaning, in an infinite 
> execution time only a finite number of errors occur), srfs should be 
> capable to deal with'em.

Good.

Help the mass market more accurately measure how often the millions of
commodity HDD's actually do fail to read back what was written, and
you'll get noticed, I think.

We can't know til after we run this experiment?

We might actually discover that in fact quantifying the real experience
of HDD failure does gives us numbers roughly equal to the more easily
repeated, carefully controlled, therefore useless to me, laboratory
results that some folk prefer to publish.

Pat LaVarre



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 21:22                                   ` Pat LaVarre
@ 2003-10-24 23:03                                     ` Nir Tzachar
  0 siblings, 0 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-24 23:03 UTC (permalink / raw)
  To: Pat LaVarre; +Cc: adilger, jaharkes, bulb, linux-fsdevel

> Help the mass market more accurately measure how often the millions of
> commodity HDD's actually do fail to read back what was written, 
the numbers are probably _very_ low.

> and you'll get noticed, I think.
i think i know where ur going, but i disagree.
say u have a system, and u wish it operational with zero maintenance.
how else can this be achieved?

-- 
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 21:00                                 ` Nir Tzachar
  2003-10-24 21:22                                   ` Pat LaVarre
@ 2003-10-25  0:23                                   ` Bryan Henderson
  2003-10-25 10:37                                     ` Nir Tzachar
  1 sibling, 1 reply; 63+ messages in thread
From: Bryan Henderson @ 2003-10-25  0:23 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: adilger, bulb, jaharkes, linux-fsdevel, Pat LaVarre

>these kind of errors are transient (meaning, in an infinite 
>execution time only a finite number of errors occur)

You lost me here.  First, why is the number of errors finite when the 
execution time is infinite?  Second, how does that mean the errors are 
transient?  I'd think the relationship is exactly the opposite:  If the 
errors are permanent, then there's a limit to how many can occur 
regardless of time.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-24 19:16                       ` Nir Tzachar
  2003-10-24 20:11                         ` Andreas Dilger
@ 2003-10-25  8:01                         ` Jan Hudec
  1 sibling, 0 replies; 63+ messages in thread
From: Jan Hudec @ 2003-10-25  8:01 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Jan Harkes, Pat LaVarre, linux-fsdevel

On Fri, Oct 24, 2003 at 21:16:49 +0200, Nir Tzachar wrote:
> > Ever heard of lamport clocks?
> i know them as vector clocks, but yes.

So I'd expect you would get the point when I used the term "vector time"
;-)

-------------------------------------------------------------------------------
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Implementing writepage
  2003-10-20  9:12     ` srfs - a new file system Nir Tzachar
  2003-10-20 21:00       ` Eric Sandall
  2003-10-22  4:57       ` Erik Andersen
@ 2003-10-25  9:27       ` Charles Manning
  2003-10-25 16:18         ` David Woodhouse
  2 siblings, 1 reply; 63+ messages in thread
From: Charles Manning @ 2003-10-25  9:27 UTC (permalink / raw)
  To: linux-fsdevel

Hi

I'm the maintainer for YAFFS, the NAND-flash file system.

I've had readpage implemented for a long while to support read memory mapping 
(eg to execute a program).

I've also had prepare_write and commit_write implemented for a while, 
thinking this was sufficient to support write mmapping. Someone found that 
this is not the case.

I need to therefore implement writepage and have a few questions:

1) Is there a generic writepage lurking somewhere that will use 
prepare/commit_write instead?
2) What fiddling is required with kmap & page flags within writepage or is 
this all handled by the caller?

Any help appreciated.

Thanx

-- Charles

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: srfs - a new file system.
  2003-10-25  0:23                                   ` Bryan Henderson
@ 2003-10-25 10:37                                     ` Nir Tzachar
  0 siblings, 0 replies; 63+ messages in thread
From: Nir Tzachar @ 2003-10-25 10:37 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: adilger, bulb, jaharkes, linux-fsdevel, Pat LaVarre

> You lost me here.  First, why is the number of errors finite when the 
> execution time is infinite?  Second, how does that mean the errors are 
> transient?  I'd think the relationship is exactly the opposite:  If the 
> errors are permanent, then there's a limit to how many can occur 
> regardless of time.

first, transient means short-lived, not permanent (from the dict. ) 
now, you can describe a model of a file system as an infinite execution of 
file operations. [we use infinite sequences to prove correctness]
im not talking about execution time, but an infinite number of separate 
operations.

when striving to achieve self stabilization, you need to prove that as 
long as at some point you will get no more errors (hence, transient 
errors) the system will stabilize and keep on working correctly.

-- 
========================================================================
nir.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-25  9:27       ` Implementing writepage Charles Manning
@ 2003-10-25 16:18         ` David Woodhouse
  2003-10-25 22:40           ` Charles Manning
  2003-10-27  8:34           ` Nikita Danilov
  0 siblings, 2 replies; 63+ messages in thread
From: David Woodhouse @ 2003-10-25 16:18 UTC (permalink / raw)
  To: manningc2; +Cc: linux-fsdevel

On Sat, 2003-10-25 at 22:27 +0000, Charles Manning wrote:
> I need to therefore implement writepage and have a few questions:

No you don't. This is flash -- people don't really need shared writable
mmap; if they think they do, they need educating not pandering to.

> 1) Is there a generic writepage lurking somewhere that will use 
> prepare/commit_write instead?

Don't think so. Offhand I don't see why it couldn't be done, but it's
not what most file systems would want.

> 2) What fiddling is required with kmap & page flags within writepage or is 
> this all handled by the caller?

You'll need to kmap the page since you actually want to touch the data
with the CPU. You probably also need to mark the page uptodate when
you're done. Take a look at the generic block writepage. 

> Any help appreciated.

Ensure you have no memory allocations in the writepage() code path,
unless they're done with (IIRC) GFP_NOIO.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-25 16:18         ` David Woodhouse
@ 2003-10-25 22:40           ` Charles Manning
  2003-10-26 10:25             ` David Woodhouse
  2003-10-27  8:34           ` Nikita Danilov
  1 sibling, 1 reply; 63+ messages in thread
From: Charles Manning @ 2003-10-25 22:40 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-fsdevel

On Sunday 26 October 2003 05:18, David Woodhouse wrote:
> On Sat, 2003-10-25 at 22:27 +0000, Charles Manning wrote:
> > I need to therefore implement writepage and have a few questions:
>
> No you don't. This is flash -- people don't really need shared writable
> mmap; if they think they do, they need educating not pandering to.

It's a matter of what software they're using. Debian apt-get is the current 
woesome application. Some people have vast YAFFS fs (512Mbytes or more) and 
are starting to think of it as a real disk equivalent rather than just a 
little pokey place to store a few configs. This means they want to run 
regular software.

>
> > 1) Is there a generic writepage lurking somewhere that will use
> > prepare/commit_write instead?
>
> Don't think so. Offhand I don't see why it couldn't be done, but it's
> not what most file systems would want.

It seems wierd to me that generic_file_write uses the address_ops 
prepare/commit_write via the page cache, yet mmap does not have a generic 
funtion to do the same, even though this might not be the best approach for 
efficiency.

>
> > 2) What fiddling is required with kmap & page flags within writepage or
> > is this all handled by the caller?
>
> You'll need to kmap the page since you actually want to touch the data
> with the CPU. You probably also need to mark the page uptodate when
> you're done. Take a look at the generic block writepage.
>
> > Any help appreciated.
>
> Ensure you have no memory allocations in the writepage() code path,
> unless they're done with (IIRC) GFP_NOIO.

Thanx

-- CHarles

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-25 22:40           ` Charles Manning
@ 2003-10-26 10:25             ` David Woodhouse
  2003-10-26 15:28               ` Matthew Wilcox
  2003-10-26 20:54               ` Charles Manning
  0 siblings, 2 replies; 63+ messages in thread
From: David Woodhouse @ 2003-10-26 10:25 UTC (permalink / raw)
  To: manningc2; +Cc: linux-fsdevel

On Sun, 2003-10-26 at 11:40 +1300, Charles Manning wrote:
> It's a matter of what software they're using. Debian apt-get is the current 
> woesome application. Some people have vast YAFFS fs (512Mbytes or more) and 
> are starting to think of it as a real disk equivalent rather than just a 
> little pokey place to store a few configs. This means they want to run 
> regular software.

While obviously YAFFS exists because you don't always make the same
choices as me, I'd encourage you to resist calls to implement shared
writable mmap on flash, or at least to make it an optional feature which
is omitted by default. Otherwise, people might actually use it :)

The life time of flash is limited, and users should endeavour to reduce
the number of writes even if the media become big enough that the rest
of the pain of using flash is alleviated.

Shared writable mmap can cause a rewrite of a whole page every time a
single byte is changed; explicit writes are almost always going to be a
more efficient use of the flash.

Your file system is not broken; your application author is. Mend that
instead. :)

-- 
dwmw2



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-26 10:25             ` David Woodhouse
@ 2003-10-26 15:28               ` Matthew Wilcox
  2003-10-26 18:47                 ` Mark B
  2003-10-26 20:54               ` Charles Manning
  1 sibling, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2003-10-26 15:28 UTC (permalink / raw)
  To: David Woodhouse; +Cc: manningc2, linux-fsdevel

On Sun, Oct 26, 2003 at 10:25:49AM +0000, David Woodhouse wrote:
> Your file system is not broken; your application author is. Mend that
> instead. :)

I don't see why every application should be rewritten for the needs of
the current generation of flash.  If this is such a problem for flash,
then maybe the filesystem should implement its own caching strategy
for these pages.  (Again, expecting the page cache to understand about
flash's special requirements is unreasonable.)

Your argument makes sense for the embedded market, but not for mainstream.

-- 
"It's not Hollywood.  War is real, war is primarily not about defeat or
victory, it is about death.  I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-26 15:28               ` Matthew Wilcox
@ 2003-10-26 18:47                 ` Mark B
  2003-10-26 20:40                   ` Charles Manning
  0 siblings, 1 reply; 63+ messages in thread
From: Mark B @ 2003-10-26 18:47 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

On Sunday 26 October 2003 16:28, Matthew Wilcox wrote:
> On Sun, Oct 26, 2003 at 10:25:49AM +0000, David Woodhouse wrote:
> > Your file system is not broken; your application author is. Mend that
> > instead. :)
>
> I don't see why every application should be rewritten for the needs of
> the current generation of flash.  If this is such a problem for flash,
> then maybe the filesystem should implement its own caching strategy
> for these pages.  (Again, expecting the page cache to understand about
> flash's special requirements is unreasonable.)
>
> Your argument makes sense for the embedded market, but not for mainstream.

I agree,
and someone maybe even cares to use the expensive flashs to keep the data 
secure, because in some cases the data is more valuable then the drive, but 
there are not so much data for use a RAID or something, it's like using a 
cannon to kill a mosquito.
I'm telling this from expirience, since I'm currently developing a filesystem 
for such a purpose, with minimising writes and aligning them to the pages of 
the media + supporting small transactions across files (via ioctl hints by 
the app) + usual journalling.

-- 
Mark Burazin 
mark@lemna.hr
---<>---<>---<>---<>---<>---<>---<>---<>---<>
Lemna d.o.o.
http://www.lemna.biz - info@lemna.hr
<>---<>---<>---<>---<>---<>---<>---<>---<>---



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-26 18:47                 ` Mark B
@ 2003-10-26 20:40                   ` Charles Manning
  2003-10-26 21:04                     ` David Woodhouse
  0 siblings, 1 reply; 63+ messages in thread
From: Charles Manning @ 2003-10-26 20:40 UTC (permalink / raw)
  To: Mark B, Matthew Wilcox; +Cc: linux-fsdevel

On Monday 27 October 2003 07:47, Mark B wrote:
> On Sunday 26 October 2003 16:28, Matthew Wilcox wrote:
> > On Sun, Oct 26, 2003 at 10:25:49AM +0000, David Woodhouse wrote:
> > > Your file system is not broken; your application author is. Mend that
> > > instead. :)
> >
> > I don't see why every application should be rewritten for the needs of
> > the current generation of flash.  If this is such a problem for flash,
> > then maybe the filesystem should implement its own caching strategy
> > for these pages.  (Again, expecting the page cache to understand about
> > flash's special requirements is unreasonable.)
> >
> > Your argument makes sense for the embedded market, but not for
> > mainstream.
>
> I agree,
> and someone maybe even cares to use the expensive flashs to keep the data
> secure, because in some cases the data is more valuable then the drive, but
> there are not so much data for use a RAID or something, it's like using a
> cannon to kill a mosquito.
> I'm telling this from expirience, since I'm currently developing a
> filesystem for such a purpose, with minimising writes and aligning them to
> the pages of the media + supporting small transactions across files (via
> ioctl hints by the app) + usual journalling.

Ok fellas, we've all had a fine time disagreeing with David Woodhouse's take 
on this, now can someome tell me how to do it? 

Thanx

-- Charles


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-26 10:25             ` David Woodhouse
  2003-10-26 15:28               ` Matthew Wilcox
@ 2003-10-26 20:54               ` Charles Manning
  1 sibling, 0 replies; 63+ messages in thread
From: Charles Manning @ 2003-10-26 20:54 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-fsdevel

On Sunday 26 October 2003 23:25, David Woodhouse wrote:
> On Sun, 2003-10-26 at 11:40 +1300, Charles Manning wrote:
> > It's a matter of what software they're using. Debian apt-get is the
> > current woesome application. Some people have vast YAFFS fs (512Mbytes or
> > more) and are starting to think of it as a real disk equivalent rather
> > than just a little pokey place to store a few configs. This means they
> > want to run regular software.
>
> While obviously YAFFS exists because you don't always make the same
> choices as me, I'd encourage you to resist calls to implement shared
> writable mmap on flash, or at least to make it an optional feature which
> is omitted by default. Otherwise, people might actually use it :)
>
> The life time of flash is limited, and users should endeavour to reduce
> the number of writes even if the media become big enough that the rest
> of the pain of using flash is alleviated.
>
> Shared writable mmap can cause a rewrite of a whole page every time a
> single byte is changed; explicit writes are almost always going to be a
> more efficient use of the flash.
>
> Your file system is not broken; your application author is. Mend that
> instead. :)

I disagree David, to an extent. Sure a change of a single byte can cause the 
rewrite of a whole page, but you can also cause the same thing with a poorly 
written application using write(). mmap is not inherently at fault. I do 
however agree that mmap is, in general, not the best way to construct apps 
for flash friendliness.

I think too that the attitude to tell people to go rewrite apps for flash is 
not really GoodForm(tm). Doing limited mmap with a few apps here and there is 
not going to hurt YAFFS.  Why embed Linux if you're going to have to scratch 
through all utils to pull out the mmaps?

Yes, the lifetime of flash is limited, but I've done accelerated lifetime 
tests (30GB or so of writes) and others have done way more than this (twenty 
or 30 times).  NAND flash, with YAFFS, is unlikely to wear out in most 
embedded usage scenarios, by an order of magnitude or so. Sure, you could 
craft an atypical "killer app".

I think it appropriate that YAFFS supports mmap. Now what I want to know is 
how to implement it.

BTW: In case others on this list think I'm slagging off at David, you're much 
mistaken. I hold him and his work in high regard. It would be a boring world 
if we all agreed.

-- CHarles



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-26 20:40                   ` Charles Manning
@ 2003-10-26 21:04                     ` David Woodhouse
  0 siblings, 0 replies; 63+ messages in thread
From: David Woodhouse @ 2003-10-26 21:04 UTC (permalink / raw)
  To: manningc2; +Cc: Mark B, Matthew Wilcox, linux-fsdevel

On Mon, 2003-10-27 at 09:40 +1300, Charles Manning wrote:
> Ok fellas, we've all had a fine time disagreeing with David Woodhouse's take 
> on this, now can someome tell me how to do it? 

Sorry, I thought we'd already done that. Make your writepage write out
the page, without allocating memory (at least with GFP_KERNEL), then
mark the offending page uptodate and unlock it.

You said in private mail you were looking at smbfs. That looks like it's
perfectly sufficient to let you implement your own. 

-- 
dwmw2



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-25 16:18         ` David Woodhouse
  2003-10-25 22:40           ` Charles Manning
@ 2003-10-27  8:34           ` Nikita Danilov
  2003-10-27  8:39             ` David Woodhouse
  1 sibling, 1 reply; 63+ messages in thread
From: Nikita Danilov @ 2003-10-27  8:34 UTC (permalink / raw)
  To: David Woodhouse; +Cc: manningc2, linux-fsdevel

David Woodhouse writes:
 > On Sat, 2003-10-25 at 22:27 +0000, Charles Manning wrote:
 > > I need to therefore implement writepage and have a few questions:
 > 
 > No you don't. This is flash -- people don't really need shared writable
 > mmap; if they think they do, they need educating not pandering to.

Note that ->writepage() is used not only by mmap() (actually, it is only
used by mmap() is file system doesn't provide its on
->writepages()). ->writepage() is used by VM to write pages in response
to the memory pressure (see mm/vmscan.c:shrink_list()). Every
well-behaving file system has to provide ->writepage() for this purpose.

 > 

[...]

 > 
 > -- 
 > dwmw2
 > 

Nikita.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-27  8:34           ` Nikita Danilov
@ 2003-10-27  8:39             ` David Woodhouse
  2003-10-27  8:43               ` Nikita Danilov
  0 siblings, 1 reply; 63+ messages in thread
From: David Woodhouse @ 2003-10-27  8:39 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: manningc2, linux-fsdevel

On Mon, 2003-10-27 at 11:34 +0300, Nikita Danilov wrote:
> Note that ->writepage() is used not only by mmap() (actually, it is only
> used by mmap() is file system doesn't provide its on
> ->writepages()). ->writepage() is used by VM to write pages in response
> to the memory pressure (see mm/vmscan.c:shrink_list()). Every
> well-behaving file system has to provide ->writepage() for this purpose.

How do you get dirty but still file-backed pages if you don't get them
by shared writable mmap?

-- 
dwmw2


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-27  8:39             ` David Woodhouse
@ 2003-10-27  8:43               ` Nikita Danilov
  2003-10-27  8:46                 ` David Woodhouse
  0 siblings, 1 reply; 63+ messages in thread
From: Nikita Danilov @ 2003-10-27  8:43 UTC (permalink / raw)
  To: David Woodhouse; +Cc: manningc2, linux-fsdevel

David Woodhouse writes:
 > On Mon, 2003-10-27 at 11:34 +0300, Nikita Danilov wrote:
 > > Note that ->writepage() is used not only by mmap() (actually, it is only
 > > used by mmap() is file system doesn't provide its on
 > > ->writepages()). ->writepage() is used by VM to write pages in response
 > > to the memory pressure (see mm/vmscan.c:shrink_list()). Every
 > > well-behaving file system has to provide ->writepage() for this purpose.
 > 
 > How do you get dirty but still file-backed pages if you don't get them
 > by shared writable mmap?

By write(2)? May be I am missing something in this discussion, though.

 > 
 > -- 
 > dwmw2
 > 

Nikita.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-27  8:43               ` Nikita Danilov
@ 2003-10-27  8:46                 ` David Woodhouse
  2003-10-27  8:52                   ` Nikita Danilov
  0 siblings, 1 reply; 63+ messages in thread
From: David Woodhouse @ 2003-10-27  8:46 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: manningc2, linux-fsdevel

On Mon, 2003-10-27 at 11:43 +0300, Nikita Danilov wrote:
> By write(2)? May be I am missing something in this discussion, though.

Either your commit_write() is synchronous or it does what writepage()
would do anyway... which is to start the I/O but not wait for it.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-27  8:46                 ` David Woodhouse
@ 2003-10-27  8:52                   ` Nikita Danilov
  2003-10-27  9:06                     ` David Woodhouse
  0 siblings, 1 reply; 63+ messages in thread
From: Nikita Danilov @ 2003-10-27  8:52 UTC (permalink / raw)
  To: David Woodhouse; +Cc: manningc2, linux-fsdevel

David Woodhouse writes:
 > On Mon, 2003-10-27 at 11:43 +0300, Nikita Danilov wrote:
 > > By write(2)? May be I am missing something in this discussion, though.
 > 
 > Either your commit_write() is synchronous or it does what writepage()
 > would do anyway... which is to start the I/O but not wait for it.

I don't quite follow why. generic_commit_write() only marks buffers
dirty. Actual IO is started by either ->writepages() called from within
balance_dirty_pages() (or pdflush) or by ->writepage() called by VM
scanner.

 > 
 > -- 
 > dwmw2
 > 

Nikita.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-27  8:52                   ` Nikita Danilov
@ 2003-10-27  9:06                     ` David Woodhouse
  2003-10-27  9:08                       ` David Woodhouse
  0 siblings, 1 reply; 63+ messages in thread
From: David Woodhouse @ 2003-10-27  9:06 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: manningc2, linux-fsdevel

On Mon, 2003-10-27 at 11:52 +0300, Nikita Danilov wrote:
> I don't quite follow why. generic_commit_write() only marks buffers
> dirty. Actual IO is started by either ->writepages() called from within
> balance_dirty_pages() (or pdflush) or by ->writepage() called by VM
> scanner.

Charles isn't using generic_commit_write(); this is not a traditional
block-device-backed file system.

I strongly suspect his commit_write() is in fact synchronous.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Implementing writepage
  2003-10-27  9:06                     ` David Woodhouse
@ 2003-10-27  9:08                       ` David Woodhouse
  0 siblings, 0 replies; 63+ messages in thread
From: David Woodhouse @ 2003-10-27  9:08 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: manningc2, linux-fsdevel

Sorry, I should be clearer...

On Mon, 2003-10-27 at 09:06 +0000, David Woodhouse wrote:
> On Mon, 2003-10-27 at 11:52 +0300, Nikita Danilov wrote:
> > I don't quite follow why. generic_commit_write() only marks buffers
> > dirty. Actual IO is started by either ->writepages() called from within
> > balance_dirty_pages() (or pdflush) or by ->writepage() called by VM
> > scanner.

Ah yes, you are probably right in the case of block device file systems;
I missed that. But...

> Charles isn't using generic_commit_write(); this is not a traditional
> block-device-backed file system.
> 
> I strongly suspect his commit_write() is in fact synchronous.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2003-10-27  9:08 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-07  4:11 fs test suite Pat LaVarre
2003-10-07 14:17 ` Randy.Dunlap
2003-10-07 14:59   ` Zachary Peterson
2003-10-07 17:16     ` Randy.Dunlap
2003-10-07 18:54       ` Pat LaVarre
2003-10-07 18:58         ` Randy.Dunlap
2003-10-07 19:26           ` Pat LaVarre
2003-10-20  9:12     ` srfs - a new file system Nir Tzachar
2003-10-20 21:00       ` Eric Sandall
2003-10-21 12:07         ` Nir Tzachar
2003-10-21 14:29           ` Brian Beattie
2003-10-21 16:59           ` Jan Harkes
2003-10-23 13:58           ` Pavel Machek
2003-10-24  9:28             ` Nir Tzachar
2003-10-23 17:46         ` Daniel Egger
2003-10-23 18:47           ` Eric Sandall
2003-10-23 23:15             ` Daniel Egger
2003-10-24 15:45             ` srfs - a new file system.--OT Gadgeteer
2003-10-24  9:26           ` srfs - a new file system Nir Tzachar
2003-10-22  4:57       ` Erik Andersen
2003-10-22 10:16         ` Nir Tzachar
2003-10-22 14:22           ` Jan Harkes
2003-10-23  7:50             ` Nir Tzachar
2003-10-23 12:33               ` Jan Hudec
2003-10-23 20:12                 ` Pat LaVarre
2003-10-24  9:21                   ` Nir Tzachar
2003-10-24 12:08                     ` Matthew Wilcox
2003-10-24 19:14                       ` Nir Tzachar
2003-10-24 14:38                     ` Jan Harkes
2003-10-24 19:16                       ` Nir Tzachar
2003-10-24 20:11                         ` Andreas Dilger
2003-10-24 20:24                           ` Pat LaVarre
2003-10-24 20:38                             ` Andreas Dilger
2003-10-24 20:52                               ` Pat LaVarre
2003-10-24 21:00                                 ` Nir Tzachar
2003-10-24 21:22                                   ` Pat LaVarre
2003-10-24 23:03                                     ` Nir Tzachar
2003-10-25  0:23                                   ` Bryan Henderson
2003-10-25 10:37                                     ` Nir Tzachar
2003-10-24 21:15                                 ` Andreas Dilger
2003-10-24 20:53                           ` Nir Tzachar
2003-10-25  8:01                         ` Jan Hudec
2003-10-22 10:21         ` Nir Tzachar
2003-10-22 16:05         ` Valdis.Kletnieks
2003-10-22 19:38           ` Erik Andersen
2003-10-23  5:20             ` Miles Bader
2003-10-23  5:37               ` Valdis.Kletnieks
2003-10-25  9:27       ` Implementing writepage Charles Manning
2003-10-25 16:18         ` David Woodhouse
2003-10-25 22:40           ` Charles Manning
2003-10-26 10:25             ` David Woodhouse
2003-10-26 15:28               ` Matthew Wilcox
2003-10-26 18:47                 ` Mark B
2003-10-26 20:40                   ` Charles Manning
2003-10-26 21:04                     ` David Woodhouse
2003-10-26 20:54               ` Charles Manning
2003-10-27  8:34           ` Nikita Danilov
2003-10-27  8:39             ` David Woodhouse
2003-10-27  8:43               ` Nikita Danilov
2003-10-27  8:46                 ` David Woodhouse
2003-10-27  8:52                   ` Nikita Danilov
2003-10-27  9:06                     ` David Woodhouse
2003-10-27  9:08                       ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.