linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* what is our answer to ZFS?
@ 2005-11-21  9:28 Alfred Brons
  2005-11-21  9:44 ` Paulo Jorge Matos
  0 siblings, 1 reply; 74+ messages in thread
From: Alfred Brons @ 2005-11-21  9:28 UTC (permalink / raw)
  To: linux-kernel

Hi All,

I just noticed in the news this link:

http://www.opensolaris.org/os/community/zfs/demos/basics

I wonder what would be our respond to this beaste?

btw, you could try it live by using Nexenta
GNU/Solaris LiveCD at
http://www.gnusolaris.org/gswiki/Download which is
Ubuntu-based OpenSolaris
distribution.

So what is ZFS?

ZFS is a new kind of filesystem that provides simple
administration, transactional semantics, end-to-end
data integrity, and immense scalability. ZFS is not an
incremental improvement to existing technology; it is
a fundamentally new approach to data management. We've
blown away 20 years of obsolete assumptions,
eliminated complexity at the source, and created a
storage system that's actually a pleasure to use.

ZFS presents a pooled storage model that completely
eliminates the concept of volumes and the associated
problems of partitions, provisioning, wasted bandwidth
and stranded storage. Thousands of filesystems can
draw from a common storage pool, each one consuming
only as much space as it actually needs.

All operations are copy-on-write transactions, so the
on-disk state is always valid. There is no need to
fsck(1M) a ZFS filesystem, ever. Every block is
checksummed to prevent silent data corruption, and the
data is self-healing in replicated (mirrored or RAID)
configurations.

ZFS provides unlimited constant-time snapshots and
clones. A snapshot is a read-only point-in-time copy
of a filesystem, while a clone is a writable copy of a
snapshot. Clones provide an extremely space-efficient
way to store many copies of mostly-shared data such as
workspaces, software installations, and diskless
clients.

ZFS administration is both simple and powerful. The
tools are designed from the ground up to eliminate all
the traditional headaches relating to managing
filesystems. Storage can be added, disks replaced, and
data scrubbed with straightforward commands.
Filesystems can be created instantaneously, snapshots
and clones taken, native backups made, and a
simplified property mechanism allows for setting of
quotas, reservations, compression, and more.

Alfred


		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21  9:28 what is our answer to ZFS? Alfred Brons
@ 2005-11-21  9:44 ` Paulo Jorge Matos
  2005-11-21  9:59   ` Alfred Brons
  0 siblings, 1 reply; 74+ messages in thread
From: Paulo Jorge Matos @ 2005-11-21  9:44 UTC (permalink / raw)
  To: Alfred Brons; +Cc: linux-kernel

Check Tarkan "Sun ZFS and Linux" topic on 18th Nov,  on this mailing list.
http://marc.theaimsgroup.com/?l=linux-kernel&m=113235728212352&w=2

Cheers,

Paulo Matos

On 21/11/05, Alfred Brons <alfredbrons@yahoo.com> wrote:
> Hi All,
>
> I just noticed in the news this link:
>
> http://www.opensolaris.org/os/community/zfs/demos/basics
>
> I wonder what would be our respond to this beaste?
>
> btw, you could try it live by using Nexenta
> GNU/Solaris LiveCD at
> http://www.gnusolaris.org/gswiki/Download which is
> Ubuntu-based OpenSolaris
> distribution.
>
> So what is ZFS?
>
> ZFS is a new kind of filesystem that provides simple
> administration, transactional semantics, end-to-end
> data integrity, and immense scalability. ZFS is not an
> incremental improvement to existing technology; it is
> a fundamentally new approach to data management. We've
> blown away 20 years of obsolete assumptions,
> eliminated complexity at the source, and created a
> storage system that's actually a pleasure to use.
>
> ZFS presents a pooled storage model that completely
> eliminates the concept of volumes and the associated
> problems of partitions, provisioning, wasted bandwidth
> and stranded storage. Thousands of filesystems can
> draw from a common storage pool, each one consuming
> only as much space as it actually needs.
>
> All operations are copy-on-write transactions, so the
> on-disk state is always valid. There is no need to
> fsck(1M) a ZFS filesystem, ever. Every block is
> checksummed to prevent silent data corruption, and the
> data is self-healing in replicated (mirrored or RAID)
> configurations.
>
> ZFS provides unlimited constant-time snapshots and
> clones. A snapshot is a read-only point-in-time copy
> of a filesystem, while a clone is a writable copy of a
> snapshot. Clones provide an extremely space-efficient
> way to store many copies of mostly-shared data such as
> workspaces, software installations, and diskless
> clients.
>
> ZFS administration is both simple and powerful. The
> tools are designed from the ground up to eliminate all
> the traditional headaches relating to managing
> filesystems. Storage can be added, disks replaced, and
> data scrubbed with straightforward commands.
> Filesystems can be created instantaneously, snapshots
> and clones taken, native backups made, and a
> simplified property mechanism allows for setting of
> quotas, reservations, compression, and more.
>
> Alfred
>
>
>
> __________________________________
> Yahoo! FareChase: Search multiple travel sites in one click.
> http://farechase.yahoo.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


--
Paulo Jorge Matos - pocm at sat inesc-id pt
Web: http://sat.inesc-id.pt/~pocm
Computer and Software Engineering
INESC-ID - SAT Group

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21  9:44 ` Paulo Jorge Matos
@ 2005-11-21  9:59   ` Alfred Brons
  2005-11-21 10:08     ` Bernd Petrovitsch
                       ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: Alfred Brons @ 2005-11-21  9:59 UTC (permalink / raw)
  To: pocm; +Cc: linux-kernel

Thanks Paulo!
I wasn't aware of this thread.

But my question was: do we have similar functionality
in Linux kernel?

Taking in account ZFS availability as 100% open
source, I'm starting think about migration to Nexenta
OS some of my servers just because of this feature...

Alfred

--- Paulo Jorge Matos <pocmatos@gmail.com> wrote:

> Check Tarkan "Sun ZFS and Linux" topic on 18th Nov, 
> on this mailing list.
>
http://marc.theaimsgroup.com/?l=linux-kernel&m=113235728212352&w=2
> 
> Cheers,
> 
> Paulo Matos
> 
> On 21/11/05, Alfred Brons <alfredbrons@yahoo.com>
> wrote:
> > Hi All,
> >
> > I just noticed in the news this link:
> >
> >
>
http://www.opensolaris.org/os/community/zfs/demos/basics
> >
> > I wonder what would be our respond to this beaste?
> >
> > btw, you could try it live by using Nexenta
> > GNU/Solaris LiveCD at
> > http://www.gnusolaris.org/gswiki/Download which is
> > Ubuntu-based OpenSolaris
> > distribution.
> >
> > So what is ZFS?
> >
> > ZFS is a new kind of filesystem that provides
> simple
> > administration, transactional semantics,
> end-to-end
> > data integrity, and immense scalability. ZFS is
> not an
> > incremental improvement to existing technology; it
> is
> > a fundamentally new approach to data management.
> We've
> > blown away 20 years of obsolete assumptions,
> > eliminated complexity at the source, and created a
> > storage system that's actually a pleasure to use.
> >
> > ZFS presents a pooled storage model that
> completely
> > eliminates the concept of volumes and the
> associated
> > problems of partitions, provisioning, wasted
> bandwidth
> > and stranded storage. Thousands of filesystems can
> > draw from a common storage pool, each one
> consuming
> > only as much space as it actually needs.
> >
> > All operations are copy-on-write transactions, so
> the
> > on-disk state is always valid. There is no need to
> > fsck(1M) a ZFS filesystem, ever. Every block is
> > checksummed to prevent silent data corruption, and
> the
> > data is self-healing in replicated (mirrored or
> RAID)
> > configurations.
> >
> > ZFS provides unlimited constant-time snapshots and
> > clones. A snapshot is a read-only point-in-time
> copy
> > of a filesystem, while a clone is a writable copy
> of a
> > snapshot. Clones provide an extremely
> space-efficient
> > way to store many copies of mostly-shared data
> such as
> > workspaces, software installations, and diskless
> > clients.
> >
> > ZFS administration is both simple and powerful.
> The
> > tools are designed from the ground up to eliminate
> all
> > the traditional headaches relating to managing
> > filesystems. Storage can be added, disks replaced,
> and
> > data scrubbed with straightforward commands.
> > Filesystems can be created instantaneously,
> snapshots
> > and clones taken, native backups made, and a
> > simplified property mechanism allows for setting
> of
> > quotas, reservations, compression, and more.
> >
> > Alfred
> >
> >
> >
> > __________________________________
> > Yahoo! FareChase: Search multiple travel sites in
> one click.
> > http://farechase.yahoo.com
> > -
> > To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> 
> --
> Paulo Jorge Matos - pocm at sat inesc-id pt
> Web: http://sat.inesc-id.pt/~pocm
> Computer and Software Engineering
> INESC-ID - SAT Group
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21  9:59   ` Alfred Brons
@ 2005-11-21 10:08     ` Bernd Petrovitsch
  2005-11-21 10:16     ` Andreas Happe
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 74+ messages in thread
From: Bernd Petrovitsch @ 2005-11-21 10:08 UTC (permalink / raw)
  To: Alfred Brons; +Cc: pocm, linux-kernel

On Mon, 2005-11-21 at 01:59 -0800, Alfred Brons wrote:
[...]
> But my question was: do we have similar functionality
> in Linux kernel?

>From reading over the marketing stuff, it seems that they now have LVM2
+ a journalling filesystem + some more nice-to-have features developed
(more or less) from scratch (or not?).
Reiser folks must comment on differences or not to a LVM2 + reiser4
combination.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21  9:59   ` Alfred Brons
  2005-11-21 10:08     ` Bernd Petrovitsch
@ 2005-11-21 10:16     ` Andreas Happe
  2005-11-21 11:30       ` Anton Altaparmakov
  2005-11-21 10:19     ` Jörn Engel
  2005-11-21 11:45     ` Diego Calleja
  3 siblings, 1 reply; 74+ messages in thread
From: Andreas Happe @ 2005-11-21 10:16 UTC (permalink / raw)
  To: linux-kernel

On 2005-11-21, Alfred Brons <alfredbrons@yahoo.com> wrote:
> Thanks Paulo!
> I wasn't aware of this thread.
>
> But my question was: do we have similar functionality
> in Linux kernel?

>>> Every block is checksummed to prevent silent data corruption,
>>> and the data is self-healing in replicated (mirrored or RAID)
>>> configurations.

should not be filesystem specific.

>>> ZFS provides unlimited constant-time snapshots and clones. A
>>> snapshot is a read-only point-in-time copy of a filesystem, while a
>>> clone is a writable copy of a snapshot. Clones provide an extremely
>>> space-efficient way to store many copies of mostly-shared data such
>>> as workspaces, software installations, and diskless clients.

lvm2 can do those too (with any filesystem that supports resizing).
Clones would be the snapshot functionality of lvm2.

>>> ZFS administration is both simple and powerful.  The tools are
>>> designed from the ground up to eliminate all the traditional
>>> headaches relating to managing filesystems. Storage can be added,
>>> disks replaced, and data scrubbed with straightforward commands.

lvm2.

>>> Filesystems can be created instantaneously, snapshots and clones
>>> taken, native backups made, and a simplified property mechanism
>>> allows for setting of quotas, reservations, compression, and more.

excepct per-file compression all thinks should be doable with normal in-kernel
fs. per-file compression may be doable with ext2 and special patches, an
overlay filesystem or reiser4.

Andreas


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21  9:59   ` Alfred Brons
  2005-11-21 10:08     ` Bernd Petrovitsch
  2005-11-21 10:16     ` Andreas Happe
@ 2005-11-21 10:19     ` Jörn Engel
  2005-11-21 11:46       ` Matthias Andree
                         ` (3 more replies)
  2005-11-21 11:45     ` Diego Calleja
  3 siblings, 4 replies; 74+ messages in thread
From: Jörn Engel @ 2005-11-21 10:19 UTC (permalink / raw)
  To: Alfred Brons; +Cc: pocm, linux-kernel

On Mon, 21 November 2005 01:59:15 -0800, Alfred Brons wrote:
> 
> I wasn't aware of this thread.
> 
> But my question was: do we have similar functionality
> in Linux kernel?

If you have a simple, technical list of the functionality, your
question will be easily answered.  I still haven't found the time to
dig for all the information underneith the marketing blur.

o Checksums for data blocks
  Done by jffs2, not done my any hard disk filesystems I'm aware of.

o Snapshots
  Use device mapper.
  Some log structured filesystems are also under development.  For
  them, snapshots will be trivial to add.  But they don't really exist
  yet.  (I barely consider reiser4 to exist.  Any filesystem that is
  not considered good enough for kernel inclusion is effectively still
  in development phase.)

o Merge of LVM and filesystem layer
  Not done.  This has some advantages, but also more complexity than
  seperate LVM and filesystem layers.  Might be considers "not worth
  it" for some years.

o 128 bit
  On 32bit machines, you can't even fully utilize a 64bit filesystem
  without VFS changes.  Have you ever noticed?  Thought so.

o other
  Dunno, what else they do.  There's the official marketing feature
  lists, but that's rather useless for comparisons.

Jörn

-- 
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 10:16     ` Andreas Happe
@ 2005-11-21 11:30       ` Anton Altaparmakov
  0 siblings, 0 replies; 74+ messages in thread
From: Anton Altaparmakov @ 2005-11-21 11:30 UTC (permalink / raw)
  To: Andreas Happe; +Cc: linux-kernel

On Mon, 21 Nov 2005, Andreas Happe wrote:
> On 2005-11-21, Alfred Brons <alfredbrons@yahoo.com> wrote:
> > Thanks Paulo!
> > I wasn't aware of this thread.
> >
> > But my question was: do we have similar functionality
> > in Linux kernel?
[snip]
> >>> Filesystems can be created instantaneously, snapshots and clones
> >>> taken, native backups made, and a simplified property mechanism
> >>> allows for setting of quotas, reservations, compression, and more.
> 
> excepct per-file compression all thinks should be doable with normal in-kernel
> fs. per-file compression may be doable with ext2 and special patches, an
> overlay filesystem or reiser4.

NTFS has per-file compression although I admit that in Linux this is 
read-only at present (mostly because it is low priority).

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21  9:59   ` Alfred Brons
                       ` (2 preceding siblings ...)
  2005-11-21 10:19     ` Jörn Engel
@ 2005-11-21 11:45     ` Diego Calleja
  2005-11-21 14:19       ` Tarkan Erimer
  2005-11-21 18:17       ` Rob Landley
  3 siblings, 2 replies; 74+ messages in thread
From: Diego Calleja @ 2005-11-21 11:45 UTC (permalink / raw)
  To: Alfred Brons; +Cc: pocm, linux-kernel

El Mon, 21 Nov 2005 01:59:15 -0800 (PST),
Alfred Brons <alfredbrons@yahoo.com> escribió:

> Thanks Paulo!
> I wasn't aware of this thread.
> 
> But my question was: do we have similar functionality
> in Linux kernel?
> 
> Taking in account ZFS availability as 100% open
> source, I'm starting think about migration to Nexenta
> OS some of my servers just because of this feature...



There're some rumors saying that sun might be considering a linux port.

http://www.sun.com/emrkt/campaign_docs/expertexchange/knowledge/solaris_zfs_gen.html#10

Q: Any thoughts on porting ZFS to Linux, AIX, or HPUX?
A: No plans of porting to AIX and HPUX. Porting to Linux is currently
being investigated. 

(personally I doubt it, that FAQ was written some time ago and Sun's
executives change their opinion more often than Linus does ;)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 10:19     ` Jörn Engel
@ 2005-11-21 11:46       ` Matthias Andree
  2005-11-21 12:07         ` Kasper Sandberg
  2005-11-21 11:59       ` Diego Calleja
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 74+ messages in thread
From: Matthias Andree @ 2005-11-21 11:46 UTC (permalink / raw)
  To: linux-kernel

On Mon, 21 Nov 2005, Jörn Engel wrote:

> o Checksums for data blocks
>   Done by jffs2, not done my any hard disk filesystems I'm aware of.

Then allow me to point you to the Amiga file systems. The variants
commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
in a data block for payload and put their block chaining information,
checksum and other "interesting" things into the blocks. This helps
recoverability a lot but kills performance, so many people (used to) use
the "Fast File System" that uses the full 512 bytes for data blocks.

Whether the Amiga FFS, even with multi-user and directory index updates,
has a lot of importance today, is a different question that you didn't
pose :-)

>   yet.  (I barely consider reiser4 to exist.  Any filesystem that is
>   not considered good enough for kernel inclusion is effectively still
>   in development phase.)

What the heck is reiserfs? I faintly recall some weirdo crap that broke
NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
into its structures that reiserfsck could only fix months later.

ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
amount of arbitrary filenames in any one directory even if there's
sufficient space), after a while in production, still random flaws in
the file systems that then require rebuild-tree that works only halfway.
No thanks.

Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
from kernel baseline until:

- reiserfs 3.6 is fully fixed up

- reiserfs 4 has been debugged in production outside the kernel for at
  least 24 months with a reasonable installed base, by for instance a
  large distro using it for the root fs

- there are guarantees that reiserfs 4 will be maintained until the EOL
  of the kernel branch it is included into, rather than the current "oh
  we have a new toy and don't give a shit about 3.6" behavior.

Harsh words, I know, but either version of reiserfs is totally out of
the game while I have the systems administrator hat on, and the recent
fuss between Namesys and Christoph Hellwig certainly doesn't raise my
trust in reiserfs.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 10:19     ` Jörn Engel
  2005-11-21 11:46       ` Matthias Andree
@ 2005-11-21 11:59       ` Diego Calleja
  2005-11-22  7:51       ` Christoph Hellwig
  2005-11-28 12:53       ` Lars Marowsky-Bree
  3 siblings, 0 replies; 74+ messages in thread
From: Diego Calleja @ 2005-11-21 11:59 UTC (permalink / raw)
  To: Jörn Engel; +Cc: alfredbrons, pocm, linux-kernel

El Mon, 21 Nov 2005 11:19:59 +0100,
Jörn Engel <joern@wohnheim.fh-wedel.de> escribió:

> question will be easily answered.  I still haven't found the time to
> dig for all the information underneith the marketing blur.

Me neither, at now that we are talking about marketing impact, has people
run benchmarks on it? (I'd do it myself but downloading a iso with a dialup
link takes some time 8)

I've found numbers against other kernels:
http://mail-index.netbsd.org/tech-perform/2005/11/18/0000.html
http://blogs.sun.com/roller/page/roch?entry=zfs_to_ufs_performance_comparison
http://blogs.sun.com/roller/page/erickustarz?entry=fs_perf_201_postmark
http://blogs.sun.com/roller/page/erickustarz?entry=fs_perf_102_filesystem_bw

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 11:46       ` Matthias Andree
@ 2005-11-21 12:07         ` Kasper Sandberg
  2005-11-21 13:18           ` Matthias Andree
  0 siblings, 1 reply; 74+ messages in thread
From: Kasper Sandberg @ 2005-11-21 12:07 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
> On Mon, 21 Nov 2005, Jörn Engel wrote:
> 
> > o Checksums for data blocks
> >   Done by jffs2, not done my any hard disk filesystems I'm aware of.
> 
> Then allow me to point you to the Amiga file systems. The variants
> commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
> in a data block for payload and put their block chaining information,
> checksum and other "interesting" things into the blocks. This helps
> recoverability a lot but kills performance, so many people (used to) use
> the "Fast File System" that uses the full 512 bytes for data blocks.
> 
> Whether the Amiga FFS, even with multi-user and directory index updates,
> has a lot of importance today, is a different question that you didn't
> pose :-)
> 
> >   yet.  (I barely consider reiser4 to exist.  Any filesystem that is
> >   not considered good enough for kernel inclusion is effectively still
> >   in development phase.)
that isnt true, just because it isnt following the kernel coding style
and therefore has to be changed, does not make it any bit more unstable.


> 
> What the heck is reiserfs? I faintly recall some weirdo crap that broke
> NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
> into its structures that reiserfsck could only fix months later.
well.. i remember that linux 2.6.0 had alot of bugs, is 2.6.14 still
crap because those particular bugs are fixed now?

> 
> ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
> amount of arbitrary filenames in any one directory even if there's
> sufficient space), after a while in production, still random flaws in
> the file systems that then require rebuild-tree that works only halfway.
> No thanks.
i have used reiserfs for a long time, and have never had the problem
that i was required to use rebuild-tree, not have issues requiring other
actions come, unless i have been hard rebooting/shutting down, in which
case the journal simply replayed a few transactions.

> 
> Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
> from kernel baseline until:

you seem to believe that reiser4 (note, reiser4, NOT reiserfs4) is just
some simple new revision of reiserfs. well guess what, its an entirely
different filesystem, which before they began the changes to have it
merged, was completely stable, and i have confidence that it will be
just as stable again soon.

> 
> - reiserfs 3.6 is fully fixed up
> 
so you are saying that if for some reason the via ide driver for old
chipsets are broken, we cant merge a via ide driver for new ide
controllers?

> - reiserfs 4 has been debugged in production outside the kernel for at
>   least 24 months with a reasonable installed base, by for instance a
>   large distro using it for the root fs
no dist will ever use (except perhaps linspire) before its included in
the kernel.
> 
> - there are guarantees that reiserfs 4 will be maintained until the EOL
>   of the kernel branch it is included into, rather than the current "oh
>   we have a new toy and don't give a shit about 3.6" behavior.
why do you think that reiser4 will not be maintained? if there are bugs
in 3.6 hans is still interrested, but really, do you expect him to still
spend all the time trying to find bugs in 3.6, when people dont seem to
have issues, and while he in fact has created an entirely new
filesystem.
> 
> Harsh words, I know, but either version of reiserfs is totally out of
> the game while I have the systems administrator hat on, and the recent
> fuss between Namesys and Christoph Hellwig certainly doesn't raise my
> trust in reiserfs.
so you are saying that if two people doesent get along the product the
one person creates somehow falls in quality?
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 12:07         ` Kasper Sandberg
@ 2005-11-21 13:18           ` Matthias Andree
  2005-11-21 14:18             ` Kasper Sandberg
  2005-11-21 20:48             ` jdow
  0 siblings, 2 replies; 74+ messages in thread
From: Matthias Andree @ 2005-11-21 13:18 UTC (permalink / raw)
  To: linux-kernel

On Mon, 21 Nov 2005, Kasper Sandberg wrote:

> On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
> > On Mon, 21 Nov 2005, Jörn Engel wrote:
> > 
> > > o Checksums for data blocks
> > >   Done by jffs2, not done my any hard disk filesystems I'm aware of.
> > 
> > Then allow me to point you to the Amiga file systems. The variants
> > commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes

Make that 488. Amiga's traditional file system loses 6 longs (at 32 bit
each) according to Ralph Babel's "The Amiga Guru Book".

> > in a data block for payload and put their block chaining information,
> > checksum and other "interesting" things into the blocks. This helps
> > recoverability a lot but kills performance, so many people (used to) use
> > the "Fast File System" that uses the full 512 bytes for data blocks.
> > 
> > Whether the Amiga FFS, even with multi-user and directory index updates,
> > has a lot of importance today, is a different question that you didn't
> > pose :-)
> > 
> > >   yet.  (I barely consider reiser4 to exist.  Any filesystem that is
> > >   not considered good enough for kernel inclusion is effectively still
> > >   in development phase.)

> that isnt true, just because it isnt following the kernel coding style
> and therefore has to be changed, does not make it any bit more unstable.

If the precondition is "adhere to CodingStyle or you don't get it in",
and the CodingStyle has been established for years, I have zero sympathy
with the maintainer if he's told "no, you didn't follow that well-known
style".

> > What the heck is reiserfs? I faintly recall some weirdo crap that broke
> > NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
> > into its structures that reiserfsck could only fix months later.
> well.. i remember that linux 2.6.0 had alot of bugs, is 2.6.14 still
> crap because those particular bugs are fixed now?

Of course not. The point is, it will take many months to shake the bugs
out that are still in it and will only be revealed as it is tested in
more diverse configurations.

> > ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
> > amount of arbitrary filenames in any one directory even if there's
> > sufficient space), after a while in production, still random flaws in
> > the file systems that then require rebuild-tree that works only halfway.
> > No thanks.

> i have used reiserfs for a long time, and have never had the problem
> that i was required to use rebuild-tree, not have issues requiring other
> actions come, unless i have been hard rebooting/shutting down, in which
> case the journal simply replayed a few transactions.

I have had, without hard shutdowns, problems with reiserfs, and
occasionally problems that couldn't be fixed easily. I have never had
such with ext3 on the same hardware.

> > Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
> > from kernel baseline until:
> 
> you seem to believe that reiser4 (note, reiser4, NOT reiserfs4) is just
> some simple new revision of reiserfs. well guess what, its an entirely
> different filesystem, which before they began the changes to have it
> merged, was completely stable, and i have confidence that it will be
> just as stable again soon.

I don't care what its name is. I am aware it is a rewrite, and that is
reason to be all the more chary about adopting it early. People believed
3.5 to be stable, too, before someone tried NFS...

Historical fact is, ext3fs was very usable already in the later 0.0.2x
versions, and pretty stable in 0.0.7x, where x is some letter. All that
happened was applying some polish to make it shine, and that it does.

reiserfs was declared stable and then the problems only began. Certainly
merging kernel-space NFS was an additional obstacle at that time, so we
may speak in favor of Namesys because reiserfs was into a merging
target.

However, as reiser4 is a major (or full) rewrite, I won't consider it
for anything except perhaps /var/cache before 2H2007.

> > - reiserfs 3.6 is fully fixed up
> 
> so you are saying that if for some reason the via ide driver for old
> chipsets are broken, we cant merge a via ide driver for new ide
> controllers?

More generally, quality should be the prime directive. And before the
reiser4 guys focus on getting their gear merged and then the many bugs
shaken out (there will be bugs found), they should have a chance to
reschedule their internal work to get 3.6 fixed. If they can't, well,
time to mark it DEPRECATED before the new work is merged, and the new
stuff should be marked EXPERIMENTAL for a year.

> > - reiserfs 4 has been debugged in production outside the kernel for at
> >   least 24 months with a reasonable installed base, by for instance a
> >   large distro using it for the root fs
> no dist will ever use (except perhaps linspire) before its included in
> the kernel.

So you think? I beg to differ. SUSE have adopted reiserfs pretty early,
and it has never shown the promised speed advantages over ext[23]fs in
my testing. SUSE have adopted submount, which also still lives outside
the kernel AFAIK.

> > - there are guarantees that reiserfs 4 will be maintained until the EOL
> >   of the kernel branch it is included into, rather than the current "oh
> >   we have a new toy and don't give a shit about 3.6" behavior.
> why do you think that reiser4 will not be maintained? if there are bugs
> in 3.6 hans is still interrested, but really, do you expect him to still
> spend all the time trying to find bugs in 3.6, when people dont seem to

I do expect Namesys to fix the *known* bugs, such as hash table overflow
preventing creation of new files. See above about DEPRECATED.

As long as reiserfs 3.6 and/or reiser 4 are standalone projects that
live outside the kernel, nobody cares, but I think pushing forward to
adoption into kernel baseline consistutes a commitment to maintaining
the code.

> have issues, and while he in fact has created an entirely new
> filesystem.

Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
3.6 will start all over. I hope the Namesys guys were to clueful as to
run all  their reiserfs 3.X regression tests against 4.X with all
plugins and switches, too.

> > Harsh words, I know, but either version of reiserfs is totally out of
> > the game while I have the systems administrator hat on, and the recent
> > fuss between Namesys and Christoph Hellwig certainly doesn't raise my
> > trust in reiserfs.
> so you are saying that if two people doesent get along the product the
> one person creates somehow falls in quality?

I wrote "trust", not "quality".

Part of my aversion against stuff that bears "reiser" in its name is the
way how it is supposed to be merged upstream, and there Namesys is a bit
lacking. After all, they want their pet in the kernel, not the kernel
wants reiser4.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 13:18           ` Matthias Andree
@ 2005-11-21 14:18             ` Kasper Sandberg
  2005-11-21 14:41               ` Matthias Andree
  2005-11-21 22:41               ` Bill Davidsen
  2005-11-21 20:48             ` jdow
  1 sibling, 2 replies; 74+ messages in thread
From: Kasper Sandberg @ 2005-11-21 14:18 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Mon, 2005-11-21 at 14:18 +0100, Matthias Andree wrote:
> On Mon, 21 Nov 2005, Kasper Sandberg wrote:
> 
> > On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
> > > On Mon, 21 Nov 2005, Jörn Engel wrote:
> > > 
> > > > o Checksums for data blocks
> > > >   Done by jffs2, not done my any hard disk filesystems I'm aware of.
> > > 
> > > Then allow me to point you to the Amiga file systems. The variants
> > > commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
> 
> Make that 488. Amiga's traditional file system loses 6 longs (at 32 bit
> each) according to Ralph Babel's "The Amiga Guru Book".
> 
> > > in a data block for payload and put their block chaining information,
> > > checksum and other "interesting" things into the blocks. This helps
> > > recoverability a lot but kills performance, so many people (used to) use
> > > the "Fast File System" that uses the full 512 bytes for data blocks.
> > > 
> > > Whether the Amiga FFS, even with multi-user and directory index updates,
> > > has a lot of importance today, is a different question that you didn't
> > > pose :-)
> > > 
> > > >   yet.  (I barely consider reiser4 to exist.  Any filesystem that is
> > > >   not considered good enough for kernel inclusion is effectively still
> > > >   in development phase.)
> 
> > that isnt true, just because it isnt following the kernel coding style
> > and therefore has to be changed, does not make it any bit more unstable.
> 
> If the precondition is "adhere to CodingStyle or you don't get it in",
> and the CodingStyle has been established for years, I have zero sympathy
> with the maintainer if he's told "no, you didn't follow that well-known
> style".

that was not the question, the question is if the code is in development
phase or not (being stable or not), where agreed, its their own fault
for not writing code which matches the kernel in coding style, however
that doesent make it the least bit more unstable.

> 
> > > What the heck is reiserfs? I faintly recall some weirdo crap that broke
> > > NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
> > > into its structures that reiserfsck could only fix months later.
> > well.. i remember that linux 2.6.0 had alot of bugs, is 2.6.14 still
> > crap because those particular bugs are fixed now?
> 
> Of course not. The point is, it will take many months to shake the bugs
> out that are still in it and will only be revealed as it is tested in
> more diverse configurations.

> > > ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
> > > amount of arbitrary filenames in any one directory even if there's
> > > sufficient space), after a while in production, still random flaws in
> > > the file systems that then require rebuild-tree that works only halfway.
> > > No thanks.
> 
> > i have used reiserfs for a long time, and have never had the problem
> > that i was required to use rebuild-tree, not have issues requiring other
> > actions come, unless i have been hard rebooting/shutting down, in which
> > case the journal simply replayed a few transactions.
> 
> I have had, without hard shutdowns, problems with reiserfs, and
> occasionally problems that couldn't be fixed easily. I have never had
> such with ext3 on the same hardware.
> 
you wouldnt want to know what ext3 did to me, which reiserfs AND reiser4
never did

> > > Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
> > > from kernel baseline until:
> > 
> > you seem to believe that reiser4 (note, reiser4, NOT reiserfs4) is just
> > some simple new revision of reiserfs. well guess what, its an entirely
> > different filesystem, which before they began the changes to have it
> > merged, was completely stable, and i have confidence that it will be
> > just as stable again soon.
> 
> I don't care what its name is. I am aware it is a rewrite, and that is
> reason to be all the more chary about adopting it early. People believed
> 3.5 to be stable, too, before someone tried NFS...
nfs works fine with reiser4. you are judging reiser4 by the problems
reiserfs had.
> 
> Historical fact is, ext3fs was very usable already in the later 0.0.2x
> versions, and pretty stable in 0.0.7x, where x is some letter. All that
> happened was applying some polish to make it shine, and that it does.
> 
> reiserfs was declared stable and then the problems only began. Certainly
> merging kernel-space NFS was an additional obstacle at that time, so we
> may speak in favor of Namesys because reiserfs was into a merging
> target.
> 
> However, as reiser4 is a major (or full) rewrite, I won't consider it
> for anything except perhaps /var/cache before 2H2007.
> 
i have had less trouble by using the reiser4 patches before even hans
considered it stable than i had by using ext3.

> > > - reiserfs 3.6 is fully fixed up
> > 
> > so you are saying that if for some reason the via ide driver for old
> > chipsets are broken, we cant merge a via ide driver for new ide
> > controllers?
> 
> More generally, quality should be the prime directive. And before the
> reiser4 guys focus on getting their gear merged and then the many bugs
> shaken out (there will be bugs found), they should have a chance to
> reschedule their internal work to get 3.6 fixed. If they can't, well,
> time to mark it DEPRECATED before the new work is merged, and the new
> stuff should be marked EXPERIMENTAL for a year.
so then mark reiser4 experimental as namesys themselves wanted.

> 
> > > - reiserfs 4 has been debugged in production outside the kernel for at
> > >   least 24 months with a reasonable installed base, by for instance a
> > >   large distro using it for the root fs
> > no dist will ever use (except perhaps linspire) before its included in
> > the kernel.
> 
> So you think? I beg to differ. SUSE have adopted reiserfs pretty early,
> and it has never shown the promised speed advantages over ext[23]fs in
> my testing. SUSE have adopted submount, which also still lives outside
> the kernel AFAIK.
there is quite a big difference between stuff like submount and the
filesystem itself.. and as you pointed out, reiserfs in the beginning
was a disappointment, do you seriously think they are willing to take
the chance again?

> 
> > > - there are guarantees that reiserfs 4 will be maintained until the EOL
> > >   of the kernel branch it is included into, rather than the current "oh
> > >   we have a new toy and don't give a shit about 3.6" behavior.
> > why do you think that reiser4 will not be maintained? if there are bugs
> > in 3.6 hans is still interrested, but really, do you expect him to still
> > spend all the time trying to find bugs in 3.6, when people dont seem to
> 
> I do expect Namesys to fix the *known* bugs, such as hash table overflow
> preventing creation of new files. See above about DEPRECATED.
> 
reiser4 is meant to be better than reiserfs, which is also one reason he
wants it merged perhaps? but agreed, known bugs should be fixed

> As long as reiserfs 3.6 and/or reiser 4 are standalone projects that
> live outside the kernel, nobody cares, but I think pushing forward to
> adoption into kernel baseline consistutes a commitment to maintaining
> the code.
> 
> > have issues, and while he in fact has created an entirely new
> > filesystem.
> 
> Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
> 3.6 will start all over. I hope the Namesys guys were to clueful as to
> run all  their reiserfs 3.X regression tests against 4.X with all
> plugins and switches, too.
you will find that reiser4 is actually very very good.
> 
> > > Harsh words, I know, but either version of reiserfs is totally out of
> > > the game while I have the systems administrator hat on, and the recent
> > > fuss between Namesys and Christoph Hellwig certainly doesn't raise my
> > > trust in reiserfs.
> > so you are saying that if two people doesent get along the product the
> > one person creates somehow falls in quality?
> 
> I wrote "trust", not "quality".
my bad.
> 
> Part of my aversion against stuff that bears "reiser" in its name is the
> way how it is supposed to be merged upstream, and there Namesys is a bit
> lacking. After all, they want their pet in the kernel, not the kernel
> wants reiser4.
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 11:45     ` Diego Calleja
@ 2005-11-21 14:19       ` Tarkan Erimer
  2005-11-21 18:52         ` Rob Landley
  2005-11-21 18:17       ` Rob Landley
  1 sibling, 1 reply; 74+ messages in thread
From: Tarkan Erimer @ 2005-11-21 14:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Diego Calleja

On 11/21/05, Diego Calleja <diegocg@gmail.com> wrote:
>
> There're some rumors saying that sun might be considering a linux port.
>
> http://www.sun.com/emrkt/campaign_docs/expertexchange/knowledge/solaris_zfs_gen.html#10
>
> Q: Any thoughts on porting ZFS to Linux, AIX, or HPUX?
> A: No plans of porting to AIX and HPUX. Porting to Linux is currently
> being investigated.
>
> (personally I doubt it, that FAQ was written some time ago and Sun's
> executives change their opinion more often than Linus does ;)

If It happenned, Sun or someone has port it to linux.
We will need some VFS changes to handle 128 bit FS as "Jörn ENGEL"
mentionned previous mail in this thread. Is there any plan or action
to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
File Systems ? Any VFS people reply to this, please?

Regards

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 14:18             ` Kasper Sandberg
@ 2005-11-21 14:41               ` Matthias Andree
  2005-11-21 15:08                 ` Kasper Sandberg
  2005-11-21 22:41               ` Bill Davidsen
  1 sibling, 1 reply; 74+ messages in thread
From: Matthias Andree @ 2005-11-21 14:41 UTC (permalink / raw)
  To: linux-kernel

On Mon, 21 Nov 2005, Kasper Sandberg wrote:

> > If the precondition is "adhere to CodingStyle or you don't get it in",
> > and the CodingStyle has been established for years, I have zero sympathy
> > with the maintainer if he's told "no, you didn't follow that well-known
> > style".
> 
> that was not the question, the question is if the code is in development
> phase or not (being stable or not), where agreed, its their own fault
> for not writing code which matches the kernel in coding style, however
> that doesent make it the least bit more unstable.

As mentioned, a file system cannot possibly be stable right after merge.
Having to change formatting is a sweeping change and certainly is a
barrier across which to look for auditing is all the more difficult.

> > I have had, without hard shutdowns, problems with reiserfs, and
> > occasionally problems that couldn't be fixed easily. I have never had
> > such with ext3 on the same hardware.
> > 
> you wouldnt want to know what ext3 did to me, which reiserfs AND reiser4
> never did

OK, we have diametral experiences, and I'm not asking since I trust you
that I don't want to know, too :) Let's leave it at that.

> > I don't care what its name is. I am aware it is a rewrite, and that is
> > reason to be all the more chary about adopting it early. People believed
> > 3.5 to be stable, too, before someone tried NFS...

> nfs works fine with reiser4. you are judging reiser4 by the problems
> reiserfs had.

Of course I do, same project lead, and probably many of the same
developers. While they may (and probably will) learn from mistakes,
changing style is more difficult - and that resulted in one of the major
non-acceptance reasons reiser4 suffered.

I won't subscribe to reiser4 specific topics before I've tried it, so
I'll quit. Same about ZFS by the way, it'll be fun some day to try on a
machine that it can trash at will, but for production, it will have to
prove itself first. After all, Sun are still fixing ufs and/or logging
bugs in Solaris 8. (And that's good, they still fix things, and it also
shows how long it takes to really get a file system stable.)

> i have had less trouble by using the reiser4 patches before even hans
> considered it stable than i had by using ext3.

Lucky you. I haven't dared try it yet for lack of a test computer to
trash.

> there is quite a big difference between stuff like submount and the
> filesystem itself.. and as you pointed out, reiserfs in the beginning
> was a disappointment, do you seriously think they are willing to take
> the chance again?

I thing naught about what they're going to put at stake. reiserfs 3 was
an utter failure for me. It was raved about, hyped, and the bottom line
was wasted time and a major disappointment.

> > Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
> > 3.6 will start all over. I hope the Namesys guys were to clueful as to
> > run all  their reiserfs 3.X regression tests against 4.X with all
> > plugins and switches, too.
> you will find that reiser4 is actually very very good.

I haven't asked what I'd find, because I'm not searching. And I might
find something else than you did - perhaps because you've picked up all
the good things already when I'll finally go there ;-)

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 14:41               ` Matthias Andree
@ 2005-11-21 15:08                 ` Kasper Sandberg
  2005-11-22  8:52                   ` Matthias Andree
  0 siblings, 1 reply; 74+ messages in thread
From: Kasper Sandberg @ 2005-11-21 15:08 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Mon, 2005-11-21 at 15:41 +0100, Matthias Andree wrote:
> On Mon, 21 Nov 2005, Kasper Sandberg wrote:
> 
> > > If the precondition is "adhere to CodingStyle or you don't get it in",
> > > and the CodingStyle has been established for years, I have zero sympathy
> > > with the maintainer if he's told "no, you didn't follow that well-known
> > > style".
> > 
> > that was not the question, the question is if the code is in development
> > phase or not (being stable or not), where agreed, its their own fault
> > for not writing code which matches the kernel in coding style, however
> > that doesent make it the least bit more unstable.
> 
> As mentioned, a file system cannot possibly be stable right after merge.
> Having to change formatting is a sweeping change and certainly is a
> barrier across which to look for auditing is all the more difficult.
before reiser4 was changed alot, to match the codingstyle (agreed, they
have to obey by the kernels codingstyle), it was stable, so had it been
merged there it wouldnt have been any less stable.

> 
> > > I have had, without hard shutdowns, problems with reiserfs, and
> > > occasionally problems that couldn't be fixed easily. I have never had
> > > such with ext3 on the same hardware.
> > > 
> > you wouldnt want to know what ext3 did to me, which reiserfs AND reiser4
> > never did
> 
> OK, we have diametral experiences, and I'm not asking since I trust you
> that I don't want to know, too :) Let's leave it at that.
> 
> > > I don't care what its name is. I am aware it is a rewrite, and that is
> > > reason to be all the more chary about adopting it early. People believed
> > > 3.5 to be stable, too, before someone tried NFS...
> 
> > nfs works fine with reiser4. you are judging reiser4 by the problems
> > reiserfs had.
> 
> Of course I do, same project lead, and probably many of the same
> developers. While they may (and probably will) learn from mistakes,
> changing style is more difficult - and that resulted in one of the major
> non-acceptance reasons reiser4 suffered.
> 
> I won't subscribe to reiser4 specific topics before I've tried it, so
> I'll quit. Same about ZFS by the way, it'll be fun some day to try on a
> machine that it can trash at will, but for production, it will have to
> prove itself first. After all, Sun are still fixing ufs and/or logging
> bugs in Solaris 8. (And that's good, they still fix things, and it also
> shows how long it takes to really get a file system stable.)
> 
> > i have had less trouble by using the reiser4 patches before even hans
> > considered it stable than i had by using ext3.
> 
> Lucky you. I haven't dared try it yet for lack of a test computer to
> trash.
i too was reluctant, i ended up using it for the things i REALLY dont
want to loose.
> 
> > there is quite a big difference between stuff like submount and the
> > filesystem itself.. and as you pointed out, reiserfs in the beginning
> > was a disappointment, do you seriously think they are willing to take
> > the chance again?
> 
> I thing naught about what they're going to put at stake. reiserfs 3 was
> an utter failure for me. It was raved about, hyped, and the bottom line
> was wasted time and a major disappointment.
> 
> > > Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
> > > 3.6 will start all over. I hope the Namesys guys were to clueful as to
> > > run all  their reiserfs 3.X regression tests against 4.X with all
> > > plugins and switches, too.
> > you will find that reiser4 is actually very very good.
> 
> I haven't asked what I'd find, because I'm not searching. And I might
> find something else than you did - perhaps because you've picked up all
> the good things already when I'll finally go there ;-)
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 11:45     ` Diego Calleja
  2005-11-21 14:19       ` Tarkan Erimer
@ 2005-11-21 18:17       ` Rob Landley
  1 sibling, 0 replies; 74+ messages in thread
From: Rob Landley @ 2005-11-21 18:17 UTC (permalink / raw)
  To: Diego Calleja; +Cc: Alfred Brons, pocm, linux-kernel

On Monday 21 November 2005 05:45, Diego Calleja wrote:
> El Mon, 21 Nov 2005 01:59:15 -0800 (PST),
> There're some rumors saying that sun might be considering a linux port.
>
> http://www.sun.com/emrkt/campaign_docs/expertexchange/knowledge/solaris_zfs
>_gen.html#10
>
> Q: Any thoughts on porting ZFS to Linux, AIX, or HPUX?
> A: No plans of porting to AIX and HPUX. Porting to Linux is currently
> being investigated.

Translation: We'd like to dangle a carrot in front of Linux users in hopes 
they'll try out this feature and possibly get interested in switching to 
Solaris because of it.  Don't hold your breath on us actually shipping 
anything.  But we didn't open source Solaris due to competitive pressure from 
AIX or HPUX users, so they don't even get the carrot.

Rob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 14:19       ` Tarkan Erimer
@ 2005-11-21 18:52         ` Rob Landley
  2005-11-21 19:28           ` Diego Calleja
                             ` (5 more replies)
  0 siblings, 6 replies; 74+ messages in thread
From: Rob Landley @ 2005-11-21 18:52 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: linux-kernel, Diego Calleja

On Monday 21 November 2005 08:19, Tarkan Erimer wrote:
> On 11/21/05, Diego Calleja <diegocg@gmail.com> wrote:
> If It happenned, Sun or someone has port it to linux.
> We will need some VFS changes to handle 128 bit FS as "Jörn ENGEL"
> mentionned previous mail in this thread. Is there any plan or action
> to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
> File Systems ? Any VFS people reply to this, please?

I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  Python says 
2**64 is 18446744073709551616, and that's roughly:
18,446,744,073,709,551,616 bytes
18,446,744,073,709 megs
18,446,744,073 gigs
18,446,744 terabytes
18,446 ...  what are those, petabytes?
18 Really big lumps of data we won't be using for a while yet.

And that's just 64 bits.  Keep in mind it took us around fifty years to burn 
through the _first_ thirty two (which makes sense, since Moore's Law says we 
need 1 more bit every 18 months).  We may go through it faster than we went 
through the first 32 bits, but it'll last us a couple decades at least.

Now I'm not saying we won't exhaust 64 bits eventually.  Back to chemistry, it 
takes 6.02*10^23 protons to weigh 1 gram, and that's just about 2^79, so it's 
feasible that someday we might be able to store more than 64 bits of data per 
gram, let alone in big room-sized clusters.   But it's not going to be for 
years and years, and that's a design problem for Sun.

Sun is proposing it can predict what storage layout will be efficient for as 
yet unheard of quantities of data, with unknown access patterns, at least a 
couple decades from now.  It's also proposing that data compression and 
checksumming are the filesystem's job.  Hands up anybody who spots 
conflicting trends here already?  Who thinks the 128 bit requirement came 
from marketing rather than the engineers?

If you're worried about being able to access your data 2 or 3 decades from 
now, you should _not_ be worried about choice of filesystem.  You should be 
worried about making it _independent_ of what filesystem it's on.  For 
example, none of the current journaling filesystems in Linux were available 
20 years ago, because fsck didn't emerge as a bottleneck until filesystem 
sizes got really big.

Rob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 18:52         ` Rob Landley
@ 2005-11-21 19:28           ` Diego Calleja
  2005-11-21 20:02           ` Bernd Petrovitsch
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: Diego Calleja @ 2005-11-21 19:28 UTC (permalink / raw)
  To: Rob Landley; +Cc: tarkane, linux-kernel

El Mon, 21 Nov 2005 12:52:04 -0600,
Rob Landley <rob@landley.net> escribió:

> If you're worried about being able to access your data 2 or 3 decades from 
> now, you should _not_ be worried about choice of filesystem.  You should be

Sun has invested 4.1$bn in buying storagetek and more money in buying other
small-size storage companies (ie: they're focusing a lot on "storage"). ZFS
fits perfectly there.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 18:52         ` Rob Landley
  2005-11-21 19:28           ` Diego Calleja
@ 2005-11-21 20:02           ` Bernd Petrovitsch
  2005-11-22  5:42             ` Rob Landley
  2005-11-21 23:05           ` Bill Davidsen
                             ` (3 subsequent siblings)
  5 siblings, 1 reply; 74+ messages in thread
From: Bernd Petrovitsch @ 2005-11-21 20:02 UTC (permalink / raw)
  To: Rob Landley; +Cc: Tarkan Erimer, linux-kernel, Diego Calleja

On Mon, 2005-11-21 at 12:52 -0600, Rob Landley wrote:
[...]
> couple decades from now.  It's also proposing that data compression and 
> checksumming are the filesystem's job.  Hands up anybody who spots 
> conflicting trends here already?  Who thinks the 128 bit requirement came 
> from marketing rather than the engineers?

Without compressing you probably need 256 bits.

SCNR,
	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 13:18           ` Matthias Andree
  2005-11-21 14:18             ` Kasper Sandberg
@ 2005-11-21 20:48             ` jdow
  2005-11-22 11:17               ` Jörn Engel
  1 sibling, 1 reply; 74+ messages in thread
From: jdow @ 2005-11-21 20:48 UTC (permalink / raw)
  To: Matthias Andree, linux-kernel

From: "Matthias Andree" <matthias.andree@gmx.de>

> On Mon, 21 Nov 2005, Kasper Sandberg wrote:
>
>> On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
>> > On Mon, 21 Nov 2005, Jörn Engel wrote:
>> >
>> > > o Checksums for data blocks
>> > >   Done by jffs2, not done my any hard disk filesystems I'm aware of.
>> >
>> > Then allow me to point you to the Amiga file systems. The variants
>> > commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
>
> Make that 488. Amiga's traditional file system loses 6 longs (at 32 bit
> each) according to Ralph Babel's "The Amiga Guru Book".

FYI it was not used very often on hard disk file systems. The affect on
performance was "remarkable". Each disk block contained a simple ulong
checksum, a pointer to the next block in the file, and a pointer to the
previous block in the file. The entire file system was built of doubly
linked lists. It was possible to effect remarkable levels of "unerase"
and recover from disk corruption better than most other filesystems I
have seen. But it bad watching glass flow seem fast when you tried to
use it. So as soon as the Amiga Fast File System, FFS, was developed
OFS became a floppy only tool. That lasted until FFS was enabled for
floppies, months later. Then OFS became a legacy compatability feature
that was seldom if ever used by real people. I am not sure how I would
apply a checksum to each block of a file and still maintain reasonable
access speeds. It would be entertaining to see what the ZFS file system
does in this regard so that it doesn't slow down to essentially single
block per transaction disk reads or huge RAM buffer areas such as had
to be used with OFS.

>> > in a data block for payload and put their block chaining information,
>> > checksum and other "interesting" things into the blocks. This helps
>> > recoverability a lot but kills performance, so many people (used to) use
>> > the "Fast File System" that uses the full 512 bytes for data blocks.
>> >
>> > Whether the Amiga FFS, even with multi-user and directory index updates,
>> > has a lot of importance today, is a different question that you didn't
>> > pose :-)

Amiga FFS has some application today, generally for archival data
recovery. I am quite happy that potential is available. The Amiga FFS
and OFS had some features mildly incompatable with 'ix type filesystems
and these features were used frequently. So it is easier to perpetuate
the old Amiga FFS images than to copy them over in many cases.

>> that isnt true, just because it isnt following the kernel coding style
>> and therefore has to be changed, does not make it any bit more unstable.
>
> If the precondition is "adhere to CodingStyle or you don't get it in",
> and the CodingStyle has been established for years, I have zero sympathy
> with the maintainer if he's told "no, you didn't follow that well-known
> style".

Personally I am not a fan of the Linux coding style. However, if I am
going to commit a patch or a large block of Linux only code then its
style will match my understanding of the Linux coding style. This is
merely a show of professionalism on the part of the person creating the
code or patch. A brand new religious war over the issue is the mark of
a stupid boor at this time. It is best to go with the flow. The worst
code to maintain is code that contains eleven thousand eleven hundred
eleven individual idiosyncratic coding styles.

> Matthias Andree

{^_^}   Joanne Dow, who pretty much knows Amiga filesystems inside and
        out if I feel a need to refresh my working memory on the subject.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 14:18             ` Kasper Sandberg
  2005-11-21 14:41               ` Matthias Andree
@ 2005-11-21 22:41               ` Bill Davidsen
  1 sibling, 0 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-21 22:41 UTC (permalink / raw)
  To: Kasper Sandberg; +Cc: linux-kernel

Kasper Sandberg wrote:
> On Mon, 2005-11-21 at 14:18 +0100, Matthias Andree wrote:

>>I don't care what its name is. I am aware it is a rewrite, and that is
>>reason to be all the more chary about adopting it early. People believed
>>3.5 to be stable, too, before someone tried NFS...
> 
> nfs works fine with reiser4. you are judging reiser4 by the problems
> reiserfs had.

reiser4 will have far more problems than 3.5 without a doubt. The NFS 
problem was because it was a use which had not been properly tested, and 
that was because it had not been envisioned. You test for the cases you 
can envision, the "this is how people will use it" cases. He is judging 
by the problems of any increasingly complex software.

reiser4 has a ton of new features not found in other filesystems, and 
the developers can't begin to guess how people will use them because 
people never had these features before. When files were read, write, 
create, delete, permissions and seek, you could think of the ways people 
would use them because there were so few things you could do. Then came 
attrs, ACLs, etc, etc. All of a sudden people were doing things they 
never did before, and there were unforseen, unintended, unsupported 
interractions which went off on code paths which reminded people of "the 
less traveled way" in the poem. Developers looked at bug reports and 
asked why anyone would ever do THAT? But the bugs got fixed and ext3 
became stable.

People are going to do things the reiser4 developers didn't envision, 
they are going to run it over LVM on top of multilevel RAID using nbd as 
part of the array, on real-time, preemptable, NUMA-enabled kernels, on 
hardware platforms at best lightly tested... and reiser4 will regularly 
lose bladder control because someone has just found another "can't 
happen" or "no one would do that" path.

This isn't a criticism of reiser4, Matthias and others are just pointing 
out that once any complex capability is added, people will use it in 
unexpected ways and it will fail. So don't bother to even think that it 
matters that it's been stable for you, because you haven't begun to 
drive the wheels of it, no one person can.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  0:15           ` Bernd Eckenfels
@ 2005-11-21 22:59             ` Jeff V. Merkey
  2005-11-22  7:45               ` Christoph Hellwig
  2005-11-22 16:00               ` Bill Davidsen
  2005-11-22  7:15             ` Rob Landley
  1 sibling, 2 replies; 74+ messages in thread
From: Jeff V. Merkey @ 2005-11-21 22:59 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

Bernd Eckenfels wrote:

>In article <200511211252.04217.rob@landley.net> you wrote:
>  
>
>>I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  Python says 
>>2**64 is 18446744073709551616, and that's roughly:
>>18,446,744,073,709,551,616 bytes
>>18,446,744,073,709 megs
>>18,446,744,073 gigs
>>18,446,744 terabytes
>>18,446 ...  what are those, pedabytes (petabytes?)
>>18          zetabytes
>>
There you go.  I deal with this a lot so, those are the names.

Linux is currently limited to 16 TB per VFS mount point, it's all mute, unless VFS gets fixed.
mmap won't go above this at present.

Jeff




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 18:52         ` Rob Landley
  2005-11-21 19:28           ` Diego Calleja
  2005-11-21 20:02           ` Bernd Petrovitsch
@ 2005-11-21 23:05           ` Bill Davidsen
  2005-11-22  0:15           ` Bernd Eckenfels
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-21 23:05 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel, Diego Calleja

Rob Landley wrote:
> On Monday 21 November 2005 08:19, Tarkan Erimer wrote:
> 
>>On 11/21/05, Diego Calleja <diegocg@gmail.com> wrote:
>>If It happenned, Sun or someone has port it to linux.
>>We will need some VFS changes to handle 128 bit FS as "Jörn ENGEL"
>>mentionned previous mail in this thread. Is there any plan or action
>>to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
>>File Systems ? Any VFS people reply to this, please?
> 
> 
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  Python says 
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs
> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ...  what are those, petabytes?
> 18 Really big lumps of data we won't be using for a while yet.
> 
> And that's just 64 bits.  Keep in mind it took us around fifty years to burn 
> through the _first_ thirty two (which makes sense, since Moore's Law says we 
> need 1 more bit every 18 months).  We may go through it faster than we went 
> through the first 32 bits, but it'll last us a couple decades at least.
> 
> Now I'm not saying we won't exhaust 64 bits eventually.  Back to chemistry, it 
> takes 6.02*10^23 protons to weigh 1 gram, and that's just about 2^79, so it's 
> feasible that someday we might be able to store more than 64 bits of data per 
> gram, let alone in big room-sized clusters.   But it's not going to be for 
> years and years, and that's a design problem for Sun.

There's a more limiting problem, energy. Assuming that the energy to set 
one bit is the energy to reverse the spin of an electron, call that s. 
If you have each value of 128 bit address a single byte, then
    T = s * 8 * 2^128   and   T > B
where T is the total enargy to low level format the storage, and B is 
the energy to boil all the oceans of earth. That was in one of the 
physics magazines earlier this year. there just isn't enough energy 
usable to write that much data.
> 
> Sun is proposing it can predict what storage layout will be efficient for as 
> yet unheard of quantities of data, with unknown access patterns, at least a 
> couple decades from now.  It's also proposing that data compression and 
> checksumming are the filesystem's job.  Hands up anybody who spots 
> conflicting trends here already?  Who thinks the 128 bit requirement came 
> from marketing rather than the engineers?

Not me, if you are going larger than 64 bits you have no good reason not 
to double the size, it avoids some problems by fitting in two 64 bit 
registers nicely without truncation or extension. And we will never need 
more than 128 bits so the addressing problems are solved.
> 
> If you're worried about being able to access your data 2 or 3 decades from 
> now, you should _not_ be worried about choice of filesystem.  You should be 
> worried about making it _independent_ of what filesystem it's on.  For 
> example, none of the current journaling filesystems in Linux were available 
> 20 years ago, because fsck didn't emerge as a bottleneck until filesystem 
> sizes got really big.

I'm gradually copying backups from the 90's off DC600 tapes to CDs, 
knowing that they will require at least one more copy in my lifetime 
(hopefully).
-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 18:52         ` Rob Landley
                             ` (2 preceding siblings ...)
  2005-11-21 23:05           ` Bill Davidsen
@ 2005-11-22  0:15           ` Bernd Eckenfels
  2005-11-21 22:59             ` Jeff V. Merkey
  2005-11-22  7:15             ` Rob Landley
  2005-11-22  0:45           ` Pavel Machek
  2005-11-22  9:20           ` Matthias Andree
  5 siblings, 2 replies; 74+ messages in thread
From: Bernd Eckenfels @ 2005-11-22  0:15 UTC (permalink / raw)
  To: linux-kernel

In article <200511211252.04217.rob@landley.net> you wrote:
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  Python says 
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs
> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ...  what are those, petabytes?
> 18 Really big lumps of data we won't be using for a while yet.

The prolem is not about file size. It is about for example unique inode
numbers. If you have a file system which spans multiple volumnes and maybe
nodes, you need more unqiue methods of addressing the files and blocks.

Gruss
Bernd

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 18:52         ` Rob Landley
                             ` (3 preceding siblings ...)
  2005-11-22  0:15           ` Bernd Eckenfels
@ 2005-11-22  0:45           ` Pavel Machek
  2005-11-22  6:34             ` Rob Landley
  2005-11-22  9:20           ` Matthias Andree
  5 siblings, 1 reply; 74+ messages in thread
From: Pavel Machek @ 2005-11-22  0:45 UTC (permalink / raw)
  To: Rob Landley; +Cc: Tarkan Erimer, linux-kernel, Diego Calleja

Hi!

> > If It happenned, Sun or someone has port it to linux.
> > We will need some VFS changes to handle 128 bit FS as "Jörn ENGEL"
> > mentionned previous mail in this thread. Is there any plan or action
> > to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
> > File Systems ? Any VFS people reply to this, please?
> 
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  Python says 
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs
> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ...  what are those, petabytes?
> 18 Really big lumps of data we won't be using for a while yet.
> 
> And that's just 64 bits.  Keep in mind it took us around fifty years to burn 
> through the _first_ thirty two (which makes sense, since Moore's Law says we 
> need 1 more bit every 18 months).  We may go through it faster than we went 
> through the first 32 bits, but it'll last us a couple decades at least.
> 
> Now I'm not saying we won't exhaust 64 bits eventually.  Back to chemistry, it 
> takes 6.02*10^23 protons to weigh 1 gram, and that's just about 2^79, so it's 
> feasible that someday we might be able to store more than 64 bits of data per 
> gram, let alone in big room-sized clusters.   But it's not going to be for 
> years and years, and that's a design problem for Sun.
> 
> Sun is proposing it can predict what storage layout will be efficient for as 
> yet unheard of quantities of data, with unknown access patterns, at least a 
> couple decades from now.  It's also proposing that data compression and 
> checksumming are the filesystem's job.  Hands up anybody who spots 
> conflicting trends here already?  Who thinks the 128 bit requirement came 
> from marketing rather than the engineers?

Actually, if you are storing information in single protons, I'd say
you _need_ checksumming :-).

[I actually agree with Sun here, not trusting disk is good idea. At
least you know kernel panic/oops/etc can't be caused by bit corruption on
the disk.]

								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 20:02           ` Bernd Petrovitsch
@ 2005-11-22  5:42             ` Rob Landley
  2005-11-22  9:25               ` Matthias Andree
  0 siblings, 1 reply; 74+ messages in thread
From: Rob Landley @ 2005-11-22  5:42 UTC (permalink / raw)
  To: Bernd Petrovitsch; +Cc: Tarkan Erimer, linux-kernel, Diego Calleja

On Monday 21 November 2005 14:02, Bernd Petrovitsch wrote:
> On Mon, 2005-11-21 at 12:52 -0600, Rob Landley wrote:
> [...]
>
> > couple decades from now.  It's also proposing that data compression and
> > checksumming are the filesystem's job.  Hands up anybody who spots
> > conflicting trends here already?  Who thinks the 128 bit requirement came
> > from marketing rather than the engineers?
>
> Without compressing you probably need 256 bits.

I assume this is sarcasm.  Once again assuming you can someday manage to store 
1 bit per electron, it would have a corresponding 2^256 protons*, which would 
weigh (in grams):

> print 2**256/(6.02*(10**23))
1.92345663185e+53

Google for the weight of the earth:
http://www.ecology.com/earth-at-a-glance/earth-at-a-glance-feature/
Earth's Weight (Mass): 5.972 sextillion (1,000 trillion) metric tons.
Yeah, alright, mass...  So that's 5.972*10^18 metric tons, and a metric ton is 
a million grams, so 5.972*10^24 grams...

Google for the mass of the sun says that's 2*10^33 grams.  Still nowhere 
close.

Basically, as far as I can tell, any device capable of storing 2^256 bits 
would collapse into a black hole under its own weight.

By the way, 2^128/avogadro gives 5.65253101198e+14, or 565 million metric 
tons.  For comparison, the empire state building: 
http://www.newyorktransportation.com/info/empirefact2.html
Is 365,000 tons.  (Probably not metric, but you get the idea.)  Assuming I 
haven't screwed up the math, an object capable of storing anywhere near 2^128 
bits (constructed as a single giant molecule) would probably be in the size 
ballpark of new york, london, or tokyo.

2^64 we may actually live to see the end of someday, but it's not guaranteed.  
2^128 becoming relevant in our lifetimes is a touch unlikely.

Rob

* Yeah, I'm glossing over neutrons.  I'm also glossing over the possibility of 
storing more than one bit per electron and other quauntum strangeness.  I 
have no idea how you'd _build_ one of these suckers.  Nobody does yet.  
They're working on it...

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  0:45           ` Pavel Machek
@ 2005-11-22  6:34             ` Rob Landley
  2005-11-22 19:05               ` Pavel Machek
  0 siblings, 1 reply; 74+ messages in thread
From: Rob Landley @ 2005-11-22  6:34 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Tarkan Erimer, linux-kernel, Diego Calleja

On Monday 21 November 2005 18:45, Pavel Machek wrote:
> Hi!
> > Sun is proposing it can predict what storage layout will be efficient for
> > as yet unheard of quantities of data, with unknown access patterns, at
> > least a couple decades from now.  It's also proposing that data
> > compression and checksumming are the filesystem's job.  Hands up anybody
> > who spots conflicting trends here already?  Who thinks the 128 bit
> > requirement came from marketing rather than the engineers?
>
> Actually, if you are storing information in single protons, I'd say
> you _need_ checksumming :-).

You need error correcting codes at the media level.  A molecular storage 
system like this would probably look a lot more like flash or dram than it 
would magnetic media.  (For one thing, I/O bandwidth and seek times become a 
serious bottleneck with high density single point of access systems.)

> [I actually agree with Sun here, not trusting disk is good idea. At
> least you know kernel panic/oops/etc can't be caused by bit corruption on
> the disk.]

But who said the filesystem was the right level to do this at?

Rob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  0:15           ` Bernd Eckenfels
  2005-11-21 22:59             ` Jeff V. Merkey
@ 2005-11-22  7:15             ` Rob Landley
  2005-11-22  8:16               ` Bernd Eckenfels
  1 sibling, 1 reply; 74+ messages in thread
From: Rob Landley @ 2005-11-22  7:15 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

On Monday 21 November 2005 18:15, Bernd Eckenfels wrote:
> In article <200511211252.04217.rob@landley.net> you wrote:
> > I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  Python
> > says 2**64 is 18446744073709551616, and that's roughly:
> > 18,446,744,073,709,551,616 bytes
> > 18,446,744,073,709 megs
> > 18,446,744,073 gigs
> > 18,446,744 terabytes
> > 18,446 ...  what are those, petabytes?
> > 18 Really big lumps of data we won't be using for a while yet.
>
> The prolem is not about file size. It is about for example unique inode
> numbers. If you have a file system which spans multiple volumnes and maybe
> nodes, you need more unqiue methods of addressing the files and blocks.

18 quintillion inodes are enough to give every ipv4 address on the internet 4 
billion unique inodes.  I take it this is not enough space for Sun to work 
out a reasonable allocation strategy in?

Rob



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 22:59             ` Jeff V. Merkey
@ 2005-11-22  7:45               ` Christoph Hellwig
  2005-11-22  9:19                 ` Jeff V. Merkey
  2005-11-22 16:00               ` Bill Davidsen
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2005-11-22  7:45 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Bernd Eckenfels, linux-kernel

On Mon, Nov 21, 2005 at 03:59:52PM -0700, Jeff V. Merkey wrote:
> Linux is currently limited to 16 TB per VFS mount point, it's all mute, 
> unless VFS gets fixed.
> mmap won't go above this at present.

You're thinking of 32bit architectures.  There is no such limit for
64 bit architectures.  There are XFS volumes in the 100TB range in production
use.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 10:19     ` Jörn Engel
  2005-11-21 11:46       ` Matthias Andree
  2005-11-21 11:59       ` Diego Calleja
@ 2005-11-22  7:51       ` Christoph Hellwig
  2005-11-22 10:28         ` Jörn Engel
  2005-11-22 14:50         ` Theodore Ts'o
  2005-11-28 12:53       ` Lars Marowsky-Bree
  3 siblings, 2 replies; 74+ messages in thread
From: Christoph Hellwig @ 2005-11-22  7:51 UTC (permalink / raw)
  To: J?rn Engel; +Cc: Alfred Brons, pocm, linux-kernel

> o 128 bit
>   On 32bit machines, you can't even fully utilize a 64bit filesystem
>   without VFS changes.  Have you ever noticed?  Thought so.

What is a '128 bit' or '64 bit' filesystem anyway?  This description doesn't
make any sense,  as there are many different things that can be
addresses in filesystems, and those can be addressed in different ways.
I guess from the marketing documents that they do 128 bit _byte_ addressing
for diskspace.  All the interesting Linux filesystems do _block_ addressing
though, and 64bits addressing large enough blocks is quite huge.
128bit inodes again is something could couldn't easily implement, it would
mean a non-scalar ino_t type which guarantees to break userspace.  128
i_size?  Again that would totally break userspace because it expects off_t
to be a scalar, so every single file must fit into 64bit _byte_ addressing.
If the surrounding enviroment changes (e.g. we get a 128bit scalar type
on 64bit architectures) that could change pretty easily, similarly to how
ext2 got a 64bit i_size during the 2.3.x LFS work.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  7:15             ` Rob Landley
@ 2005-11-22  8:16               ` Bernd Eckenfels
  0 siblings, 0 replies; 74+ messages in thread
From: Bernd Eckenfels @ 2005-11-22  8:16 UTC (permalink / raw)
  To: linux-kernel

In article <200511220115.17450.rob@landley.net> you wrote:
> 18 quintillion inodes are enough to give every ipv4 address on the internet 4 
> billion unique inodes.  I take it this is not enough space for Sun to work 
> out a reasonable allocation strategy in?

Yes, I think thats why they did it. However with ipv6, it bevomes 1-inode/node.

Gruss
Bernd

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 15:08                 ` Kasper Sandberg
@ 2005-11-22  8:52                   ` Matthias Andree
  0 siblings, 0 replies; 74+ messages in thread
From: Matthias Andree @ 2005-11-22  8:52 UTC (permalink / raw)
  To: linux-kernel

On Mon, 21 Nov 2005, Kasper Sandberg wrote:

> > As mentioned, a file system cannot possibly be stable right after merge.
> > Having to change formatting is a sweeping change and certainly is a
> > barrier across which to look for auditing is all the more difficult.
> before reiser4 was changed alot, to match the codingstyle (agreed, they
> have to obey by the kernels codingstyle), it was stable, so had it been
> merged there it wouldnt have been any less stable.

Code reformatting, unless 100% automatic with a 100% proven and C99
aware formatting tool, also introduces instability.

> > Lucky you. I haven't dared try it yet for lack of a test computer to
> > trash.
> i too was reluctant, i ended up using it for the things i REALLY dont
> want to loose.

So did many when reiser 3 was fresh, it was much raved about its speed,
stability, its alleged recoverability and recovery speed, and then
people started sending full filesystem dumps on tape and other media to
Namesys...

It's impossible to fully test nontrivial code, every option, every
possible state exponentiates the number of possibilities you have to
test to claim 100% coverage.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  7:45               ` Christoph Hellwig
@ 2005-11-22  9:19                 ` Jeff V. Merkey
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff V. Merkey @ 2005-11-22  9:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Bernd Eckenfels, linux-kernel

Christoph Hellwig wrote:

>On Mon, Nov 21, 2005 at 03:59:52PM -0700, Jeff V. Merkey wrote:
>  
>
>>Linux is currently limited to 16 TB per VFS mount point, it's all mute, 
>>unless VFS gets fixed.
>>mmap won't go above this at present.
>>    
>>
>
>You're thinking of 32bit architectures.  There is no such limit for
>64 bit architectures.  There are XFS volumes in the 100TB range in production
>use.
>
>
>  
>
I have 128 TB volumes in production use on 32 bit processors.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 18:52         ` Rob Landley
                             ` (4 preceding siblings ...)
  2005-11-22  0:45           ` Pavel Machek
@ 2005-11-22  9:20           ` Matthias Andree
  2005-11-22 10:00             ` Tarkan Erimer
  5 siblings, 1 reply; 74+ messages in thread
From: Matthias Andree @ 2005-11-22  9:20 UTC (permalink / raw)
  To: linux-kernel

On Mon, 21 Nov 2005, Rob Landley wrote:

> On Monday 21 November 2005 08:19, Tarkan Erimer wrote:
> > On 11/21/05, Diego Calleja <diegocg@gmail.com> wrote:
> > If It happenned, Sun or someone has port it to linux.
> > We will need some VFS changes to handle 128 bit FS as "Jörn ENGEL"
> > mentionned previous mail in this thread. Is there any plan or action
> > to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
> > File Systems ? Any VFS people reply to this, please?
> 
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  Python says 
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs

  18,446,744,073,710 Mbytes (round up)

> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ...  what are those, petabytes?

  18,447 Pbytes, right.

> 18 Really big lumps of data we won't be using for a while yet.

18 Exabytes, indeeed.

Sun decided not to think about sizing for a longer while, and looking at
how long ufs has been around, Sun may have the better laugh in the end.

> Sun is proposing it can predict what storage layout will be efficient for as 
> yet unheard of quantities of data, with unknown access patterns, at least a 
> couple decades from now.  It's also proposing that data compression and 
> checksumming are the filesystem's job.  Hands up anybody who spots 
> conflicting trends here already?  Who thinks the 128 bit requirement came 
> from marketing rather than the engineers?

Is that important? Who says Sun isn't going to put checksumming and
compression hardware into its machines, and tell ZFS and their hardware
drivers to use it? Keep ZFS tuned for new requirements as they emerge?

AFAIK, no-one has suggested ZFS yet for floppies (including LS120, ZIP
and that stuff - it was also a major hype, now with DVD-RAM, DVD+RW and
DVD-RW few people talk about LS120 or ZIP any more).

What if some breakthrough in storage gives us vastly larger (larger than
predicted harddisk storage density increases) storage densities in 10
years for the same price of a 200 or 300 GB disk drive now?

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  5:42             ` Rob Landley
@ 2005-11-22  9:25               ` Matthias Andree
  0 siblings, 0 replies; 74+ messages in thread
From: Matthias Andree @ 2005-11-22  9:25 UTC (permalink / raw)
  To: linux-kernel

On Mon, 21 Nov 2005, Rob Landley wrote:

> 2^64 we may actually live to see the end of someday, but it's not guaranteed.  
> 2^128 becoming relevant in our lifetimes is a touch unlikely.

Some people suggested we don't know usage and organization patterns yet,
perhaps something that is very sparse can benefit from linear addressing
in a huge (not to say vastly oversized) address space. Perhaps not.

One real-world example is that we've been doing RAM overcommit for a
long time to account for but not actually perform memory allocations,
and on 32-bit machines, 1 GB of RAM already required highmem until
recently. So here, 64-bit address space comes as an advantage.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  9:20           ` Matthias Andree
@ 2005-11-22 10:00             ` Tarkan Erimer
  2005-11-22 15:46               ` Jan Dittmer
  2005-11-22 16:27               ` Bill Davidsen
  0 siblings, 2 replies; 74+ messages in thread
From: Tarkan Erimer @ 2005-11-22 10:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: matthias.andree

On 11/22/05, Matthias Andree <matthias.andree@gmx.de> wrote:
> What if some breakthrough in storage gives us vastly larger (larger than
> predicted harddisk storage density increases) storage densities in 10
> years for the same price of a 200 or 300 GB disk drive now?

If all the speculations are true for AtomChip Corp.'s
(http://www.atomchip.com) Optical Technology. We wil begin to use
really large RAMs and Storages very early than we expected.
Their prototypes already begin with 1 TB (both for RAM and Storage).
It's not hard to imagine, a few years later, we can use 100-200 and up
TB Storages and RAMs.

Regards

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  7:51       ` Christoph Hellwig
@ 2005-11-22 10:28         ` Jörn Engel
  2005-11-22 14:50         ` Theodore Ts'o
  1 sibling, 0 replies; 74+ messages in thread
From: Jörn Engel @ 2005-11-22 10:28 UTC (permalink / raw)
  To: Christoph Hellwig, Alfred Brons, pocm, linux-kernel

On Tue, 22 November 2005 07:51:48 +0000, Christoph Hellwig wrote:
> 
> > o 128 bit
> >   On 32bit machines, you can't even fully utilize a 64bit filesystem
> >   without VFS changes.  Have you ever noticed?  Thought so.
> 
> What is a '128 bit' or '64 bit' filesystem anyway?  This description doesn't
> make any sense,  as there are many different things that can be
> addresses in filesystems, and those can be addressed in different ways.
> I guess from the marketing documents that they do 128 bit _byte_ addressing
> for diskspace.  All the interesting Linux filesystems do _block_ addressing
> though, and 64bits addressing large enough blocks is quite huge.
> 128bit inodes again is something could couldn't easily implement, it would
> mean a non-scalar ino_t type which guarantees to break userspace.  128
> i_size?  Again that would totally break userspace because it expects off_t
> to be a scalar, so every single file must fit into 64bit _byte_ addressing.
> If the surrounding enviroment changes (e.g. we get a 128bit scalar type
> on 64bit architectures) that could change pretty easily, similarly to how
> ext2 got a 64bit i_size during the 2.3.x LFS work.

...once the need arises.  Even with byte addressing, 64 bit are enough
to handle roughly 46116860 of the biggest hard disks currently
available.  Looks like we still have a bit of time to think about the
problem before action is required.

Jörn

-- 
Victory in war is not repetitious.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 20:48             ` jdow
@ 2005-11-22 11:17               ` Jörn Engel
  0 siblings, 0 replies; 74+ messages in thread
From: Jörn Engel @ 2005-11-22 11:17 UTC (permalink / raw)
  To: jdow; +Cc: Matthias Andree, linux-kernel

On Mon, 21 November 2005 12:48:44 -0800, jdow wrote:
>
> that was seldom if ever used by real people. I am not sure how I would
> apply a checksum to each block of a file and still maintain reasonable
> access speeds. It would be entertaining to see what the ZFS file system
> does in this regard so that it doesn't slow down to essentially single
> block per transaction disk reads or huge RAM buffer areas such as had
> to be used with OFS.

Design should be just as ZFS alledgedly does it.  Store the checksum
near the indirect block pointers.  Seeks for checksums basically don't
exist, as you need to seek for the indirect block pointers anyway.
Only drawback is the effective growth of the area for the
pointers+checksum blocks, which has a small impact on your caches.

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  7:51       ` Christoph Hellwig
  2005-11-22 10:28         ` Jörn Engel
@ 2005-11-22 14:50         ` Theodore Ts'o
  2005-11-22 15:25           ` Jan Harkes
  1 sibling, 1 reply; 74+ messages in thread
From: Theodore Ts'o @ 2005-11-22 14:50 UTC (permalink / raw)
  To: Christoph Hellwig, J?rn Engel, Alfred Brons, pocm, linux-kernel

On Tue, Nov 22, 2005 at 07:51:48AM +0000, Christoph Hellwig wrote:
> 
> What is a '128 bit' or '64 bit' filesystem anyway?  This description doesn't
> make any sense,  as there are many different things that can be
> addresses in filesystems, and those can be addressed in different ways.
> I guess from the marketing documents that they do 128 bit _byte_ addressing
> for diskspace.  All the interesting Linux filesystems do _block_ addressing
> though, and 64bits addressing large enough blocks is quite huge.
> 128bit inodes again is something could couldn't easily implement, it would
> mean a non-scalar ino_t type which guarantees to break userspace.  128
> i_size?  Again that would totally break userspace because it expects off_t
> to be a scalar, so every single file must fit into 64bit _byte_ addressing.
> If the surrounding enviroment changes (e.g. we get a 128bit scalar type
> on 64bit architectures) that could change pretty easily, similarly to how
> ext2 got a 64bit i_size during the 2.3.x LFS work.

I will note though that there are people who are asking for 64-bit
inode numbers on 32-bit platforms, since 2**32 inodes are not enough
for certain distributed/clustered filesystems.  And this is something
we don't yet support today, and probably will need to think about much
sooner than 128-bit filesystems....


						- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 14:50         ` Theodore Ts'o
@ 2005-11-22 15:25           ` Jan Harkes
  2005-11-22 16:17             ` Chris Adams
  2005-11-22 16:28             ` what is our " Theodore Ts'o
  0 siblings, 2 replies; 74+ messages in thread
From: Jan Harkes @ 2005-11-22 15:25 UTC (permalink / raw)
  To: Theodore Ts'o, Christoph Hellwig, J?rn Engel, Alfred Brons,
	pocm, linux-kernel

On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
> I will note though that there are people who are asking for 64-bit
> inode numbers on 32-bit platforms, since 2**32 inodes are not enough
> for certain distributed/clustered filesystems.  And this is something
> we don't yet support today, and probably will need to think about much
> sooner than 128-bit filesystems....

As far as the kernel is concerned this hasn't been a problem in a while
(2.4.early). The iget4 operation that was introduced by reiserfs (now
iget5) pretty much makes it possible for a filesystem to use anything to
identify it's inodes. The 32-bit inode numbers are simply used as a hash
index.

The only thing that tends to break are userspace archiving tools like
tar, which assume that 2 objects with the same 32-bit st_ino value are
identical. I think that by now several actually double check that the
inode linkcount is larger than 1.

Jan

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 10:00             ` Tarkan Erimer
@ 2005-11-22 15:46               ` Jan Dittmer
  2005-11-22 16:27               ` Bill Davidsen
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Dittmer @ 2005-11-22 15:46 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: linux-kernel, matthias.andree

Tarkan Erimer wrote:
> On 11/22/05, Matthias Andree <matthias.andree@gmx.de> wrote:
> 
>>What if some breakthrough in storage gives us vastly larger (larger than
>>predicted harddisk storage density increases) storage densities in 10
>>years for the same price of a 200 or 300 GB disk drive now?
> 
> 
> If all the speculations are true for AtomChip Corp.'s
> (http://www.atomchip.com) Optical Technology. We wil begin to use
> really large RAMs and Storages very early than we expected.
> Their prototypes already begin with 1 TB (both for RAM and Storage).
> It's not hard to imagine, a few years later, we can use 100-200 and up
> TB Storages and RAMs.

http://www.portablegadgets.net/article/59/atomchip-is-a-hoax

Jan

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 22:59             ` Jeff V. Merkey
  2005-11-22  7:45               ` Christoph Hellwig
@ 2005-11-22 16:00               ` Bill Davidsen
  2005-11-22 16:09                 ` Jeff V. Merkey
  2005-11-22 16:14                 ` Randy.Dunlap
  1 sibling, 2 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-22 16:00 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: linux-kernel

Jeff V. Merkey wrote:
> Bernd Eckenfels wrote:
> 
>> In article <200511211252.04217.rob@landley.net> you wrote:
>>  
>>
>>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.  
>>> Python says 2**64 is 18446744073709551616, and that's roughly:
>>> 18,446,744,073,709,551,616 bytes
>>> 18,446,744,073,709 megs
>>> 18,446,744,073 gigs
>>> 18,446,744 terabytes
>>> 18,446 ...  what are those, pedabytes (petabytes?)
>>> 18          zetabytes
>>>
> There you go.  I deal with this a lot so, those are the names.
> 
> Linux is currently limited to 16 TB per VFS mount point, it's all mute, 
> unless VFS gets fixed.
> mmap won't go above this at present.
> 
What does "it's all mute" mean?

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:00               ` Bill Davidsen
@ 2005-11-22 16:09                 ` Jeff V. Merkey
  2005-11-22 20:16                   ` Bill Davidsen
  2005-11-22 16:14                 ` Randy.Dunlap
  1 sibling, 1 reply; 74+ messages in thread
From: Jeff V. Merkey @ 2005-11-22 16:09 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

Bill Davidsen wrote:

> Jeff V. Merkey wrote:
>
>> Bernd Eckenfels wrote:
>>
>>> In article <200511211252.04217.rob@landley.net> you wrote:
>>>
>>>
>>>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. 
>>>> Python says 2**64 is 18446744073709551616, and that's roughly:
>>>> 18,446,744,073,709,551,616 bytes
>>>> 18,446,744,073,709 megs
>>>> 18,446,744,073 gigs
>>>> 18,446,744 terabytes
>>>> 18,446 ... what are those, pedabytes (petabytes?)
>>>> 18 zetabytes
>>>>
>> There you go. I deal with this a lot so, those are the names.
>>
>> Linux is currently limited to 16 TB per VFS mount point, it's all 
>> mute, unless VFS gets fixed.
>> mmap won't go above this at present.
>>
> What does "it's all mute" mean?
>
Should be spelled "moot". It's a legal term that means "it doesn't matter".

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:00               ` Bill Davidsen
  2005-11-22 16:09                 ` Jeff V. Merkey
@ 2005-11-22 16:14                 ` Randy.Dunlap
  2005-11-22 16:38                   ` Steve Flynn
  1 sibling, 1 reply; 74+ messages in thread
From: Randy.Dunlap @ 2005-11-22 16:14 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Jeff V. Merkey, linux-kernel

On Tue, 22 Nov 2005, Bill Davidsen wrote:

> Jeff V. Merkey wrote:
> > Bernd Eckenfels wrote:
> >
> >> In article <200511211252.04217.rob@landley.net> you wrote:
> >>
> >>
> >>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.
> >>> Python says 2**64 is 18446744073709551616, and that's roughly:
> >>> 18,446,744,073,709,551,616 bytes
> >>> 18,446,744,073,709 megs
> >>> 18,446,744,073 gigs
> >>> 18,446,744 terabytes
> >>> 18,446 ...  what are those, pedabytes (petabytes?)
> >>> 18          zetabytes
> >>>
> > There you go.  I deal with this a lot so, those are the names.
> >
> > Linux is currently limited to 16 TB per VFS mount point, it's all mute,
> > unless VFS gets fixed.
> > mmap won't go above this at present.
> >
> What does "it's all mute" mean?

It means "it's all moot."

-- 
~Randy

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 15:25           ` Jan Harkes
@ 2005-11-22 16:17             ` Chris Adams
  2005-11-22 16:55               ` Anton Altaparmakov
  2005-11-22 20:19               ` Alan Cox
  2005-11-22 16:28             ` what is our " Theodore Ts'o
  1 sibling, 2 replies; 74+ messages in thread
From: Chris Adams @ 2005-11-22 16:17 UTC (permalink / raw)
  To: linux-kernel

Once upon a time, Jan Harkes <jaharkes@cs.cmu.edu> said:
>The only thing that tends to break are userspace archiving tools like
>tar, which assume that 2 objects with the same 32-bit st_ino value are
>identical.

That assumption is probably made because that's what POSIX and Single
Unix Specification define: "The st_ino and st_dev fields taken together
uniquely identify the file within the system."  Don't blame code that
follows standards for breaking.

>I think that by now several actually double check that theinode
>linkcount is larger than 1.

That is not a good check.  I could have two separate files that have
multiple links; if st_ino is the same, how can tar make sense of it?
-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 10:00             ` Tarkan Erimer
  2005-11-22 15:46               ` Jan Dittmer
@ 2005-11-22 16:27               ` Bill Davidsen
  1 sibling, 0 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-22 16:27 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: Linux Kernel Mailing List

Tarkan Erimer wrote:
> On 11/22/05, Matthias Andree <matthias.andree@gmx.de> wrote:
> 
>>What if some breakthrough in storage gives us vastly larger (larger than
>>predicted harddisk storage density increases) storage densities in 10
>>years for the same price of a 200 or 300 GB disk drive now?
> 
> 
> If all the speculations are true for AtomChip Corp.'s
> (http://www.atomchip.com) Optical Technology. We wil begin to use
> really large RAMs and Storages very early than we expected.
> Their prototypes already begin with 1 TB (both for RAM and Storage).
> It's not hard to imagine, a few years later, we can use 100-200 and up
> TB Storages and RAMs.
> 
Amazing technology, run XP on a 256 bit 6.8GHz protrietary quantum CPU, 
by breaking the words into 64 bit pieces and passing them to XP via a 
"RAM packet counter" device.

And they run four copies of XP at once, too, and you don't need to boot 
them, they run instantly because... the web page says so?

I assume this is a joke, a scam would have prices ;-)
-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 15:25           ` Jan Harkes
  2005-11-22 16:17             ` Chris Adams
@ 2005-11-22 16:28             ` Theodore Ts'o
  2005-11-22 17:37               ` Jan Harkes
  1 sibling, 1 reply; 74+ messages in thread
From: Theodore Ts'o @ 2005-11-22 16:28 UTC (permalink / raw)
  To: Christoph Hellwig, J?rn Engel, Alfred Brons, pocm, linux-kernel

On Tue, Nov 22, 2005 at 10:25:31AM -0500, Jan Harkes wrote:
> On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
> > I will note though that there are people who are asking for 64-bit
> > inode numbers on 32-bit platforms, since 2**32 inodes are not enough
> > for certain distributed/clustered filesystems.  And this is something
> > we don't yet support today, and probably will need to think about much
> > sooner than 128-bit filesystems....
> 
> As far as the kernel is concerned this hasn't been a problem in a while
> (2.4.early). The iget4 operation that was introduced by reiserfs (now
> iget5) pretty much makes it possible for a filesystem to use anything to
> identify it's inodes. The 32-bit inode numbers are simply used as a hash
> index.

iget4 wasn't even strictly necessary, unless you want to use the inode
cache (which has always been strictly optional for filesystems, even
inode-based ones) --- Linux's VFS is dentry-based, not inode-based, so
we don't use inode numbers to index much of anything inside the
kernel, other than the aforementioned optional inode cache.

The main issue is the lack of a 64-bit interface to extract inode
numbers, which is needed as you point out for userspace archiving
tools like tar. There are also other programs or protocols that in the
past have broken as a result of inode number collisions.

As another example, a quick google search indicates that the some mail
programs can use inode numbers as a part of a technique to create
unique filenames in maildir directories.  One could easily also
imagine using inode numbers as part of creating unique ids returned by
an IMAP server --- not something I would recommend, but it's an
example of what some people might have done, since everybody _knows_
they can count on inode numbers on Unix systems, right?  POSIX
promises that they won't break!

> The only thing that tends to break are userspace archiving tools like
> tar, which assume that 2 objects with the same 32-bit st_ino value are
> identical. I think that by now several actually double check that the
> inode linkcount is larger than 1.

Um, that's not good enough to avoid failure modes; consider what might
happen if you have two inodes that have hardlinks, so that st_nlink >
1, but whose inode numbers are the same if you only look at the low 32
bits?  Oops.

It's not a bad hueristic, if you don't have that many hard-linked
files on your system, but if you have a huge number of hard-linked
trees (such as you might find on a kernel developer with tons of
hard-linked trees), I wouldn't want to count on this always working.

						- Ted








> 
> Jan

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 17:37               ` Jan Harkes
@ 2005-11-22 16:36                 ` Jeff V. Merkey
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff V. Merkey @ 2005-11-22 16:36 UTC (permalink / raw)
  To: Jan Harkes
  Cc: Theodore Ts'o, Christoph Hellwig, J?rn Engel, Alfred Brons,
	pocm, linux-kernel

Jan Harkes wrote:

>On Tue, Nov 22, 2005 at 11:28:36AM -0500, Theodore Ts'o wrote:
>  
>
>>On Tue, Nov 22, 2005 at 10:25:31AM -0500, Jan Harkes wrote:
>>    
>>
>>>On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
>>>      
>>>
>>>>I will note though that there are people who are asking for 64-bit
>>>>inode numbers on 32-bit platforms, since 2**32 inodes are not enough
>>>>for certain distributed/clustered filesystems.  And this is something
>>>>we don't yet support today, and probably will need to think about much
>>>>sooner than 128-bit filesystems....
>>>>        
>>>>
>>>As far as the kernel is concerned this hasn't been a problem in a while
>>>(2.4.early). The iget4 operation that was introduced by reiserfs (now
>>>iget5) pretty much makes it possible for a filesystem to use anything to
>>>identify it's inodes. The 32-bit inode numbers are simply used as a hash
>>>index.
>>>      
>>>
>>iget4 wasn't even strictly necessary, unless you want to use the inode
>>cache (which has always been strictly optional for filesystems, even
>>inode-based ones) --- Linux's VFS is dentry-based, not inode-based, so
>>we don't use inode numbers to index much of anything inside the
>>kernel, other than the aforementioned optional inode cache.
>>    
>>
>
>Ah yes, you're right.
>
>  
>
>>The main issue is the lack of a 64-bit interface to extract inode
>>numbers, which is needed as you point out for userspace archiving
>>tools like tar. There are also other programs or protocols that in the
>>past have broken as a result of inode number collisions.
>>    
>>
>
>64-bit? Coda has been using 128-bit file identifiers for a while now.
>And I can imagine someone trying to plug something like git into the VFS
>might want to use 168-bits. Or even more for a CAS-based storage that
>identifies objects by their SHA256 or SHA512 checksum.
>
>On the other hand, any large scale distributed/cluster based file system
>probably will have some sort of snapshot based backup strategy as part
>of the file system design. Using tar to back up a couple of tera/peta
>bytes just seems like asking for trouble, even keeping track of the
>possible hardlinks by remembering previously seen inode numbers over
>vast amounts of files will become difficult at some point.
>
>  
>
>>As another example, a quick google search indicates that the some mail
>>programs can use inode numbers as a part of a technique to create
>>unique filenames in maildir directories.  One could easily also
>>    
>>
>
>Hopefully it is only part of the technique. Like combining it with
>grabbing a timestamp, the hostname/MAC address where the operation
>occurred, etc.
>
>  
>
>>imagine using inode numbers as part of creating unique ids returned by
>>an IMAP server --- not something I would recommend, but it's an
>>example of what some people might have done, since everybody _knows_
>>they can count on inode numbers on Unix systems, right?  POSIX
>>promises that they won't break!
>>    
>>
>
>Under limited conditions. Not sure how stable/unique 32-bit inode
>numbers are on NFS clients, taking into account client-reboots, failing
>disks that are restored from tape, or when the file system reuses inode
>numbers of recently deleted files, etc. It doesn't matter how much
>stability and uniqueness POSIX demands, I simply can't see how it can be
>guaranteed in all cases.
>
>  
>
>>>The only thing that tends to break are userspace archiving tools like
>>>tar, which assume that 2 objects with the same 32-bit st_ino value are
>>>identical. I think that by now several actually double check that the
>>>inode linkcount is larger than 1.
>>>      
>>>
>>Um, that's not good enough to avoid failure modes; consider what might
>>happen if you have two inodes that have hardlinks, so that st_nlink >
>>1, but whose inode numbers are the same if you only look at the low 32
>>bits?  Oops.
>>
>>It's not a bad hueristic, if you don't have that many hard-linked
>>files on your system, but if you have a huge number of hard-linked
>>trees (such as you might find on a kernel developer with tons of
>>hard-linked trees), I wouldn't want to count on this always working.
>>    
>>
>
>Yeah, bad example for the typical case. But there must be some check to
>at least avoid problems when files are removed/created and the inode
>numbers are reused during a backup run.
>
>Jan
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>
Someone needs to fix the mmap problems with some clever translation for 
supporting huge files and filesystems
beyond 16 TB. Increasing block sizes will help (increase to 64K in the 
buffer cache). I have a lot of input here
ajnd I am supporting huge data storage volumes at present with 32 bit 
version, but I have had to insert my own
64K managements layer to interface with the VFS and I have had to also 
put some restrictions on file sizes. Packet
capture based FS's can generate more data than any of these traditional 
FS's do.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:14                 ` Randy.Dunlap
@ 2005-11-22 16:38                   ` Steve Flynn
  0 siblings, 0 replies; 74+ messages in thread
From: Steve Flynn @ 2005-11-22 16:38 UTC (permalink / raw)
  Cc: linux-kernel

On 22/11/05, Randy.Dunlap <rdunlap@xenotime.net> wrote:
> On Tue, 22 Nov 2005, Bill Davidsen wrote:
> > Jeff V. Merkey wrote:
> > > Bernd Eckenfels wrote:
> > > Linux is currently limited to 16 TB per VFS mount point, it's all mute,
> > > unless VFS gets fixed.
> > > mmap won't go above this at present.
> > >
> > What does "it's all mute" mean?
>
> It means "it's all moot."

On the contrary, "all mute" is correct - indicating that it doesn't
really matter. All moot means it's open to debate, which is the
opposite of what Bernd meant.

I'll get back to lurking and being boggled by the stuff on the AtomChip website.
--
Steve
Despair - It's always darkest just before it goes pitch black...

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:17             ` Chris Adams
@ 2005-11-22 16:55               ` Anton Altaparmakov
  2005-11-22 17:18                 ` Theodore Ts'o
  2005-11-22 21:06                 ` Bill Davidsen
  2005-11-22 20:19               ` Alan Cox
  1 sibling, 2 replies; 74+ messages in thread
From: Anton Altaparmakov @ 2005-11-22 16:55 UTC (permalink / raw)
  To: Chris Adams; +Cc: linux-kernel

On Tue, 22 Nov 2005, Chris Adams wrote:
> Once upon a time, Jan Harkes <jaharkes@cs.cmu.edu> said:
> >The only thing that tends to break are userspace archiving tools like
> >tar, which assume that 2 objects with the same 32-bit st_ino value are
> >identical.
> 
> That assumption is probably made because that's what POSIX and Single
> Unix Specification define: "The st_ino and st_dev fields taken together
> uniquely identify the file within the system."  Don't blame code that
> follows standards for breaking.

The standards are insufficient however.  For example dealing with named 
streams or extended attributes if exposed as "normal files" would 
naturally have the same st_ino (given they are the same inode as the 
normal file data) and st_dev fields.

> >I think that by now several actually double check that theinode
> >linkcount is larger than 1.
> 
> That is not a good check.  I could have two separate files that have
> multiple links; if st_ino is the same, how can tar make sense of it?

Now that is true.  In addition to checking the link count is larger then 
1, they should check the file size and if that matches compute the SHA-1 
digest of the data (or the MD5 sum or whatever) and probably should also 
check the various stat fields for equality before bothering with the 
checksum of the file contents.

Or Linux just needs a backup api that programs like this can use to 
save/restore files.  (Analogous to the MS Backup API but hopefully 
less horid...)

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:55               ` Anton Altaparmakov
@ 2005-11-22 17:18                 ` Theodore Ts'o
  2005-11-22 19:25                   ` Anton Altaparmakov
  2005-11-22 21:06                 ` Bill Davidsen
  1 sibling, 1 reply; 74+ messages in thread
From: Theodore Ts'o @ 2005-11-22 17:18 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Chris Adams, linux-kernel

On Tue, Nov 22, 2005 at 04:55:08PM +0000, Anton Altaparmakov wrote:
> > That assumption is probably made because that's what POSIX and Single
> > Unix Specification define: "The st_ino and st_dev fields taken together
> > uniquely identify the file within the system."  Don't blame code that
> > follows standards for breaking.
> 
> The standards are insufficient however.  For example dealing with named 
> streams or extended attributes if exposed as "normal files" would 
> naturally have the same st_ino (given they are the same inode as the 
> normal file data) and st_dev fields.

Um, but that's why even Solaris's openat(2) proposal doesn't expose
streams or extended attributes as "normal files".  The answer is that
you can't just expose named streams or extended attributes as "normal
files" without screwing yourself.

Also, I haven't checked to see what Solaris does, but technically
their UFS implementation does actually use separate inodes for their
named streams, so stat(2) could return separate inode numbers for the
named streams.  (In fact, if you take a Solaris UFS filesystem with
extended attributs, and run it on a Solaris 8 fsck, the directory
containing named streams/extended attributes will show up in
lost+found.)

						- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:28             ` what is our " Theodore Ts'o
@ 2005-11-22 17:37               ` Jan Harkes
  2005-11-22 16:36                 ` Jeff V. Merkey
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Harkes @ 2005-11-22 17:37 UTC (permalink / raw)
  To: Theodore Ts'o, Christoph Hellwig, J?rn Engel, Alfred Brons,
	pocm, linux-kernel

On Tue, Nov 22, 2005 at 11:28:36AM -0500, Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 10:25:31AM -0500, Jan Harkes wrote:
> > On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
> > > I will note though that there are people who are asking for 64-bit
> > > inode numbers on 32-bit platforms, since 2**32 inodes are not enough
> > > for certain distributed/clustered filesystems.  And this is something
> > > we don't yet support today, and probably will need to think about much
> > > sooner than 128-bit filesystems....
> > 
> > As far as the kernel is concerned this hasn't been a problem in a while
> > (2.4.early). The iget4 operation that was introduced by reiserfs (now
> > iget5) pretty much makes it possible for a filesystem to use anything to
> > identify it's inodes. The 32-bit inode numbers are simply used as a hash
> > index.
> 
> iget4 wasn't even strictly necessary, unless you want to use the inode
> cache (which has always been strictly optional for filesystems, even
> inode-based ones) --- Linux's VFS is dentry-based, not inode-based, so
> we don't use inode numbers to index much of anything inside the
> kernel, other than the aforementioned optional inode cache.

Ah yes, you're right.

> The main issue is the lack of a 64-bit interface to extract inode
> numbers, which is needed as you point out for userspace archiving
> tools like tar. There are also other programs or protocols that in the
> past have broken as a result of inode number collisions.

64-bit? Coda has been using 128-bit file identifiers for a while now.
And I can imagine someone trying to plug something like git into the VFS
might want to use 168-bits. Or even more for a CAS-based storage that
identifies objects by their SHA256 or SHA512 checksum.

On the other hand, any large scale distributed/cluster based file system
probably will have some sort of snapshot based backup strategy as part
of the file system design. Using tar to back up a couple of tera/peta
bytes just seems like asking for trouble, even keeping track of the
possible hardlinks by remembering previously seen inode numbers over
vast amounts of files will become difficult at some point.

> As another example, a quick google search indicates that the some mail
> programs can use inode numbers as a part of a technique to create
> unique filenames in maildir directories.  One could easily also

Hopefully it is only part of the technique. Like combining it with
grabbing a timestamp, the hostname/MAC address where the operation
occurred, etc.

> imagine using inode numbers as part of creating unique ids returned by
> an IMAP server --- not something I would recommend, but it's an
> example of what some people might have done, since everybody _knows_
> they can count on inode numbers on Unix systems, right?  POSIX
> promises that they won't break!

Under limited conditions. Not sure how stable/unique 32-bit inode
numbers are on NFS clients, taking into account client-reboots, failing
disks that are restored from tape, or when the file system reuses inode
numbers of recently deleted files, etc. It doesn't matter how much
stability and uniqueness POSIX demands, I simply can't see how it can be
guaranteed in all cases.

> > The only thing that tends to break are userspace archiving tools like
> > tar, which assume that 2 objects with the same 32-bit st_ino value are
> > identical. I think that by now several actually double check that the
> > inode linkcount is larger than 1.
> 
> Um, that's not good enough to avoid failure modes; consider what might
> happen if you have two inodes that have hardlinks, so that st_nlink >
> 1, but whose inode numbers are the same if you only look at the low 32
> bits?  Oops.
>
> It's not a bad hueristic, if you don't have that many hard-linked
> files on your system, but if you have a huge number of hard-linked
> trees (such as you might find on a kernel developer with tons of
> hard-linked trees), I wouldn't want to count on this always working.

Yeah, bad example for the typical case. But there must be some check to
at least avoid problems when files are removed/created and the inode
numbers are reused during a backup run.

Jan

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22  6:34             ` Rob Landley
@ 2005-11-22 19:05               ` Pavel Machek
  0 siblings, 0 replies; 74+ messages in thread
From: Pavel Machek @ 2005-11-22 19:05 UTC (permalink / raw)
  To: Rob Landley; +Cc: Tarkan Erimer, linux-kernel, Diego Calleja

Hi!

> > > Sun is proposing it can predict what storage layout will be efficient for
> > > as yet unheard of quantities of data, with unknown access patterns, at
> > > least a couple decades from now.  It's also proposing that data
> > > compression and checksumming are the filesystem's job.  Hands up anybody
> > > who spots conflicting trends here already?  Who thinks the 128 bit
> > > requirement came from marketing rather than the engineers?
> >
> > Actually, if you are storing information in single protons, I'd say
> > you _need_ checksumming :-).
> 
> You need error correcting codes at the media level.  A molecular storage 
> system like this would probably look a lot more like flash or dram than it 
> would magnetic media.  (For one thing, I/O bandwidth and seek times become a 
> serious bottleneck with high density single point of access systems.)
> 
> > [I actually agree with Sun here, not trusting disk is good idea. At
> > least you know kernel panic/oops/etc can't be caused by bit corruption on
> > the disk.]
> 
> But who said the filesystem was the right level to do this at?

Filesystem level may not be the best level to do it at, but doing it
at all is still better than current state-of-the-art. Doing it at
media level is not enough, because then you get interference at IDE
cable or driver bugs etc.

DM layer might be better place to do checksums at, but perhaps
filesystem can do it more efficiently (it knows its own access
patterns), and is definitely easier to setup for the end user.

If you want compression anyway (and you want -- for performance
reasons, if you are working with big texts or geographical data),
doing checksums at the same level just makes sense.
								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 17:18                 ` Theodore Ts'o
@ 2005-11-22 19:25                   ` Anton Altaparmakov
  2005-11-22 19:52                     ` Theodore Ts'o
  0 siblings, 1 reply; 74+ messages in thread
From: Anton Altaparmakov @ 2005-11-22 19:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Chris Adams, linux-kernel

On Tue, 22 Nov 2005, Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 04:55:08PM +0000, Anton Altaparmakov wrote:
> > > That assumption is probably made because that's what POSIX and Single
> > > Unix Specification define: "The st_ino and st_dev fields taken together
> > > uniquely identify the file within the system."  Don't blame code that
> > > follows standards for breaking.
> > 
> > The standards are insufficient however.  For example dealing with named 
> > streams or extended attributes if exposed as "normal files" would 
> > naturally have the same st_ino (given they are the same inode as the 
> > normal file data) and st_dev fields.
> 
> Um, but that's why even Solaris's openat(2) proposal doesn't expose
> streams or extended attributes as "normal files".  The answer is that
> you can't just expose named streams or extended attributes as "normal
> files" without screwing yourself.

Reiser4 does I believe...

> Also, I haven't checked to see what Solaris does, but technically
> their UFS implementation does actually use separate inodes for their
> named streams, so stat(2) could return separate inode numbers for the
> named streams.  (In fact, if you take a Solaris UFS filesystem with
> extended attributs, and run it on a Solaris 8 fsck, the directory
> containing named streams/extended attributes will show up in
> lost+found.)

I was not talking about Solaris/UFS.  NTFS has named streams and extended 
attributes and both are stored as separate attribute records inside the 
same inode as the data attribute.  (A bit simplified as multiple inodes 
can be in use for one "file" when an inode's attributes become large than 
an inode - in that case attributes are either moved whole to a new inode 
and/or are chopped up in bits and each bit goes to a different inode.)

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 19:25                   ` Anton Altaparmakov
@ 2005-11-22 19:52                     ` Theodore Ts'o
  2005-11-22 20:00                       ` Anton Altaparmakov
  2005-11-22 21:14                       ` Bill Davidsen
  0 siblings, 2 replies; 74+ messages in thread
From: Theodore Ts'o @ 2005-11-22 19:52 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Chris Adams, linux-kernel

On Tue, Nov 22, 2005 at 07:25:20PM +0000, Anton Altaparmakov wrote:
> > > The standards are insufficient however.  For example dealing with named 
> > > streams or extended attributes if exposed as "normal files" would 
> > > naturally have the same st_ino (given they are the same inode as the 
> > > normal file data) and st_dev fields.
> > 
> > Um, but that's why even Solaris's openat(2) proposal doesn't expose
> > streams or extended attributes as "normal files".  The answer is that
> > you can't just expose named streams or extended attributes as "normal
> > files" without screwing yourself.
> 
> Reiser4 does I believe...

Reiser4 violates POSIX.  News at 11....

> I was not talking about Solaris/UFS.  NTFS has named streams and extended 
> attributes and both are stored as separate attribute records inside the 
> same inode as the data attribute.  (A bit simplified as multiple inodes 
> can be in use for one "file" when an inode's attributes become large than 
> an inode - in that case attributes are either moved whole to a new inode 
> and/or are chopped up in bits and each bit goes to a different inode.)

NTFS violates POSIX.  News at 11....

							- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 20:19               ` Alan Cox
@ 2005-11-22 19:56                 ` Chris Adams
  2005-11-22 21:19                   ` Bill Davidsen
  2005-11-23 19:20                   ` Generation numbers in stat was Re: what is slashdot's " Andi Kleen
  0 siblings, 2 replies; 74+ messages in thread
From: Chris Adams @ 2005-11-22 19:56 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Once upon a time, Alan Cox <alan@lxorguk.ukuu.org.uk> said:
> It was a nice try but there is a giant gotcha most people forget. Its
> only safe to make this assumption while you have all of the
> files/directories in question open.

Tru64 adds a "st_gen" field to struct stat.  It is an unsigned int that
is a "generation" counter for a particular inode.  To get a collision
while creating and removing files, you'd have to remove and create a
file with the same inode 2^32 times while tar (or whatever) is running.
Here's what stat(2) says:

  Two structure members in <sys/stat.h> uniquely identify a file in a file
  system: st_ino, the file serial number, and st_dev, the device id for the
  directory that contains the file.

  [Tru64 UNIX]  However, in the rare case when a user application has been
  deleting open files, and a file serial number is reused, a third structure
  member in <sys/stat.h>, the file generation number, is needed to uniquely
  identify a file. This member, st_gen, is used in addition to st_ino and
  st_dev.

-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 19:52                     ` Theodore Ts'o
@ 2005-11-22 20:00                       ` Anton Altaparmakov
  2005-11-22 23:02                         ` Theodore Ts'o
  2005-11-22 21:14                       ` Bill Davidsen
  1 sibling, 1 reply; 74+ messages in thread
From: Anton Altaparmakov @ 2005-11-22 20:00 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Chris Adams, linux-kernel

On Tue, 22 Nov 2005, Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 07:25:20PM +0000, Anton Altaparmakov wrote:
> > > > The standards are insufficient however.  For example dealing with named 
> > > > streams or extended attributes if exposed as "normal files" would 
> > > > naturally have the same st_ino (given they are the same inode as the 
> > > > normal file data) and st_dev fields.
> > > 
> > > Um, but that's why even Solaris's openat(2) proposal doesn't expose
> > > streams or extended attributes as "normal files".  The answer is that
> > > you can't just expose named streams or extended attributes as "normal
> > > files" without screwing yourself.
> > 
> > Reiser4 does I believe...
> 
> Reiser4 violates POSIX.  News at 11....
> 
> > I was not talking about Solaris/UFS.  NTFS has named streams and extended 
> > attributes and both are stored as separate attribute records inside the 
> > same inode as the data attribute.  (A bit simplified as multiple inodes 
> > can be in use for one "file" when an inode's attributes become large than 
> > an inode - in that case attributes are either moved whole to a new inode 
> > and/or are chopped up in bits and each bit goes to a different inode.)
> 
> NTFS violates POSIX.  News at 11....

What is your point?  I personally couldn't care less about POSIX (or any 
other simillarly old-fashioned standards for that matter).  What counts is 
reality and having a working system that does what I want/need it to do.  
If that means violating POSIX, so be it.  I am not going to burry my head 
in the sand just because POSIX says "you can't do that".  Utilities can be 
taught to work with the system instead of blindly following standards.  

And anyway the Linux kernel defies POSIX left, right, and centre so if you 
care that much you ought to be off fixing all those violations...  (-;

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:09                 ` Jeff V. Merkey
@ 2005-11-22 20:16                   ` Bill Davidsen
  0 siblings, 0 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-22 20:16 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: linux-kernel

Jeff V. Merkey wrote:
> Bill Davidsen wrote:
> 
>> Jeff V. Merkey wrote:
>>
>>> Bernd Eckenfels wrote:
>>>
>>>> In article <200511211252.04217.rob@landley.net> you wrote:
>>>>
>>>>
>>>>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. 
>>>>> Python says 2**64 is 18446744073709551616, and that's roughly:
>>>>> 18,446,744,073,709,551,616 bytes
>>>>> 18,446,744,073,709 megs
>>>>> 18,446,744,073 gigs
>>>>> 18,446,744 terabytes
>>>>> 18,446 ... what are those, pedabytes (petabytes?)
>>>>> 18 zetabytes
>>>>>
>>> There you go. I deal with this a lot so, those are the names.
>>>
>>> Linux is currently limited to 16 TB per VFS mount point, it's all 
>>> mute, unless VFS gets fixed.
>>> mmap won't go above this at present.
>>>
>> What does "it's all mute" mean?
>>
> Should be spelled "moot". It's a legal term that means "it doesn't matter".

Yes, I am well aware of what moot means, had you used that.


-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:17             ` Chris Adams
  2005-11-22 16:55               ` Anton Altaparmakov
@ 2005-11-22 20:19               ` Alan Cox
  2005-11-22 19:56                 ` Chris Adams
  1 sibling, 1 reply; 74+ messages in thread
From: Alan Cox @ 2005-11-22 20:19 UTC (permalink / raw)
  To: Chris Adams; +Cc: linux-kernel

On Maw, 2005-11-22 at 10:17 -0600, Chris Adams wrote:
> That assumption is probably made because that's what POSIX and Single
> Unix Specification define: "The st_ino and st_dev fields taken together
> uniquely identify the file within the system."  Don't blame code that
> follows standards for breaking.

It was a nice try but there is a giant gotcha most people forget. Its
only safe to make this assumption while you have all of the
files/directories in question open.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 16:55               ` Anton Altaparmakov
  2005-11-22 17:18                 ` Theodore Ts'o
@ 2005-11-22 21:06                 ` Bill Davidsen
  1 sibling, 0 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-22 21:06 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: linux-kernel

Anton Altaparmakov wrote:
> On Tue, 22 Nov 2005, Chris Adams wrote:
> 
>>Once upon a time, Jan Harkes <jaharkes@cs.cmu.edu> said:
>>
>>>The only thing that tends to break are userspace archiving tools like
>>>tar, which assume that 2 objects with the same 32-bit st_ino value are
>>>identical.
>>
>>That assumption is probably made because that's what POSIX and Single
>>Unix Specification define: "The st_ino and st_dev fields taken together
>>uniquely identify the file within the system."  Don't blame code that
>>follows standards for breaking.
> 
> 
> The standards are insufficient however.  For example dealing with named 
> streams or extended attributes if exposed as "normal files" would 
> naturally have the same st_ino (given they are the same inode as the 
> normal file data) and st_dev fields.
> 
> 
>>>I think that by now several actually double check that theinode
>>>linkcount is larger than 1.
>>
>>That is not a good check.  I could have two separate files that have
>>multiple links; if st_ino is the same, how can tar make sense of it?
> 
> 
> Now that is true.  In addition to checking the link count is larger then 
> 1, they should check the file size and if that matches compute the SHA-1 
> digest of the data (or the MD5 sum or whatever) and probably should also 
> check the various stat fields for equality before bothering with the 
> checksum of the file contents.
> 
> Or Linux just needs a backup api that programs like this can use to 
> save/restore files.  (Analogous to the MS Backup API but hopefully 
> less horid...)
> 
In order to prevent the problems mentioned AND satisfy SuS, I would 
think that the st_dev field would be the value which should be unique, 
which is not always the case currently. The st_inod is a file id on 
st_dev, and it would be less confusing if the inode on each st_dev were 
unique. Not to mention that some backup programs do look at that st_dev 
and could be mightily confused if the meaning is not determinant.

Historical application usage assumes that it is invariant, many 
applications were written before pluggable devices and network mounts. 
In a perfect world where nothing broke when things were changed, if 
there were some UUID on a filesystem, so it looks the same mounted over 
network or by direct mount, or loopback mount, etc, then there would be 
no confusion.

A backup API would really be nice if it could somehow provide some 
unique ID, such that a netowrk or direct backup of the same data would 
have the same IDs.
-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 19:52                     ` Theodore Ts'o
  2005-11-22 20:00                       ` Anton Altaparmakov
@ 2005-11-22 21:14                       ` Bill Davidsen
  1 sibling, 0 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-22 21:14 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Chris Adams, linux-kernel

Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 07:25:20PM +0000, Anton Altaparmakov wrote:
> 
>>>>The standards are insufficient however.  For example dealing with named 
>>>>streams or extended attributes if exposed as "normal files" would 
>>>>naturally have the same st_ino (given they are the same inode as the 
>>>>normal file data) and st_dev fields.
>>>
>>>Um, but that's why even Solaris's openat(2) proposal doesn't expose
>>>streams or extended attributes as "normal files".  The answer is that
>>>you can't just expose named streams or extended attributes as "normal
>>>files" without screwing yourself.
>>
>>Reiser4 does I believe...
> 
> 
> Reiser4 violates POSIX.  News at 11....
> 
> 
>>I was not talking about Solaris/UFS.  NTFS has named streams and extended 
>>attributes and both are stored as separate attribute records inside the 
>>same inode as the data attribute.  (A bit simplified as multiple inodes 
>>can be in use for one "file" when an inode's attributes become large than 
>>an inode - in that case attributes are either moved whole to a new inode 
>>and/or are chopped up in bits and each bit goes to a different inode.)
> 
> 
> NTFS violates POSIX.  News at 11....
> 
True, but perhaps in this case it's time for POSIX to move, the things 
in filesystems, and which are used as filesystems have changed a bunch.

It would be nice to have a neutral standard rather than adopting 
existing extended implementations, just because the politics of it are 
that everyone but MS would hate NTFS, MS would hate any of the existing 
others, and a new standard would have the same impact on everyone and 
therefore might be viable. Not a quick fix, however, standards take a 
LONG time.
-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 19:56                 ` Chris Adams
@ 2005-11-22 21:19                   ` Bill Davidsen
  2005-11-23 19:20                   ` Generation numbers in stat was Re: what is slashdot's " Andi Kleen
  1 sibling, 0 replies; 74+ messages in thread
From: Bill Davidsen @ 2005-11-22 21:19 UTC (permalink / raw)
  To: Chris Adams; +Cc: linux-kernel, alan

Chris Adams wrote:
> Once upon a time, Alan Cox <alan@lxorguk.ukuu.org.uk> said:
> 
>>It was a nice try but there is a giant gotcha most people forget. Its
>>only safe to make this assumption while you have all of the
>>files/directories in question open.
> 

Right, at the time the structures were created removable (in any sense) 
media usually meant 1/2 inch mag tape, not block storage. The inode was 
pretty well set by SysIII, IIRC.
> 
> Tru64 adds a "st_gen" field to struct stat.  It is an unsigned int that
> is a "generation" counter for a particular inode.  To get a collision
> while creating and removing files, you'd have to remove and create a
> file with the same inode 2^32 times while tar (or whatever) is running.
> Here's what stat(2) says:
> 
>   Two structure members in <sys/stat.h> uniquely identify a file in a file
>   system: st_ino, the file serial number, and st_dev, the device id for the
>   directory that contains the file.
> 
>   [Tru64 UNIX]  However, in the rare case when a user application has been
>   deleting open files, and a file serial number is reused, a third structure
>   member in <sys/stat.h>, the file generation number, is needed to uniquely
>   identify a file. This member, st_gen, is used in addition to st_ino and
>   st_dev.
> 
Shades of VMS! Of course that's not unique, I believe iso9660 (CD) has 
versioning which is almost never used.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-22 20:00                       ` Anton Altaparmakov
@ 2005-11-22 23:02                         ` Theodore Ts'o
  0 siblings, 0 replies; 74+ messages in thread
From: Theodore Ts'o @ 2005-11-22 23:02 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Chris Adams, linux-kernel

On Tue, Nov 22, 2005 at 08:00:58PM +0000, Anton Altaparmakov wrote:
> 
> What is your point?  I personally couldn't care less about POSIX (or any 
> other simillarly old-fashioned standards for that matter).  What counts is 
> reality and having a working system that does what I want/need it to do.  
> If that means violating POSIX, so be it.  I am not going to burry my head 
> in the sand just because POSIX says "you can't do that".  Utilities can be 
> taught to work with the system instead of blindly following standards.  

Finding all of the utilities and userspace applications that depend on
some specific POSIX behavior is hard; and convincing them to change,
instead of fixing the buggy OS, is even harder.  But that's OK, no one
has to use your filesystem (or operating system) if doesn't conform to
standards enough that your applications start breaking.

> And anyway the Linux kernel defies POSIX left, right, and centre so if you 
> care that much you ought to be off fixing all those violations...  (-;

Um, where?  Actually, we're pretty close, and we often spend quite a
bit of time fixing places where we don't conform to the standards
correctly.  Look at all of the work that's gone into the kernel to
make Linux's threads support POSIX compliant, for example.  We did
*not* tell everyone to go rewrite their applications to use
LinuxThreads, even if certain aspects of Posix threads are a little
brain-damaged.  

					- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Generation numbers in stat was Re: what is slashdot's answer to ZFS?
  2005-11-22 19:56                 ` Chris Adams
  2005-11-22 21:19                   ` Bill Davidsen
@ 2005-11-23 19:20                   ` Andi Kleen
  2005-11-24  5:15                     ` Chris Adams
  1 sibling, 1 reply; 74+ messages in thread
From: Andi Kleen @ 2005-11-23 19:20 UTC (permalink / raw)
  To: Chris Adams; +Cc: linux-kernel

Chris Adams <cmadams@hiwaay.net> writes:
> 
>   [Tru64 UNIX]  However, in the rare case when a user application has been
>   deleting open files, and a file serial number is reused, a third structure
>   member in <sys/stat.h>, the file generation number, is needed to uniquely
>   identify a file. This member, st_gen, is used in addition to st_ino and
>   st_dev.

Sounds like a cool idea. Many fs already maintain this information
in the kernel. We still had some unused pad space in the struct stat
so it could be implemented without any compatibility issues 
(e.g. in place of __pad0). On old kernels it would be always 0.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Generation numbers in stat was Re: what is slashdot's answer to ZFS?
  2005-11-23 19:20                   ` Generation numbers in stat was Re: what is slashdot's " Andi Kleen
@ 2005-11-24  5:15                     ` Chris Adams
  2005-11-24  8:47                       ` Andi Kleen
  0 siblings, 1 reply; 74+ messages in thread
From: Chris Adams @ 2005-11-24  5:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Once upon a time, Andi Kleen <ak@suse.de> said:
> Chris Adams <cmadams@hiwaay.net> writes:
> >   [Tru64 UNIX]  However, in the rare case when a user application has been
> >   deleting open files, and a file serial number is reused, a third structure
> >   member in <sys/stat.h>, the file generation number, is needed to uniquely
> >   identify a file. This member, st_gen, is used in addition to st_ino and
> >   st_dev.
> 
> Sounds like a cool idea. Many fs already maintain this information
> in the kernel. We still had some unused pad space in the struct stat
> so it could be implemented without any compatibility issues 
> (e.g. in place of __pad0). On old kernels it would be always 0.

Searching around some, I see that OS X has st_gen, but the man page I
found says it is only available for super-user.  It also appears that
AIX and at least some of the BSDs have it (which would make sense I
guess as Tru64, OS X, and IIRC AIX are all BSD derived).

Also, I ses someone pitched it to linux-kernel several years ago but it
didn't appear to go anywhere.  Maybe time to rethink that?
-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Generation numbers in stat was Re: what is slashdot's answer to ZFS?
  2005-11-24  5:15                     ` Chris Adams
@ 2005-11-24  8:47                       ` Andi Kleen
  0 siblings, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2005-11-24  8:47 UTC (permalink / raw)
  To: Chris Adams; +Cc: Andi Kleen, linux-kernel

> Also, I ses someone pitched it to linux-kernel several years ago but it
> didn't appear to go anywhere.  Maybe time to rethink that?

It just needs someone to post a patch.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-21 10:19     ` Jörn Engel
                         ` (2 preceding siblings ...)
  2005-11-22  7:51       ` Christoph Hellwig
@ 2005-11-28 12:53       ` Lars Marowsky-Bree
  2005-11-29  5:04         ` Theodore Ts'o
  3 siblings, 1 reply; 74+ messages in thread
From: Lars Marowsky-Bree @ 2005-11-28 12:53 UTC (permalink / raw)
  To: linux-kernel

On 2005-11-21T11:19:59, Jörn Engel <joern@wohnheim.fh-wedel.de> wrote:

> o Merge of LVM and filesystem layer
>   Not done.  This has some advantages, but also more complexity than
>   seperate LVM and filesystem layers.  Might be considers "not worth
>   it" for some years.

This is one of the cooler ideas IMHO. In effect, LVM is just a special
case filesystem - huge blocksizes, few files, mostly no directories,
exports block instead of character/streams "files".

Why do we need to implement a clustered LVM as well as a clustered
filesystem? Because we can't re-use across this boundary and not stack
"real" filesystems, so we need a pseudo-layer we call volume management.
And then, if by accident we need a block device from a filesystem again,
we get to use loop devices. Does that make sense? Not really.

(Same as the distinction between character and block devices in the
kernel.)

Look at how people want to use Xen: host the images on OCFS2/GFS backing
stores. In effect, this uses the CFS as a cluster enabled volume
manager.

If they'd be better integrated (ie: be able to stack filesystems), we
could snapshot/RAID single files (or ultimately, even directories trees)
just like today we can snapshot whole block devices.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-28 12:53       ` Lars Marowsky-Bree
@ 2005-11-29  5:04         ` Theodore Ts'o
  2005-11-29  5:57           ` Willy Tarreau
                             ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Theodore Ts'o @ 2005-11-29  5:04 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: linux-kernel

On Mon, Nov 28, 2005 at 01:53:51PM +0100, Lars Marowsky-Bree wrote:
> On 2005-11-21T11:19:59, J?rn Engel <joern@wohnheim.fh-wedel.de> wrote:
> 
> > o Merge of LVM and filesystem layer
> >   Not done.  This has some advantages, but also more complexity than
> >   seperate LVM and filesystem layers.  Might be considers "not worth
> >   it" for some years.
> 
> This is one of the cooler ideas IMHO. In effect, LVM is just a special
> case filesystem - huge blocksizes, few files, mostly no directories,
> exports block instead of character/streams "files".

This isn't actually a new idea, BTW.  Digital's advfs had storage
pools and the ability to have a single advfs filesystem spam multiple
filesystems, and to have multiple adv filesystems using storage pool,
something like ten years ago.  Something to keep in mind for those
people looking for prior art for any potential Sun patents covering
ZFS.... (not that I am giving legal advice, of course!)

						- Ted


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-29  5:04         ` Theodore Ts'o
@ 2005-11-29  5:57           ` Willy Tarreau
  2005-11-29 14:42             ` John Stoffel
  2005-11-29 13:58           ` Andi Kleen
  2005-11-29 16:03           ` Chris Adams
  2 siblings, 1 reply; 74+ messages in thread
From: Willy Tarreau @ 2005-11-29  5:57 UTC (permalink / raw)
  To: Theodore Ts'o, Lars Marowsky-Bree, linux-kernel

On Tue, Nov 29, 2005 at 12:04:39AM -0500, Theodore Ts'o wrote:
> On Mon, Nov 28, 2005 at 01:53:51PM +0100, Lars Marowsky-Bree wrote:
> > On 2005-11-21T11:19:59, J?rn Engel <joern@wohnheim.fh-wedel.de> wrote:
> > 
> > > o Merge of LVM and filesystem layer
> > >   Not done.  This has some advantages, but also more complexity than
> > >   seperate LVM and filesystem layers.  Might be considers "not worth
> > >   it" for some years.
> > 
> > This is one of the cooler ideas IMHO. In effect, LVM is just a special
> > case filesystem - huge blocksizes, few files, mostly no directories,
> > exports block instead of character/streams "files".
> 
> This isn't actually a new idea, BTW.  Digital's advfs had storage
> pools and the ability to have a single advfs filesystem spam multiple
> filesystems, and to have multiple adv filesystems using storage pool,
> something like ten years ago.  Something to keep in mind for those
> people looking for prior art for any potential Sun patents covering
> ZFS.... (not that I am giving legal advice, of course!)
> 
> 						- Ted

Having played a few months with a machine installed with advfs, I
can say that I *loved* this FS. It could be hot-resized, mounted
into several places at once (a bit like we can do now with --bind),
and best of all, it was by far the fastest FS I had ever seen. I
think that the 512 MB cache for the metadata helped a lot ;-)

Regards,
Willy


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-29  5:04         ` Theodore Ts'o
  2005-11-29  5:57           ` Willy Tarreau
@ 2005-11-29 13:58           ` Andi Kleen
  2005-11-29 16:03           ` Chris Adams
  2 siblings, 0 replies; 74+ messages in thread
From: Andi Kleen @ 2005-11-29 13:58 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

Theodore Ts'o <tytso@mit.edu> writes:

> On Mon, Nov 28, 2005 at 01:53:51PM +0100, Lars Marowsky-Bree wrote:
> > On 2005-11-21T11:19:59, J?rn Engel <joern@wohnheim.fh-wedel.de> wrote:
> > 
> > > o Merge of LVM and filesystem layer
> > >   Not done.  This has some advantages, but also more complexity than
> > >   seperate LVM and filesystem layers.  Might be considers "not worth
> > >   it" for some years.
> > 
> > This is one of the cooler ideas IMHO. In effect, LVM is just a special
> > case filesystem - huge blocksizes, few files, mostly no directories,
> > exports block instead of character/streams "files".
> 
> This isn't actually a new idea, BTW.  Digital's advfs had storage
> pools and the ability to have a single advfs filesystem spam multiple
> filesystems, and to have multiple adv filesystems using storage pool,
> something like ten years ago.

The old JFS code base had something similar before it got ported
to Linux (I believe it came from OS/2). But it was removed.
And miguel did a prototype of it with ext2 at some point long ago.

But to me it's unclear it's a really good idea. Having at least the option
to control where physical storage is placed is nice, especially 
if you cannot mirror everything (ZFS seems to assume everything is mirrored)
And separate devices and LVM make that easier.

-Andi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-29  5:57           ` Willy Tarreau
@ 2005-11-29 14:42             ` John Stoffel
  0 siblings, 0 replies; 74+ messages in thread
From: John Stoffel @ 2005-11-29 14:42 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Theodore Ts'o, Lars Marowsky-Bree, linux-kernel

>>>>> "Willy" == Willy Tarreau <willy@w.ods.org> writes:

Willy> Having played a few months with a machine installed with advfs,
Willy> I can say that I *loved* this FS. It could be hot-resized,
Willy> mounted into several places at once (a bit like we can do now
Willy> with --bind), and best of all, it was by far the fastest FS I
Willy> had ever seen. I think that the 512 MB cache for the metadata
Willy> helped a lot ;-)

It was a wonderful FS, but if you used a PrestoServer NFS accelerator
board on the system with 4mb of RAM, but forgot to actually enable the
battery, bad things happened when the system crashed... you got a nice
4mb hole in the filesystem which caused wonderfully obtuse panics.
All the while the hardware keeps insisting that the battery on the
NVRAM board was just fine... turned out to be a hardware bug on the
NVRAM board, which screwed us completely.

Once that was solved, back in Oct 93 time frame as I recall, the Advfs
filesystem just ran and ran and ran.  Too bad DEC/Compaq/HP won't
release it nowdays....

John

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
  2005-11-29  5:04         ` Theodore Ts'o
  2005-11-29  5:57           ` Willy Tarreau
  2005-11-29 13:58           ` Andi Kleen
@ 2005-11-29 16:03           ` Chris Adams
  2 siblings, 0 replies; 74+ messages in thread
From: Chris Adams @ 2005-11-29 16:03 UTC (permalink / raw)
  To: linux-kernel

Once upon a time, Theodore Ts'o <tytso@mit.edu> said:
>This isn't actually a new idea, BTW.  Digital's advfs had storage
>pools and the ability to have a single advfs filesystem spam multiple
>filesystems, and to have multiple adv filesystems using storage pool,
>something like ten years ago.

A really nice feature of AdvFS is fileset-level snapshots.  For my Alpha
servers, I don't have to allocate disk space to snapshot storage; the
fileset uses free space within the fileset for changes while a snapshot
is active.  For my Linux servers using LVM, I have to leave a chunk of
space free in the volume group, make sure it is big enough, make sure
only one snapshot exists at a time (or make sure there's enough free
space for multiple snapshots), etc.

AdvFS is also fully integrated with TruCluster; when I started
clustering, I didn't have to change anything for most of my storage.

I will miss AdvFS when we turn off our Alphas for the last time (which
won't be far off I guess; final order date for an HP Alpha system is
less than a year away now).
-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: what is our answer to ZFS?
@ 2005-11-24  1:52 art
  0 siblings, 0 replies; 74+ messages in thread
From: art @ 2005-11-24  1:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: alan

check this but remember developers can be contaminated
and sued by SUN over patent stuff

http://www.opensolaris.org/os/community/zfs/source/

xboom


^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2005-11-29 16:03 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-21  9:28 what is our answer to ZFS? Alfred Brons
2005-11-21  9:44 ` Paulo Jorge Matos
2005-11-21  9:59   ` Alfred Brons
2005-11-21 10:08     ` Bernd Petrovitsch
2005-11-21 10:16     ` Andreas Happe
2005-11-21 11:30       ` Anton Altaparmakov
2005-11-21 10:19     ` Jörn Engel
2005-11-21 11:46       ` Matthias Andree
2005-11-21 12:07         ` Kasper Sandberg
2005-11-21 13:18           ` Matthias Andree
2005-11-21 14:18             ` Kasper Sandberg
2005-11-21 14:41               ` Matthias Andree
2005-11-21 15:08                 ` Kasper Sandberg
2005-11-22  8:52                   ` Matthias Andree
2005-11-21 22:41               ` Bill Davidsen
2005-11-21 20:48             ` jdow
2005-11-22 11:17               ` Jörn Engel
2005-11-21 11:59       ` Diego Calleja
2005-11-22  7:51       ` Christoph Hellwig
2005-11-22 10:28         ` Jörn Engel
2005-11-22 14:50         ` Theodore Ts'o
2005-11-22 15:25           ` Jan Harkes
2005-11-22 16:17             ` Chris Adams
2005-11-22 16:55               ` Anton Altaparmakov
2005-11-22 17:18                 ` Theodore Ts'o
2005-11-22 19:25                   ` Anton Altaparmakov
2005-11-22 19:52                     ` Theodore Ts'o
2005-11-22 20:00                       ` Anton Altaparmakov
2005-11-22 23:02                         ` Theodore Ts'o
2005-11-22 21:14                       ` Bill Davidsen
2005-11-22 21:06                 ` Bill Davidsen
2005-11-22 20:19               ` Alan Cox
2005-11-22 19:56                 ` Chris Adams
2005-11-22 21:19                   ` Bill Davidsen
2005-11-23 19:20                   ` Generation numbers in stat was Re: what is slashdot's " Andi Kleen
2005-11-24  5:15                     ` Chris Adams
2005-11-24  8:47                       ` Andi Kleen
2005-11-22 16:28             ` what is our " Theodore Ts'o
2005-11-22 17:37               ` Jan Harkes
2005-11-22 16:36                 ` Jeff V. Merkey
2005-11-28 12:53       ` Lars Marowsky-Bree
2005-11-29  5:04         ` Theodore Ts'o
2005-11-29  5:57           ` Willy Tarreau
2005-11-29 14:42             ` John Stoffel
2005-11-29 13:58           ` Andi Kleen
2005-11-29 16:03           ` Chris Adams
2005-11-21 11:45     ` Diego Calleja
2005-11-21 14:19       ` Tarkan Erimer
2005-11-21 18:52         ` Rob Landley
2005-11-21 19:28           ` Diego Calleja
2005-11-21 20:02           ` Bernd Petrovitsch
2005-11-22  5:42             ` Rob Landley
2005-11-22  9:25               ` Matthias Andree
2005-11-21 23:05           ` Bill Davidsen
2005-11-22  0:15           ` Bernd Eckenfels
2005-11-21 22:59             ` Jeff V. Merkey
2005-11-22  7:45               ` Christoph Hellwig
2005-11-22  9:19                 ` Jeff V. Merkey
2005-11-22 16:00               ` Bill Davidsen
2005-11-22 16:09                 ` Jeff V. Merkey
2005-11-22 20:16                   ` Bill Davidsen
2005-11-22 16:14                 ` Randy.Dunlap
2005-11-22 16:38                   ` Steve Flynn
2005-11-22  7:15             ` Rob Landley
2005-11-22  8:16               ` Bernd Eckenfels
2005-11-22  0:45           ` Pavel Machek
2005-11-22  6:34             ` Rob Landley
2005-11-22 19:05               ` Pavel Machek
2005-11-22  9:20           ` Matthias Andree
2005-11-22 10:00             ` Tarkan Erimer
2005-11-22 15:46               ` Jan Dittmer
2005-11-22 16:27               ` Bill Davidsen
2005-11-21 18:17       ` Rob Landley
2005-11-24  1:52 art

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).