linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jack Stone <jack@hawkeye.stone.uk.eu.org>
To: Kyle Moffett <mrmacman_g4@mac.com>
Cc: Bryan Henderson <hbryan@us.ibm.com>,
	akpm@linux-foundation.org, alan <alan@clueserver.org>,
	hpa@zytor.com, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, viro@zeniv.linux.org.uk,
	git@vger.kernel.org
Subject: Re: Versioning file system
Date: Tue, 19 Jun 2007 08:49:14 +0100	[thread overview]
Message-ID: <46778A7A.1080403@hawkeye.stone.uk.eu.org> (raw)
In-Reply-To: <6E9A6F9E-8948-40F2-9129-1F1491D49D83@mac.com>

Kyle Moffett wrote:
> On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:
>>> The question remains is where to implement versioning: directly in
>>> individual filesystems or in the vfs code so all filesystems can use it?
>>
>> Or not in the kernel at all.  I've been doing versioning of the types
>> I described for years with user space code and I don't remember
>> feeling that I compromised in order not to involve the kernel.
>>
>> Of course, if you want to do it with snapshots and COW, you'll have to
>> ask where in the kernel to put that, but that's not a file versioning
>> question; it's the larger snapshot question.
> 
> What I think would be particularly interesting in this domain is
> something similar in concept to GIT, except in a file-system:
>   1) Redundancy is easy, you just ensure that you have at least "N"
> distributed copies of each object, where "N" is some function of the
> object itself.
>   2) Network replication is easy, you look up objects based on the SHA-1
> stored in the parent directory entry and cache them where needed (IE:
> make the "N" function above dynamic based on frequency of access on a
> given computer).
>   3) Snapshots are easy and cheap; an RO snapshot is a tag and an RW
> snapshot is a branch.  These can be easily converted between.
>   4) Compression is easy; you can compress objects based on any
> arbitrary configurable criteria and the filesystem will record whether
> or not an object is compressed.  You can also compress differently when
> archiving objects to secondary storage.
>   5) Streaming fsck-like verification is easy; ensure the hash name
> field matches the actual hash of the object.
>   6) Fsck is easy since rollback is trivial, you can always revert to an
> older tree to boot and start up services before attempting resurrection
> of lost objects and trees in the background.
>   7) Multiple-drive or multiple-host storage pools are easy:  Think the
> git "alternates" file.
>   8) Network filesystem load-balancing is easy; SHA-1s are essentially
> random so you can just assign SHA-1 prefixes to different systems for
> data storage and your data is automatically split up.
> 
> 
> Other issues:
> 
> Q. How do you deal with block allocation?
> A. Same way other filesystems deal with block allocation. 
> Reference-counting gets tricky, especially across a network, but it's
> easy to play it safe with simple cross-network refcount-journalling. 
> Since the _only_ thing that needs journalling is the refcounts and
> block-free data, you need at most a megabyte or two of journal.  If in
> doubt, it's easy to play it safe and keep an extra refcount around for
> an in-the-background consistency check later on.  When networked-gitfs
> systems crash, you just assume they still have all the refcounts they
> had at the moment they died, and compare notes when they start back up
> again.  If a node has a cached copy of data on its local disk then it
> can just nonatomically increment the refcount for that object in its own
> RAM (ordered with respect to disk-flushes, of course) and tell its peers
> at some point.  A node should probably cache most of its working set on
> local disk for efficiency; it's trivially verified against updates from
> other nodes and provides an easy way to keep refcounts for such data. 
> If a node increments the refcount on such data and dies before getting
> that info out to its peers, then when it starts up again its peers will
> just be told that it has a "new" node with insufficient replication and
> they will clone it out again properly.  For networked
> refcount-increments you can do one of 2 things: (1) Tell at least X many
> peers and wait for them to sync the update out to disk, or (2) Get the
> object from any peer (at least one of whom hopefully has it in RAM) and
> save it to local disk with an increased refcount.
> 
> Q. How do you actually delete things?
> A. Just replace all the to-be-erased tree and commit objects before a
> specified point with "History erased" objects with their SHA-1's
> magically set to that of the erased objects.  If you want you may delete
> only the "tree" objects and leave the commits intact.  If you delete a
> whole linear segment of history then you can just use a single "History
> erased" commit object with its parent pointed to the object before the
> erased segment.  Probably needs some form of back-reference storage to
> make it efficient; not sure how expensive that would be.  This would
> allow making a bunch of snapshots and purging them logarithmically based
> on passage of time.  For instance, you might have snapshots of every 5
> minutes for the last hour, every 30 minutes for the last day, every 4
> hours for the last week, every day for the last month, once per week for
> the last year, once per month for the last 5 years, and once per year
> beyond that.
> 
> That's pretty impressive data-recovery resolution, and it accounts for
> only 200 unique commits after it's been running for 10 years.
> 
> Q. How do you archive data?
> A. Same as deleting, except instead of a "History erased" object you
> would use a "History archived" object with a little bit of string data
> to indicate which volume it's stored on (and where on the volume).  When
> you stick that volume into the system you could easily tell the kernel
> to use it as an alternate for the given storage group.
> 
> Q. What enforces data integrity?
> A. Ensure that a new tree object and its associated sub objects are on
> disk before you delete the old one.  Doesn't need any actual full syncs
> at all, just barriers.  If you replace the tree object before write-out
> is complete then just skip writing the old one and write the new one in
> its place.
> 
> Q. What consists of a "commit"?
> A. Anything the administrator wants to define it as.  Useful algorithms
> include: "Once per x Mbyte of page dirtying", "Once per 5 min", "Only
> when sync() or fsync() are called", "Only when gitfs-commit is called". 
> You could even combine them:  "Every x Mbyte of page dirtying or every 5
> minutes, whichever is shorter (or longer, depending on admin
> requirements)".  There would also be appropriate syscalls to trigger
> appropriate git-like behavior.  Network-accessible gitfs would want to
> have mechanisms to trigger commits based on activity on other systems
> (needs more thought).
> 
> Q. How do you access old versions?
> A. Mount another instance of the filesystem with an SHA-1 ID, a
> tag-name, or a branch-name in a special mount option.  Should be user
> accessible with some restrictions (needs more thought).
> 
> Q. How do you deal with conflicts on networked filesystems.
> A. Once again, however the administrator wants to deal with them.  Options:
>    1)  Forcibly create a new branch for the conflicted tree.
>    2)  Attempt to merge changes using the standard git-merge semantics
>    3)  Merge independent changes to different files and pick one for
> changes to the same file
>    4)  Your Algorithm Here(TM).  GIT makes it easy to extend
> conflict-resolution.
> 
> Q. How do you deal with little scattered changes in big (or sparse) files?
> A. Two questions, two answers:  For sparse files, git would need
> extending to understand (and hash) the nature of the sparse-ness.  For
> big files, you should be able to introduce a "compound-file" datatype
> and configure git to deal with specific X-Mbyte chunks of it
> independently.  This might not be a bad idea for native git as well. 
> Would need system-specific configuration.
> 
> Q. How do you prevent massive data consumption by spurious tiny changes
> A. You have a few options:
>    1)  Configure your commit algorithm as above to not commit so often
>    2)  Configure a stepped commit-discard algorithm as described above
> in the "How do you delete things" question
>    3)  Archive unused data to secondary storage more often
> 
> Q. What about all the unanswered questions?
> A. These are all the ones I could think of off the top of my head but
> there are at least a hundred more.  I'm pretty sure these are some of
> the most significant ones.
> 
> Q. That's a great idea and I'll implement it right away!
> A. Yay!  (but that's not a question :-D)  Good luck and happy hacking.
> 
> Q. That's a stupid idea and would never ever work!
> A. Thanks for your useful input! (but that's not a question either)  I'm
> sure anybody who takes up a project like this will consider such opinions.
> 
> Q. *flamage*
> A. I'm glad you have such strong opinions, feel free to to continue to
> spam my /dev/null device (and that's also not a question).
> 
> All opinions and comments welcomed.
> 
> Cheers,
> Kyle Moffett
> 
> 

It sounds brilliant and I'd love to have a got at implementing it but I
don't know enough (yet :-D) about how git works, a little research is
called for I think.

Jack

  reply	other threads:[~2007-06-19  7:55 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <OFC119F96F.A721E445-ON882572FE.0055140A-882572FE.00571482@us.ibm.com>
2007-06-18 16:37 ` Versioning file system Jack Stone
2007-06-18 16:56   ` H. Peter Anvin
2007-06-18 17:56   ` Bryan Henderson
2007-06-19  3:10     ` Kyle Moffett
2007-06-19  7:49       ` Jack Stone [this message]
2007-06-19  7:58       ` Bron Gondwana
2007-06-20  2:43         ` Kyle Moffett
2007-06-19  9:09       ` Martin Langhoff
2007-06-19 16:52       ` Jakub Narebski
     [not found] <8wst3-3kh-31@gated-at.bofh.it>
     [not found] ` <8wsCC-3wf-21@gated-at.bofh.it>
     [not found]   ` <8wsW4-3UY-3@gated-at.bofh.it>
     [not found]     ` <8wJal-3KA-1@gated-at.bofh.it>
     [not found]       ` <8xm22-4Ql-1@gated-at.bofh.it>
     [not found]         ` <8xq5G-32l-7@gated-at.bofh.it>
     [not found]           ` <8xs7w-69W-21@gated-at.bofh.it>
2007-06-18 20:54             ` Bodo Eggert
2007-06-18 21:08               ` alan
2007-06-18 21:31                 ` H. Peter Anvin
2007-06-18 21:34                   ` Jeremy Allison
2007-06-18 22:10                   ` Theodore Tso
2007-06-18 22:26                     ` Jörn Engel
2007-06-18 21:24                       ` Brad Boyer
2007-06-19  3:15                         ` Kyle Moffett
2007-06-18 22:34                       ` Jeremy Allison
2007-06-18 22:56                       ` alan
2007-06-19  7:01                       ` Theodore Tso
2007-06-18 22:48                     ` Jeremy Allison
2007-06-18 23:00                       ` alan
2007-06-19  7:05                       ` Theodore Tso
2007-06-19 16:52                         ` Jeremy Allison
2007-06-19 16:56                           ` H. Peter Anvin
2007-06-18 22:47                   ` alan
2007-06-15 22:23 Jack Stone
2007-06-15 22:38 ` H. Peter Anvin
2007-06-15 22:51   ` alan
2007-06-15 22:59     ` H. Peter Anvin
2007-06-15 23:06       ` alan
2007-06-16  8:11     ` Jack Stone
2007-06-16  9:46       ` Jeffrey V. Merkey
2007-06-16 10:12         ` Jeffrey V. Merkey
2007-06-16 13:15           ` Mark Williamson
2007-06-16 19:57             ` Jeffrey V. Merkey
2007-06-16 16:49           ` Jan Harkes
2007-06-16 20:03             ` Jeffrey V. Merkey
2007-06-16 19:38               ` Jack Stone
2007-06-16 20:08               ` Alan Cox
2007-06-16 21:25                 ` Jeffrey V. Merkey
2007-06-16 20:39               ` Jan Harkes
2007-06-16 20:43                 ` Jack Stone
2007-06-16 22:17                 ` Alan Cox
2007-06-17  2:18                   ` Jeffrey V. Merkey
2007-06-17  2:39                     ` Jeffrey V. Merkey
2007-06-17 22:11                   ` Dale Amon
2007-06-16 21:06               ` Dale Amon
2007-06-16 11:42         ` Graham Murray
2007-06-16 14:53     ` Jörn Engel
2007-06-18  9:45       ` Andreas Dilger
2007-06-18  9:54         ` Jack Stone
2007-06-18 10:13         ` Jörn Engel
2007-06-18 14:01         ` Theodore Tso
2007-06-18 16:16           ` alan
2007-06-18 17:29             ` Theodore Tso
2007-06-18 17:33               ` Jeremy Allison
2007-06-18 20:30                 ` Theodore Tso
2007-06-18 20:50                   ` J. Bruce Fields
2007-06-18 17:46               ` H. Peter Anvin
2007-07-04 17:32               ` Erik Mouw
2007-07-04 20:47                 ` Theodore Tso
2007-07-05 17:55                   ` Erik Mouw
2007-07-05 13:57                 ` John Stoffel
2007-07-05 14:23                   ` Chris Mason
2007-07-05 17:57                   ` Erik Mouw
2007-06-18 15:32         ` Chris Mason
2007-06-18 23:18           ` Bron Gondwana
2007-09-29 17:44         ` Sorin Faibish
2007-06-15 22:52 ` Chris Snook
2007-06-16  8:25   ` Jack Stone
2007-06-19 18:03     ` Chris Snook
2007-06-19 19:06       ` Jack Stone
2007-06-19 20:03         ` Chris Snook
2007-06-19 20:08           ` Jack Stone
2007-06-19 20:15             ` Chris Snook
2007-06-19 20:27               ` Jack Stone
2007-06-19 20:34             ` John Stoffel
2007-06-19 20:38               ` Jack Stone
2007-06-19 20:38               ` Matthew Wilcox
2007-06-19 21:02                 ` John Stoffel
2007-06-19 19:08       ` H. Peter Anvin
2007-06-19 19:12         ` Jack Stone
2007-06-19 19:15           ` H. Peter Anvin
2007-06-19 19:22             ` Jack Stone
2007-06-19 20:10           ` Chris Snook
2007-06-19 20:14             ` Jack Stone
2007-06-19 20:31               ` Chris Snook
2007-06-20  8:34           ` Bernd Petrovitsch
2007-06-19 21:50         ` Alan Cox
2007-06-19 22:07           ` H. Peter Anvin
2007-06-20  8:05             ` Ph. Marek
2007-06-19 20:43       ` Lennart Sorensen
2007-06-19 22:07         ` david
2007-06-19 22:13           ` H. Peter Anvin
2007-06-19 23:07             ` Jan Harkes
2007-06-19 23:12               ` H. Peter Anvin
2007-06-19 22:21           ` Lennart Sorensen
2007-06-19 23:35         ` Bryan Henderson
2007-06-20  0:27           ` Trond Myklebust
2007-06-20  5:00             ` H. Peter Anvin
2007-06-20 17:04             ` Bryan Henderson
2007-06-20 17:10               ` H. Peter Anvin
2007-06-20 17:33               ` Chris Snook
2007-06-15 22:57 ` Kok, Auke
2007-06-15 23:01   ` alan
2007-06-16 11:20     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46778A7A.1080403@hawkeye.stone.uk.eu.org \
    --to=jack@hawkeye.stone.uk.eu.org \
    --cc=akpm@linux-foundation.org \
    --cc=alan@clueserver.org \
    --cc=git@vger.kernel.org \
    --cc=hbryan@us.ibm.com \
    --cc=hpa@zytor.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mrmacman_g4@mac.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).