linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BTRFS backup questions
@ 2014-09-27 15:39 James Pharaoh
  2014-09-27 16:17 ` Hugo Mills
  0 siblings, 1 reply; 5+ messages in thread
From: James Pharaoh @ 2014-09-27 15:39 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I'm trying to build a backup solution for a highly virtualized server 
environment, based on BTRFS. I have a lot of questions which I can't 
find the answers to, and have included some of the most important ones here.

1. Simultaneous snapshots

I would really like to snapshot multiple subvolumes at the same time, so 
I can get a consistent view of my system. It seems like BTRFS should be 
able to provide this, given its data model, but I can't see any way to 
do so. Can anyone suggest how I can do this, or confirm that it is not 
possible and perhaps enlighten me as to why?

2. Duplicating NOCOW files

This is obviously possible, since it takes place when you make a 
snapshot. So why can't I create a clone of a snapshot of a NOCOW file? I 
am hoping the answer to this is that it is possible but not implemented 
yet...

I also have a question about the implementation of this. It would make 
sense, to me, to fragment the snapshot instead of the file itself. This 
is especially true in my case, where I am taking a snapshot which I am 
going to discard later.

Can someone confirm what happens in this case? Basically I want to know 
if access to the original file will continue to be performant after lots 
of snpshots have been taken.

3. Peformance penalty of fragmentation on SSD systems with lots of memory

I see a lot of discussion of the performance issues running databases, 
and similar, on top of BTRFS without NOCOW. I suspect that this is not a 
huge issue if using SSD, and with a lot of memory, since things will 
generally be in memory anyway.

Can anyone confirm if this is true? Obviously it makes sense to use a 
database's native replication if possible but I am trying to come up 
with a general purpose hosting platform and so I am very interested in 
the performance when this kind of optimization hasn't taken place.

4. Generations and tree structures

I am planning to use lots more clever tricks which I think should be 
available in BTRFS, but I can't see much documentation. Can anyone point 
out any good examples or documentation of how to access the tree 
structures directly. I'm particularly interested in finding changed 
files and portions of files using the generations and the tree search.

Even better, would anyone be able to help me with this?

5. Project

I've looked around for existing projects, but can't find anything apart 
from some basic scripts. Please let me know if there are any good 
projects I should be aware of.

In the mean time, I've created my own project in Haskell and shared on 
github.

https://github.com/wellbehavedsoftware/wbs-backup

Some of the goals here are:

- Take advantage of deduplication, both in the running system and in the 
backups

- Work seamlessly and efficiently with a large number of snapshots.

- Efficiently take backups at a high frequency and send them to a remote 
system

- Backups should serve for disaster recovery, for undoing mistakes, and 
for tracking changes

- Provide a means to verify the backup via a completely indepdent code 
path, and to do so efficiently.

I am developing this for a direct business need, but I think this kind 
of functionality should be open source, and that it will be more useful 
to me with community support. If anyone is interested in participating, 
or even just using it, please let me know.

Thanks to everyone who has worked on BTRFS so far ;-)

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS backup questions
  2014-09-27 15:39 BTRFS backup questions James Pharaoh
@ 2014-09-27 16:17 ` Hugo Mills
  2014-09-27 16:33   ` James Pharaoh
  0 siblings, 1 reply; 5+ messages in thread
From: Hugo Mills @ 2014-09-27 16:17 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5679 bytes --]

On Sat, Sep 27, 2014 at 05:39:07PM +0200, James Pharaoh wrote:
> Hi,
> 
> I'm trying to build a backup solution for a highly virtualized server
> environment, based on BTRFS. I have a lot of questions which I can't find
> the answers to, and have included some of the most important ones here.
> 
> 1. Simultaneous snapshots
> 
> I would really like to snapshot multiple subvolumes at the same time, so I
> can get a consistent view of my system. It seems like BTRFS should be able
> to provide this, given its data model, but I can't see any way to do so. Can
> anyone suggest how I can do this, or confirm that it is not possible and
> perhaps enlighten me as to why?

   It's not currently possible. I'm not sure if there are any plans to
allow it.

> 2. Duplicating NOCOW files
> 
> This is obviously possible, since it takes place when you make a snapshot.
> So why can't I create a clone of a snapshot of a NOCOW file? I am hoping the
> answer to this is that it is possible but not implemented yet...

   Umm... you should be able to, I think.

> I also have a question about the implementation of this. It would make
> sense, to me, to fragment the snapshot instead of the file itself. This is
> especially true in my case, where I am taking a snapshot which I am going to
> discard later.

   Fragmenting the snapshot would require true copy-on-write, which
doubles the amount of writes made to the media. Btrfs's CoW
implementation is actually redirect-on-write, which puts the
newly-written data somewhere else. This implies that the copy being
written to gets the fragmentation.

> Can someone confirm what happens in this case? Basically I want to know if
> access to the original file will continue to be performant after lots of
> snpshots have been taken.
> 
> 3. Peformance penalty of fragmentation on SSD systems with lots of memory
> 
> I see a lot of discussion of the performance issues running databases, and
> similar, on top of BTRFS without NOCOW. I suspect that this is not a huge
> issue if using SSD, and with a lot of memory, since things will generally be
> in memory anyway.
> 
> Can anyone confirm if this is true? Obviously it makes sense to use a
> database's native replication if possible but I am trying to come up with a
> general purpose hosting platform and so I am very interested in the
> performance when this kind of optimization hasn't taken place.

   There are two performance problems with fragmentation -- seek time
to find the fragments (which affects only rotational media), and the
amount of time taken to manage the fragments. As the number of
fragments increases, so does the number of extents that the FS has to
keep track of. Ultimately, with very fragmented files, this will have
an effect, as the metadata size will increase hugely.

> 4. Generations and tree structures
> 
> I am planning to use lots more clever tricks which I think should be
> available in BTRFS, but I can't see much documentation. Can anyone point out
> any good examples or documentation of how to access the tree structures
> directly. I'm particularly interested in finding changed files and portions
> of files using the generations and the tree search.

   You need the TREE SEARCH ioctl -- that gives you direct access to
all the internal trees of the FS. There's some documentation on the
wiki about how these fit together:

https://btrfs.wiki.kernel.org/index.php/Data_Structures
https://btrfs.wiki.kernel.org/index.php/Trees

   What "tricks" are you thinking of, exactly?

> Even better, would anyone be able to help me with this?
> 
> 5. Project
> 
> I've looked around for existing projects, but can't find anything apart from
> some basic scripts. Please let me know if there are any good projects I
> should be aware of.

   There's a few of them out there. Mine, in a pretty rough state, but
functional on a single machine at the moment, is:

http://git.darksatanic.net/cgi/gitweb.cgi?p=carfax-backups.git;a=summary

> In the mean time, I've created my own project in Haskell and shared on
> github.
> 
> https://github.com/wellbehavedsoftware/wbs-backup
> 
> Some of the goals here are:
> 
> - Take advantage of deduplication, both in the running system and in the
> backups
> 
> - Work seamlessly and efficiently with a large number of snapshots.
> 
> - Efficiently take backups at a high frequency and send them to a remote
> system
> 
> - Backups should serve for disaster recovery, for undoing mistakes, and for
> tracking changes

   Are you aware of btrfs send/receive? It should allow you to do all
of this. The main part of the code then comes down to managing the
send/receive, and all the distributed error handling. Then the only
direct access to the internal metadata you need is being able to read
UUIDs to work out what you have on each side -- which can also be done
by "btrfs sub list".

   Hugo.

> - Provide a means to verify the backup via a completely indepdent code path,
> and to do so efficiently.
> 
> I am developing this for a direct business need, but I think this kind of
> functionality should be open source, and that it will be more useful to me
> with community support. If anyone is interested in participating, or even
> just using it, please let me know.
> 
> Thanks to everyone who has worked on BTRFS so far ;-)
> 
> James

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- How do you become King?  You stand in the marketplace and ---    
          announce you're going to tax everyone. If you get out          
                           alive, you're King.                           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS backup questions
  2014-09-27 16:17 ` Hugo Mills
@ 2014-09-27 16:33   ` James Pharaoh
  2014-09-27 16:59     ` Hugo Mills
  0 siblings, 1 reply; 5+ messages in thread
From: James Pharaoh @ 2014-09-27 16:33 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On 27/09/14 18:17, Hugo Mills wrote:
> On Sat, Sep 27, 2014 at 05:39:07PM +0200, James Pharaoh wrote:

>> 2. Duplicating NOCOW files
>>
>> This is obviously possible, since it takes place when you make a snapshot.
>> So why can't I create a clone of a snapshot of a NOCOW file? I am hoping the
>> answer to this is that it is possible but not implemented yet...
>
>     Umm... you should be able to, I think.

Well I've tried with the haskell btrfs library, using clone, and also 
using cp --reflink=auto. Here's an example using cp:

root@host:/btrfs# btrfs subvolume snapshot -r src dest
Create a readonly snapshot of 'src' in './dest'
root@host:/btrfs# cp --reflink dest/test test
cp: failed to clone 'test' from 'dest/test': Invalid argument

>> I also have a question about the implementation of this. It would make
>> sense, to me, to fragment the snapshot instead of the file itself. This is
>> especially true in my case, where I am taking a snapshot which I am going to
>> discard later.
>
>     Fragmenting the snapshot would require true copy-on-write, which
> doubles the amount of writes made to the media. Btrfs's CoW
> implementation is actually redirect-on-write, which puts the
> newly-written data somewhere else. This implies that the copy being
> written to gets the fragmentation.

Yeah ok. I think I'll just have to live with this one for the time 
being. Thanks ;)

>> 3. Peformance penalty of fragmentation on SSD systems with lots of memory
>>
>     There are two performance problems with fragmentation -- seek time
> to find the fragments (which affects only rotational media), and the
> amount of time taken to manage the fragments. As the number of
> fragments increases, so does the number of extents that the FS has to
> keep track of. Ultimately, with very fragmented files, this will have
> an effect, as the metadata size will increase hugely.

Ok so this sounds like the answer I wanted to hear ;-) Presumably so 
long as the load is not too great, and I run the occasional defrag, then 
this shouldn't be much to worry about then?

>> 4. Generations and tree structures
>>
>> I am planning to use lots more clever tricks which I think should be
>> available in BTRFS, but I can't see much documentation. Can anyone point out
>> any good examples or documentation of how to access the tree structures
>> directly. I'm particularly interested in finding changed files and portions
>> of files using the generations and the tree search.
>
>     You need the TREE SEARCH ioctl -- that gives you direct access to
> all the internal trees of the FS. There's some documentation on the
> wiki about how these fit together:
>
> https://btrfs.wiki.kernel.org/index.php/Data_Structures
> https://btrfs.wiki.kernel.org/index.php/Trees
>
>     What "tricks" are you thinking of, exactly?

Principally I want to be able to detect exactly what has changed, so 
that I can perform backups very quickly. I want to be able to update a 
small portion of a large file and then identify exactly which parts 
changed and only back those up, for example.

>> 5. Project
>>
>> I've looked around for existing projects, but can't find anything apart from
>> some basic scripts. Please let me know if there are any good projects I
>> should be aware of.
>
>     There's a few of them out there. Mine, in a pretty rough state, but
> functional on a single machine at the moment, is:
>
> http://git.darksatanic.net/cgi/gitweb.cgi?p=carfax-backups.git;a=summary

Thanks I'll take a look at that one.

>     Are you aware of btrfs send/receive? It should allow you to do all
> of this. The main part of the code then comes down to managing the
> send/receive, and all the distributed error handling. Then the only
> direct access to the internal metadata you need is being able to read
> UUIDs to work out what you have on each side -- which can also be done
> by "btrfs sub list".

Yes, this is one of my main inspirations. The problem is that I am 
pretty sure it won't handle deduplication of the data.

I'm planning to have a LOT of containers running the same stuff, on fast 
(expensive) SSD media, and deduplication is essential to make that work 
properly. I can already see huge savings from this.

As far as I can tell, btrfs send/receive operates on a subvolume basis, 
and any shared data between those subvolumes is duplicated if you copy 
them separately.

I'll be very happy if this is already possible, or if there is some 
simple way around this!

My current solution, which I have already implemented in the project I 
shared, is to first snapshot all the subvolumes into an identical tree, 
then to reflink copy (or normal(ish) copy for nocow) all of the files 
over to another subvolume, which I am planning to then send/receive as a 
single entity.

I believe this will allow the deduplication to be transferred over to 
the receiving machine, and that this won't take place if I transfer the 
subvolumes separately.

Thanks,
James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS backup questions
  2014-09-27 16:33   ` James Pharaoh
@ 2014-09-27 16:59     ` Hugo Mills
  2014-09-29 11:02       ` James Pharaoh
  0 siblings, 1 reply; 5+ messages in thread
From: Hugo Mills @ 2014-09-27 16:59 UTC (permalink / raw)
  To: James Pharaoh; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6889 bytes --]

On Sat, Sep 27, 2014 at 06:33:58PM +0200, James Pharaoh wrote:
> On 27/09/14 18:17, Hugo Mills wrote:
> >On Sat, Sep 27, 2014 at 05:39:07PM +0200, James Pharaoh wrote:
> 
> >>2. Duplicating NOCOW files
> >>
> >>This is obviously possible, since it takes place when you make a snapshot.
> >>So why can't I create a clone of a snapshot of a NOCOW file? I am hoping the
> >>answer to this is that it is possible but not implemented yet...
> >
> >    Umm... you should be able to, I think.
> 
> Well I've tried with the haskell btrfs library, using clone, and also using
> cp --reflink=auto. Here's an example using cp:
> 
> root@host:/btrfs# btrfs subvolume snapshot -r src dest
> Create a readonly snapshot of 'src' in './dest'
> root@host:/btrfs# cp --reflink dest/test test
> cp: failed to clone 'test' from 'dest/test': Invalid argument

   Are you trying to cross a mount-point with that? It works for me:

hrm@amelia:/media/btrfs/amelia/test $ sudo btrfs sub create bar
Create subvolume './bar'
hrm@amelia:/media/btrfs/amelia/test $ sudo dd if=/dev/zero of=bar/data bs=1024 count=500
500+0 records in
500+0 records out
512000 bytes (512 kB) copied, 0.0047491 s, 108 MB/s
hrm@amelia:/media/btrfs/amelia/test $ sudo btrfs sub snap -r bar foo
Create a readonly snapshot of 'bar' in './foo'
hrm@amelia:/media/btrfs/amelia/test $ sudo cp --reflink=always bar/data bar-data
hrm@amelia:/media/btrfs/amelia/test $ sudo cp --reflink=always foo/data foo-data
hrm@amelia:/media/btrfs/amelia/test $ ls -l
total 1000
drwxr-xr-x 1 root root      8 Sep 27 17:55 bar
-rw-r--r-- 1 root root 512000 Sep 27 17:57 bar-data
drwxr-xr-x 1 root root      8 Sep 27 17:55 foo
-rw-r--r-- 1 root root 512000 Sep 27 17:57 foo-data

[snip]
> >>3. Peformance penalty of fragmentation on SSD systems with lots of memory
> >>
> >    There are two performance problems with fragmentation -- seek time
> >to find the fragments (which affects only rotational media), and the
> >amount of time taken to manage the fragments. As the number of
> >fragments increases, so does the number of extents that the FS has to
> >keep track of. Ultimately, with very fragmented files, this will have
> >an effect, as the metadata size will increase hugely.
> 
> Ok so this sounds like the answer I wanted to hear ;-) Presumably so long as
> the load is not too great, and I run the occasional defrag, then this
> shouldn't be much to worry about then?

   Be aware that the current implementation of (manual) defrag will
separate the shared extents, so you no longer get the deduplication
effect. There was a snapshot-aware defrag implementation, but it
caused filesystem corruption, and has been removed for now until a
working version can be written. I think Josef was working on this.

> >>4. Generations and tree structures
> >>
> >>I am planning to use lots more clever tricks which I think should be
> >>available in BTRFS, but I can't see much documentation. Can anyone point out
> >>any good examples or documentation of how to access the tree structures
> >>directly. I'm particularly interested in finding changed files and portions
> >>of files using the generations and the tree search.
> >
> >    You need the TREE SEARCH ioctl -- that gives you direct access to
> >all the internal trees of the FS. There's some documentation on the
> >wiki about how these fit together:
> >
> >https://btrfs.wiki.kernel.org/index.php/Data_Structures
> >https://btrfs.wiki.kernel.org/index.php/Trees
> >
> >    What "tricks" are you thinking of, exactly?
> 
> Principally I want to be able to detect exactly what has changed, so that I
> can perform backups very quickly. I want to be able to update a small
> portion of a large file and then identify exactly which parts changed and
> only back those up, for example.

   send/receive does this.

[snip]
> >    Are you aware of btrfs send/receive? It should allow you to do all
> >of this. The main part of the code then comes down to managing the
> >send/receive, and all the distributed error handling. Then the only
> >direct access to the internal metadata you need is being able to read
> >UUIDs to work out what you have on each side -- which can also be done
> >by "btrfs sub list".
> 
> Yes, this is one of my main inspirations. The problem is that I am pretty
> sure it won't handle deduplication of the data.

   It does. That's one of the things it's explicitly designed to do.

> I'm planning to have a LOT of containers running the same stuff, on fast
> (expensive) SSD media, and deduplication is essential to make that work
> properly. I can already see huge savings from this.
> 
> As far as I can tell, btrfs send/receive operates on a subvolume basis, and
> any shared data between those subvolumes is duplicated if you copy them
> separately.

   Not so.

   You can tell send that there are subvolumes with known IDs on the
receive side, using the -c option (arbitrarily many subvols). If the
subvol you are sending (on the send side) shares extents with any of
those, then the data is not sent -- just a reference to it. On the
receive side, if that happens, the shared extents are reconstructed.
It will also do this with the -p option.

> I'll be very happy if this is already possible, or if there is some simple
> way around this!
> 
> My current solution, which I have already implemented in the project I
> shared, is to first snapshot all the subvolumes into an identical tree, then
> to reflink copy (or normal(ish) copy for nocow) all of the files over to
> another subvolume, which I am planning to then send/receive as a single
> entity.
> 
> I believe this will allow the deduplication to be transferred over to the
> receiving machine, and that this won't take place if I transfer the
> subvolumes separately.

   You send each one in turn, and add the -c option for the ones
you've already sent:

for n in A B C D etc; do
   btrfs sub snap -r live/subvol$n backups/subvol$n.1
done
btrfs send backups/subvolA.1 | ...
btrfs send -c backups/subvolA.1 backups/subvolB.1 | ...
btrfs send -c backups/subvolA.1 -c backups/subvolB.1 backups/subvolC.1 | ...
btrfs send  -c backups/subvolA.1 -c backups/subvolB.1 -c backups/subvolC.1 backups/subvolD.1 | ...

   You can then use the same process to do incrementals against each
subvol, by keeping the last snapshot you sent and doing an incremental
against it:

for n in A B C D etc; do
   btrfs sub snap -r live/subvol$n backups/subvol$n.2
done
btrfs send -p backups/subvolA.1 backups/subvolA.2 | ...
btrfs send -c backups/subvolA.2 -p backups/subvolB.1 backups/subvolB.2 | ...
btrfs send -c backups/subvolA.2 -c backups/subvolB.2 -p backups/subvolC.1 backups/subvolC.2 | ...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I am an opera lover from planet Zog. Take me to your lieder ---   

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS backup questions
  2014-09-27 16:59     ` Hugo Mills
@ 2014-09-29 11:02       ` James Pharaoh
  0 siblings, 0 replies; 5+ messages in thread
From: James Pharaoh @ 2014-09-29 11:02 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On 27/09/14 18:59, Hugo Mills wrote:
>>>> 2. Duplicating NOCOW files
>     Are you trying to cross a mount-point with that? It works for me:

Here's a script which replicates what I'm doing:

https://gist.github.com/jamespharaoh/d693067ffd203689ebea

And here's the output when I run it:

https://gist.github.com/jamespharaoh/75cb937fd73b05c9128d

>     Be aware that the current implementation of (manual) defrag will
> separate the shared extents, so you no longer get the deduplication
> effect. There was a snapshot-aware defrag implementation, but it
> caused filesystem corruption, and has been removed for now until a
> working version can be written. I think Josef was working on this.

Yeah, good to know but won't be a major problem. So I'll probably leave 
cow on in almost all cases even for database files. I'll defragment 
those files and deduplicate all the rest. In the case of very large 
sites, which will be rare, I'll use nocow for those files and provision 
replication or whatever.

I'll do some performance testing at some point and post some code and 
the results here ;-)

>> Yes, this is one of my main inspirations. The problem is that I am pretty
>> sure it won't handle deduplication of the data.
>     It does. That's one of the things it's explicitly designed to do.

Ok, so I think I understand this now. I believe that the only type of 
object with a universal id is a subvolume, so the receive function can't 
identify items which already exist by themselves, or that it would be 
expensive to do so.

Providing a "parent" subvolume allows it to do that. So as long as the 
parent subvolume shares the reference with the filesystem being sent it 
will do so after the receive takes place on the target.

I think the issue for me is the word "parent". These are really 
"reference" filesystems.

The subvolumes you've told me to list as the parents are not parent 
filesystems at all, compared to the one I'm sending, except for the 
previous version of the same subvolume of course.

Is that all correct?

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-09-29 11:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-27 15:39 BTRFS backup questions James Pharaoh
2014-09-27 16:17 ` Hugo Mills
2014-09-27 16:33   ` James Pharaoh
2014-09-27 16:59     ` Hugo Mills
2014-09-29 11:02       ` James Pharaoh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).