linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* What to do about snapshot-aware defrag
@ 2014-05-30 20:43 Josef Bacik
  2014-05-30 22:00 ` Martin
  0 siblings, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2014-05-30 20:43 UTC (permalink / raw)
  To: linux-btrfs

Hello,

TL;DR: I want to only do snapshot-aware defrag on inodes in snapshots 
that haven't changed since the snapshot was taken.  Yay or nay (with a 
reason why for nay)

=== How snapshot aware defrag currently works ===

First the defrag stuff will go through and read in and mark dirty a big 
area where it finds a bunch of tiny extents.  We go to write this stuff 
out and when the write completes we notice this operation was for 
defrag.  We go through the entire range of existing extents for this 
inode if there was a snapshot after the inode was created and record all 
of the extents in this range.

Then we look up all of the references for each of the extents we found. 
  So say we have 100 snapshots and we defragged 100 extents, we'll end 
up with 100 * 100 of these data structures to know what we have to 
stitch back together.

Then we go through each of these things and do btrfs_drop_extents for 
the range and either add the new extent for this range, or merge it with 
the previous one if we've already done that.  So it looks like this

[----- New extent -----]
[old1][old2][old3][old4]

We will drop old1 and create new1 for the range of old1, so we have this

[new1][old2][old3][old4]

and then the next one we will drop old2 and merge it with new1 so we 
have this

[-- new1 --][old3][old4]

and so on and so forth.  We do this because some random extent within 
this range could have changed and we don't want to overwrite that bit.

=== Problems that need to be fixed ===

1) The memory usage for this is astronomical.  Every extent for every 
snapshot we are replacing is recorded in memory while we do this, which 
is why this feature was disabled as people were constantly OOM'ing.

2) Currently this stitching operation is done in 
btrfs_finish_ordered_io, which means if anybody is waiting for the 
ordered extent or any ordered extent to be completed they are going to 
block longer waiting for the operation to complete, which could be very 
time consuming.

=== Solutions ===

1) Move the snapshot aware stitching part off into a different thread. 
We can make this block unmount until it completes if we want to make 
sure we always finish our job, or we can make it exit and then we lose 
some of the snapshot awareness.  Either way is fine with me, I figure 
we'd go with blocking first and change it if somebody complains too loudly.

2) Fix how we do the stitching.  This is where I need input.  What I 
want to do is just lookup the first extent, which will give me all of 
the roots that share the extent range that we are defragging.  Then I 
want to just lookup those inodes, do btrfs_drop_extents and add in the 
new extent, the same way btrfs_finish_ordered_io works.  This will make 
the stitching operation much simpler and less error prone, and much much 
faster.  The drawback is that we need extra checks to make sure the 
inode on the snapshots hasn't changed since we took the snapshot.  So if 
any of that file has been modified, even if none of the data has, we 
won't do anything because we won't be able to verify that it is the same.

This isn't the only way we can do it.  I can fix the stitching part to 
just do one extent at a time across all roots.  This way we're only 
allocating N number of snapshots worth of entries at a time so it keeps 
our allocation low.  The other option is to just do one root at a time 
and process each extent.  Either of these will be fine and reduce our 
memory usage but will be pretty disk io intensive.

=== Summary and what I need ===

Option 1: Only relink inodes that haven't changed since the snapshot was 
taken.

Pros:
-Faster
-Simpler
-Less duplicated code, uses existing functions for tricky operations so 
less likely to introduce weird bugs.

Cons:
-Could possibly lost some of the snapshot-awareness of the defrag.  If 
you just touch a file we would not do the relinking and you'd end up 
with twice the space usage.

Option 2: Process each root one extent at a time in whatever way results 
in less memory usage.

Pros:
-Maximizes the space reduction of the snapshot-aware defrag.  Every 
extent that is the same as the original will be replaced with the new 
extent and all will be well with the world.

Cons:
-Way slower.  We'll have to walk and check every extent to make sure we 
can actually replace it.  This is how it used to work so we'd be 
consistent with the 2 or 3 releases where we had snapshot-aware defrag 
enabled.

-More complicated.  We have to do a lot of extra checking and such, new 
code, possibility for bugs to show up.

So tell me which one you want and I'll do that one.  If you want Option 
2 please explain your use case so I can keep it in mind when deciding on 
how to go about making the memory usage suck less.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-05-30 20:43 What to do about snapshot-aware defrag Josef Bacik
@ 2014-05-30 22:00 ` Martin
  2014-05-31 23:51   ` Brendan Hide
  2014-06-02 13:22   ` Josef Bacik
  0 siblings, 2 replies; 11+ messages in thread
From: Martin @ 2014-05-30 22:00 UTC (permalink / raw)
  To: linux-btrfs

OK... I'll jump in...

On 30/05/14 21:43, Josef Bacik wrote:
> Hello,
> 
> TL;DR: I want to only do snapshot-aware defrag on inodes in snapshots
> that haven't changed since the snapshot was taken.  Yay or nay (with a
> reason why for nay)

[...]
> 
> === Summary and what I need ===
> 
> Option 1: Only relink inodes that haven't changed since the snapshot was
> taken.
> 
> Pros:
> -Faster
> -Simpler
> -Less duplicated code, uses existing functions for tricky operations so
> less likely to introduce weird bugs.
> 
> Cons:
> -Could possibly lost some of the snapshot-awareness of the defrag.  If
> you just touch a file we would not do the relinking and you'd end up
> with twice the space usage.
[...]


Obvious way to go for fast KISS.


One question:

Will option one mean that we always need to mount with noatime or
read-only to allow snapshot defragging to do anything?


Regards,
Martin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-05-30 22:00 ` Martin
@ 2014-05-31 23:51   ` Brendan Hide
  2014-06-01  1:52     ` Duncan
  2014-06-02  3:07     ` Mitch Harder
  2014-06-02 13:22   ` Josef Bacik
  1 sibling, 2 replies; 11+ messages in thread
From: Brendan Hide @ 2014-05-31 23:51 UTC (permalink / raw)
  To: Martin, linux-btrfs

On 2014/05/31 12:00 AM, Martin wrote:
> OK... I'll jump in...
>
> On 30/05/14 21:43, Josef Bacik wrote:
>> [snip]
>> Option 1: Only relink inodes that haven't changed since the snapshot was
>> taken.
>>
>> Pros:
>> -Faster
>> -Simpler
>> -Less duplicated code, uses existing functions for tricky operations so
>> less likely to introduce weird bugs.
>>
>> Cons:
>> -Could possibly lost some of the snapshot-awareness of the defrag.  If
>> you just touch a file we would not do the relinking and you'd end up
>> with twice the space usage.
> [...]
>
>
> Obvious way to go for fast KISS.

I second this - KISS is better.

Would in-band dedupe resolve the issue with losing the 
"snapshot-awareness of the defrag"? I figure that if someone absolutely 
wants everything deduped efficiently they'd put in the necessary 
resources (memory/dedicated SSD/etc) to have in-band dedupe work well.
> One question:
>
> Will option one mean that we always need to mount with noatime or
> read-only to allow snapshot defragging to do anything?

That is a very good question. I very rarely have mounts without noatime 
- and usually only because I hadn't thought of it.

> Regards,
> Martin

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-05-31 23:51   ` Brendan Hide
@ 2014-06-01  1:52     ` Duncan
  2014-06-02  3:07     ` Mitch Harder
  1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2014-06-01  1:52 UTC (permalink / raw)
  To: linux-btrfs

Brendan Hide posted on Sun, 01 Jun 2014 01:51:49 +0200 as excerpted:

>> Will option one mean that we always need to mount with noatime or
>> read-only to allow snapshot defragging to do anything?
> 
> That is a very good question. I very rarely have mounts without noatime
> - and usually only because I hadn't thought of it.

Heh, a couple months ago I got tired of having to add noatime to all my 
standard mounts, and decided to patch my kernel to noatime (instead of 
relatime) by default.  I can if I need to still use relatime or 
strictatime, but if they aren't listed, noatime gets added by default, 
now. =:^)

The only remaining problem is that it's not a full default, as noatime 
still shows up in /proc/self/mounts, but avoiding that would have 
complexified the patch, and not being an actual coder (and DEFINITELY not 
a kernel coder), I decided I best leave well enough alone, patching only 
what I had to to avoid having relatime if I omitted the parameter, and 
only what I was reasonably sure I could do without screwing things up, 
based on my limited sysadmin level reading of the sources.

It would sure be nice if there were a kernel config option for that 
default, which would naturally still default to relatime, which is I'm 
sure where most of the distros not already patching it to noatime would 
leave it, as well.

Meanwhile, agreed, a good question that does indeed remain! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-05-31 23:51   ` Brendan Hide
  2014-06-01  1:52     ` Duncan
@ 2014-06-02  3:07     ` Mitch Harder
  2014-06-02 13:19       ` Josef Bacik
  1 sibling, 1 reply; 11+ messages in thread
From: Mitch Harder @ 2014-06-02  3:07 UTC (permalink / raw)
  To: Brendan Hide; +Cc: Martin, linux-btrfs

On Sat, May 31, 2014 at 6:51 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
> On 2014/05/31 12:00 AM, Martin wrote:
>>
>> OK... I'll jump in...
>>
>> On 30/05/14 21:43, Josef Bacik wrote:
>>>
>>> [snip]
>>>
>>> Option 1: Only relink inodes that haven't changed since the snapshot was
>>> taken.
>>>
>>> Pros:
>>> -Faster
>>> -Simpler
>>> -Less duplicated code, uses existing functions for tricky operations so
>>> less likely to introduce weird bugs.
>>>
>>> Cons:
>>> -Could possibly lost some of the snapshot-awareness of the defrag.  If
>>> you just touch a file we would not do the relinking and you'd end up
>>> with twice the space usage.
>>
>> [...]
>>
>>
>> Obvious way to go for fast KISS.
>
>
> I second this - KISS is better.
>
> Would in-band dedupe resolve the issue with losing the "snapshot-awareness
> of the defrag"? I figure that if someone absolutely wants everything deduped
> efficiently they'd put in the necessary resources (memory/dedicated SSD/etc)
> to have in-band dedupe work well.
>
>> One question:
>>
>> Will option one mean that we always need to mount with noatime or
>> read-only to allow snapshot defragging to do anything?
>
>

When snapshot-aware defrag first came out, I was convinced it was a
"must-have" capability for nearly everybody using btrfs.  But, the
more I look at my work load and common practices with btrfs, the more
I am wondering just how often snapshot-aware defrag was actually doing
something for me.

I use a lot of snapshots.  But for the most part, once I touch a file
in my current subvolume, the whole file needs to be COW-ed from it's
previous version.

Now that we have a working sysfs, I wonder if we could implement some
counters to track how often snapshot-aware defrag would have run.  I
might be surprised at how much it was doing.

---
Regards,
Mitch Harder

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-06-02  3:07     ` Mitch Harder
@ 2014-06-02 13:19       ` Josef Bacik
  0 siblings, 0 replies; 11+ messages in thread
From: Josef Bacik @ 2014-06-02 13:19 UTC (permalink / raw)
  To: Mitch Harder, Brendan Hide; +Cc: Martin, linux-btrfs

On 06/01/2014 11:07 PM, Mitch Harder wrote:
> On Sat, May 31, 2014 at 6:51 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
>> On 2014/05/31 12:00 AM, Martin wrote:
>>>
>>> OK... I'll jump in...
>>>
>>> On 30/05/14 21:43, Josef Bacik wrote:
>>>>
>>>> [snip]
>>>>
>>>> Option 1: Only relink inodes that haven't changed since the snapshot was
>>>> taken.
>>>>
>>>> Pros:
>>>> -Faster
>>>> -Simpler
>>>> -Less duplicated code, uses existing functions for tricky operations so
>>>> less likely to introduce weird bugs.
>>>>
>>>> Cons:
>>>> -Could possibly lost some of the snapshot-awareness of the defrag.  If
>>>> you just touch a file we would not do the relinking and you'd end up
>>>> with twice the space usage.
>>>
>>> [...]
>>>
>>>
>>> Obvious way to go for fast KISS.
>>
>>
>> I second this - KISS is better.
>>
>> Would in-band dedupe resolve the issue with losing the "snapshot-awareness
>> of the defrag"? I figure that if someone absolutely wants everything deduped
>> efficiently they'd put in the necessary resources (memory/dedicated SSD/etc)
>> to have in-band dedupe work well.
>>
>>> One question:
>>>
>>> Will option one mean that we always need to mount with noatime or
>>> read-only to allow snapshot defragging to do anything?
>>
>>
>
> When snapshot-aware defrag first came out, I was convinced it was a
> "must-have" capability for nearly everybody using btrfs.  But, the
> more I look at my work load and common practices with btrfs, the more
> I am wondering just how often snapshot-aware defrag was actually doing
> something for me.
>
> I use a lot of snapshots.  But for the most part, once I touch a file
> in my current subvolume, the whole file needs to be COW-ed from it's
> previous version.
>

The whole file doesn't need to be cow'ed, just the part that you touch. 
  So old snapshot-aware defrag probably would have saved you quite a bit 
of space.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-05-30 22:00 ` Martin
  2014-05-31 23:51   ` Brendan Hide
@ 2014-06-02 13:22   ` Josef Bacik
  2014-06-03 23:54     ` Martin
  1 sibling, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2014-06-02 13:22 UTC (permalink / raw)
  To: Martin, linux-btrfs

On 05/30/2014 06:00 PM, Martin wrote:
> OK... I'll jump in...
>
> On 30/05/14 21:43, Josef Bacik wrote:
>> Hello,
>>
>> TL;DR: I want to only do snapshot-aware defrag on inodes in snapshots
>> that haven't changed since the snapshot was taken.  Yay or nay (with a
>> reason why for nay)
>
> [...]
>>
>> === Summary and what I need ===
>>
>> Option 1: Only relink inodes that haven't changed since the snapshot was
>> taken.
>>
>> Pros:
>> -Faster
>> -Simpler
>> -Less duplicated code, uses existing functions for tricky operations so
>> less likely to introduce weird bugs.
>>
>> Cons:
>> -Could possibly lost some of the snapshot-awareness of the defrag.  If
>> you just touch a file we would not do the relinking and you'd end up
>> with twice the space usage.
> [...]
>
>
> Obvious way to go for fast KISS.
>
>
> One question:
>
> Will option one mean that we always need to mount with noatime or
> read-only to allow snapshot defragging to do anything?
>

Yeah atime would screw this up, I hadn't thought of that.  With that 
being the case I think the only option is to keep the old behavior, we 
don't want to screw up stuff like this just because users used a backup 
program on their snapshot and didn't use noatime.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-06-02 13:22   ` Josef Bacik
@ 2014-06-03 23:54     ` Martin
  2014-06-04  9:19       ` Erkki Seppala
  0 siblings, 1 reply; 11+ messages in thread
From: Martin @ 2014-06-03 23:54 UTC (permalink / raw)
  To: linux-btrfs

On 02/06/14 14:22, Josef Bacik wrote:
> On 05/30/2014 06:00 PM, Martin wrote:
>> OK... I'll jump in...
>>
>> On 30/05/14 21:43, Josef Bacik wrote:
>>> Hello,
>>>
>>> TL;DR: I want to only do snapshot-aware defrag on inodes in snapshots
>>> that haven't changed since the snapshot was taken.  Yay or nay (with a
>>> reason why for nay)
>>
>> [...]
>>>
>>> === Summary and what I need ===
>>>
>>> Option 1: Only relink inodes that haven't changed since the snapshot was
>>> taken.
[...]
>> Obvious way to go for fast KISS.
>>
>>
>> One question:
>>
>> Will option one mean that we always need to mount with noatime or
>> read-only to allow snapshot defragging to do anything?
>>
> 
> Yeah atime would screw this up, I hadn't thought of that.  With that
> being the case I think the only option is to keep the old behavior, we
> don't want to screw up stuff like this just because users used a backup
> program on their snapshot and didn't use noatime.  Thanks,

Not so fast into non-KISS!


The *ONLY* application that I know of that uses atime is Mutt and then
*only* for mbox files!...

NOTHING else uses atime as far as I know.

We already have most distros enabling reltime by default as a
just-in-case...


Can we not have noatime as the default for btrfs? Also widely note that
default in the man page and wiki and with why?...

*And go KISS and move on faster* better?


Myself, I still use Mutt sometimes, but no mbox, and all my filesystems
have been noatime for many years now with good positive results. (Both
home and work servers.)

Regards,
Martin




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-06-03 23:54     ` Martin
@ 2014-06-04  9:19       ` Erkki Seppala
  2014-06-04 13:15         ` Martin
  0 siblings, 1 reply; 11+ messages in thread
From: Erkki Seppala @ 2014-06-04  9:19 UTC (permalink / raw)
  To: linux-btrfs

Martin <m_btrfs@ml1.co.uk> writes:

> The *ONLY* application that I know of that uses atime is Mutt and then
> *only* for mbox files!...

However, users, such as myself :), can be interested in when a certain
file has been last accessed. With snapshots I can even get an idea of
all the times the file has been accessed.

> *And go KISS and move on faster* better?

Well, it in uncertain to me if it truly is better that btrfs would after
that point no longer truly even support atime, if using it results in
blowing up snapshot sizes. They might at that point even consider just
using LVM2 snapshots (shudder) ;).

-- 
  _____________________________________________________________________
     / __// /__ ____  __               http://www.modeemi.fi/~flux/\   \
    / /_ / // // /\ \/ /                                            \  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi                                  \/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-06-04  9:19       ` Erkki Seppala
@ 2014-06-04 13:15         ` Martin
  2014-06-04 19:33           ` Chris Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Martin @ 2014-06-04 13:15 UTC (permalink / raw)
  To: linux-btrfs

On 04/06/14 10:19, Erkki Seppala wrote:
> Martin <m_btrfs@ml1.co.uk> writes:
> 
>> The *ONLY* application that I know of that uses atime is Mutt and then
>> *only* for mbox files!...
> 
> However, users, such as myself :), can be interested in when a certain
> file has been last accessed. With snapshots I can even get an idea of
> all the times the file has been accessed.
> 
>> *And go KISS and move on faster* better?
> 
> Well, it in uncertain to me if it truly is better that btrfs would after
> that point no longer truly even support atime, if using it results in
> blowing up snapshot sizes. They might at that point even consider just
> using LVM2 snapshots (shudder) ;).

Not quite... My emphasis is:


1:

Go KISS for the defrag and accept that any atime use will render the
defrag ineffective. Give a note that the noatime mount option should be
used.


2:

Consider using noatime as a /default/ being as there are no known
'must-use' use cases. Those users still wanting atime can add that as a
mount option with the note that atime use reduces the snapshot defrag
effectiveness.


(The "for/against atime" is a good subject for another thread!)


Go fast KISS!

Regards,
Martin




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What to do about snapshot-aware defrag
  2014-06-04 13:15         ` Martin
@ 2014-06-04 19:33           ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2014-06-04 19:33 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jun 4, 2014, at 7:15 AM, Martin <m_btrfs@ml1.co.uk> wrote:
> 
> Consider using noatime as a /default/ being as there are no known
> 'must-use' use cases.

The quote I'm finding on the interwebs is POSIX  “requires that operating systems maintain file system metadata that records when each file was last accessed". I'm not sure if upstream kernel projects aim for LSB (and thus POSIX) compliance by default and let distros opt out; or the opposite.

> Those users still wanting atime can add that as a
> mount option with the note that atime use reduces the snapshot defrag
> effectiveness.

I can imagine some optimizations for Btrfs that are easier than other file systems, like a way to point metadata chunks to specific devices, for example metadata to persistent memory, while the data goes to conventional hard drives.


Chris Murphy


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-06-04 19:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-30 20:43 What to do about snapshot-aware defrag Josef Bacik
2014-05-30 22:00 ` Martin
2014-05-31 23:51   ` Brendan Hide
2014-06-01  1:52     ` Duncan
2014-06-02  3:07     ` Mitch Harder
2014-06-02 13:19       ` Josef Bacik
2014-06-02 13:22   ` Josef Bacik
2014-06-03 23:54     ` Martin
2014-06-04  9:19       ` Erkki Seppala
2014-06-04 13:15         ` Martin
2014-06-04 19:33           ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).