All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs-cleaner / snapshot performance analysis
@ 2018-02-09 16:45 Ellis H. Wilson III
  2018-02-09 17:10 ` Peter Grandi
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-09 16:45 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

I am trying to better understand how the cleaner kthread (btrfs-cleaner) 
impacts foreground performance, specifically during snapshot deletion. 
My experience so far has been that it can be dramatically disruptive to 
foreground I/O.

Looking through the wiki at kernel.org I have not yet stumbled onto any 
analysis that would shed light on this specific problem.  I have found 
numerous complaints about btrfs-cleaner online, especially relating to 
quotas being enabled.  This has proven thus far less than helpful, as 
the response tends to be "use less snapshots," or "disable quotas," both 
of which strike me as intellectually unsatisfying answers, especially 
the former in a filesystem where snapshots are supposed to be 
"first-class citizens."

The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot 
performance analysis I would expect to see given the assertions in the 
latter that "BTRFS...supports efficient snapshots..."  The former is 
sufficiently pre-BTRFS that while it does performance analysis of btree 
clones, it's unclear (to me at least) if the results can be 
forward-propagated in some way to real-world performance expectations 
for BTRFS snapshot creation/deletion/modification.

Has this analysis been performed somewhere else and I'm just missing it? 
  Also, I'll be glad to comment on my specific setup, kernel version, 
etc, and discuss pragmatic work-arounds, but I'd like to better 
understand the high-level performance implications first.

Thanks in advance to anyone who can comment on this.  I am very inclined 
to read anything thrown at me, so if there is documentation I failed to 
read, please just send the link.

Best,

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-09 16:45 btrfs-cleaner / snapshot performance analysis Ellis H. Wilson III
@ 2018-02-09 17:10 ` Peter Grandi
  2018-02-09 20:36 ` Hans van Kranenburg
  2018-02-11  6:40 ` Qu Wenruo
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Grandi @ 2018-02-09 17:10 UTC (permalink / raw)
  To: Linux fs Btrfs

> I am trying to better understand how the cleaner kthread
> (btrfs-cleaner) impacts foreground performance, specifically
> during snapshot deletion.  My experience so far has been that
> it can be dramatically disruptive to foreground I/O.

That's such a warmly innocent and optimistic question! This post
gives the answer, and to an even more general question:

  http://www.sabi.co.uk/blog/17-one.html?170610#170610

> the response tends to be "use less snapshots," or "disable
> quotas," both of which strike me as intellectually
> unsatisfying answers, especially the former in a filesystem
> where snapshots are supposed to be "first-class citizens."

They are "first class" but not "cost-free".
In particular every extent is linked in a forward map and a
reverse map, and deleting a snapshot involves materializing and
updating a join of the two, which seems to be done with a
classic nested-loop join strategy resulting in N^2 running
time. I suspect that quotas have a similar optimization.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-09 16:45 btrfs-cleaner / snapshot performance analysis Ellis H. Wilson III
  2018-02-09 17:10 ` Peter Grandi
@ 2018-02-09 20:36 ` Hans van Kranenburg
  2018-02-10 18:29   ` Ellis H. Wilson III
  2018-02-11  6:40 ` Qu Wenruo
  2 siblings, 1 reply; 22+ messages in thread
From: Hans van Kranenburg @ 2018-02-09 20:36 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On 02/09/2018 05:45 PM, Ellis H. Wilson III wrote:
> 
> I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> impacts foreground performance, specifically during snapshot deletion.
> My experience so far has been that it can be dramatically disruptive to
> foreground I/O.
> 
> Looking through the wiki at kernel.org I have not yet stumbled onto any
> analysis that would shed light on this specific problem.  I have found
> numerous complaints about btrfs-cleaner online, especially relating to
> quotas being enabled.  This has proven thus far less than helpful, as
> the response tends to be "use less snapshots," or "disable quotas," both
> of which strike me as intellectually unsatisfying answers

Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
didn't even realize I had those, and now the problem is gone."

>, especially
> the former in a filesystem where snapshots are supposed to be
> "first-class citizens."

Throwing complaints around is also not helpful.

> The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot
> performance analysis I would expect to see given the assertions in the
> latter that "BTRFS...supports efficient snapshots..."  The former is
> sufficiently pre-BTRFS that while it does performance analysis of btree
> clones, it's unclear (to me at least) if the results can be
> forward-propagated in some way to real-world performance expectations
> for BTRFS snapshot creation/deletion/modification.

I don't really think they can.

> Has this analysis been performed somewhere else and I'm just missing it?
>  Also, I'll be glad to comment on my specific setup, kernel version,
> etc, and discuss pragmatic work-arounds, but I'd like to better
> understand the high-level performance implications first.

The "performance implications" are highly dependent on your specific
setup, kernel version, etc, so it really makes sense to share:

* kernel version
* mount options (from /proc/mounts|grep btrfs)
* is it ssd? hdd? iscsi lun?
* how big is the FS
* how many subvolumes/snapshots? (how many snapshots per subvolume)

And what's essential to look at is what your computer is doing while you
are throwing a list of subvolumes into the cleaner.

* is it using 100% cpu?
* is it showing 100% disk read I/O utilization?
* is it showing 100% disk write I/O utilization? (is it writing lots and
lots of data to disk?)

Since you could be looking at any combination of answers on all those
things, there's not much specific to tell.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-09 20:36 ` Hans van Kranenburg
@ 2018-02-10 18:29   ` Ellis H. Wilson III
  2018-02-10 22:05     ` Tomasz Pala
  2018-02-11  1:02     ` Hans van Kranenburg
  0 siblings, 2 replies; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-10 18:29 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

Thank you very much for your response Hans.  Comments in-line, but I did 
want to handle one miscommunication straight-away:

I'm a huge fan of BTRFS.  If I came off like I was complaining, my 
sincere apologies.   To be completely transparent we are using BTRFS in 
a very large project at my company, which I am lead architect on, and 
while I have read the academic papers, perused a subset of the source 
code, and been following it's development in the background, I now need 
to deeply understand where there might be performance hiccups.  All of 
our foreground I/O testing with BTRFS in RAID0/RAID1/single across 
different SSDs and HDDs has been stellar, but we haven't dug too far 
into snapshot performance, balancing, and other more background-oriented 
performance.  Hence my interest in finding documentation and analysis I 
can read and grok myself on the implications of snapshot operations on 
foreground I/O if such exists.  More in-line below:

On 02/09/2018 03:36 PM, Hans van Kranenburg wrote:
>> This has proven thus far less than helpful, as
>> the response tends to be "use less snapshots," or "disable quotas," both
>> of which strike me as intellectually unsatisfying answers
> 
> Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
> didn't even realize I had those, and now the problem is gone."

I meant less than helpful for me, since for my project I need detailed 
and fairly accurate capacity information per sub-volume, and the 
relationship between qgroups and subvolume performance wasn't being 
spelled out in the responses.  Please correct me if I am wrong about 
needing qgroups enabled to see detailed capacity information 
per-subvolume (including snapshots).

>> the former in a filesystem where snapshots are supposed to be
>> "first-class citizens."
> 
> Throwing complaints around is also not helpful.

Sorry about this.  It wasn't directed in any way at BTRFS developers, 
but rather some of the suggestions for solution proposed in random 
forums online.  As mentioned I'm a fan of BTRFS, especially as my 
project requires the snapshots to truly be first-class citizens in that 
they are writable and one can roll-back to them at-will, unlike in ZFS 
and other filesystems.  I was just saying it seemed backwards to suggest 
having less snapshots was a solution in a filesystem where the 
architecture appears to treat them as a core part of the design.

> The "performance implications" are highly dependent on your specific
> setup, kernel version, etc, so it really makes sense to share:
> 
> * kernel version
> * mount options (from /proc/mounts|grep btrfs)
> * is it ssd? hdd? iscsi lun?
> * how big is the FS
> * how many subvolumes/snapshots? (how many snapshots per subvolume)

I will answer the above, but would like to reiterate my previous comment 
that I still would like to understand the fundamental relationships here 
as in my project kernel version is very likely to change (to more 
recent), along with mount options and underlying device media.  Once 
this project hits the field I will additionally have limited control 
over how large the FS gets (until physical media space is exhausted of 
course) or how many subvolumes/snapshots there are.  If I know that 
above N snapshots per subvolume performance tanks by M%, I can apply 
limits on the use-case in the field, but I am not aware of those kinds 
of performance implications yet.

My present situation is the following:
- Fairly default opensuse 42.3.
- uname -a: Linux betty 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC 
2018 (7db1912) x86_64 x86_64 x86_64 GNU/Linux
- /dev/sda6 / btrfs 
rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot 0 0
(I have about 10 other btrfs subvolumes, but this is the only one being 
snapshotted)
- At the time of my noticing the slow-down, I had about 24 snapshots, 10 
of which were in the process of being deleted
- Usage output:
~> sudo btrfs filesystem usage /
Overall:
     Device size:		  40.00GiB
     Device allocated:		  11.54GiB
     Device unallocated:		  28.46GiB
     Device missing:		     0.00B
     Used:			   7.57GiB
     Free (estimated):		  32.28GiB	(min: 32.28GiB)
     Data ratio:			      1.00
     Metadata ratio:		      1.00
     Global reserve:		  28.44MiB	(used: 0.00B)
Data,single: Size:11.01GiB, Used:7.19GiB
    /dev/sda6	  11.01GiB
Metadata,single: Size:512.00MiB, Used:395.91MiB
    /dev/sda6	 512.00MiB
System,single: Size:32.00MiB, Used:16.00KiB
    /dev/sda6	  32.00MiB
Unallocated:
    /dev/sda6	  28.46GiB

> And what's essential to look at is what your computer is doing while you
> are throwing a list of subvolumes into the cleaner.
> 
> * is it using 100% cpu?
> * is it showing 100% disk read I/O utilization?
> * is it showing 100% disk write I/O utilization? (is it writing lots and
> lots of data to disk?)

I noticed the problem when Thunderbird became completely unresponsive. 
I fired up top, and btrfs-cleaner was at the top, along with snapper. 
btrfs-cleaner was at 100% cpu (single-core) for the entirety of the 
time.  I knew I had about 24 snapshots prior to this, and after about 
60s when the pain subsided only about 14 remained, so I estimate 10 were 
deleted as part of snapper's cleaning algorithm.  I quickly also ran 
dstat during the slow-down, and after 5s it finally started and reported 
only about 3-6MB/s in terms of read and write to the drive in question.

I have since run top and dstat before running snapper cleaner manually, 
and the system lock-up does still occur, albeit for shorter times as 
I've only done it with a few snapshots and not much changed in each.

Best,

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-10 18:29   ` Ellis H. Wilson III
@ 2018-02-10 22:05     ` Tomasz Pala
  2018-02-11 15:59       ` Ellis H. Wilson III
  2018-02-11  1:02     ` Hans van Kranenburg
  1 sibling, 1 reply; 22+ messages in thread
From: Tomasz Pala @ 2018-02-10 22:05 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: linux-btrfs

On Sat, Feb 10, 2018 at 13:29:15 -0500, Ellis H. Wilson III wrote:

>> Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
>> didn't even realize I had those, and now the problem is gone."
> 
> I meant less than helpful for me, since for my project I need detailed 
> and fairly accurate capacity information per sub-volume, and the 

You won't have anything close to "accurate" in btrfs - quotas don't
include space wasted by fragmentation, which happens to allocate from tens
to thousands times (sic!) more space than the files itself.
Not in some worst-case scenarios, but in real life situations...
I got 10 MB db-file which was eating 10 GB of space after a week of
regular updates - withOUT snapshotting it. All described here.

> relationship between qgroups and subvolume performance wasn't being 
> spelled out in the responses.  Please correct me if I am wrong about 
> needing qgroups enabled to see detailed capacity information 
> per-subvolume (including snapshots).

Yes, you need that. But while snapshots are in use, it's not
straighforward to interpret the values, especially in regard of
exclusive spaace (which is not a btrfs limitation, just pure logical
conclusion) - this was also described in my thread.

> course) or how many subvolumes/snapshots there are.  If I know that 
> above N snapshots per subvolume performance tanks by M%, I can apply 
> limits on the use-case in the field, but I am not aware of those kinds 
> of performance implications yet.

This doesn't work like this. It all depends on data that are subject of
snapshots, especially how they are updated. How exactly, including write
patterns.

I think you expect answers that can't be formulated - with fs architecture so
advanced as ZFS or btrfs it's behavior can't be analyzed for simple
answers like 'keep less than N snapshots'.

If you want PRACTICAL rules, there is one not known commonly: since
the btrfs limitation is that defragmentation breaks CoW links, so all
your snapshots can grow like regular copies, defrag data just
before snapshotting them.

> I noticed the problem when Thunderbird became completely unresponsive. 

Is it using some database engine for storage? Mark the files with nocow.

This is an exception of easy-answer: btrfs doesn't handle databases with
CoW. Period. Doesn't matter if snapshotted or not, ANY database files
(systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They
slow down entire system to the speed of cheap SD card.

If you have btrfs on your home partition, make sure that AT LEAST all
$USER/.cache directories are chattr +C. The same applies to entire /var
partition and dozen of other various directories with user-databases
(~/.mozilla/firefox, ~/.ccache and many many more application-specific).

In fact, if you want the quotas to be accurate, you NEED to mount every
volume with possibly hostile write patterns (like /home) as nocow.


Actually, if you do not use compression and don't need checksums of data
blocks, you may want to mount all the btrfs with nocow by default.
This way the quotas would be more accurate (no fragmentation _between_
snapshots) and you'll have some decent performance with snapshots.
If that is all you care.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-10 18:29   ` Ellis H. Wilson III
  2018-02-10 22:05     ` Tomasz Pala
@ 2018-02-11  1:02     ` Hans van Kranenburg
  2018-02-11  9:31       ` Andrei Borzenkov
  2018-02-11 16:15       ` Ellis H. Wilson III
  1 sibling, 2 replies; 22+ messages in thread
From: Hans van Kranenburg @ 2018-02-11  1:02 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

Hey,

On 02/10/2018 07:29 PM, Ellis H. Wilson III wrote:
> Thank you very much for your response Hans.  Comments in-line, but I did
> want to handle one miscommunication straight-away:
> 
> I'm a huge fan of BTRFS.  If I came off like I was complaining, my
> sincere apologies.   To be completely transparent we are using BTRFS in
> a very large project at my company, which I am lead architect on, and
> while I have read the academic papers, perused a subset of the source
> code, and been following it's development in the background, I now need
> to deeply understand where there might be performance hiccups.

I'd suggest just trying to do what you want to do for real, finding out
what the problems are and then finding out what to do about them, but I
think that's already almost exactly what you've started doing now. :)

If you ask 100 different btrfs users about your specific situation, you
probably get 100 different answers. So, I'll just throw some of my own
thoughts in here, which may or may not make sense for you.

> All of
> our foreground I/O testing with BTRFS in RAID0/RAID1/single across
> different SSDs and HDDs has been stellar, but we haven't dug too far
> into snapshot performance, balancing, and other more background-oriented
> performance.  Hence my interest in finding documentation and analysis I
> can read and grok myself on the implications of snapshot operations on
> foreground I/O if such exists.

> More in-line below:
> 
> On 02/09/2018 03:36 PM, Hans van Kranenburg wrote:
>>> This has proven thus far less than helpful, as
>>> the response tends to be "use less snapshots," or "disable quotas," both
>>> of which strike me as intellectually unsatisfying answers
>>
>> Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
>> didn't even realize I had those, and now the problem is gone."
> 
> I meant less than helpful for me, since for my project I need detailed
> and fairly accurate capacity information per sub-volume, and the
> relationship between qgroups and subvolume performance wasn't being
> spelled out in the responses.  Please correct me if I am wrong about
> needing qgroups enabled to see detailed capacity information
> per-subvolume (including snapshots).

Aha, so you actually want to use qgroups.

>>> the former in a filesystem where snapshots are supposed to be
>>> "first-class citizens."

They are. But if you put extra optional feature X Y and Z on top which
kill your performance, then they are still supposed to be first-class
citizens, but feature X Y and Z might start blurring it a bit.

The problem is that qgroups and quota etc is still in development and if
you ask the developers, they are probably honest about the fact that you
cannot just enable that part of the functionality without some expected
and unexpected performance side effects.

>> Throwing complaints around is also not helpful.
> 
> Sorry about this.  It wasn't directed in any way at BTRFS developers,
> but rather some of the suggestions for solution proposed in random
> forums online.
> As mentioned I'm a fan of BTRFS, especially as my
> project requires the snapshots to truly be first-class citizens in that
> they are writable and one can roll-back to them at-will, unlike in ZFS
> and other filesystems.  I was just saying it seemed backwards to suggest
> having less snapshots was a solution in a filesystem where the
> architecture appears to treat them as a core part of the design.

And I was just saying that subvolumes and snapshots are fine, and that
you shouldn't blame them while your problems might be more likely
qgroups/quota related.

>> The "performance implications" are highly dependent on your specific
>> setup, kernel version, etc, so it really makes sense to share:
>>
>> * kernel version
>> * mount options (from /proc/mounts|grep btrfs)
>> * is it ssd? hdd? iscsi lun?
>> * how big is the FS
>> * how many subvolumes/snapshots? (how many snapshots per subvolume)
> 
> I will answer the above, but would like to reiterate my previous comment
> that I still would like to understand the fundamental relationships here
> as in my project kernel version is very likely to change (to more
> recent), along with mount options and underlying device media.  Once
> this project hits the field I will additionally have limited control
> over how large the FS gets (until physical media space is exhausted of
> course) or how many subvolumes/snapshots there are.  If I know that
> above N snapshots per subvolume performance tanks by M%, I can apply
> limits on the use-case in the field, but I am not aware of those kinds
> of performance implications yet.
> 
> My present situation is the following:
> - Fairly default opensuse 42.3.
> - uname -a: Linux betty 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC
> 2018 (7db1912) x86_64 x86_64 x86_64 GNU/Linux

You're ignoring 2 years of development and performance improvements. I'd
suggest jumping forward to 4.14 to see which part of your problems will
disappear already.

> - /dev/sda6 / btrfs
> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
> 0 0

Note that changes on atime cause writes to metadata, which means cowing
metadata blocks and unsharing them from a previous snapshot, only when
using the filesystem, not even when changing things (!). I don't know
what the exact pattern of consequences for quota and subvolume removal
is, but I always mount with noatime to prevent unnecessary metadata
writes from happening when just accessing files.

> (I have about 10 other btrfs subvolumes, but this is the only one being
> snapshotted)
> - At the time of my noticing the slow-down, I had about 24 snapshots, 10
> of which were in the process of being deleted
> - Usage output:
> ~> sudo btrfs filesystem usage /
> Overall:
>     Device size:          40.00GiB

Ok, so small filesystem.

>     Device allocated:          11.54GiB
>     Device unallocated:          28.46GiB
>     Device missing:             0.00B
>     Used:               7.57GiB
>     Free (estimated):          32.28GiB    (min: 32.28GiB)
>     Data ratio:                  1.00
>     Metadata ratio:              1.00
>     Global reserve:          28.44MiB    (used: 0.00B)
> Data,single: Size:11.01GiB, Used:7.19GiB
>    /dev/sda6      11.01GiB
> Metadata,single: Size:512.00MiB, Used:395.91MiB
>    /dev/sda6     512.00MiB
> System,single: Size:32.00MiB, Used:16.00KiB
>    /dev/sda6      32.00MiB
> Unallocated:
>    /dev/sda6      28.46GiB
> 
>> And what's essential to look at is what your computer is doing while you
>> are throwing a list of subvolumes into the cleaner.
>>
>> * is it using 100% cpu?
>> * is it showing 100% disk read I/O utilization?
>> * is it showing 100% disk write I/O utilization? (is it writing lots and
>> lots of data to disk?)
> 
> I noticed the problem when Thunderbird became completely unresponsive. I
> fired up top, and btrfs-cleaner was at the top, along with snapper.

Oh, snapper? Is there a specific reason why you want to use snapper as
the tool for whatever thing you're planning to do?

> btrfs-cleaner was at 100% cpu (single-core) for the entirety of the
> time.

Ok, so your problem is 100% cpu, not excessive disk I/O.

> I knew I had about 24 snapshots prior to this, and after about
> 60s when the pain subsided only about 14 remained, so I estimate 10 were
> deleted as part of snapper's cleaning algorithm.  I quickly also ran
> dstat during the slow-down, and after 5s it finally started and reported
> only about 3-6MB/s in terms of read and write to the drive in question.
> 
> I have since run top and dstat before running snapper cleaner manually,
> and the system lock-up does still occur, albeit for shorter times as
> I've only done it with a few snapshots and not much changed in each.

There certainly are performance improvements made in qgroups in the last
few years, so to repeat myself, please get a recent kernel version first.

I don't use qgroups/quota myself, so I can't be of much help on a
detailed level.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-09 16:45 btrfs-cleaner / snapshot performance analysis Ellis H. Wilson III
  2018-02-09 17:10 ` Peter Grandi
  2018-02-09 20:36 ` Hans van Kranenburg
@ 2018-02-11  6:40 ` Qu Wenruo
  2018-02-14  1:14   ` Darrick J. Wong
  2 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2018-02-11  6:40 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9513 bytes --]



On 2018年02月10日 00:45, Ellis H. Wilson III wrote:
> Hi all,
> 
> I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> impacts foreground performance, specifically during snapshot deletion.
> My experience so far has been that it can be dramatically disruptive to
> foreground I/O.
> 
> Looking through the wiki at kernel.org I have not yet stumbled onto any
> analysis that would shed light on this specific problem.  I have found
> numerous complaints about btrfs-cleaner online, especially relating to
> quotas being enabled.  This has proven thus far less than helpful, as
> the response tends to be "use less snapshots," or "disable quotas," both
> of which strike me as intellectually unsatisfying answers, especially
> the former in a filesystem where snapshots are supposed to be
> "first-class citizens."

Yes, snapshots of btrfs is really "first-class citizen".
Tons of designs are all biased to snapshot.

But one should be clear about one thing:
Snapshot creation and backref walk (used in qgroup, relocation and
extent deletion), are two conflicting workload in fact.

Btrfs puts snapshot creation on a very high priority, so that it greatly
degrades the performance of backref walk (used in snapshot deletion,
relocation and extent exclusive/shared calculation of qgroup).

Let me explain this problem in detail.

Just as explained by Peter Grandi, for any snapshot system (or any
system supports reflink) there must be a reserved mapping tree, to tell
which extent is used by who.

It's very critical, to determine if an extent is shared so we determine
if we need to do CoW.

There are several different ways to implement it, and this hugely
affects snapshot creation performance.

1) Direct mapping record
   Just records exactly which extent is used by who, directly.
   So when we needs to check the owner, just search the tree ONCE, then
   we get it.

   This is simple and it seems that LVM thin-provision and LVM
   traditional targets are all using them.
   (Maybe XFS also follows this way?)

   Pros:
   *FAST* backref walk, which means quick extent deletion and CoW
   condition check.


   Cons:
   *SLOW* snapshot creation.
   Each snapshot creation needs to insert new owner relationship into
   the tree. This modification grow with the size of snapshot source.

2) Indirect mapping record
   Records upper level referencer only.

   To get all direct owner of an extent, it will needs multiple lookup
   in the reserved mapping tree.

   And obviously, btrfs is using this method.

   Pros:
   *FAST* owner inheritance, which means snapshot creation.
   (Well, the only advantage I can think of)

   Cons:
   *VERY SLOW* backref walk, used by extent deletion, relocation, qgroup
   and Cow condition check.
   (That may also be why btrfs default to CoW data, so that it can skip
    the costy backref walk)

And a more detailed example of the difference between them will be:

[Basic tree layout]
                             Tree X
                             node A
                           /        \
                        node B         node C
                        /     \       /      \
                     leaf D  leaf E  leaf F  leaf G

Use above tree X as snapshot source.

[Snapshot creation: Direct mapping]
Then for direct mapping record, if we are going to create snapshot Y
then we would get:

            Tree X      Tree Y
            node A     <node H>
             |      \ /     |
             |       X      |
             |      / \     |
            node B      node C
         /      \          /     \
      leaf D  leaf E   leaf F   leaf G

We need to create new node H, and update the owner for node B/C/D/E/F/G.

That's to say, we need to create 1 new node, and update 6 references of
existing nodes/leaves.
And this will grow rapidly if the tree is large, but still should be a
linear increase.


[Snapshot creation: Indirect mapping]
And if using indirect mapping tree, firstly, reserved mapping tree
doesn't record exactly the owner for each leaf/node, but only records
its parent(s).

So even when tree X exists along, without snapshot Y, if we need to know
the owner of leaf D, we only knows its only parent is node B.
And do the same query on node B until we read node A and knows it's
owned by tree X.

                             Tree X         ^
                             node A         ^ Look upward until
                           /                | we reach tree root
                        node B              | to search the owner
                        /                   | of a leaf/node
                     leaf D                 |

So even in its best case, to look up the owner of leaf D, we need to do
3 times lookup. One for leaf D, one for node B, one for node A (which is
the end).
Such lookup will get more and more complex if there are extra branch in
the lookup chain.

But such complicated design makes one thing easier, that is snapshot
creation:
            Tree X      Tree Y
            node A     <node H>
             |      \ /     |
             |       X      |
             |      / \     |
            node B      node C
         /      \          /     \
      leaf D  leaf E   leaf F   leaf G

Still same tree Y, snapshot from tree X.

Despite the new node H, we only needs to update the reference lookup for
node B and C.

So far so good, as for indirect mapping, we reduced the modification to
reserved mapping tree, from 6 to 2.
And it reduce will be even more obvious if the tree is larger.

But the problem is reserved for snapshot deletion:

[Snapshot deletion: Direct mapping]

To delete snapshot Y:

            Tree X      Tree Y
            node A     <node H>
             |      \ /     |
             |       X      |
             |      / \     |
            node B      node C
         /      \          /     \
      leaf D  leaf E   leaf F   leaf G

[Snapshot deletion: Direct mapping]
Quite straightforward, just check the owner of each node to see if we
can delete the node/leaf.

For direct mapping, we just do the owner lookup in the reserved mapping
tree, 7 times. And we found node H can be deleted.

That's all, same amount of work for snapshot creation and deletion.
Not bad.

[Snapshot deletion: Indirect mapping]
Here we still needs to the lookup, 7 times.

But the difference is, each lookup can cause extra lookup.

For node H, just one single lookup as it's the root.
But for leaf G, it needs 4 times lookup.
            Tree X      Tree Y
            node A     <node H>
                    \       |
                     \      |
                      \     |
                        node C
                             |
                        leaf G

One for leaf G itself, one for node C, one for node A (parent of node C)
and one for node H (parent of node C again).

When summing up the lookup, for indirect mapping it needs:
1 for node H
3 for node B and C each
4 for leaf D~G each

total 23 lookup opeartions.

And it will just be even more if there are more snapshots, and it's not
a linear increase.


Although we could do some optimization, for example for above extent
deletion, we don't really care all the owners of the node/leaf, but only
cares if the extent is shared.

In that case, if we find node C is also shared by tree X, we don't need
to check node H.
If using this optimization, the lookup times reduced to 17 times.


But here comes to qgroup and balance, where they can't use such
optimization, as they needs to update all owners to handle the owner
change. (tree relocation tree for relocation, and qgroup number change
for quota).

That's why quota brings an obvious impact on performance.


So in short conclusions:
1) Snapshot is not an easy workload to be considered as one single
   operation
   Creation and deletion are different workload, at least for btrfs.

2) Snapshot deletion and qgroup is the biggest cost, by the btrfs design
   Either reduce number of snapshots to reduce branches, or disable
   quota to optimize the lookup operation.

Thanks,
Qu


> 
> The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot
> performance analysis I would expect to see given the assertions in the
> latter that "BTRFS...supports efficient snapshots..."  The former is
> sufficiently pre-BTRFS that while it does performance analysis of btree
> clones, it's unclear (to me at least) if the results can be
> forward-propagated in some way to real-world performance expectations
> for BTRFS snapshot creation/deletion/modification.
> 
> Has this analysis been performed somewhere else and I'm just missing it?
>  Also, I'll be glad to comment on my specific setup, kernel version,
> etc, and discuss pragmatic work-arounds, but I'd like to better
> understand the high-level performance implications first.
> 
> Thanks in advance to anyone who can comment on this.  I am very inclined
> to read anything thrown at me, so if there is documentation I failed to
> read, please just send the link.
> 
> Best,
> 
> ellis
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11  1:02     ` Hans van Kranenburg
@ 2018-02-11  9:31       ` Andrei Borzenkov
  2018-02-11 17:25         ` Adam Borowski
  2018-02-11 16:15       ` Ellis H. Wilson III
  1 sibling, 1 reply; 22+ messages in thread
From: Andrei Borzenkov @ 2018-02-11  9:31 UTC (permalink / raw)
  To: Hans van Kranenburg, Ellis H. Wilson III, linux-btrfs

11.02.2018 04:02, Hans van Kranenburg пишет:
...
> 
>> - /dev/sda6 / btrfs
>> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
>> 0 0
> 
> Note that changes on atime cause writes to metadata, which means cowing
> metadata blocks and unsharing them from a previous snapshot, only when
> using the filesystem, not even when changing things (!).

With relatime atime is updated only once after file was changed. So your
description is not entirely accurate and things should not be that
dramatic unless files are continuously being changed.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-10 22:05     ` Tomasz Pala
@ 2018-02-11 15:59       ` Ellis H. Wilson III
  2018-02-11 18:24         ` Hans van Kranenburg
  0 siblings, 1 reply; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-11 15:59 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-btrfs

Thanks Tomasz,

Comments in-line:

On 02/10/2018 05:05 PM, Tomasz Pala wrote:
> You won't have anything close to "accurate" in btrfs - quotas don't
> include space wasted by fragmentation, which happens to allocate from tens
> to thousands times (sic!) more space than the files itself.
> Not in some worst-case scenarios, but in real life situations...
> I got 10 MB db-file which was eating 10 GB of space after a week of
> regular updates - withOUT snapshotting it. All described here.

The underlying filesystem this is replacing was an in-house developed 
COW filesystem, so we're aware of the difficulties of fragmentation. 
I'm more interested in an approximate space consumed across snapshots 
when considering CoW.  I realize it will be approximate.  Approximate is 
ok for us -- no accounting for snapshot space consumed is not.

Also, I don't see the thread you mentioned.  Perhaps you forgot to 
mention it, or an html link didn't come through properly?

>> course) or how many subvolumes/snapshots there are.  If I know that
>> above N snapshots per subvolume performance tanks by M%, I can apply
>> limits on the use-case in the field, but I am not aware of those kinds
>> of performance implications yet.
> 
> This doesn't work like this. It all depends on data that are subject of
> snapshots, especially how they are updated. How exactly, including write
> patterns.
> 
> I think you expect answers that can't be formulated - with fs architecture so
> advanced as ZFS or btrfs it's behavior can't be analyzed for simple
> answers like 'keep less than N snapshots'.

I was using an extremely simple heuristic to drive at what I was looking 
to get out of this.  I should have been more explicit that the example 
was not to be taken literally.

> This is an exception of easy-answer: btrfs doesn't handle databases with
> CoW. Period. Doesn't matter if snapshotted or not, ANY database files
> (systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They
> slow down entire system to the speed of cheap SD card.

I will keep this in mind, thank you.  We do have a higher level above 
BTRFS that stages data.  I will consider implementing an algorithm to 
add the nocow flag to the file if it has been written to sufficiently to 
indicate it will be a bad fit for the BTRFS COW algorithm.

> Actually, if you do not use compression and don't need checksums of data
> blocks, you may want to mount all the btrfs with nocow by default.
> This way the quotas would be more accurate (no fragmentation _between_
> snapshots) and you'll have some decent performance with snapshots.
> If that is all you care.

CoW is still valuable for us as we're shooting to support on the order 
of hundreds of snapshots per subvolume, and without it (if BTRFS COW 
works the same as our old COW FS) that's going to be quite expensive to 
keep snapshots around.  So some hybrid solution is required here.

Best,

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11  1:02     ` Hans van Kranenburg
  2018-02-11  9:31       ` Andrei Borzenkov
@ 2018-02-11 16:15       ` Ellis H. Wilson III
  2018-02-11 18:03         ` Hans van Kranenburg
  1 sibling, 1 reply; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-11 16:15 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

Thanks Hans.  Sorry for the top-post, but I'm boiling things down here 
so I don't have a clear line-item to respond to.  The take-aways I see 
here to my original queries are:

1. Nobody has done a thorough analysis of the impact of snapshot 
manipulation WITHOUT qgroups enabled on foreground I/O performance
2. Nobody has done a thorough analysis of the impact of snapshot 
manipulation WITH qgroups enabled on foreground I/O performance
3. I need to look at the code to understand the interplay between 
qgroups, snapshots, and foreground I/O performance as there isn't 
existing architecture documentation to point me to that covers this
4. I should be cautioned that CoW in BTRFS can exhibit pathological (if 
expected) capacity consumption for very random-write-oriented datasets 
with or without snapshots, and nocow (or in my case transparently 
absorbing and coalescing writes at a higher tier) is my friend.
5. I should be cautioned that CoW is broken across snapshots when 
defragmentation is run.

I will update a test system to the most recent kernel and will perform 
tests to answer #1 and #2.  I will plan to share it when I'm done.  If I 
have time to write-up my findings for #3 I will similarly share that.

Thanks to all for your input on this issue.

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11  9:31       ` Andrei Borzenkov
@ 2018-02-11 17:25         ` Adam Borowski
  0 siblings, 0 replies; 22+ messages in thread
From: Adam Borowski @ 2018-02-11 17:25 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Hans van Kranenburg, Ellis H. Wilson III, linux-btrfs

On Sun, Feb 11, 2018 at 12:31:42PM +0300, Andrei Borzenkov wrote:
> 11.02.2018 04:02, Hans van Kranenburg пишет:
> >> - /dev/sda6 / btrfs
> >> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
> >> 0 0
> > 
> > Note that changes on atime cause writes to metadata, which means cowing
> > metadata blocks and unsharing them from a previous snapshot, only when
> > using the filesystem, not even when changing things (!).
> 
> With relatime atime is updated only once after file was changed. So your
> description is not entirely accurate and things should not be that
> dramatic unless files are continuously being changed.

Alas, that's untrue.  relatime updates happen if:
* the file has been written after it was last read, or
* previous atime was older than 24 hours

Thus, you get at least one unshare per inode per day, which is also the most
widespread frequency of both snapshotting and cronjobs.

Fortunately, most uses of atime are gone, thus it's generally safe to
disable it completely.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄⠀⠀⠀⠀ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11 16:15       ` Ellis H. Wilson III
@ 2018-02-11 18:03         ` Hans van Kranenburg
  2018-02-12 14:45           ` Ellis H. Wilson III
  0 siblings, 1 reply; 22+ messages in thread
From: Hans van Kranenburg @ 2018-02-11 18:03 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On 02/11/2018 05:15 PM, Ellis H. Wilson III wrote:
> Thanks Hans.  Sorry for the top-post, but I'm boiling things down here
> so I don't have a clear line-item to respond to.  The take-aways I see
> here to my original queries are:
> 
> 1. Nobody has done a thorough analysis of the impact of snapshot
> manipulation WITHOUT qgroups enabled on foreground I/O performance
> 2. Nobody has done a thorough analysis of the impact of snapshot
> manipulation WITH qgroups enabled on foreground I/O performance

It's more that there is no simple list of clear-cut answers that apply
to every possible situation and type/pattern of work that you can throw
at a btrfs filesystem.

> 3. I need to look at the code to understand the interplay between
> qgroups, snapshots, and foreground I/O performance as there isn't
> existing architecture documentation to point me to that covers this

Well, the excellent write-up of Qu this morning shows some explanation
from the design point of view.

> 4. I should be cautioned that CoW in BTRFS can exhibit pathological (if
> expected) capacity consumption for very random-write-oriented datasets
> with or without snapshots, and nocow (or in my case transparently
> absorbing and coalescing writes at a higher tier) is my friend.

nocow only keeps the cows on a distance as long as you don't start
snapshotting (or cp --reflink) those files... If you take a snapshot,
then you force btrfs to keep the data around that is referenced by the
snapshot. So, that means that every next write will be cowed once again,
moo, so small writes will be redirected to a new location, causing
fragmentation again. The second and third write can go in the same (new)
location of the first new write, but as soon as you snapshot again, this
happens again.

> 5. I should be cautioned that CoW is broken across snapshots when
> defragmentation is run.
> 
> I will update a test system to the most recent kernel and will perform
> tests to answer #1 and #2.  I will plan to share it when I'm done.  If I
> have time to write-up my findings for #3 I will similarly share that.
> 
> Thanks to all for your input on this issue.

Have fun!

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11 15:59       ` Ellis H. Wilson III
@ 2018-02-11 18:24         ` Hans van Kranenburg
  2018-02-12 15:37           ` Ellis H. Wilson III
  0 siblings, 1 reply; 22+ messages in thread
From: Hans van Kranenburg @ 2018-02-11 18:24 UTC (permalink / raw)
  To: Ellis H. Wilson III, Tomasz Pala; +Cc: linux-btrfs

On 02/11/2018 04:59 PM, Ellis H. Wilson III wrote:
> Thanks Tomasz,
> 
> Comments in-line:
> 
> On 02/10/2018 05:05 PM, Tomasz Pala wrote:
>> You won't have anything close to "accurate" in btrfs - quotas don't
>> include space wasted by fragmentation, which happens to allocate from
>> tens
>> to thousands times (sic!) more space than the files itself.
>> Not in some worst-case scenarios, but in real life situations...
>> I got 10 MB db-file which was eating 10 GB of space after a week of
>> regular updates - withOUT snapshotting it. All described here.
> 
> The underlying filesystem this is replacing was an in-house developed
> COW filesystem, so we're aware of the difficulties of fragmentation. I'm
> more interested in an approximate space consumed across snapshots when
> considering CoW.  I realize it will be approximate.  Approximate is ok
> for us -- no accounting for snapshot space consumed is not.

If your goal is to have an approximate idea for accounting, and you
don't need to be able to actually enforce limits, and if the filesystems
that you are using are as small as the 40GiB example you gave...

Why not just use `btrfs fi du <subvol> <snap1> <snap2>` now and then and
update your administration with the results? .. Instead of putting the
burden of keeping track of all administration during every tiny change
all day long?

> Also, I don't see the thread you mentioned.  Perhaps you forgot to
> mention it, or an html link didn't come through properly?
> 
>>> course) or how many subvolumes/snapshots there are.  If I know that
>>> above N snapshots per subvolume performance tanks by M%, I can apply
>>> limits on the use-case in the field, but I am not aware of those kinds
>>> of performance implications yet.
>>
>> This doesn't work like this. It all depends on data that are subject of
>> snapshots, especially how they are updated. How exactly, including write
>> patterns.
>>
>> I think you expect answers that can't be formulated - with fs
>> architecture so
>> advanced as ZFS or btrfs it's behavior can't be analyzed for simple
>> answers like 'keep less than N snapshots'.
> 
> I was using an extremely simple heuristic to drive at what I was looking
> to get out of this.  I should have been more explicit that the example
> was not to be taken literally.
> 
>> This is an exception of easy-answer: btrfs doesn't handle databases with
>> CoW. Period. Doesn't matter if snapshotted or not, ANY database files
>> (systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They
>> slow down entire system to the speed of cheap SD card.
> 
> I will keep this in mind, thank you.  We do have a higher level above
> BTRFS that stages data.  I will consider implementing an algorithm to
> add the nocow flag to the file if it has been written to sufficiently to
> indicate it will be a bad fit for the BTRFS COW algorithm.

Adding nocow attribute to a file only works when it's just created and
not written to yet or when setting it on the containing directory and
letting it inherit for new files. You can't just turn it on for existing
files with content.

https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

>> Actually, if you do not use compression and don't need checksums of data
>> blocks, you may want to mount all the btrfs with nocow by default.
>> This way the quotas would be more accurate (no fragmentation _between_
>> snapshots) and you'll have some decent performance with snapshots.
>> If that is all you care.
> 
> CoW is still valuable for us as we're shooting to support on the order
> of hundreds of snapshots per subvolume,

Hundreds will get you into trouble even without qgroups.

> and without it (if BTRFS COW
> works the same as our old COW FS) that's going to be quite expensive to
> keep snapshots around.  So some hybrid solution is required here.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11 18:03         ` Hans van Kranenburg
@ 2018-02-12 14:45           ` Ellis H. Wilson III
  2018-02-12 17:09             ` Hans van Kranenburg
  0 siblings, 1 reply; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-12 14:45 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 02/11/2018 01:03 PM, Hans van Kranenburg wrote:
>> 3. I need to look at the code to understand the interplay between
>> qgroups, snapshots, and foreground I/O performance as there isn't
>> existing architecture documentation to point me to that covers this
> 
> Well, the excellent write-up of Qu this morning shows some explanation
> from the design point of view.

Sorry, I may have missed this email.  Or perhaps you are referring to a 
wiki or blog post of some kind I'm not following actively?  Either way, 
if you can forward me the link, I'd greatly appreciate it.

> nocow only keeps the cows on a distance as long as you don't start
> snapshotting (or cp --reflink) those files... If you take a snapshot,
> then you force btrfs to keep the data around that is referenced by the
> snapshot. So, that means that every next write will be cowed once again,
> moo, so small writes will be redirected to a new location, causing
> fragmentation again. The second and third write can go in the same (new)
> location of the first new write, but as soon as you snapshot again, this
> happens again.

Ah, very interesting.  Thank you for clarifying!

Best,

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11 18:24         ` Hans van Kranenburg
@ 2018-02-12 15:37           ` Ellis H. Wilson III
  2018-02-12 16:02             ` Austin S. Hemmelgarn
  2018-02-13 13:34             ` E V
  0 siblings, 2 replies; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-12 15:37 UTC (permalink / raw)
  To: Hans van Kranenburg, Tomasz Pala; +Cc: linux-btrfs

On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:
> Why not just use `btrfs fi du <subvol> <snap1> <snap2>` now and then and
> update your administration with the results? .. Instead of putting the
> burden of keeping track of all administration during every tiny change
> all day long?

I will look into that if using built-in group capacity functionality 
proves to be truly untenable.  Thanks!

>> CoW is still valuable for us as we're shooting to support on the order
>> of hundreds of snapshots per subvolume,
> 
> Hundreds will get you into trouble even without qgroups.

I should have been more specific.  We are looking to use up to a few 
dozen snapshots per subvolume, but will have many (tens to hundreds of) 
discrete subvolumes (each with up to a few dozen snapshots) in a BTRFS 
filesystem.  If I have it wrong and the scalability issues in BTRFS do 
not solely apply to subvolumes and their snapshot counts, please let me 
know.

I will note you focused on my tiny desktop filesystem when making some 
of your previous comments -- this is why I didn't want to share specific 
details.  Our filesystem will be RAID0 with six large HDDs (12TB each). 
Reliability concerns do not apply to our situation for technical 
reasons, but if there are capacity scaling issues with BTRFS I should be 
made aware of, I'd be glad to hear them.  I have not seen any in 
technical documentation of such a limit, and experiments so far on 6x6TB 
arrays has not shown any performance problems, so I'm inclined to 
believe the only scaling issue exists with reflinks.  Correct me if I'm 
wrong.

Thanks,

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-12 15:37           ` Ellis H. Wilson III
@ 2018-02-12 16:02             ` Austin S. Hemmelgarn
  2018-02-12 16:39               ` Ellis H. Wilson III
  2018-02-13 13:34             ` E V
  1 sibling, 1 reply; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-12 16:02 UTC (permalink / raw)
  To: Ellis H. Wilson III, Hans van Kranenburg, Tomasz Pala; +Cc: linux-btrfs

On 2018-02-12 10:37, Ellis H. Wilson III wrote:
> On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:
>> Why not just use `btrfs fi du <subvol> <snap1> <snap2>` now and then and
>> update your administration with the results? .. Instead of putting the
>> burden of keeping track of all administration during every tiny change
>> all day long?
> 
> I will look into that if using built-in group capacity functionality 
> proves to be truly untenable.  Thanks!
As a general rule, unless you really need to actively prevent a 
subvolume from exceeding it's quota, this will generally be more 
reliable and have much less performance impact than using qgroups.
> 
>>> CoW is still valuable for us as we're shooting to support on the order
>>> of hundreds of snapshots per subvolume,
>>
>> Hundreds will get you into trouble even without qgroups.
> 
> I should have been more specific.  We are looking to use up to a few 
> dozen snapshots per subvolume, but will have many (tens to hundreds of) 
> discrete subvolumes (each with up to a few dozen snapshots) in a BTRFS 
> filesystem.  If I have it wrong and the scalability issues in BTRFS do 
> not solely apply to subvolumes and their snapshot counts, please let me 
> know.
The issue isn't so much total number of snapshots as it is how many 
snapshots are sharing data.  If each of your individual subvolumes 
shares no data with any of the others via reflinks (so no deduplication 
across subvolumes, and no copying files around using reflinks or the 
clone ioctl), then I would expect things will be just fine without 
qgroups provided that you're not deleting huge numbers of snapshots at 
the same time.

With qgroups involved, I really can't say for certain, as I've never 
done much with them myself, but based on my understanding of how it all 
works, I would expect multiple subvolumes with a small number of 
snapshots each to not have as many performance issues as a single 
subvolume with the same total number of snapshots.
> 
> I will note you focused on my tiny desktop filesystem when making some 
> of your previous comments -- this is why I didn't want to share specific 
> details.  Our filesystem will be RAID0 with six large HDDs (12TB each). 
> Reliability concerns do not apply to our situation for technical 
> reasons, but if there are capacity scaling issues with BTRFS I should be 
> made aware of, I'd be glad to hear them.  I have not seen any in 
> technical documentation of such a limit, and experiments so far on 6x6TB 
> arrays has not shown any performance problems, so I'm inclined to 
> believe the only scaling issue exists with reflinks.  Correct me if I'm 
> wrong.
BTRFS in general works fine at that scale, dependent of course on the 
level of concurrent access you need to support.  Each tree update needs 
to lock a bunch of things in the tree itself, and having large numbers 
of clients writing to the same set of files concurrently can cause lock 
contention issues because of this, especially if all of them are calling 
fsync() or fdatasync() regularly.  These issues can be mitigated by 
segregating workloads into their own subvolumes (each subvolume is a 
mostly independent filesystem tree), but it sounds like you're already 
doing that, so I don't think that would be an issue for you.

The only other possibility I can think of is that the performance hit 
from qgroups may scale not just based on the number of snapshots of a 
given subvolume, but also the total size of the subvolume (more data 
means more accounting work), though I'm not certain about that (it's 
just a hunch based on what I do know about qgroups).

Now, there are some other odd theoretical cases that may cause issues 
when dealing with really big filesystems, but they're either really 
specific edge cases (for example, starting with a really small 
filesystem and gradually scaling it up in size as it gets full) or 
happen at scales far larger than what you're talking about (on the order 
of at least double digit petabyte scale).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-12 16:02             ` Austin S. Hemmelgarn
@ 2018-02-12 16:39               ` Ellis H. Wilson III
  2018-02-12 18:07                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-12 16:39 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Hans van Kranenburg, Tomasz Pala; +Cc: linux-btrfs

On 02/12/2018 11:02 AM, Austin S. Hemmelgarn wrote:
>> I will look into that if using built-in group capacity functionality 
>> proves to be truly untenable.  Thanks!
> As a general rule, unless you really need to actively prevent a 
> subvolume from exceeding it's quota, this will generally be more 
> reliable and have much less performance impact than using qgroups.

Ok ok :).  I will plan to go this route, but since I'll want to 
benchmark it either way, I'll include qgroups enabled in the benchmark 
and will report back.

> With qgroups involved, I really can't say for certain, as I've never 
> done much with them myself, but based on my understanding of how it all 
> works, I would expect multiple subvolumes with a small number of 
> snapshots each to not have as many performance issues as a single 
> subvolume with the same total number of snapshots.

Glad to hear that.  That was my expectation as well.

> BTRFS in general works fine at that scale, dependent of course on the 
> level of concurrent access you need to support.  Each tree update needs 
> to lock a bunch of things in the tree itself, and having large numbers 
> of clients writing to the same set of files concurrently can cause lock 
> contention issues because of this, especially if all of them are calling 
> fsync() or fdatasync() regularly.  These issues can be mitigated by 
> segregating workloads into their own subvolumes (each subvolume is a 
> mostly independent filesystem tree), but it sounds like you're already 
> doing that, so I don't think that would be an issue for you.
Hmm...I'll think harder about this.  There is potential for us to 
artificially divide access to files across subvolumes automatically 
because of the way we are using BTRFS as a backing store for our 
parallel file system.  So far even with around 1000 threads across about 
10 machines accessing BTRFS via our parallel filesystem over the wire 
we've not seen issues, but if we do I have some ways out I've not 
explored yet.  Thanks!

> Now, there are some other odd theoretical cases that may cause issues 
> when dealing with really big filesystems, but they're either really 
> specific edge cases (for example, starting with a really small 
> filesystem and gradually scaling it up in size as it gets full) or 
> happen at scales far larger than what you're talking about (on the order 
> of at least double digit petabyte scale).

Yea, our use case will be in the tens of TB to hundreds of TB for the 
foreseeable future, so I'm glad to hear this is relatively standard. 
That was my read of the situation as well.

Thanks!

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-12 14:45           ` Ellis H. Wilson III
@ 2018-02-12 17:09             ` Hans van Kranenburg
  2018-02-12 17:38               ` Ellis H. Wilson III
  0 siblings, 1 reply; 22+ messages in thread
From: Hans van Kranenburg @ 2018-02-12 17:09 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On 02/12/2018 03:45 PM, Ellis H. Wilson III wrote:
> On 02/11/2018 01:03 PM, Hans van Kranenburg wrote:
>>> 3. I need to look at the code to understand the interplay between
>>> qgroups, snapshots, and foreground I/O performance as there isn't
>>> existing architecture documentation to point me to that covers this
>>
>> Well, the excellent write-up of Qu this morning shows some explanation
>> from the design point of view.
> 
> Sorry, I may have missed this email.  Or perhaps you are referring to a
> wiki or blog post of some kind I'm not following actively?  Either way,
> if you can forward me the link, I'd greatly appreciate it.

You are in the To: of it:

https://www.spinics.net/lists/linux-btrfs/msg74737.html

>> nocow only keeps the cows on a distance as long as you don't start
>> snapshotting (or cp --reflink) those files... If you take a snapshot,
>> then you force btrfs to keep the data around that is referenced by the
>> snapshot. So, that means that every next write will be cowed once again,
>> moo, so small writes will be redirected to a new location, causing
>> fragmentation again. The second and third write can go in the same (new)
>> location of the first new write, but as soon as you snapshot again, this
>> happens again.
> 
> Ah, very interesting.  Thank you for clarifying!

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-12 17:09             ` Hans van Kranenburg
@ 2018-02-12 17:38               ` Ellis H. Wilson III
  0 siblings, 0 replies; 22+ messages in thread
From: Ellis H. Wilson III @ 2018-02-12 17:38 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 02/12/2018 12:09 PM, Hans van Kranenburg wrote:
> You are in the To: of it:
> 
> https://www.spinics.net/lists/linux-btrfs/msg74737.html

Apparently MS365 decided my disabling of junk/clutter filter rules some 
year+ ago wasn't wise and re-enabled it.  I wondered why I wasn't seeing 
my own messages back from the list.  Qu's along with all of my responses 
were in spam.  Go figure, MS marking kernel.org mail spam...

This is exactly what I was looking for, and indeed is a fantastic write 
up I'll need to read over a few times to really have it soak in.  Thank 
you very much Qu!

Best,

ellis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-12 16:39               ` Ellis H. Wilson III
@ 2018-02-12 18:07                 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-12 18:07 UTC (permalink / raw)
  To: Ellis H. Wilson III, Hans van Kranenburg, Tomasz Pala; +Cc: linux-btrfs

On 2018-02-12 11:39, Ellis H. Wilson III wrote:
> On 02/12/2018 11:02 AM, Austin S. Hemmelgarn wrote:
>> BTRFS in general works fine at that scale, dependent of course on the 
>> level of concurrent access you need to support.  Each tree update 
>> needs to lock a bunch of things in the tree itself, and having large 
>> numbers of clients writing to the same set of files concurrently can 
>> cause lock contention issues because of this, especially if all of 
>> them are calling fsync() or fdatasync() regularly.  These issues can 
>> be mitigated by segregating workloads into their own subvolumes (each 
>> subvolume is a mostly independent filesystem tree), but it sounds like 
>> you're already doing that, so I don't think that would be an issue for 
>> you.
> Hmm...I'll think harder about this.  There is potential for us to 
> artificially divide access to files across subvolumes automatically 
> because of the way we are using BTRFS as a backing store for our 
> parallel file system.  So far even with around 1000 threads across about 
> 10 machines accessing BTRFS via our parallel filesystem over the wire 
> we've not seen issues, but if we do I have some ways out I've not 
> explored yet.  Thanks!
For what it's worth, most of the issues I've personally seen with 
parallel performance involved very heavy use of fsync(), or lots of 
parallel calls to stat() and statvfs() happening while files are also 
being written to, so it may just be the way you happen to be doing 
things just doesn't cause issues.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-12 15:37           ` Ellis H. Wilson III
  2018-02-12 16:02             ` Austin S. Hemmelgarn
@ 2018-02-13 13:34             ` E V
  1 sibling, 0 replies; 22+ messages in thread
From: E V @ 2018-02-13 13:34 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On Mon, Feb 12, 2018 at 10:37 AM, Ellis H. Wilson III
<ellisw@panasas.com> wrote:
> On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:
>>
>> Why not just use `btrfs fi du <subvol> <snap1> <snap2>` now and then and
>> update your administration with the results? .. Instead of putting the
>> burden of keeping track of all administration during every tiny change
>> all day long?
>
>
> I will look into that if using built-in group capacity functionality proves
> to be truly untenable.  Thanks!
>
>>> CoW is still valuable for us as we're shooting to support on the order
>>> of hundreds of snapshots per subvolume,
>>
>>
>> Hundreds will get you into trouble even without qgroups.
>
>
> I should have been more specific.  We are looking to use up to a few dozen
> snapshots per subvolume, but will have many (tens to hundreds of) discrete
> subvolumes (each with up to a few dozen snapshots) in a BTRFS filesystem.
> If I have it wrong and the scalability issues in BTRFS do not solely apply
> to subvolumes and their snapshot counts, please let me know.
>
> I will note you focused on my tiny desktop filesystem when making some of
> your previous comments -- this is why I didn't want to share specific
> details.  Our filesystem will be RAID0 with six large HDDs (12TB each).
> Reliability concerns do not apply to our situation for technical reasons,
> but if there are capacity scaling issues with BTRFS I should be made aware
> of, I'd be glad to hear them.  I have not seen any in technical
> documentation of such a limit, and experiments so far on 6x6TB arrays has
> not shown any performance problems, so I'm inclined to believe the only
> scaling issue exists with reflinks.  Correct me if I'm wrong.
>
> Thanks,
>
> ellis
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

When testing btrfs on large volumes, especially with metadata heavy
operations, I'd suggest you match the node size of your mkfs.btrfs
(-n) with the stripe size used in creating your RAID array. Also, use
the ssd_spread mount option as discussed in a previous thread. It make
a big difference on arrays. It allocates much more space for metadata,
but it greatly reduces fragmentation over time.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: btrfs-cleaner / snapshot performance analysis
  2018-02-11  6:40 ` Qu Wenruo
@ 2018-02-14  1:14   ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2018-02-14  1:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Ellis H. Wilson III, linux-btrfs

On Sun, Feb 11, 2018 at 02:40:16PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年02月10日 00:45, Ellis H. Wilson III wrote:
> > Hi all,
> > 
> > I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> > impacts foreground performance, specifically during snapshot deletion.
> > My experience so far has been that it can be dramatically disruptive to
> > foreground I/O.
> > 
> > Looking through the wiki at kernel.org I have not yet stumbled onto any
> > analysis that would shed light on this specific problem.  I have found
> > numerous complaints about btrfs-cleaner online, especially relating to
> > quotas being enabled.  This has proven thus far less than helpful, as
> > the response tends to be "use less snapshots," or "disable quotas," both
> > of which strike me as intellectually unsatisfying answers, especially
> > the former in a filesystem where snapshots are supposed to be
> > "first-class citizens."
> 
> Yes, snapshots of btrfs is really "first-class citizen".
> Tons of designs are all biased to snapshot.
> 
> But one should be clear about one thing:
> Snapshot creation and backref walk (used in qgroup, relocation and
> extent deletion), are two conflicting workload in fact.
> 
> Btrfs puts snapshot creation on a very high priority, so that it greatly
> degrades the performance of backref walk (used in snapshot deletion,
> relocation and extent exclusive/shared calculation of qgroup).
> 
> Let me explain this problem in detail.
> 
> Just as explained by Peter Grandi, for any snapshot system (or any
> system supports reflink) there must be a reserved mapping tree, to tell
> which extent is used by who.
> 
> It's very critical, to determine if an extent is shared so we determine
> if we need to do CoW.
> 
> There are several different ways to implement it, and this hugely
> affects snapshot creation performance.
> 
> 1) Direct mapping record
>    Just records exactly which extent is used by who, directly.
>    So when we needs to check the owner, just search the tree ONCE, then
>    we get it.
> 
>    This is simple and it seems that LVM thin-provision and LVM
>    traditional targets are all using them.
>    (Maybe XFS also follows this way?)

Yes, it does.

>    Pros:
>    *FAST* backref walk, which means quick extent deletion and CoW
>    condition check.
> 
> 
>    Cons:
>    *SLOW* snapshot creation.
>    Each snapshot creation needs to insert new owner relationship into
>    the tree. This modification grow with the size of snapshot source.

...of course xfs also doesn't support snapshots. :)

--D

> 2) Indirect mapping record
>    Records upper level referencer only.
> 
>    To get all direct owner of an extent, it will needs multiple lookup
>    in the reserved mapping tree.
> 
>    And obviously, btrfs is using this method.
> 
>    Pros:
>    *FAST* owner inheritance, which means snapshot creation.
>    (Well, the only advantage I can think of)
> 
>    Cons:
>    *VERY SLOW* backref walk, used by extent deletion, relocation, qgroup
>    and Cow condition check.
>    (That may also be why btrfs default to CoW data, so that it can skip
>     the costy backref walk)
> 
> And a more detailed example of the difference between them will be:
> 
> [Basic tree layout]
>                              Tree X
>                              node A
>                            /        \
>                         node B         node C
>                         /     \       /      \
>                      leaf D  leaf E  leaf F  leaf G
> 
> Use above tree X as snapshot source.
> 
> [Snapshot creation: Direct mapping]
> Then for direct mapping record, if we are going to create snapshot Y
> then we would get:
> 
>             Tree X      Tree Y
>             node A     <node H>
>              |      \ /     |
>              |       X      |
>              |      / \     |
>             node B      node C
>          /      \          /     \
>       leaf D  leaf E   leaf F   leaf G
> 
> We need to create new node H, and update the owner for node B/C/D/E/F/G.
> 
> That's to say, we need to create 1 new node, and update 6 references of
> existing nodes/leaves.
> And this will grow rapidly if the tree is large, but still should be a
> linear increase.
> 
> 
> [Snapshot creation: Indirect mapping]
> And if using indirect mapping tree, firstly, reserved mapping tree
> doesn't record exactly the owner for each leaf/node, but only records
> its parent(s).
> 
> So even when tree X exists along, without snapshot Y, if we need to know
> the owner of leaf D, we only knows its only parent is node B.
> And do the same query on node B until we read node A and knows it's
> owned by tree X.
> 
>                              Tree X         ^
>                              node A         ^ Look upward until
>                            /                | we reach tree root
>                         node B              | to search the owner
>                         /                   | of a leaf/node
>                      leaf D                 |
> 
> So even in its best case, to look up the owner of leaf D, we need to do
> 3 times lookup. One for leaf D, one for node B, one for node A (which is
> the end).
> Such lookup will get more and more complex if there are extra branch in
> the lookup chain.
> 
> But such complicated design makes one thing easier, that is snapshot
> creation:
>             Tree X      Tree Y
>             node A     <node H>
>              |      \ /     |
>              |       X      |
>              |      / \     |
>             node B      node C
>          /      \          /     \
>       leaf D  leaf E   leaf F   leaf G
> 
> Still same tree Y, snapshot from tree X.
> 
> Despite the new node H, we only needs to update the reference lookup for
> node B and C.
> 
> So far so good, as for indirect mapping, we reduced the modification to
> reserved mapping tree, from 6 to 2.
> And it reduce will be even more obvious if the tree is larger.
> 
> But the problem is reserved for snapshot deletion:
> 
> [Snapshot deletion: Direct mapping]
> 
> To delete snapshot Y:
> 
>             Tree X      Tree Y
>             node A     <node H>
>              |      \ /     |
>              |       X      |
>              |      / \     |
>             node B      node C
>          /      \          /     \
>       leaf D  leaf E   leaf F   leaf G
> 
> [Snapshot deletion: Direct mapping]
> Quite straightforward, just check the owner of each node to see if we
> can delete the node/leaf.
> 
> For direct mapping, we just do the owner lookup in the reserved mapping
> tree, 7 times. And we found node H can be deleted.
> 
> That's all, same amount of work for snapshot creation and deletion.
> Not bad.
> 
> [Snapshot deletion: Indirect mapping]
> Here we still needs to the lookup, 7 times.
> 
> But the difference is, each lookup can cause extra lookup.
> 
> For node H, just one single lookup as it's the root.
> But for leaf G, it needs 4 times lookup.
>             Tree X      Tree Y
>             node A     <node H>
>                     \       |
>                      \      |
>                       \     |
>                         node C
>                              |
>                         leaf G
> 
> One for leaf G itself, one for node C, one for node A (parent of node C)
> and one for node H (parent of node C again).
> 
> When summing up the lookup, for indirect mapping it needs:
> 1 for node H
> 3 for node B and C each
> 4 for leaf D~G each
> 
> total 23 lookup opeartions.
> 
> And it will just be even more if there are more snapshots, and it's not
> a linear increase.
> 
> 
> Although we could do some optimization, for example for above extent
> deletion, we don't really care all the owners of the node/leaf, but only
> cares if the extent is shared.
> 
> In that case, if we find node C is also shared by tree X, we don't need
> to check node H.
> If using this optimization, the lookup times reduced to 17 times.
> 
> 
> But here comes to qgroup and balance, where they can't use such
> optimization, as they needs to update all owners to handle the owner
> change. (tree relocation tree for relocation, and qgroup number change
> for quota).
> 
> That's why quota brings an obvious impact on performance.
> 
> 
> So in short conclusions:
> 1) Snapshot is not an easy workload to be considered as one single
>    operation
>    Creation and deletion are different workload, at least for btrfs.
> 
> 2) Snapshot deletion and qgroup is the biggest cost, by the btrfs design
>    Either reduce number of snapshots to reduce branches, or disable
>    quota to optimize the lookup operation.
> 
> Thanks,
> Qu
> 
> 
> > 
> > The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot
> > performance analysis I would expect to see given the assertions in the
> > latter that "BTRFS...supports efficient snapshots..."  The former is
> > sufficiently pre-BTRFS that while it does performance analysis of btree
> > clones, it's unclear (to me at least) if the results can be
> > forward-propagated in some way to real-world performance expectations
> > for BTRFS snapshot creation/deletion/modification.
> > 
> > Has this analysis been performed somewhere else and I'm just missing it?
> >  Also, I'll be glad to comment on my specific setup, kernel version,
> > etc, and discuss pragmatic work-arounds, but I'd like to better
> > understand the high-level performance implications first.
> > 
> > Thanks in advance to anyone who can comment on this.  I am very inclined
> > to read anything thrown at me, so if there is documentation I failed to
> > read, please just send the link.
> > 
> > Best,
> > 
> > ellis
> > -- 
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-02-14  1:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-09 16:45 btrfs-cleaner / snapshot performance analysis Ellis H. Wilson III
2018-02-09 17:10 ` Peter Grandi
2018-02-09 20:36 ` Hans van Kranenburg
2018-02-10 18:29   ` Ellis H. Wilson III
2018-02-10 22:05     ` Tomasz Pala
2018-02-11 15:59       ` Ellis H. Wilson III
2018-02-11 18:24         ` Hans van Kranenburg
2018-02-12 15:37           ` Ellis H. Wilson III
2018-02-12 16:02             ` Austin S. Hemmelgarn
2018-02-12 16:39               ` Ellis H. Wilson III
2018-02-12 18:07                 ` Austin S. Hemmelgarn
2018-02-13 13:34             ` E V
2018-02-11  1:02     ` Hans van Kranenburg
2018-02-11  9:31       ` Andrei Borzenkov
2018-02-11 17:25         ` Adam Borowski
2018-02-11 16:15       ` Ellis H. Wilson III
2018-02-11 18:03         ` Hans van Kranenburg
2018-02-12 14:45           ` Ellis H. Wilson III
2018-02-12 17:09             ` Hans van Kranenburg
2018-02-12 17:38               ` Ellis H. Wilson III
2018-02-11  6:40 ` Qu Wenruo
2018-02-14  1:14   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.