All of lore.kernel.org
 help / color / mirror / Atom feed
* BTRFS for OLTP Databases
@ 2017-02-07 13:53 Peter Zaitsev
  2017-02-07 14:00 ` Hugo Mills
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Peter Zaitsev @ 2017-02-07 13:53 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
Workload.

It did not go very well ranging from multi-seconds stalls where no
transactions are completed to the finally kernel OOPS with "no space left
on device" error message and filesystem going read only.

I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.

Do you have any advice on how BTRFS should be tuned for OLTP workload
(large files having a lot of random writes)  ?    Or is this the case where
one should simply stay away from BTRFS and use something else ?

One item recommended in some places is "nodatacow"  this however defeats
the main purpose I'm looking at BTRFS -  I am interested in "free"
snapshots which look very attractive to use for database recovery scenarios
allow instant rollback to the previous state.

-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev
@ 2017-02-07 14:00 ` Hugo Mills
  2017-02-07 14:13   ` Peter Zaitsev
                     ` (2 more replies)
  2017-02-07 14:47 ` Peter Grandi
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 42+ messages in thread
From: Hugo Mills @ 2017-02-07 14:00 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1989 bytes --]

On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote:
> Hi,
> 
> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
> Workload.
> 
> It did not go very well ranging from multi-seconds stalls where no
> transactions are completed to the finally kernel OOPS with "no space left
> on device" error message and filesystem going read only.
> 
> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
> 
> Do you have any advice on how BTRFS should be tuned for OLTP workload
> (large files having a lot of random writes)  ?    Or is this the case where
> one should simply stay away from BTRFS and use something else ?
> 
> One item recommended in some places is "nodatacow"  this however defeats
> the main purpose I'm looking at BTRFS -  I am interested in "free"
> snapshots which look very attractive to use for database recovery scenarios
> allow instant rollback to the previous state.

   Well, nodatacow will still allow snapshots to work, but it also
allows the data to fragment. Each snapshot made will cause subsequent
writes to shared areas to be CoWed once (and then it reverts to
unshared and nodatacow again).

   There's another approach which might be worth testing, which is to
use autodefrag. This will increase data write I/O, because where you
have one or more small writes in a region, it will also read and write
the data in a small neghbourhood around those writes, so the
fragmentation is reduced. This will improve subsequent read
performance.

   I could also suggest getting the latest kernel you can -- 16.04 is
already getting on for a year old, and there may be performance
improvements in upstream kernels which affect your workload. There's
an Ubuntu kernel PPA you can use to get the new kernels without too
much pain.

   Hugo.

-- 
Hugo Mills             | I don't care about "it works on my machine". We are
hugo@... carfax.org.uk | not shipping your machine.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 14:00 ` Hugo Mills
@ 2017-02-07 14:13   ` Peter Zaitsev
  2017-02-07 15:00     ` Timofey Titovets
                       ` (3 more replies)
  2017-02-07 19:31   ` Peter Zaitsev
  2017-02-08  2:11   ` Peter Zaitsev
  2 siblings, 4 replies; 42+ messages in thread
From: Peter Zaitsev @ 2017-02-07 14:13 UTC (permalink / raw)
  To: Hugo Mills, Peter Zaitsev, linux-btrfs

Hi Hugo,

For the use case I'm looking for I'm interested in having snapshot(s)
open at all time.  Imagine  for example snapshot being created every
hour and several of these snapshots  kept at all time providing quick
recovery points to the state of 1,2,3 hours ago.  In  such case (as I
think you also describe)  nodatacow  does not provide any advantage.

I have not seen autodefrag helping much but I will try again.     Is
there any autodefrag documentation available about how is it expected
to work and if it can be tuned in any way

I noticed remounting already fragmented filesystem with autodefrag
and putting workload  which does more fragmentation does not seem to
improve over time



>    Well, nodatacow will still allow snapshots to work, but it also
> allows the data to fragment. Each snapshot made will cause subsequent
> writes to shared areas to be CoWed once (and then it reverts to
> unshared and nodatacow again).
>
>    There's another approach which might be worth testing, which is to
> use autodefrag. This will increase data write I/O, because where you
> have one or more small writes in a region, it will also read and write
> the data in a small neghbourhood around those writes, so the
> fragmentation is reduced. This will improve subsequent read
> performance.
>
>    I could also suggest getting the latest kernel you can -- 16.04 is
> already getting on for a year old, and there may be performance
> improvements in upstream kernels which affect your workload. There's
> an Ubuntu kernel PPA you can use to get the new kernels without too
> much pain.







-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev
  2017-02-07 14:00 ` Hugo Mills
@ 2017-02-07 14:47 ` Peter Grandi
  2017-02-07 15:06 ` Austin S. Hemmelgarn
  2017-02-07 18:27 ` Jeff Mahoney
  3 siblings, 0 replies; 42+ messages in thread
From: Peter Grandi @ 2017-02-07 14:47 UTC (permalink / raw)
  To: linux-btrfs

> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive
> OLTP MySQL Workload.

This has a lot of interesting and mostly agreeable information:

https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

The main target of Btrfs is where one wants checksums and
occasional snapshot for backup (rather than rollback) and
applications do whole-file rewrites or appends.

> It did not go very well ranging from multi-seconds stalls
> where no transactions are completed

That usually is more because of the "clever" design and defaults
of the Linux page cache and block IO subsystem, which are
astutely pessimized for every workload, but especially for
read-modify-write ones, never mind for RMW workloads on
copy-on-write filesystems.

That most OS designs are pessimized for anything like a "write
intensive OLTP" workload is not new, M Stonebraker complained
about that 35 years ago, and nothing much has changed:

  http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d

> to the finally kernel OOPS with "no space left on device"
> error message and filesystem going read only.

That's because Btrfs has a a two-level allocator, where space is
allocated in 1GiB chunks (distinct as to data and metadata) and
then in 16KiB nodes, and this makes it far more likely for free
space fragmentation to occur. Therefore Btrfs has a free space
compactor ('btrfs balance') that must be used the more often the
more updates happen.

> interested in "free" snapshots which look very attractive

The general problem is that it is pretty much impossible to have
read-modify-write rollbacks for cheap, because the writes in
general are scattered (that is their time coherence is very
different from their spatial coherence). That means either heavy
spatial fragmentation or huge write amplification.

The 'snapshot' type of DM/LVM2 device delivers heavy spatial
fragmentation, Btrfs does a balance of both. Another commenter
has mentioned the use of 'nodatacow' to prevent RMW resulting in
huge write-amplification.

> to use for database recovery scenarios allow instant rollback
> to the previous state.

You may be more interested in NILFS2 for that, but there are
significant tradeoffs there too, and NILFS2 requires a free
space compactor too, plus since NILFS2 gives up on short-term
spatial coherence, the compactor also needs to compact data
space.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 14:13   ` Peter Zaitsev
@ 2017-02-07 15:00     ` Timofey Titovets
  2017-02-07 15:09       ` Austin S. Hemmelgarn
  2017-02-07 16:22     ` Lionel Bouton
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Timofey Titovets @ 2017-02-07 15:00 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Hugo Mills, linux-btrfs

2017-02-07 17:13 GMT+03:00 Peter Zaitsev <pz@percona.com>:
> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.
>
> I have not seen autodefrag helping much but I will try again.     Is
> there any autodefrag documentation available about how is it expected
> to work and if it can be tuned in any way
>
> I noticed remounting already fragmented filesystem with autodefrag
> and putting workload  which does more fragmentation does not seem to
> improve over time
>
>
>
>>    Well, nodatacow will still allow snapshots to work, but it also
>> allows the data to fragment. Each snapshot made will cause subsequent
>> writes to shared areas to be CoWed once (and then it reverts to
>> unshared and nodatacow again).
>>
>>    There's another approach which might be worth testing, which is to
>> use autodefrag. This will increase data write I/O, because where you
>> have one or more small writes in a region, it will also read and write
>> the data in a small neghbourhood around those writes, so the
>> fragmentation is reduced. This will improve subsequent read
>> performance.
>>
>>    I could also suggest getting the latest kernel you can -- 16.04 is
>> already getting on for a year old, and there may be performance
>> improvements in upstream kernels which affect your workload. There's
>> an Ubuntu kernel PPA you can use to get the new kernels without too
>> much pain.
>
>
>
>
>
>
>
> --
> Peter Zaitsev, CEO, Percona
> Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I think that you have a problem with extent bookkeeping (if i
understand how btrfs manage extents).
So for deal with it, try enable compression, as compression will force
all extents to be fragmented with size ~128kb.

I did have a similar problem with MySQL (Zabbix as a workload, i.e.
most time load are random write), and i fix it, by enabling
compression. (I use debian with latest kernel from backports)
At now it just works with stable speed under stable load.

P.S.
(And i also use your percona MySQL some time, it's cool).

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev
  2017-02-07 14:00 ` Hugo Mills
  2017-02-07 14:47 ` Peter Grandi
@ 2017-02-07 15:06 ` Austin S. Hemmelgarn
  2017-02-07 19:39   ` Kai Krakow
  2017-02-07 18:27 ` Jeff Mahoney
  3 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 15:06 UTC (permalink / raw)
  To: Peter Zaitsev, linux-btrfs

On 2017-02-07 08:53, Peter Zaitsev wrote:
> Hi,
>
> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
> Workload.
>
> It did not go very well ranging from multi-seconds stalls where no
> transactions are completed to the finally kernel OOPS with "no space left
> on device" error message and filesystem going read only.
How much spare space did you have allocated in the filesystem?  At a 
minimum, you want at least a few GB beyond what you expect to be the 
maximum size of your data-set times the number of snapshots you plan to 
keep around at any given time.
>
> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
Not exactly wrong, but getting this to work efficiently is more art than 
engineering.
>
> Do you have any advice on how BTRFS should be tuned for OLTP workload
> (large files having a lot of random writes)  ?    Or is this the case where
> one should simply stay away from BTRFS and use something else ?
The general recommendation is usually to avoid BTRFS for such things. 
There are however a number of things you can do to improve performance:
1. Use a backing storage format that has the minimal amount of 
complexity.  The more data structures that get updated when a record 
changes, the worse the performance will be.  I don't have enough 
experience with MySQL to give a specific recommendation on what backing 
storage format to use, but someone else might.
2. Avoid large numbers of small transactions.  The smaller the 
transaction, the worse it will fragment things.
3. Use autodefrag.  This will increase write load on the storage device, 
but it should improve performance for reads.
4. Try using in-line compression.  This can actually significantly 
improve performance, especially if you have slow storage devices and a 
really nice CPU.
5. If you're running raid10 mode for BTRFS, run raid1 on top of two LVM 
or MD RAID0 devices instead.  This sounds stupid, but it actually will 
hugely improve both read and write performance without sacrificing any 
data safety.
6. Look at I/O scheduler tuning.  This can have a huge impact, 
especially considering that most of the defaults for the various 
schedulers are somewhat poor for most modern systems.  I won't go into 
the details here, since there are a huge number of online resources 
about this.
>
> One item recommended in some places is "nodatacow"  this however defeats
> the main purpose I'm looking at BTRFS -  I am interested in "free"
> snapshots which look very attractive to use for database recovery scenarios
> allow instant rollback to the previous state.
Snapshots aren't free.  They are quick, but they aren't free by any 
means.  If you're going to be using snapshots, keep them to a minimum, 
performance scales inversely proportionate to the number of snapshots, 
and this has a much bigger impact the more you're trying to do on the 
filesystem.  Also, consider whether or not you _actually_ need 
filesystem level snapshots.  I don't know about your full software 
stack, but most good OLTP software supports rollback segments (or an 
equivalent with a different name), and those are probably what you want 
to use, not filesystem snapshots.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 15:00     ` Timofey Titovets
@ 2017-02-07 15:09       ` Austin S. Hemmelgarn
  2017-02-07 15:20         ` Timofey Titovets
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 15:09 UTC (permalink / raw)
  To: Timofey Titovets, Peter Zaitsev; +Cc: Hugo Mills, linux-btrfs

On 2017-02-07 10:00, Timofey Titovets wrote:
> 2017-02-07 17:13 GMT+03:00 Peter Zaitsev <pz@percona.com>:
>> Hi Hugo,
>>
>> For the use case I'm looking for I'm interested in having snapshot(s)
>> open at all time.  Imagine  for example snapshot being created every
>> hour and several of these snapshots  kept at all time providing quick
>> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
>> think you also describe)  nodatacow  does not provide any advantage.
>>
>> I have not seen autodefrag helping much but I will try again.     Is
>> there any autodefrag documentation available about how is it expected
>> to work and if it can be tuned in any way
>>
>> I noticed remounting already fragmented filesystem with autodefrag
>> and putting workload  which does more fragmentation does not seem to
>> improve over time
>>
>>
>>
>>>    Well, nodatacow will still allow snapshots to work, but it also
>>> allows the data to fragment. Each snapshot made will cause subsequent
>>> writes to shared areas to be CoWed once (and then it reverts to
>>> unshared and nodatacow again).
>>>
>>>    There's another approach which might be worth testing, which is to
>>> use autodefrag. This will increase data write I/O, because where you
>>> have one or more small writes in a region, it will also read and write
>>> the data in a small neghbourhood around those writes, so the
>>> fragmentation is reduced. This will improve subsequent read
>>> performance.
>>>
>>>    I could also suggest getting the latest kernel you can -- 16.04 is
>>> already getting on for a year old, and there may be performance
>>> improvements in upstream kernels which affect your workload. There's
>>> an Ubuntu kernel PPA you can use to get the new kernels without too
>>> much pain.
>>
>>
>>
>>
>>
>>
>>
>> --
>> Peter Zaitsev, CEO, Percona
>> Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> I think that you have a problem with extent bookkeeping (if i
> understand how btrfs manage extents).
> So for deal with it, try enable compression, as compression will force
> all extents to be fragmented with size ~128kb.
No, it will compress everything in chunks of 128kB, but it will not 
fragment things any more than they already would have been (it may 
actually _reduce_ fragmentation because there is less data being stored 
on disk).  This representation is a bug in the FIEMAP ioctl, it doesn't 
understand the way BTRFS represents things properly.  IIRC, there was a 
patch to fix this, but I don't remember what happened with it.

That said, in-line compression can help significantly, especially if you 
have slow storage devices.
>
> I did have a similar problem with MySQL (Zabbix as a workload, i.e.
> most time load are random write), and i fix it, by enabling
> compression. (I use debian with latest kernel from backports)
> At now it just works with stable speed under stable load.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 15:09       ` Austin S. Hemmelgarn
@ 2017-02-07 15:20         ` Timofey Titovets
  2017-02-07 15:43           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 42+ messages in thread
From: Timofey Titovets @ 2017-02-07 15:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Peter Zaitsev, Hugo Mills, linux-btrfs

>> I think that you have a problem with extent bookkeeping (if i
>> understand how btrfs manage extents).
>> So for deal with it, try enable compression, as compression will force
>> all extents to be fragmented with size ~128kb.
>
> No, it will compress everything in chunks of 128kB, but it will not fragment
> things any more than they already would have been (it may actually _reduce_
> fragmentation because there is less data being stored on disk).  This
> representation is a bug in the FIEMAP ioctl, it doesn't understand the way
> BTRFS represents things properly.  IIRC, there was a patch to fix this, but
> I don't remember what happened with it.
>
> That said, in-line compression can help significantly, especially if you
> have slow storage devices.


I mean that:
You have a 128MB extent, you rewrite random 4k sectors, btrfs will not
split 128MB extent, and not free up data, (i don't know internal algo,
so i can't predict when this will hapen), and after some time, btrfs
will rebuild extents, and split 128 MB exten to several more smaller.
But when you use compression, allocator rebuilding extents much early
(i think, it's because btrfs also operates with that like 128kb
extent, even if it's a continuos 128MB chunk of data).

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 15:20         ` Timofey Titovets
@ 2017-02-07 15:43           ` Austin S. Hemmelgarn
  2017-02-07 21:14             ` Kai Krakow
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 15:43 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: Peter Zaitsev, Hugo Mills, linux-btrfs

On 2017-02-07 10:20, Timofey Titovets wrote:
>>> I think that you have a problem with extent bookkeeping (if i
>>> understand how btrfs manage extents).
>>> So for deal with it, try enable compression, as compression will force
>>> all extents to be fragmented with size ~128kb.
>>
>> No, it will compress everything in chunks of 128kB, but it will not fragment
>> things any more than they already would have been (it may actually _reduce_
>> fragmentation because there is less data being stored on disk).  This
>> representation is a bug in the FIEMAP ioctl, it doesn't understand the way
>> BTRFS represents things properly.  IIRC, there was a patch to fix this, but
>> I don't remember what happened with it.
>>
>> That said, in-line compression can help significantly, especially if you
>> have slow storage devices.
>
>
> I mean that:
> You have a 128MB extent, you rewrite random 4k sectors, btrfs will not
> split 128MB extent, and not free up data, (i don't know internal algo,
> so i can't predict when this will hapen), and after some time, btrfs
> will rebuild extents, and split 128 MB exten to several more smaller.
> But when you use compression, allocator rebuilding extents much early
> (i think, it's because btrfs also operates with that like 128kb
> extent, even if it's a continuos 128MB chunk of data).
>
The allocator has absolutely nothing to do with this, it's a function of 
the COW operation.  Unless you're using nodatacow, that 128MB extent 
will get split the moment the data hits the storage device (either on 
the next commit cycle (at most 30 seconds with the default commit 
cycle), or when fdatasync is called, whichever is sooner).  In the case 
of compression, it's still one extent (although on disk it will be less 
than 128MB) and will be split at _exactly_ the same time under _exactly_ 
the same circumstances as an uncompressed extent.  IOW, it has 
absolutely nothing to do with the extent handling either.

The difference arises in that compressed data effectively has a on-media 
block size of 128k, not 16k (the current default block size) or 4k (the 
old default).  This means that the smallest fragment possible for a file 
with in-line compression enabled is 128k, while for a file without it 
it's equal to the filesystem block size.  A larger minimum fragment size 
means that the maximum number of fragments a given file can have is 
smaller (8 times smaller in fact than without compression when using the 
current default block size), which means that there will be less 
fragmentation.

Some rather complex and tedious math indicates that this is not the 
_only_ thing improving performance when using in-line compression, but 
it's probably the biggest thing doing so for the workload being discussed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 14:13   ` Peter Zaitsev
  2017-02-07 15:00     ` Timofey Titovets
@ 2017-02-07 16:22     ` Lionel Bouton
  2017-02-07 19:57     ` Roman Mamedov
  2017-02-07 20:36     ` Kai Krakow
  3 siblings, 0 replies; 42+ messages in thread
From: Lionel Bouton @ 2017-02-07 16:22 UTC (permalink / raw)
  To: Peter Zaitsev, Hugo Mills, linux-btrfs

Hi Peter,

Le 07/02/2017 à 15:13, Peter Zaitsev a écrit :
> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.
>
> I have not seen autodefrag helping much but I will try again.     Is
> there any autodefrag documentation available about how is it expected
> to work and if it can be tuned in any way

There's not much that can be done if the same file is modified in 2
different subvolumes (typically the original and a R/W snapshot). You
either break the reflink around the modification to limit the amount of
fragmentation (which will use disk space and write I/O) or get
fragmentation on at least one subvolume (which will add seeks).
So the only options are either to flatten the files (which can be done
incrementally by defragmenting them on both sides when they change) or
only defragment the most used volume (especially if the other is a
relatively short-lived snapshot where performance won't degrade much
until it is removed and won't matter much).

I just modified our defragmenter scheduler to be aware of multiple
subvolumes and support ignoring some of them. The previous version (not
tagged, sorry) was battle tested on a Ceph cluster and was designed for
it. Autodefrag didn't work with Ceph with our workload (latency went
through the roof, OSDs were timing out requests, ...) and our scheduler
with some simple Ceph BTRFS related tunings gave us even better
performance than XFS (which is usually the recommended choice with
current Ceph versions).

The current version is probably still rough around the edges as it is
brand new (most of the work was done last Sunday) and only running on a
backup server with a situation not much different from yours : a large
PostgreSQL slave (>50GB) which is snapshoted hourly and daily, with a
daily snapshot used to start a PostgreSQL instance for "tests on real
data" purposes + a copy of a <10TB NFS server with similar snapshots in
place. All of this is on a single RAID10 13-14TB BTRFS.
In our case using autodefrag on this slowly degraded performance to the
point where off-site backups became slow enough to warrant preventive
measures.
The current scheduler looks for the mountpoints of top BTRFS volumes (so
you have to mount the top volume somewhere), and defragments them avoiding :
- read-only snapshots,
- all data below configurable subdirs (including read-write subvolumes
even if they are mounted elsewhere), see README.md for instructions.

It slowly walks all files eligible for defragmentation and in parallel
detects writes to the same filesystem, including writes to read-write
subvolumes mounted elsewhere to trigger defragmentation. The scheduler
uses an estimated "cost" for each file to prioritize defragmentation
tasks and with default settings tries to keep I/O activity low enough
that it doesn't slow down other tasks too much. However it defragments
files whole, which might put some strain for huge ibdata* files if you
didn't switch to file per table. In our case defragmenting 1GB files is
OK and doesn't have a major impact.

We are already seeing better performance (our total daily backup time is
below worrying levels again) and the scheduler didn't even finish
walking the whole filesystem (there are approximately 8 millions files
and it is configured to evaluate them over a week). This is probably
because it follows the most write-active files (which are in the
PostgreSQL slave directory) and defragmented most of them early.

Note that it is tuned for filesystems using ~2TB 7200rpm drives (there
are some options that will adapt it to subsystems with more I/O
capacity). Using drives with different capacities shouldn't need tuning,
but it probably will not work well on SSD (it should be configured to
speed up significantly).

See https://github.com/jtek/ceph-utils you want btrfs-defrag-scheduler.rb

Some parameters are available (start it with --help). You should
probably start it with --verbose at least until you are comfortable with
it to get a list of which files are defragmented and many debug messages
you probably want to ignore (or you'll probably have to read the Ruby
code to fully understand what they mean).

I don't provide any warranty for it but the worst I believe can happen
is no performance improvements or performance degradation until you stop
it. If you don't blacklist read-write snapshots with the .no-defrag file
(see README.md) defragmentation will probably eat more disk space than
usual. Space usage will go up rapidly during defragmentation if you have
snapshots, it is supposed to go down after all snapshots referring to
fragmented files are removed and replaced by new snapshots (where
fragmentation should be more stable).

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev
                   ` (2 preceding siblings ...)
  2017-02-07 15:06 ` Austin S. Hemmelgarn
@ 2017-02-07 18:27 ` Jeff Mahoney
  2017-02-07 18:59   ` Peter Zaitsev
  3 siblings, 1 reply; 42+ messages in thread
From: Jeff Mahoney @ 2017-02-07 18:27 UTC (permalink / raw)
  To: Peter Zaitsev, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2026 bytes --]

On 2/7/17 8:53 AM, Peter Zaitsev wrote:
> Hi,
> 
> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
> Workload.
> 
> It did not go very well ranging from multi-seconds stalls where no
> transactions are completed to the finally kernel OOPS with "no space left
> on device" error message and filesystem going read only.
> 
> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
> 
> Do you have any advice on how BTRFS should be tuned for OLTP workload
> (large files having a lot of random writes)  ?    Or is this the case where
> one should simply stay away from BTRFS and use something else ?
> 
> One item recommended in some places is "nodatacow"  this however defeats
> the main purpose I'm looking at BTRFS -  I am interested in "free"
> snapshots which look very attractive to use for database recovery scenarios
> allow instant rollback to the previous state.
> 

Hi Peter -

There seems to be some misunderstanding around how nodatacow works.
Nodatacow doesn't prohibit snapshot use.  Snapshots are still allowed
and, of course, will cause CoW to happen when a write occurs, but only
on the first write.  Subsequent writes will not CoW again.  This does
mean you don't get CRC protection for data, though.  Since most
databases do this internally, that is probably no great loss.  You will
get fragmentation, but that's true of any random-write workload on btrfs.

Timothy's comment about how extents are accounted is more-or-less
correct.  The file extents in the file system trees reference data
extents in the extent tree.  When portions of the data extent are
unreferenced, they're not necessarily released.  A balance operation
will usually split the data extents so that the unused space is released.

As for the Oopses with ENOSPC, that's something we'd want to look into
if it can be reproduced with a more recent kernel.  We shouldn't be
getting ENOSPC anywhere sensitive anymore.

-Jeff

-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 18:27 ` Jeff Mahoney
@ 2017-02-07 18:59   ` Peter Zaitsev
  2017-02-07 19:54     ` Austin S. Hemmelgarn
  2017-02-07 22:08     ` Hans van Kranenburg
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zaitsev @ 2017-02-07 18:59 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: linux-btrfs

Jeff,

Thank you very much for explanations. Indeed it was not clear in the
documentation - I read it simply as "if you have snapshots enabled
nodatacow makes no difference"

I will rebuild the database in this mode from scratch and see how
performance changes.

So far the most frustating for me was periodic stalls for many seconds
 (running sysbench workload).  What was the most puzzling  I get this
even if I run workload at the  50% or less of the full load  -  Ie
database can handle 1000 transactions/sec and I only inject 500/sec
and I still have those stalls.

This is where it looks to me like some work is being delayed and when
it requires stall for a few seconds to catch up.    I wonder  if there
are some configuration options available to play with.

So far I found BTRFS rather  "zero configuration" which is great if it
works but it is also great to have more levers to pull if you're
having some troubles.


On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoney <jeffm@suse.com> wrote:
> On 2/7/17 8:53 AM, Peter Zaitsev wrote:
>> Hi,
>>
>> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
>> Workload.
>>
>> It did not go very well ranging from multi-seconds stalls where no
>> transactions are completed to the finally kernel OOPS with "no space left
>> on device" error message and filesystem going read only.
>>
>> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
>>
>> Do you have any advice on how BTRFS should be tuned for OLTP workload
>> (large files having a lot of random writes)  ?    Or is this the case where
>> one should simply stay away from BTRFS and use something else ?
>>
>> One item recommended in some places is "nodatacow"  this however defeats
>> the main purpose I'm looking at BTRFS -  I am interested in "free"
>> snapshots which look very attractive to use for database recovery scenarios
>> allow instant rollback to the previous state.
>>
>
> Hi Peter -
>
> There seems to be some misunderstanding around how nodatacow works.
> Nodatacow doesn't prohibit snapshot use.  Snapshots are still allowed
> and, of course, will cause CoW to happen when a write occurs, but only
> on the first write.  Subsequent writes will not CoW again.  This does
> mean you don't get CRC protection for data, though.  Since most
> databases do this internally, that is probably no great loss.  You will
> get fragmentation, but that's true of any random-write workload on btrfs.
>
> Timothy's comment about how extents are accounted is more-or-less
> correct.  The file extents in the file system trees reference data
> extents in the extent tree.  When portions of the data extent are
> unreferenced, they're not necessarily released.  A balance operation
> will usually split the data extents so that the unused space is released.
>
> As for the Oopses with ENOSPC, that's something we'd want to look into
> if it can be reproduced with a more recent kernel.  We shouldn't be
> getting ENOSPC anywhere sensitive anymore.
>
> -Jeff
>
> --
> Jeff Mahoney
> SUSE Labs
>



-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 14:00 ` Hugo Mills
  2017-02-07 14:13   ` Peter Zaitsev
@ 2017-02-07 19:31   ` Peter Zaitsev
  2017-02-07 19:50     ` Austin S. Hemmelgarn
  2017-02-08  2:11   ` Peter Zaitsev
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zaitsev @ 2017-02-07 19:31 UTC (permalink / raw)
  To: Hugo Mills, Peter Zaitsev, linux-btrfs

Hi Hugo,

As I re-read it closely (and also other comments in the thread) I know
understand there is a difference how nodatacow works even if snapshot are
in place.

On autodefrag I wonder is there some more detailed documentation about how
autodefrag works.

The manual  https://btrfs.wiki.kernel.org/index.php/Mount_options    has
very general statement.

What does "detect random IO" really means  ? It also talks about
 defragmenting the file - is i really about the whole file which is
triggered for defrag or is defrag locally ?      Ie I would understand what
as writes happen the  1MB block is checked and if it is more than X
fragments it is defragmented or something like that.

Also does autodefrag works with nodatacow (ie with snapshot)  or are these
exclusive ?


>
>    There's another approach which might be worth testing, which is to
> use autodefrag. This will increase data write I/O, because where you
> have one or more small writes in a region, it will also read and write
> the data in a small neghbourhood around those writes, so the
> fragmentation is reduced. This will improve subsequent read
> performance.
>
>    I could also suggest getting the latest kernel you can -- 16.04 is
> already getting on for a year old, and there may be performance
> improvements in upstream kernels which affect your workload. There's
> an Ubuntu kernel PPA you can use to get the new kernels without too
> much pain.
>
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 15:06 ` Austin S. Hemmelgarn
@ 2017-02-07 19:39   ` Kai Krakow
  2017-02-07 19:59     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 42+ messages in thread
From: Kai Krakow @ 2017-02-07 19:39 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 7 Feb 2017 10:06:34 -0500
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> 4. Try using in-line compression.  This can actually significantly 
> improve performance, especially if you have slow storage devices and
> a really nice CPU.

Just a side note: With nodatacow there'll be no compression, I think.
At least for files with "chattr +C" there'll be no compression. I thus
think "nodatacow" has the same effect.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 19:31   ` Peter Zaitsev
@ 2017-02-07 19:50     ` Austin S. Hemmelgarn
  2017-02-07 20:19       ` Kai Krakow
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 19:50 UTC (permalink / raw)
  To: Peter Zaitsev, Hugo Mills, linux-btrfs

On 2017-02-07 14:31, Peter Zaitsev wrote:
> Hi Hugo,
>
> As I re-read it closely (and also other comments in the thread) I know
> understand there is a difference how nodatacow works even if snapshot are
> in place.
>
> On autodefrag I wonder is there some more detailed documentation about how
> autodefrag works.
>
> The manual  https://btrfs.wiki.kernel.org/index.php/Mount_options    has
> very general statement.
>
> What does "detect random IO" really means  ? It also talks about
>  defragmenting the file - is i really about the whole file which is
> triggered for defrag or is defrag locally ?      Ie I would understand what
> as writes happen the  1MB block is checked and if it is more than X
> fragments it is defragmented or something like that.
I don't know the exact algorithm, but I'm pretty sure it's similar to 
what bcache uses to bypass the cache device for sequential I/O.  In 
essence, it's going to trigger for database usage.
>
> Also does autodefrag works with nodatacow (ie with snapshot)  or are these
> exclusive ?
I'm not sure about this one.  I would assume based on the fact that many 
other things don't work with nodatacow and that regular defrag doesn't 
work on files which are currently mapped as executable code that it does 
not, but I could be completely wrong about this too.
>
>
>>
>>    There's another approach which might be worth testing, which is to
>> use autodefrag. This will increase data write I/O, because where you
>> have one or more small writes in a region, it will also read and write
>> the data in a small neghbourhood around those writes, so the
>> fragmentation is reduced. This will improve subsequent read
>> performance.
>>
>>    I could also suggest getting the latest kernel you can -- 16.04 is
>> already getting on for a year old, and there may be performance
>> improvements in upstream kernels which affect your workload. There's
>> an Ubuntu kernel PPA you can use to get the new kernels without too
>> much pain.
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 18:59   ` Peter Zaitsev
@ 2017-02-07 19:54     ` Austin S. Hemmelgarn
  2017-02-07 20:40       ` Peter Zaitsev
  2017-02-07 22:08     ` Hans van Kranenburg
  1 sibling, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 19:54 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Jeff Mahoney, linux-btrfs

On 2017-02-07 13:59, Peter Zaitsev wrote:
> Jeff,
>
> Thank you very much for explanations. Indeed it was not clear in the
> documentation - I read it simply as "if you have snapshots enabled
> nodatacow makes no difference"
>
> I will rebuild the database in this mode from scratch and see how
> performance changes.
>
> So far the most frustating for me was periodic stalls for many seconds
>  (running sysbench workload).  What was the most puzzling  I get this
> even if I run workload at the  50% or less of the full load  -  Ie
> database can handle 1000 transactions/sec and I only inject 500/sec
> and I still have those stalls.
>
> This is where it looks to me like some work is being delayed and when
> it requires stall for a few seconds to catch up.    I wonder  if there
> are some configuration options available to play with.
>
> So far I found BTRFS rather  "zero configuration" which is great if it
> works but it is also great to have more levers to pull if you're
> having some troubles.
It's worth keeping in mind that there is more to the storage stack than 
just the filesystem, and BTRFS tends to be more sensitive to the 
behavior of other components in the stack than most other filesystems 
are.  The stalls you're describing sound more like a symptom of the 
brain-dead writeback buffering defaults used by the VFS layer than they 
do an issue with BTRFS (although BTRFS tends to be a  bit more heavily 
impacted by this than most other filesystems).  Try fiddling with the 
/proc/sys/vm/dirty_* sysctls (there is some pretty good documentation in 
Documentation/sysctl/vm.txt in the kernel source) and see if that helps. 
  The default values it uses are at most 20% of RAM, which is an insane 
amount of data to buffer before starting writeback when you're talking 
about systems with 16GB of RAM.
>
>
> On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoney <jeffm@suse.com> wrote:
>> On 2/7/17 8:53 AM, Peter Zaitsev wrote:
>>> Hi,
>>>
>>> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
>>> Workload.
>>>
>>> It did not go very well ranging from multi-seconds stalls where no
>>> transactions are completed to the finally kernel OOPS with "no space left
>>> on device" error message and filesystem going read only.
>>>
>>> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
>>>
>>> Do you have any advice on how BTRFS should be tuned for OLTP workload
>>> (large files having a lot of random writes)  ?    Or is this the case where
>>> one should simply stay away from BTRFS and use something else ?
>>>
>>> One item recommended in some places is "nodatacow"  this however defeats
>>> the main purpose I'm looking at BTRFS -  I am interested in "free"
>>> snapshots which look very attractive to use for database recovery scenarios
>>> allow instant rollback to the previous state.
>>>
>>
>> Hi Peter -
>>
>> There seems to be some misunderstanding around how nodatacow works.
>> Nodatacow doesn't prohibit snapshot use.  Snapshots are still allowed
>> and, of course, will cause CoW to happen when a write occurs, but only
>> on the first write.  Subsequent writes will not CoW again.  This does
>> mean you don't get CRC protection for data, though.  Since most
>> databases do this internally, that is probably no great loss.  You will
>> get fragmentation, but that's true of any random-write workload on btrfs.
>>
>> Timothy's comment about how extents are accounted is more-or-less
>> correct.  The file extents in the file system trees reference data
>> extents in the extent tree.  When portions of the data extent are
>> unreferenced, they're not necessarily released.  A balance operation
>> will usually split the data extents so that the unused space is released.
>>
>> As for the Oopses with ENOSPC, that's something we'd want to look into
>> if it can be reproduced with a more recent kernel.  We shouldn't be
>> getting ENOSPC anywhere sensitive anymore.
>>
>> -Jeff
>>
>> --
>> Jeff Mahoney
>> SUSE Labs
>>
>
>
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 14:13   ` Peter Zaitsev
  2017-02-07 15:00     ` Timofey Titovets
  2017-02-07 16:22     ` Lionel Bouton
@ 2017-02-07 19:57     ` Roman Mamedov
  2017-02-07 20:36     ` Kai Krakow
  3 siblings, 0 replies; 42+ messages in thread
From: Roman Mamedov @ 2017-02-07 19:57 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Hugo Mills, linux-btrfs

On Tue, 7 Feb 2017 09:13:25 -0500
Peter Zaitsev <pz@percona.com> wrote:

> Hi Hugo,
> 
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

It still does provide some advantage, as in each write into new area since
last hour snapshot is going to be CoW'ed only once, as opposed to every new
write getting CoW'ed every time no matter what.

I'm not sold on autodefrag, what I'd suggest instead is to schedule regular
defrag ("btrfs fi defrag") of the database files, e.g. daily. This may increase
space usage temporarily as it will partially unmerge extents previously shared
across snapshots, but you won't get away runaway fragmentation anymore, as you
would without nodatacow or with periodical snapshotting.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 19:39   ` Kai Krakow
@ 2017-02-07 19:59     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 19:59 UTC (permalink / raw)
  To: linux-btrfs

On 2017-02-07 14:39, Kai Krakow wrote:
> Am Tue, 7 Feb 2017 10:06:34 -0500
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> 4. Try using in-line compression.  This can actually significantly
>> improve performance, especially if you have slow storage devices and
>> a really nice CPU.
>
> Just a side note: With nodatacow there'll be no compression, I think.
> At least for files with "chattr +C" there'll be no compression. I thus
> think "nodatacow" has the same effect.
You're absolutely right, thanks for mentioning this, I completely forgot 
to point it out myself.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 19:50     ` Austin S. Hemmelgarn
@ 2017-02-07 20:19       ` Kai Krakow
  2017-02-07 20:27         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 42+ messages in thread
From: Kai Krakow @ 2017-02-07 20:19 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 7 Feb 2017 14:50:04 -0500
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> > Also does autodefrag works with nodatacow (ie with snapshot)  or
> > are these exclusive ?  
> I'm not sure about this one.  I would assume based on the fact that
> many other things don't work with nodatacow and that regular defrag
> doesn't work on files which are currently mapped as executable code
> that it does not, but I could be completely wrong about this too.

Technically, there's nothing that prevents autodefrag to work for
nodatacow files. The question is: is it really necessary? Standard file
systems also have no autodefrag, it's not an issue there because they
are essentially nodatacow. Simply defrag the database file once and
you're done. Transactional MySQL uses huge data files, probably
preallocated. It should simply work with nodatacow.

On the other hand: Using snapshots clearly introduces fragmentation over
time. If autodefrag kicks in (given, it is supported for nodatacow), it
will slowly unshare all data over time. This somehow defeats the
purpose of having snapshots in the first place for this scenario.

In conclusion, I'd recommend to run some maintenance scripts from time
to time, one to re-share identical blocks, and one to defragment the
current workspace.

The bees daemon comes into mind here... I haven't tried it but it
sounds like it could fill a gap here:

https://github.com/Zygo/bees

Another option comes into mind: XFS now supports shared-extents
copies. You could simply do a cold copy of the database with this
feature resulting in the same effect as a snapshot, without seeing the
other performance problems of btrfs. Tho, the fragmentation issue would
remain, and I think there's no dedupe application for XFS yet.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 20:19       ` Kai Krakow
@ 2017-02-07 20:27         ` Austin S. Hemmelgarn
  2017-02-07 20:54           ` Kai Krakow
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 20:27 UTC (permalink / raw)
  To: linux-btrfs

On 2017-02-07 15:19, Kai Krakow wrote:
> Am Tue, 7 Feb 2017 14:50:04 -0500
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>>> Also does autodefrag works with nodatacow (ie with snapshot)  or
>>> are these exclusive ?
>> I'm not sure about this one.  I would assume based on the fact that
>> many other things don't work with nodatacow and that regular defrag
>> doesn't work on files which are currently mapped as executable code
>> that it does not, but I could be completely wrong about this too.
>
> Technically, there's nothing that prevents autodefrag to work for
> nodatacow files. The question is: is it really necessary? Standard file
> systems also have no autodefrag, it's not an issue there because they
> are essentially nodatacow. Simply defrag the database file once and
> you're done. Transactional MySQL uses huge data files, probably
> preallocated. It should simply work with nodatacow.
The thing is, I don't have enough knowledge of how defrag is implemented 
in BTRFS to say for certain that ti doesn't use COW semantics somewhere 
(and I would actually expect it to do so, since that in theory makes 
many things _much_ easier to handle), and if it uses COW somewhere, then 
it by definition doesn't work on NOCOW files.
>
> On the other hand: Using snapshots clearly introduces fragmentation over
> time. If autodefrag kicks in (given, it is supported for nodatacow), it
> will slowly unshare all data over time. This somehow defeats the
> purpose of having snapshots in the first place for this scenario.
>
> In conclusion, I'd recommend to run some maintenance scripts from time
> to time, one to re-share identical blocks, and one to defragment the
> current workspace.
>
> The bees daemon comes into mind here... I haven't tried it but it
> sounds like it could fill a gap here:
>
> https://github.com/Zygo/bees
>
> Another option comes into mind: XFS now supports shared-extents
> copies. You could simply do a cold copy of the database with this
> feature resulting in the same effect as a snapshot, without seeing the
> other performance problems of btrfs. Tho, the fragmentation issue would
> remain, and I think there's no dedupe application for XFS yet.
There isn't, but cp --reflink=auto with a reasonably recent version of 
coreutils should be able to reflink the file properly.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 14:13   ` Peter Zaitsev
                       ` (2 preceding siblings ...)
  2017-02-07 19:57     ` Roman Mamedov
@ 2017-02-07 20:36     ` Kai Krakow
  2017-02-07 20:44       ` Lionel Bouton
  2017-02-07 20:47       ` Austin S. Hemmelgarn
  3 siblings, 2 replies; 42+ messages in thread
From: Kai Krakow @ 2017-02-07 20:36 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 7 Feb 2017 09:13:25 -0500
schrieb Peter Zaitsev <pz@percona.com>:

> Hi Hugo,
> 
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

Out of curiosity, I see one problem here:

If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.

How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.

XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 19:54     ` Austin S. Hemmelgarn
@ 2017-02-07 20:40       ` Peter Zaitsev
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zaitsev @ 2017-02-07 20:40 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Jeff Mahoney, linux-btrfs

Austin,

I recognize there are other components too.  In this case I'm actually
comparing BTRFS to XFS and EXT4 so I'm 100% sure it is file system
related.   Also I'm using O_DIRECT   asynchronous IO with MySQL which
means there are no significant dirty block size on the file system
level.

I'll see if it helps though

Also I assumed this is something well known as it is documented in Gotchas here:

https://btrfs.wiki.kernel.org/index.php/Gotchas

(Fragmentation section)




>
> It's worth keeping in mind that there is more to the storage stack than just
> the filesystem, and BTRFS tends to be more sensitive to the behavior of
> other components in the stack than most other filesystems are.  The stalls
> you're describing sound more like a symptom of the brain-dead writeback
> buffering defaults used by the VFS layer than they do an issue with BTRFS
> (although BTRFS tends to be a  bit more heavily impacted by this than most
> other filesystems).  Try fiddling with the /proc/sys/vm/dirty_* sysctls
> (there is some pretty good documentation in Documentation/sysctl/vm.txt in
> the kernel source) and see if that helps.  The default values it uses are at
> most 20% of RAM, which is an insane amount of data to buffer before starting
> writeback when you're talking about systems with 16GB of RAM.
>


-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 20:36     ` Kai Krakow
@ 2017-02-07 20:44       ` Lionel Bouton
  2017-02-07 20:47       ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 42+ messages in thread
From: Lionel Bouton @ 2017-02-07 20:44 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

Le 07/02/2017 à 21:36, Kai Krakow a écrit :
> [...]
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.


I don't think so for three reasons :
- it's so far away from admin's expectations that someone would have
documented this in "man btrfs-subvolume",
- the CoW nature of Btrfs makes this trivial : it only has to keep old
versions of data and the corresponding tree for it to work instead of
unlinking them,
- the backup server I referred to restarted a PostgreSQL system from
snapshots about one thousand time now without a single problem while
being almost continuously being updated by streaming replication.

Lionel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 20:36     ` Kai Krakow
  2017-02-07 20:44       ` Lionel Bouton
@ 2017-02-07 20:47       ` Austin S. Hemmelgarn
  2017-02-07 21:25         ` Lionel Bouton
       [not found]         ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com>
  1 sibling, 2 replies; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-07 20:47 UTC (permalink / raw)
  To: linux-btrfs

On 2017-02-07 15:36, Kai Krakow wrote:
> Am Tue, 7 Feb 2017 09:13:25 -0500
> schrieb Peter Zaitsev <pz@percona.com>:
>
>> Hi Hugo,
>>
>> For the use case I'm looking for I'm interested in having snapshot(s)
>> open at all time.  Imagine  for example snapshot being created every
>> hour and several of these snapshots  kept at all time providing quick
>> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
>> think you also describe)  nodatacow  does not provide any advantage.
>
> Out of curiosity, I see one problem here:
>
> If you're doing snapshots of the live database, each snapshot leaves
> the database files like killing the database in-flight. Like shutting
> the system down in the middle of writing data.
>
> This is because I think there's no API for user space to subscribe to
> events like a snapshot - unlike e.g. the VSS API (volume snapshot
> service) in Windows. You should put the database into frozen state to
> prepare it for a hotcopy before creating the snapshot, then ensure all
> data is flushed before continuing.
Correct.
>
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.
Also correct AFAICT, and this needs to be better documented (for most 
people, the term snapshot implies atomicity of the operation).
>
> How is this going to be addressed? Is there some snapshot aware API to
> let user space subscribe to such events and do proper preparation? Is
> this planned? LVM could be a user of such an API, too. I think this
> could have nice enterprise-grade value for Linux.
Ideally, such an API should be in the VFS layer, not just BTRFS. 
Reflinking exists in other filesystems already, it's only a matter of 
time before they decide to do snapshotting too.
>
> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
> still, also this needs to be integrated with MySQL to properly work. I
> once (years ago) researched on this but gave up on my plans when I
> planned database backups for our web server infrastructure. We moved to
> creating SQL dumps instead, although there're binlogs which can be used
> to recover to a clean and stable transactional state after taking
> snapshots. But I simply didn't want to fiddle around with properly
> cleaning up binlogs which accumulate horribly much space usage over
> time. The cleanup process requires to create a cold copy or dump of the
> complete database from time to time, only then it's safe to remove all
> binlogs up to that point in time.
Sadly, freezefs (the generic interface based off of xfs_freeze) only 
works for block device snapshots.  Filesystem level snapshots need the 
application software to sync all it's data and then stop writing until 
the snapshot is complete.

As of right now, the sanest way I can come up with for a database server 
is to find a way to do a point-in-time SQL dump of the database (this 
also has the advantage that it works as a backup, and decouples you from 
the backing storage format).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 20:27         ` Austin S. Hemmelgarn
@ 2017-02-07 20:54           ` Kai Krakow
  2017-02-08 12:12             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 42+ messages in thread
From: Kai Krakow @ 2017-02-07 20:54 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 7 Feb 2017 15:27:34 -0500
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> >> I'm not sure about this one.  I would assume based on the fact that
> >> many other things don't work with nodatacow and that regular defrag
> >> doesn't work on files which are currently mapped as executable code
> >> that it does not, but I could be completely wrong about this too.  
> >
> > Technically, there's nothing that prevents autodefrag to work for
> > nodatacow files. The question is: is it really necessary? Standard
> > file systems also have no autodefrag, it's not an issue there
> > because they are essentially nodatacow. Simply defrag the database
> > file once and you're done. Transactional MySQL uses huge data
> > files, probably preallocated. It should simply work with
> > nodatacow.  
> The thing is, I don't have enough knowledge of how defrag is
> implemented in BTRFS to say for certain that ti doesn't use COW
> semantics somewhere (and I would actually expect it to do so, since
> that in theory makes many things _much_ easier to handle), and if it
> uses COW somewhere, then it by definition doesn't work on NOCOW files.

A dev would be needed on this. But from a non-dev point of view, the
defrag operation itself is CoW: Blocks are rewritten to another
location in contiguous order. Only metadata CoW should be needed for
this operation.

It should be nothing else than writing to a nodatacow snapshot... Just
that the snapshot is more or less implicit and temporary.

Hmm? *curious*

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 15:43           ` Austin S. Hemmelgarn
@ 2017-02-07 21:14             ` Kai Krakow
  0 siblings, 0 replies; 42+ messages in thread
From: Kai Krakow @ 2017-02-07 21:14 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 7 Feb 2017 10:43:11 -0500
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> > I mean that:
> > You have a 128MB extent, you rewrite random 4k sectors, btrfs will
> > not split 128MB extent, and not free up data, (i don't know
> > internal algo, so i can't predict when this will hapen), and after
> > some time, btrfs will rebuild extents, and split 128 MB exten to
> > several more smaller. But when you use compression, allocator
> > rebuilding extents much early (i think, it's because btrfs also
> > operates with that like 128kb extent, even if it's a continuos
> > 128MB chunk of data). 
> The allocator has absolutely nothing to do with this, it's a function
> of the COW operation.  Unless you're using nodatacow, that 128MB
> extent will get split the moment the data hits the storage device
> (either on the next commit cycle (at most 30 seconds with the default
> commit cycle), or when fdatasync is called, whichever is sooner).  In
> the case of compression, it's still one extent (although on disk it
> will be less than 128MB) and will be split at _exactly_ the same time
> under _exactly_ the same circumstances as an uncompressed extent.
> IOW, it has absolutely nothing to do with the extent handling either.

I don't think that btrfs splits extents which are part of the snapshot.
The extent in a snapshot will stay intact when writing to this extent
in another snapshot. Of course, in the just written snapshot, the
extent will be represented as a split extent mapping to the original
extents data blocks plus the new data in the middle (thus resulting in
three extents). This is also why small random writes without autodefrag
result in a vast amount of small extents bringing the fs performance to
a crawl.

Do that multiple times on multiple snapshots, delete some of the
original snapshots, and you're left with slack space, data blocks being
inaccessible and won't be reclaimed into free space (because they
are still part of the original extent), and which can only be
reclaimed by a defrag operation - which would of course unshares data.

Thus, if any of the above mentioned small extents is still shared with
an extent originally much bigger, then it will still occupy its
original space on the filesystem - even when its associated
snapshot/subvolume no longer exists. Only when the last remaining
tiny block of such an extent gets rewritten and the reference counter
decreases to zero, the extent is given up and freed.

To work around this, you can currently only unshare and recombine by
doing defrag and dedupe on all snapshots. This will reclaim space
sitting in parts of the original extents no longer referenced by a
snapshot visible from the VFS layer.

This is for performance reasons because btrfs is extent based.

As far as I know, ZFS on the other side, works different. It uses block
based storage for the snapshot feature and can easily throw away unused
blocks. Only a second layer on top maps this back into extents. The
underlying infrastructure, however, is block based storage, which also
enables the volume pool to create block devices on the fly out of ZFS
storage space.

PS: All above given the fact I understood it right. ;-)

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 20:47       ` Austin S. Hemmelgarn
@ 2017-02-07 21:25         ` Lionel Bouton
  2017-02-07 21:35           ` Kai Krakow
       [not found]         ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com>
  1 sibling, 1 reply; 42+ messages in thread
From: Lionel Bouton @ 2017-02-07 21:25 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
> On 2017-02-07 15:36, Kai Krakow wrote:
>> Am Tue, 7 Feb 2017 09:13:25 -0500
>> schrieb Peter Zaitsev <pz@percona.com>:
>>
>>> Hi Hugo,
>>>
>>> For the use case I'm looking for I'm interested in having snapshot(s)
>>> open at all time.  Imagine  for example snapshot being created every
>>> hour and several of these snapshots  kept at all time providing quick
>>> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
>>> think you also describe)  nodatacow  does not provide any advantage.
>>
>> Out of curiosity, I see one problem here:
>>
>> If you're doing snapshots of the live database, each snapshot leaves
>> the database files like killing the database in-flight. Like shutting
>> the system down in the middle of writing data.
>>
>> This is because I think there's no API for user space to subscribe to
>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>> service) in Windows. You should put the database into frozen state to
>> prepare it for a hotcopy before creating the snapshot, then ensure all
>> data is flushed before continuing.
> Correct.
>>
>> I think I've read that btrfs snapshots do not guarantee single point in
>> time snapshots - the snapshot may be smeared across a longer period of
>> time while the kernel is still writing data. So parts of your writes
>> may still end up in the snapshot after issuing the snapshot command,
>> instead of in the working copy as expected.
> Also correct AFAICT, and this needs to be better documented (for most
> people, the term snapshot implies atomicity of the operation).

Atomicity can be a relative term. If the snapshot atomicity is relative
to barriers but not relative to individual writes between barriers then
AFAICT it's fine because the filesystem doesn't make any promise it
won't keep even in the context of its snapshots.
Consider a power loss : the filesystems atomicity guarantees can't go
beyond what the hardware guarantees which means not all current in fly
write will reach the disk and partial writes can happen. Modern
filesystems will remain consistent though and if an application using
them makes uses of f*sync it can provide its own guarantees too. The
same should apply to snapshots : all the writes in fly can complete or
not on disk before the snapshot what matters is that both the snapshot
and these writes will be completed after the next barrier (and any
robust application will ignore all the in fly writes it finds in the
snapshot if they were part of a batch that should be atomically commited).

This is why AFAIK PostgreSQL or MySQL with their default ACID compliant
configuration will recover from a BTRFS snapshot in the same way they
recover from a power loss.

Lionel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 21:25         ` Lionel Bouton
@ 2017-02-07 21:35           ` Kai Krakow
  2017-02-07 22:27             ` Hans van Kranenburg
  2017-02-08 19:08             ` Goffredo Baroncelli
  0 siblings, 2 replies; 42+ messages in thread
From: Kai Krakow @ 2017-02-07 21:35 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 7 Feb 2017 22:25:29 +0100
schrieb Lionel Bouton <lionel-subscription@bouton.name>:

> Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
> > On 2017-02-07 15:36, Kai Krakow wrote:  
> >> Am Tue, 7 Feb 2017 09:13:25 -0500
> >> schrieb Peter Zaitsev <pz@percona.com>:
> >>  
>  [...]  
> >>
> >> Out of curiosity, I see one problem here:
> >>
> >> If you're doing snapshots of the live database, each snapshot
> >> leaves the database files like killing the database in-flight.
> >> Like shutting the system down in the middle of writing data.
> >>
> >> This is because I think there's no API for user space to subscribe
> >> to events like a snapshot - unlike e.g. the VSS API (volume
> >> snapshot service) in Windows. You should put the database into
> >> frozen state to prepare it for a hotcopy before creating the
> >> snapshot, then ensure all data is flushed before continuing.  
> > Correct.  
> >>
> >> I think I've read that btrfs snapshots do not guarantee single
> >> point in time snapshots - the snapshot may be smeared across a
> >> longer period of time while the kernel is still writing data. So
> >> parts of your writes may still end up in the snapshot after
> >> issuing the snapshot command, instead of in the working copy as
> >> expected.  
> > Also correct AFAICT, and this needs to be better documented (for
> > most people, the term snapshot implies atomicity of the
> > operation).  
> 
> Atomicity can be a relative term. If the snapshot atomicity is
> relative to barriers but not relative to individual writes between
> barriers then AFAICT it's fine because the filesystem doesn't make
> any promise it won't keep even in the context of its snapshots.
> Consider a power loss : the filesystems atomicity guarantees can't go
> beyond what the hardware guarantees which means not all current in fly
> write will reach the disk and partial writes can happen. Modern
> filesystems will remain consistent though and if an application using
> them makes uses of f*sync it can provide its own guarantees too. The
> same should apply to snapshots : all the writes in fly can complete or
> not on disk before the snapshot what matters is that both the snapshot
> and these writes will be completed after the next barrier (and any
> robust application will ignore all the in fly writes it finds in the
> snapshot if they were part of a batch that should be atomically
> commited).
> 
> This is why AFAIK PostgreSQL or MySQL with their default ACID
> compliant configuration will recover from a BTRFS snapshot in the
> same way they recover from a power loss.

This is what I meant in my other reply. But this is also why it should
be documented. Wrongly implying that snapshots are single point in time
snapshots is a wrong assumption with possibly horrible side effects one
wouldn't expect.

Taking a snapshot is like a power loss - even tho there is no power
loss. So the database has to be properly configured. It is simply short
sighted if you don't think about this fact. The documentation should
really point that fact out.


-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 18:59   ` Peter Zaitsev
  2017-02-07 19:54     ` Austin S. Hemmelgarn
@ 2017-02-07 22:08     ` Hans van Kranenburg
  1 sibling, 0 replies; 42+ messages in thread
From: Hans van Kranenburg @ 2017-02-07 22:08 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-btrfs

On 02/07/2017 07:59 PM, Peter Zaitsev wrote:
> 
> So far the most frustating for me was periodic stalls for many seconds
>  (running sysbench workload).  What was the most puzzling  I get this
> even if I run workload at the  50% or less of the full load  -  Ie
> database can handle 1000 transactions/sec and I only inject 500/sec
> and I still have those stalls.
> 
> This is where it looks to me like some work is being delayed and when
> it requires stall for a few seconds to catch up.    I wonder  if there
> are some configuration options available to play with.

What happens during these stalls? Do you mean a 'stall' like it seems
nothing is happening at all, or a 'stall' during which something is so
busy that something else cannot continue?

Is there some kernel thread doing a lot of cpu? What does the
/proc/<pid>/stack show?

Is it huge write spikes with not many writes in between, or do you
generate enough action to be writing to disk all the time?

If the stalls show the behaviour of huge disk-write spikes, during which
applications seem to be blocked from continuing to write more, and if
during that time you see btrfs-transaction active in the kernel, aaaand,
if your test is doing a lot of writes all over the place (not only
simply appending table files sequentially, but changing a lot and
touching a lot of metadata) and you're pushing it, it might be space
cache related.

I think the /proc/<pid>/stack of the btrfs-transaction will show you
something related to free space cache in this case.

In this case, it might be interesting to test the free space tree
(instead of the default free space cache):

http://events.linuxfoundation.org/sites/events/files/slides/vault2016_0.pdf

Using free space tree helped me a lot on write-heavy filesystems (like a
backup server with concurrent rsync data streaming in, also doing
snapshotting) from having incoming traffic drop to the ground every time
there was a transaction commit.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 21:35           ` Kai Krakow
@ 2017-02-07 22:27             ` Hans van Kranenburg
  2017-02-08 19:08             ` Goffredo Baroncelli
  1 sibling, 0 replies; 42+ messages in thread
From: Hans van Kranenburg @ 2017-02-07 22:27 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 02/07/2017 10:35 PM, Kai Krakow wrote:
> Am Tue, 7 Feb 2017 22:25:29 +0100
> schrieb Lionel Bouton <lionel-subscription@bouton.name>:
> 
>> Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
>>> On 2017-02-07 15:36, Kai Krakow wrote:  
>>>> Am Tue, 7 Feb 2017 09:13:25 -0500
>>>> schrieb Peter Zaitsev <pz@percona.com>:
>>>>  
>>  [...]  
>>>>
>>>> Out of curiosity, I see one problem here:
>>>>
>>>> If you're doing snapshots of the live database, each snapshot
>>>> leaves the database files like killing the database in-flight.
>>>> Like shutting the system down in the middle of writing data.
>>>>
>>>> This is because I think there's no API for user space to subscribe
>>>> to events like a snapshot - unlike e.g. the VSS API (volume
>>>> snapshot service) in Windows. You should put the database into
>>>> frozen state to prepare it for a hotcopy before creating the
>>>> snapshot, then ensure all data is flushed before continuing.  
>>> Correct.  
>>>>
>>>> I think I've read that btrfs snapshots do not guarantee single
>>>> point in time snapshots - the snapshot may be smeared across a
>>>> longer period of time while the kernel is still writing data. So
>>>> parts of your writes may still end up in the snapshot after
>>>> issuing the snapshot command, instead of in the working copy as
>>>> expected.  
>>> Also correct AFAICT, and this needs to be better documented (for
>>> most people, the term snapshot implies atomicity of the
>>> operation).  
>>
>> Atomicity can be a relative term. If the snapshot atomicity is
>> relative to barriers but not relative to individual writes between
>> barriers then AFAICT it's fine because the filesystem doesn't make
>> any promise it won't keep even in the context of its snapshots.
>> Consider a power loss : the filesystems atomicity guarantees can't go
>> beyond what the hardware guarantees which means not all current in fly
>> write will reach the disk and partial writes can happen. Modern
>> filesystems will remain consistent though and if an application using
>> them makes uses of f*sync it can provide its own guarantees too. The
>> same should apply to snapshots : all the writes in fly can complete or
>> not on disk before the snapshot what matters is that both the snapshot
>> and these writes will be completed after the next barrier (and any
>> robust application will ignore all the in fly writes it finds in the
>> snapshot if they were part of a batch that should be atomically
>> commited).
>>
>> This is why AFAIK PostgreSQL or MySQL with their default ACID
>> compliant configuration will recover from a BTRFS snapshot in the
>> same way they recover from a power loss.
> 
> This is what I meant in my other reply. But this is also why it should
> be documented. Wrongly implying that snapshots are single point in time
> snapshots is a wrong assumption with possibly horrible side effects one
> wouldn't expect.

It depends on what the definition of time is. (whoa!!) A snapshot is
taken of a single point in the lifetime of a filesystem tree (a
generation, the point where a transaction commits)...?

> Taking a snapshot is like a power loss - even tho there is no power
> loss. So the database has to be properly configured. It is simply short
> sighted if you don't think about this fact. The documentation should
> really point that fact out.

I'd almost say that it would be short sighted to assume a btrfs snapshot
would *not* behave like a power loss. At least, to me (thinking as a
sysadmin) it feels really weird to think of it in any other way than that.

Oh wait, that's what you mean, or not? What is the thing that the
documentation should point out? I'm not trying to be trolling, the piled
up double negations make this discussion a bit hard to read.

Moo

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 14:00 ` Hugo Mills
  2017-02-07 14:13   ` Peter Zaitsev
  2017-02-07 19:31   ` Peter Zaitsev
@ 2017-02-08  2:11   ` Peter Zaitsev
  2017-02-08 12:14     ` Martin Raiber
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zaitsev @ 2017-02-08  2:11 UTC (permalink / raw)
  To: linux-btrfs

Hi Kai,

I guess your message did not make it to me as I'm not subscribed to the list.

I totally understand what the the snapshot is "crash consistent"  -
consistent to the state of the disk you would find if you shut down
the power with no notice,
for many applications it is a problem however it is fine for many
databases which already need to be able to recover correctly from
power loss

for MySQL this works well for Innodb storage engine it does not work for MyISAM

The great of such "uncoordinated" snapshot is what it is instant and
have very little production impact -  if you want to "freeze" multiple
filesystems or
even worse flush MyISAM table it can take a lot of time and can be
unacceptable for many 24/7 workloads.

Or are you saying BTRFS snapshots do not provide this kind of consistency ?

> Hi Hugo,
>
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

Out of curiosity, I see one problem here:

If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.

How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.

XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.

-- 
Regards,
Kai

On Tue, Feb 7, 2017 at 9:00 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote:
>> Hi,
>>
>> I have tried BTRFS from Ubuntu 16.04 LTS   for write intensive OLTP MySQL
>> Workload.
>>
>> It did not go very well ranging from multi-seconds stalls where no
>> transactions are completed to the finally kernel OOPS with "no space left
>> on device" error message and filesystem going read only.
>>
>> I'm complete newbie in BTRFS so  I assume  I'm doing something wrong.
>>
>> Do you have any advice on how BTRFS should be tuned for OLTP workload
>> (large files having a lot of random writes)  ?    Or is this the case where
>> one should simply stay away from BTRFS and use something else ?
>>
>> One item recommended in some places is "nodatacow"  this however defeats
>> the main purpose I'm looking at BTRFS -  I am interested in "free"
>> snapshots which look very attractive to use for database recovery scenarios
>> allow instant rollback to the previous state.
>
>    Well, nodatacow will still allow snapshots to work, but it also
> allows the data to fragment. Each snapshot made will cause subsequent
> writes to shared areas to be CoWed once (and then it reverts to
> unshared and nodatacow again).
>
>    There's another approach which might be worth testing, which is to
> use autodefrag. This will increase data write I/O, because where you
> have one or more small writes in a region, it will also read and write
> the data in a small neghbourhood around those writes, so the
> fragmentation is reduced. This will improve subsequent read
> performance.
>
>    I could also suggest getting the latest kernel you can -- 16.04 is
> already getting on for a year old, and there may be performance
> improvements in upstream kernels which affect your workload. There's
> an Ubuntu kernel PPA you can use to get the new kernels without too
> much pain.
>
>    Hugo.
>
> --
> Hugo Mills             | I don't care about "it works on my machine". We are
> hugo@... carfax.org.uk | not shipping your machine.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |



-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 20:54           ` Kai Krakow
@ 2017-02-08 12:12             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-08 12:12 UTC (permalink / raw)
  To: linux-btrfs

On 2017-02-07 15:54, Kai Krakow wrote:
> Am Tue, 7 Feb 2017 15:27:34 -0500
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>>>> I'm not sure about this one.  I would assume based on the fact that
>>>> many other things don't work with nodatacow and that regular defrag
>>>> doesn't work on files which are currently mapped as executable code
>>>> that it does not, but I could be completely wrong about this too.
>>>
>>> Technically, there's nothing that prevents autodefrag to work for
>>> nodatacow files. The question is: is it really necessary? Standard
>>> file systems also have no autodefrag, it's not an issue there
>>> because they are essentially nodatacow. Simply defrag the database
>>> file once and you're done. Transactional MySQL uses huge data
>>> files, probably preallocated. It should simply work with
>>> nodatacow.
>> The thing is, I don't have enough knowledge of how defrag is
>> implemented in BTRFS to say for certain that ti doesn't use COW
>> semantics somewhere (and I would actually expect it to do so, since
>> that in theory makes many things _much_ easier to handle), and if it
>> uses COW somewhere, then it by definition doesn't work on NOCOW files.
>
> A dev would be needed on this. But from a non-dev point of view, the
> defrag operation itself is CoW: Blocks are rewritten to another
> location in contiguous order. Only metadata CoW should be needed for
> this operation.
>
> It should be nothing else than writing to a nodatacow snapshot... Just
> that the snapshot is more or less implicit and temporary.
>
> Hmm? *curious*
>
The gimmicky part though is that the file has to remain accessible 
throughout the entire operation, and the defrad can't lose changes that 
occur while the file is being defragmented.  In many filesystems (NTFS 
on Windows for example), a defrag functions similarly to a pvmove 
operation in LVM, as each extent gets moved, writes to that region get 
indirected to the new location and treat the areas that were written to 
as having been moved already.  The thing is, on BTRFS that would result 
in extents getting split, which means COW is probably involved at some 
level in the data path too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-08  2:11   ` Peter Zaitsev
@ 2017-02-08 12:14     ` Martin Raiber
  2017-02-08 13:00       ` Adrian Brzezinski
  2017-02-08 13:08       ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 42+ messages in thread
From: Martin Raiber @ 2017-02-08 12:14 UTC (permalink / raw)
  To: Peter Zaitsev, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2780 bytes --]

Hi,

On 08.02.2017 03:11 Peter Zaitsev wrote:
> Out of curiosity, I see one problem here:
> If you're doing snapshots of the live database, each snapshot leaves
> the database files like killing the database in-flight. Like shutting
> the system down in the middle of writing data.
>
> This is because I think there's no API for user space to subscribe to
> events like a snapshot - unlike e.g. the VSS API (volume snapshot
> service) in Windows. You should put the database into frozen state to
> prepare it for a hotcopy before creating the snapshot, then ensure all
> data is flushed before continuing.
>
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.
>
> How is this going to be addressed? Is there some snapshot aware API to
> let user space subscribe to such events and do proper preparation? Is
> this planned? LVM could be a user of such an API, too. I think this
> could have nice enterprise-grade value for Linux.
>
> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
> still, also this needs to be integrated with MySQL to properly work. I
> once (years ago) researched on this but gave up on my plans when I
> planned database backups for our web server infrastructure. We moved to
> creating SQL dumps instead, although there're binlogs which can be used
> to recover to a clean and stable transactional state after taking
> snapshots. But I simply didn't want to fiddle around with properly
> cleaning up binlogs which accumulate horribly much space usage over
> time. The cleanup process requires to create a cold copy or dump of the
> complete database from time to time, only then it's safe to remove all
> binlogs up to that point in time.

little bit off topic, but I for one would be on board with such an
effort. It "just" needs coordination between the backup
software/snapshot tools, the backed up software and the various snapshot
providers. If you look at the Windows VSS API, this would be a
relatively large undertaking if all the corner cases are taken into
account, like e.g. a database having the database log on a separate
volume from the data, dependencies between different components etc.

You'll know more about this, but databases usually fsync quite often in
their default configuration, so btrfs snapshots shouldn't be much behind
the properly snapshotted state, so I see the advantages more with
usability and taking care of corner cases automatically.

Regards,
Martin Raiber


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-08 12:14     ` Martin Raiber
@ 2017-02-08 13:00       ` Adrian Brzezinski
  2017-02-08 13:08       ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 42+ messages in thread
From: Adrian Brzezinski @ 2017-02-08 13:00 UTC (permalink / raw)
  To: Martin Raiber, Peter Zaitsev, linux-btrfs

W dniu 2017-02-08 o 13:14 PM, Martin Raiber pisze:
> Hi,
>
> On 08.02.2017 03:11 Peter Zaitsev wrote:
>> Out of curiosity, I see one problem here:
>> If you're doing snapshots of the live database, each snapshot leaves
>> the database files like killing the database in-flight. Like shutting
>> the system down in the middle of writing data.
>>
>> This is because I think there's no API for user space to subscribe to
>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>> service) in Windows. You should put the database into frozen state to
>> prepare it for a hotcopy before creating the snapshot, then ensure all
>> data is flushed before continuing.
>>
>> I think I've read that btrfs snapshots do not guarantee single point in
>> time snapshots - the snapshot may be smeared across a longer period of
>> time while the kernel is still writing data. So parts of your writes
>> may still end up in the snapshot after issuing the snapshot command,
>> instead of in the working copy as expected.
>>
>> How is this going to be addressed? Is there some snapshot aware API to
>> let user space subscribe to such events and do proper preparation? Is
>> this planned? LVM could be a user of such an API, too. I think this
>> could have nice enterprise-grade value for Linux.
>>
>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>> still, also this needs to be integrated with MySQL to properly work. I
>> once (years ago) researched on this but gave up on my plans when I
>> planned database backups for our web server infrastructure. We moved to
>> creating SQL dumps instead, although there're binlogs which can be used
>> to recover to a clean and stable transactional state after taking
>> snapshots. But I simply didn't want to fiddle around with properly
>> cleaning up binlogs which accumulate horribly much space usage over
>> time. The cleanup process requires to create a cold copy or dump of the
>> complete database from time to time, only then it's safe to remove all
>> binlogs up to that point in time.
> little bit off topic, but I for one would be on board with such an
> effort. It "just" needs coordination between the backup
> software/snapshot tools, the backed up software and the various snapshot
> providers. If you look at the Windows VSS API, this would be a
> relatively large undertaking if all the corner cases are taken into
> account, like e.g. a database having the database log on a separate
> volume from the data, dependencies between different components etc.
>
> You'll know more about this, but databases usually fsync quite often in
> their default configuration, so btrfs snapshots shouldn't be much behind
> the properly snapshotted state, so I see the advantages more with
> usability and taking care of corner cases automatically.
>
> Regards,
> Martin Raiber

xfs_freeze works also for BTRFS...


-- 

Adrian Brzeziński


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-08 12:14     ` Martin Raiber
  2017-02-08 13:00       ` Adrian Brzezinski
@ 2017-02-08 13:08       ` Austin S. Hemmelgarn
  2017-02-08 13:26         ` Martin Raiber
  1 sibling, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-08 13:08 UTC (permalink / raw)
  To: Martin Raiber, Peter Zaitsev, linux-btrfs

On 2017-02-08 07:14, Martin Raiber wrote:
> Hi,
>
> On 08.02.2017 03:11 Peter Zaitsev wrote:
>> Out of curiosity, I see one problem here:
>> If you're doing snapshots of the live database, each snapshot leaves
>> the database files like killing the database in-flight. Like shutting
>> the system down in the middle of writing data.
>>
>> This is because I think there's no API for user space to subscribe to
>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>> service) in Windows. You should put the database into frozen state to
>> prepare it for a hotcopy before creating the snapshot, then ensure all
>> data is flushed before continuing.
>>
>> I think I've read that btrfs snapshots do not guarantee single point in
>> time snapshots - the snapshot may be smeared across a longer period of
>> time while the kernel is still writing data. So parts of your writes
>> may still end up in the snapshot after issuing the snapshot command,
>> instead of in the working copy as expected.
>>
>> How is this going to be addressed? Is there some snapshot aware API to
>> let user space subscribe to such events and do proper preparation? Is
>> this planned? LVM could be a user of such an API, too. I think this
>> could have nice enterprise-grade value for Linux.
>>
>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>> still, also this needs to be integrated with MySQL to properly work. I
>> once (years ago) researched on this but gave up on my plans when I
>> planned database backups for our web server infrastructure. We moved to
>> creating SQL dumps instead, although there're binlogs which can be used
>> to recover to a clean and stable transactional state after taking
>> snapshots. But I simply didn't want to fiddle around with properly
>> cleaning up binlogs which accumulate horribly much space usage over
>> time. The cleanup process requires to create a cold copy or dump of the
>> complete database from time to time, only then it's safe to remove all
>> binlogs up to that point in time.
>
> little bit off topic, but I for one would be on board with such an
> effort. It "just" needs coordination between the backup
> software/snapshot tools, the backed up software and the various snapshot
> providers. If you look at the Windows VSS API, this would be a
> relatively large undertaking if all the corner cases are taken into
> account, like e.g. a database having the database log on a separate
> volume from the data, dependencies between different components etc.
>
> You'll know more about this, but databases usually fsync quite often in
> their default configuration, so btrfs snapshots shouldn't be much behind
> the properly snapshotted state, so I see the advantages more with
> usability and taking care of corner cases automatically.
Just my perspective, but BTRFS (and XFS, and OCFS2) already provide 
reflinking to userspace, and therefore it's fully possible to implement 
this in userspace.  Having a version of the fsfreeze (the generic form 
of xfs_freeze) stuff that worked on individual sub-trees would be nice 
from a practical perspective, but implementing it would not be easy by 
any means, and would be essentially necessary for a VSS-like API.  In 
the meantime though, it is fully possible for the application software 
to implement this itself without needing anything more from the kernel.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-08 13:08       ` Austin S. Hemmelgarn
@ 2017-02-08 13:26         ` Martin Raiber
  2017-02-08 13:32           ` Austin S. Hemmelgarn
  2017-02-08 13:38           ` Peter Zaitsev
  0 siblings, 2 replies; 42+ messages in thread
From: Martin Raiber @ 2017-02-08 13:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Peter Zaitsev, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4253 bytes --]

On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
> On 2017-02-08 07:14, Martin Raiber wrote:
>> Hi,
>>
>> On 08.02.2017 03:11 Peter Zaitsev wrote:
>>> Out of curiosity, I see one problem here:
>>> If you're doing snapshots of the live database, each snapshot leaves
>>> the database files like killing the database in-flight. Like shutting
>>> the system down in the middle of writing data.
>>>
>>> This is because I think there's no API for user space to subscribe to
>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>>> service) in Windows. You should put the database into frozen state to
>>> prepare it for a hotcopy before creating the snapshot, then ensure all
>>> data is flushed before continuing.
>>>
>>> I think I've read that btrfs snapshots do not guarantee single point in
>>> time snapshots - the snapshot may be smeared across a longer period of
>>> time while the kernel is still writing data. So parts of your writes
>>> may still end up in the snapshot after issuing the snapshot command,
>>> instead of in the working copy as expected.
>>>
>>> How is this going to be addressed? Is there some snapshot aware API to
>>> let user space subscribe to such events and do proper preparation? Is
>>> this planned? LVM could be a user of such an API, too. I think this
>>> could have nice enterprise-grade value for Linux.
>>>
>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>>> still, also this needs to be integrated with MySQL to properly work. I
>>> once (years ago) researched on this but gave up on my plans when I
>>> planned database backups for our web server infrastructure. We moved to
>>> creating SQL dumps instead, although there're binlogs which can be used
>>> to recover to a clean and stable transactional state after taking
>>> snapshots. But I simply didn't want to fiddle around with properly
>>> cleaning up binlogs which accumulate horribly much space usage over
>>> time. The cleanup process requires to create a cold copy or dump of the
>>> complete database from time to time, only then it's safe to remove all
>>> binlogs up to that point in time.
>>
>> little bit off topic, but I for one would be on board with such an
>> effort. It "just" needs coordination between the backup
>> software/snapshot tools, the backed up software and the various snapshot
>> providers. If you look at the Windows VSS API, this would be a
>> relatively large undertaking if all the corner cases are taken into
>> account, like e.g. a database having the database log on a separate
>> volume from the data, dependencies between different components etc.
>>
>> You'll know more about this, but databases usually fsync quite often in
>> their default configuration, so btrfs snapshots shouldn't be much behind
>> the properly snapshotted state, so I see the advantages more with
>> usability and taking care of corner cases automatically.
> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
> reflinking to userspace, and therefore it's fully possible to
> implement this in userspace.  Having a version of the fsfreeze (the
> generic form of xfs_freeze) stuff that worked on individual sub-trees
> would be nice from a practical perspective, but implementing it would
> not be easy by any means, and would be essentially necessary for a
> VSS-like API.  In the meantime though, it is fully possible for the
> application software to implement this itself without needing anything
> more from the kernel.

VSS snapshots whole volumes, not individual files (so comparable to an
LVM snapshot). The sub-folder freeze would be something useful in some
situations, but duplicating the files+extends might also take too long
in a lot of situations. You are correct that the kernel features are
there and what is missing is a user-space daemon, plus a protocol that
facilitates/coordinates the backups/snapshots.

Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
manages its on buffer pool which won't get the FIFREEZE and flush, but
as said, the default configuration is to flush/fsync on every commit.




[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-08 13:26         ` Martin Raiber
@ 2017-02-08 13:32           ` Austin S. Hemmelgarn
  2017-02-08 14:28             ` Adrian Brzezinski
  2017-02-08 13:38           ` Peter Zaitsev
  1 sibling, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-08 13:32 UTC (permalink / raw)
  To: Martin Raiber, Peter Zaitsev, linux-btrfs

On 2017-02-08 08:26, Martin Raiber wrote:
> On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
>> On 2017-02-08 07:14, Martin Raiber wrote:
>>> Hi,
>>>
>>> On 08.02.2017 03:11 Peter Zaitsev wrote:
>>>> Out of curiosity, I see one problem here:
>>>> If you're doing snapshots of the live database, each snapshot leaves
>>>> the database files like killing the database in-flight. Like shutting
>>>> the system down in the middle of writing data.
>>>>
>>>> This is because I think there's no API for user space to subscribe to
>>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>>>> service) in Windows. You should put the database into frozen state to
>>>> prepare it for a hotcopy before creating the snapshot, then ensure all
>>>> data is flushed before continuing.
>>>>
>>>> I think I've read that btrfs snapshots do not guarantee single point in
>>>> time snapshots - the snapshot may be smeared across a longer period of
>>>> time while the kernel is still writing data. So parts of your writes
>>>> may still end up in the snapshot after issuing the snapshot command,
>>>> instead of in the working copy as expected.
>>>>
>>>> How is this going to be addressed? Is there some snapshot aware API to
>>>> let user space subscribe to such events and do proper preparation? Is
>>>> this planned? LVM could be a user of such an API, too. I think this
>>>> could have nice enterprise-grade value for Linux.
>>>>
>>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>>>> still, also this needs to be integrated with MySQL to properly work. I
>>>> once (years ago) researched on this but gave up on my plans when I
>>>> planned database backups for our web server infrastructure. We moved to
>>>> creating SQL dumps instead, although there're binlogs which can be used
>>>> to recover to a clean and stable transactional state after taking
>>>> snapshots. But I simply didn't want to fiddle around with properly
>>>> cleaning up binlogs which accumulate horribly much space usage over
>>>> time. The cleanup process requires to create a cold copy or dump of the
>>>> complete database from time to time, only then it's safe to remove all
>>>> binlogs up to that point in time.
>>>
>>> little bit off topic, but I for one would be on board with such an
>>> effort. It "just" needs coordination between the backup
>>> software/snapshot tools, the backed up software and the various snapshot
>>> providers. If you look at the Windows VSS API, this would be a
>>> relatively large undertaking if all the corner cases are taken into
>>> account, like e.g. a database having the database log on a separate
>>> volume from the data, dependencies between different components etc.
>>>
>>> You'll know more about this, but databases usually fsync quite often in
>>> their default configuration, so btrfs snapshots shouldn't be much behind
>>> the properly snapshotted state, so I see the advantages more with
>>> usability and taking care of corner cases automatically.
>> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
>> reflinking to userspace, and therefore it's fully possible to
>> implement this in userspace.  Having a version of the fsfreeze (the
>> generic form of xfs_freeze) stuff that worked on individual sub-trees
>> would be nice from a practical perspective, but implementing it would
>> not be easy by any means, and would be essentially necessary for a
>> VSS-like API.  In the meantime though, it is fully possible for the
>> application software to implement this itself without needing anything
>> more from the kernel.
>
> VSS snapshots whole volumes, not individual files (so comparable to an
> LVM snapshot). The sub-folder freeze would be something useful in some
> situations, but duplicating the files+extends might also take too long
> in a lot of situations. You are correct that the kernel features are
> there and what is missing is a user-space daemon, plus a protocol that
> facilitates/coordinates the backups/snapshots.
>
> Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
> really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
> manages its on buffer pool which won't get the FIFREEZE and flush, but
> as said, the default configuration is to flush/fsync on every commit.
OK, there's part of the misunderstanding.  You can't FIFREEZE a BTRFS 
filesystem and then take a snapshot in it, because the snapshot requires 
writing to the filesystem (which the FIFREEZE would prevent, so a script 
that tried to do this would deadlock).  A new version of the FIFREEZE 
ioctl would be needed that operates on subvolumes.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-08 13:26         ` Martin Raiber
  2017-02-08 13:32           ` Austin S. Hemmelgarn
@ 2017-02-08 13:38           ` Peter Zaitsev
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Zaitsev @ 2017-02-08 13:38 UTC (permalink / raw)
  To: Martin Raiber; +Cc: Austin S. Hemmelgarn, linux-btrfs

Hi,

When it comes to MySQL I'm not really sure what you're trying to
achieve.  Because MySQL manages its own cache flushing OS cache to the
disk and "freezing" FS does not really do much - it will still need to
do crash recovery when such snapshot is restored.

The reason people would use xfs_freeze with MySQL is when we have the
database spread across different filesystems - typically   log files
placed on the different partition than the data or databases placed on
different partitions.  In this case you need to have consistent single
point in time snapshot across the filesystems for backup to be
recoverable.         More common approach though is to keep it KISS
and have everything on single filesystem.

On Wed, Feb 8, 2017 at 8:26 AM, Martin Raiber <martin@urbackup.org> wrote:
> On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
>> On 2017-02-08 07:14, Martin Raiber wrote:
>>> Hi,
>>>
>>> On 08.02.2017 03:11 Peter Zaitsev wrote:
>>>> Out of curiosity, I see one problem here:
>>>> If you're doing snapshots of the live database, each snapshot leaves
>>>> the database files like killing the database in-flight. Like shutting
>>>> the system down in the middle of writing data.
>>>>
>>>> This is because I think there's no API for user space to subscribe to
>>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>>>> service) in Windows. You should put the database into frozen state to
>>>> prepare it for a hotcopy before creating the snapshot, then ensure all
>>>> data is flushed before continuing.
>>>>
>>>> I think I've read that btrfs snapshots do not guarantee single point in
>>>> time snapshots - the snapshot may be smeared across a longer period of
>>>> time while the kernel is still writing data. So parts of your writes
>>>> may still end up in the snapshot after issuing the snapshot command,
>>>> instead of in the working copy as expected.
>>>>
>>>> How is this going to be addressed? Is there some snapshot aware API to
>>>> let user space subscribe to such events and do proper preparation? Is
>>>> this planned? LVM could be a user of such an API, too. I think this
>>>> could have nice enterprise-grade value for Linux.
>>>>
>>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>>>> still, also this needs to be integrated with MySQL to properly work. I
>>>> once (years ago) researched on this but gave up on my plans when I
>>>> planned database backups for our web server infrastructure. We moved to
>>>> creating SQL dumps instead, although there're binlogs which can be used
>>>> to recover to a clean and stable transactional state after taking
>>>> snapshots. But I simply didn't want to fiddle around with properly
>>>> cleaning up binlogs which accumulate horribly much space usage over
>>>> time. The cleanup process requires to create a cold copy or dump of the
>>>> complete database from time to time, only then it's safe to remove all
>>>> binlogs up to that point in time.
>>>
>>> little bit off topic, but I for one would be on board with such an
>>> effort. It "just" needs coordination between the backup
>>> software/snapshot tools, the backed up software and the various snapshot
>>> providers. If you look at the Windows VSS API, this would be a
>>> relatively large undertaking if all the corner cases are taken into
>>> account, like e.g. a database having the database log on a separate
>>> volume from the data, dependencies between different components etc.
>>>
>>> You'll know more about this, but databases usually fsync quite often in
>>> their default configuration, so btrfs snapshots shouldn't be much behind
>>> the properly snapshotted state, so I see the advantages more with
>>> usability and taking care of corner cases automatically.
>> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
>> reflinking to userspace, and therefore it's fully possible to
>> implement this in userspace.  Having a version of the fsfreeze (the
>> generic form of xfs_freeze) stuff that worked on individual sub-trees
>> would be nice from a practical perspective, but implementing it would
>> not be easy by any means, and would be essentially necessary for a
>> VSS-like API.  In the meantime though, it is fully possible for the
>> application software to implement this itself without needing anything
>> more from the kernel.
>
> VSS snapshots whole volumes, not individual files (so comparable to an
> LVM snapshot). The sub-folder freeze would be something useful in some
> situations, but duplicating the files+extends might also take too long
> in a lot of situations. You are correct that the kernel features are
> there and what is missing is a user-space daemon, plus a protocol that
> facilitates/coordinates the backups/snapshots.
>
> Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
> really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
> manages its on buffer pool which won't get the FIFREEZE and flush, but
> as said, the default configuration is to flush/fsync on every commit.
>
>
>



-- 
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-08 13:32           ` Austin S. Hemmelgarn
@ 2017-02-08 14:28             ` Adrian Brzezinski
  0 siblings, 0 replies; 42+ messages in thread
From: Adrian Brzezinski @ 2017-02-08 14:28 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Martin Raiber, Peter Zaitsev, linux-btrfs

W dniu 2017-02-08 o 14:32 PM, Austin S. Hemmelgarn pisze:
> On 2017-02-08 08:26, Martin Raiber wrote:
>> On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
>>> On 2017-02-08 07:14, Martin Raiber wrote:
>>>> Hi,
>>>>
>>>> On 08.02.2017 03:11 Peter Zaitsev wrote:
>>>>> Out of curiosity, I see one problem here:
>>>>> If you're doing snapshots of the live database, each snapshot leaves
>>>>> the database files like killing the database in-flight. Like shutting
>>>>> the system down in the middle of writing data.
>>>>>
>>>>> This is because I think there's no API for user space to subscribe to
>>>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>>>>> service) in Windows. You should put the database into frozen state to
>>>>> prepare it for a hotcopy before creating the snapshot, then ensure
>>>>> all
>>>>> data is flushed before continuing.
>>>>>
>>>>> I think I've read that btrfs snapshots do not guarantee single
>>>>> point in
>>>>> time snapshots - the snapshot may be smeared across a longer
>>>>> period of
>>>>> time while the kernel is still writing data. So parts of your writes
>>>>> may still end up in the snapshot after issuing the snapshot command,
>>>>> instead of in the working copy as expected.
>>>>>
>>>>> How is this going to be addressed? Is there some snapshot aware
>>>>> API to
>>>>> let user space subscribe to such events and do proper preparation? Is
>>>>> this planned? LVM could be a user of such an API, too. I think this
>>>>> could have nice enterprise-grade value for Linux.
>>>>>
>>>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM
>>>>> snapshots. But
>>>>> still, also this needs to be integrated with MySQL to properly
>>>>> work. I
>>>>> once (years ago) researched on this but gave up on my plans when I
>>>>> planned database backups for our web server infrastructure. We
>>>>> moved to
>>>>> creating SQL dumps instead, although there're binlogs which can be
>>>>> used
>>>>> to recover to a clean and stable transactional state after taking
>>>>> snapshots. But I simply didn't want to fiddle around with properly
>>>>> cleaning up binlogs which accumulate horribly much space usage over
>>>>> time. The cleanup process requires to create a cold copy or dump
>>>>> of the
>>>>> complete database from time to time, only then it's safe to remove
>>>>> all
>>>>> binlogs up to that point in time.
>>>>
>>>> little bit off topic, but I for one would be on board with such an
>>>> effort. It "just" needs coordination between the backup
>>>> software/snapshot tools, the backed up software and the various
>>>> snapshot
>>>> providers. If you look at the Windows VSS API, this would be a
>>>> relatively large undertaking if all the corner cases are taken into
>>>> account, like e.g. a database having the database log on a separate
>>>> volume from the data, dependencies between different components etc.
>>>>
>>>> You'll know more about this, but databases usually fsync quite
>>>> often in
>>>> their default configuration, so btrfs snapshots shouldn't be much
>>>> behind
>>>> the properly snapshotted state, so I see the advantages more with
>>>> usability and taking care of corner cases automatically.
>>> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
>>> reflinking to userspace, and therefore it's fully possible to
>>> implement this in userspace.  Having a version of the fsfreeze (the
>>> generic form of xfs_freeze) stuff that worked on individual sub-trees
>>> would be nice from a practical perspective, but implementing it would
>>> not be easy by any means, and would be essentially necessary for a
>>> VSS-like API.  In the meantime though, it is fully possible for the
>>> application software to implement this itself without needing anything
>>> more from the kernel.
>>
>> VSS snapshots whole volumes, not individual files (so comparable to an
>> LVM snapshot). The sub-folder freeze would be something useful in some
>> situations, but duplicating the files+extends might also take too long
>> in a lot of situations. You are correct that the kernel features are
>> there and what is missing is a user-space daemon, plus a protocol that
>> facilitates/coordinates the backups/snapshots.
>>
>> Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
>> really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
>> manages its on buffer pool which won't get the FIFREEZE and flush, but
>> as said, the default configuration is to flush/fsync on every commit.
> OK, there's part of the misunderstanding.  You can't FIFREEZE a BTRFS
> filesystem and then take a snapshot in it, because the snapshot
> requires writing to the filesystem (which the FIFREEZE would prevent,
> so a script that tried to do this would deadlock).  A new version of
> the FIFREEZE ioctl would be needed that operates on subvolumes.
You can also you put your filesystem on LVM, and take LVM snapshots.


-- 
Adrian Brzeziński

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-07 21:35           ` Kai Krakow
  2017-02-07 22:27             ` Hans van Kranenburg
@ 2017-02-08 19:08             ` Goffredo Baroncelli
  1 sibling, 0 replies; 42+ messages in thread
From: Goffredo Baroncelli @ 2017-02-08 19:08 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 2017-02-07 22:35, Kai Krakow wrote:
[...]
>>
>> Atomicity can be a relative term. If the snapshot atomicity is
>> relative to barriers but not relative to individual writes between
>> barriers then AFAICT it's fine because the filesystem doesn't make
>> any promise it won't keep even in the context of its snapshots.
>> Consider a power loss : the filesystems atomicity guarantees can't go
>> beyond what the hardware guarantees which means not all current in fly
>> write will reach the disk and partial writes can happen. Modern
>> filesystems will remain consistent though and if an application using
>> them makes uses of f*sync it can provide its own guarantees too. The
>> same should apply to snapshots : all the writes in fly can complete or
>> not on disk before the snapshot what matters is that both the snapshot
>> and these writes will be completed after the next barrier (and any
>> robust application will ignore all the in fly writes it finds in the
>> snapshot if they were part of a batch that should be atomically
>> commited).
>>
>> This is why AFAIK PostgreSQL or MySQL with their default ACID
>> compliant configuration will recover from a BTRFS snapshot in the
>> same way they recover from a power loss.
> 
> This is what I meant in my other reply. But this is also why it should
> be documented. Wrongly implying that snapshots are single point in time
> snapshots is a wrong assumption with possibly horrible side effects one
> wouldn't expect.

I don't understand what are you saying. 
Until now, my understanding was that "all the writings which were passed to btrfs before the snapshot time are in the snapshot. The ones after not".
Am I wrong ? Which are the others possible interpretations ?


[..]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
       [not found]         ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com>
@ 2017-02-13 12:44           ` Austin S. Hemmelgarn
  2017-02-13 17:16             ` linux-btrfs
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-13 12:44 UTC (permalink / raw)
  To: linux-btrfs

On 2017-02-09 22:58, Andrei Borzenkov wrote:
> 07.02.2017 23:47, Austin S. Hemmelgarn пишет:
> ...
>> Sadly, freezefs (the generic interface based off of xfs_freeze) only
>> works for block device snapshots.  Filesystem level snapshots need the
>> application software to sync all it's data and then stop writing until
>> the snapshot is complete.
>>
>
> I expect databases to be using directio, otherwise we have problems even
> without using snapshots. Is it still an issue with directio?
It is less of an issue, but it's still an issue because you can still 
call for snapshot creation in the middle of an application I/O request. 
In other words, the application wouldn't need to worry about syncing 
data, but it would need to worry about making sure it's not actually 
writing anything when the snapshot happens.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: BTRFS for OLTP Databases
  2017-02-13 12:44           ` Austin S. Hemmelgarn
@ 2017-02-13 17:16             ` linux-btrfs
  0 siblings, 0 replies; 42+ messages in thread
From: linux-btrfs @ 2017-02-13 17:16 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

W dniu 2017-02-13 o 13:44 PM, Austin S. Hemmelgarn pisze:
> On 2017-02-09 22:58, Andrei Borzenkov wrote:
>> 07.02.2017 23:47, Austin S. Hemmelgarn пишет:
>> ...
>>> Sadly, freezefs (the generic interface based off of xfs_freeze) only
>>> works for block device snapshots.  Filesystem level snapshots need the
>>> application software to sync all it's data and then stop writing until
>>> the snapshot is complete.
>>>
>>
>> I expect databases to be using directio, otherwise we have problems even
>> without using snapshots. Is it still an issue with directio?
> It is less of an issue, but it's still an issue because you can still
> call for snapshot creation in the middle of an application I/O
> request. In other words, the application wouldn't need to worry about
> syncing data, but it would need to worry about making sure it's not
> actually writing anything when the snapshot happens.
>
I think this should work the other way around. Snapshot should wait
until all directio writes are done,

and new requests sent when creating snapshot, should wait until snapshot
is done.


-- 

Adrian Brzeziński


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-02-13 17:16 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev
2017-02-07 14:00 ` Hugo Mills
2017-02-07 14:13   ` Peter Zaitsev
2017-02-07 15:00     ` Timofey Titovets
2017-02-07 15:09       ` Austin S. Hemmelgarn
2017-02-07 15:20         ` Timofey Titovets
2017-02-07 15:43           ` Austin S. Hemmelgarn
2017-02-07 21:14             ` Kai Krakow
2017-02-07 16:22     ` Lionel Bouton
2017-02-07 19:57     ` Roman Mamedov
2017-02-07 20:36     ` Kai Krakow
2017-02-07 20:44       ` Lionel Bouton
2017-02-07 20:47       ` Austin S. Hemmelgarn
2017-02-07 21:25         ` Lionel Bouton
2017-02-07 21:35           ` Kai Krakow
2017-02-07 22:27             ` Hans van Kranenburg
2017-02-08 19:08             ` Goffredo Baroncelli
     [not found]         ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com>
2017-02-13 12:44           ` Austin S. Hemmelgarn
2017-02-13 17:16             ` linux-btrfs
2017-02-07 19:31   ` Peter Zaitsev
2017-02-07 19:50     ` Austin S. Hemmelgarn
2017-02-07 20:19       ` Kai Krakow
2017-02-07 20:27         ` Austin S. Hemmelgarn
2017-02-07 20:54           ` Kai Krakow
2017-02-08 12:12             ` Austin S. Hemmelgarn
2017-02-08  2:11   ` Peter Zaitsev
2017-02-08 12:14     ` Martin Raiber
2017-02-08 13:00       ` Adrian Brzezinski
2017-02-08 13:08       ` Austin S. Hemmelgarn
2017-02-08 13:26         ` Martin Raiber
2017-02-08 13:32           ` Austin S. Hemmelgarn
2017-02-08 14:28             ` Adrian Brzezinski
2017-02-08 13:38           ` Peter Zaitsev
2017-02-07 14:47 ` Peter Grandi
2017-02-07 15:06 ` Austin S. Hemmelgarn
2017-02-07 19:39   ` Kai Krakow
2017-02-07 19:59     ` Austin S. Hemmelgarn
2017-02-07 18:27 ` Jeff Mahoney
2017-02-07 18:59   ` Peter Zaitsev
2017-02-07 19:54     ` Austin S. Hemmelgarn
2017-02-07 20:40       ` Peter Zaitsev
2017-02-07 22:08     ` Hans van Kranenburg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.