* BTRFS for OLTP Databases @ 2017-02-07 13:53 Peter Zaitsev 2017-02-07 14:00 ` Hugo Mills ` (3 more replies) 0 siblings, 4 replies; 42+ messages in thread From: Peter Zaitsev @ 2017-02-07 13:53 UTC (permalink / raw) To: linux-btrfs Hi, I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL Workload. It did not go very well ranging from multi-seconds stalls where no transactions are completed to the finally kernel OOPS with "no space left on device" error message and filesystem going read only. I'm complete newbie in BTRFS so I assume I'm doing something wrong. Do you have any advice on how BTRFS should be tuned for OLTP workload (large files having a lot of random writes) ? Or is this the case where one should simply stay away from BTRFS and use something else ? One item recommended in some places is "nodatacow" this however defeats the main purpose I'm looking at BTRFS - I am interested in "free" snapshots which look very attractive to use for database recovery scenarios allow instant rollback to the previous state. -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev @ 2017-02-07 14:00 ` Hugo Mills 2017-02-07 14:13 ` Peter Zaitsev ` (2 more replies) 2017-02-07 14:47 ` Peter Grandi ` (2 subsequent siblings) 3 siblings, 3 replies; 42+ messages in thread From: Hugo Mills @ 2017-02-07 14:00 UTC (permalink / raw) To: Peter Zaitsev; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1989 bytes --] On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote: > Hi, > > I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL > Workload. > > It did not go very well ranging from multi-seconds stalls where no > transactions are completed to the finally kernel OOPS with "no space left > on device" error message and filesystem going read only. > > I'm complete newbie in BTRFS so I assume I'm doing something wrong. > > Do you have any advice on how BTRFS should be tuned for OLTP workload > (large files having a lot of random writes) ? Or is this the case where > one should simply stay away from BTRFS and use something else ? > > One item recommended in some places is "nodatacow" this however defeats > the main purpose I'm looking at BTRFS - I am interested in "free" > snapshots which look very attractive to use for database recovery scenarios > allow instant rollback to the previous state. Well, nodatacow will still allow snapshots to work, but it also allows the data to fragment. Each snapshot made will cause subsequent writes to shared areas to be CoWed once (and then it reverts to unshared and nodatacow again). There's another approach which might be worth testing, which is to use autodefrag. This will increase data write I/O, because where you have one or more small writes in a region, it will also read and write the data in a small neghbourhood around those writes, so the fragmentation is reduced. This will improve subsequent read performance. I could also suggest getting the latest kernel you can -- 16.04 is already getting on for a year old, and there may be performance improvements in upstream kernels which affect your workload. There's an Ubuntu kernel PPA you can use to get the new kernels without too much pain. Hugo. -- Hugo Mills | I don't care about "it works on my machine". We are hugo@... carfax.org.uk | not shipping your machine. http://carfax.org.uk/ | PGP: E2AB1DE4 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 14:00 ` Hugo Mills @ 2017-02-07 14:13 ` Peter Zaitsev 2017-02-07 15:00 ` Timofey Titovets ` (3 more replies) 2017-02-07 19:31 ` Peter Zaitsev 2017-02-08 2:11 ` Peter Zaitsev 2 siblings, 4 replies; 42+ messages in thread From: Peter Zaitsev @ 2017-02-07 14:13 UTC (permalink / raw) To: Hugo Mills, Peter Zaitsev, linux-btrfs Hi Hugo, For the use case I'm looking for I'm interested in having snapshot(s) open at all time. Imagine for example snapshot being created every hour and several of these snapshots kept at all time providing quick recovery points to the state of 1,2,3 hours ago. In such case (as I think you also describe) nodatacow does not provide any advantage. I have not seen autodefrag helping much but I will try again. Is there any autodefrag documentation available about how is it expected to work and if it can be tuned in any way I noticed remounting already fragmented filesystem with autodefrag and putting workload which does more fragmentation does not seem to improve over time > Well, nodatacow will still allow snapshots to work, but it also > allows the data to fragment. Each snapshot made will cause subsequent > writes to shared areas to be CoWed once (and then it reverts to > unshared and nodatacow again). > > There's another approach which might be worth testing, which is to > use autodefrag. This will increase data write I/O, because where you > have one or more small writes in a region, it will also read and write > the data in a small neghbourhood around those writes, so the > fragmentation is reduced. This will improve subsequent read > performance. > > I could also suggest getting the latest kernel you can -- 16.04 is > already getting on for a year old, and there may be performance > improvements in upstream kernels which affect your workload. There's > an Ubuntu kernel PPA you can use to get the new kernels without too > much pain. -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 14:13 ` Peter Zaitsev @ 2017-02-07 15:00 ` Timofey Titovets 2017-02-07 15:09 ` Austin S. Hemmelgarn 2017-02-07 16:22 ` Lionel Bouton ` (2 subsequent siblings) 3 siblings, 1 reply; 42+ messages in thread From: Timofey Titovets @ 2017-02-07 15:00 UTC (permalink / raw) To: Peter Zaitsev; +Cc: Hugo Mills, linux-btrfs 2017-02-07 17:13 GMT+03:00 Peter Zaitsev <pz@percona.com>: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. > > I have not seen autodefrag helping much but I will try again. Is > there any autodefrag documentation available about how is it expected > to work and if it can be tuned in any way > > I noticed remounting already fragmented filesystem with autodefrag > and putting workload which does more fragmentation does not seem to > improve over time > > > >> Well, nodatacow will still allow snapshots to work, but it also >> allows the data to fragment. Each snapshot made will cause subsequent >> writes to shared areas to be CoWed once (and then it reverts to >> unshared and nodatacow again). >> >> There's another approach which might be worth testing, which is to >> use autodefrag. This will increase data write I/O, because where you >> have one or more small writes in a region, it will also read and write >> the data in a small neghbourhood around those writes, so the >> fragmentation is reduced. This will improve subsequent read >> performance. >> >> I could also suggest getting the latest kernel you can -- 16.04 is >> already getting on for a year old, and there may be performance >> improvements in upstream kernels which affect your workload. There's >> an Ubuntu kernel PPA you can use to get the new kernels without too >> much pain. > > > > > > > > -- > Peter Zaitsev, CEO, Percona > Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html I think that you have a problem with extent bookkeeping (if i understand how btrfs manage extents). So for deal with it, try enable compression, as compression will force all extents to be fragmented with size ~128kb. I did have a similar problem with MySQL (Zabbix as a workload, i.e. most time load are random write), and i fix it, by enabling compression. (I use debian with latest kernel from backports) At now it just works with stable speed under stable load. P.S. (And i also use your percona MySQL some time, it's cool). -- Have a nice day, Timofey. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 15:00 ` Timofey Titovets @ 2017-02-07 15:09 ` Austin S. Hemmelgarn 2017-02-07 15:20 ` Timofey Titovets 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 15:09 UTC (permalink / raw) To: Timofey Titovets, Peter Zaitsev; +Cc: Hugo Mills, linux-btrfs On 2017-02-07 10:00, Timofey Titovets wrote: > 2017-02-07 17:13 GMT+03:00 Peter Zaitsev <pz@percona.com>: >> Hi Hugo, >> >> For the use case I'm looking for I'm interested in having snapshot(s) >> open at all time. Imagine for example snapshot being created every >> hour and several of these snapshots kept at all time providing quick >> recovery points to the state of 1,2,3 hours ago. In such case (as I >> think you also describe) nodatacow does not provide any advantage. >> >> I have not seen autodefrag helping much but I will try again. Is >> there any autodefrag documentation available about how is it expected >> to work and if it can be tuned in any way >> >> I noticed remounting already fragmented filesystem with autodefrag >> and putting workload which does more fragmentation does not seem to >> improve over time >> >> >> >>> Well, nodatacow will still allow snapshots to work, but it also >>> allows the data to fragment. Each snapshot made will cause subsequent >>> writes to shared areas to be CoWed once (and then it reverts to >>> unshared and nodatacow again). >>> >>> There's another approach which might be worth testing, which is to >>> use autodefrag. This will increase data write I/O, because where you >>> have one or more small writes in a region, it will also read and write >>> the data in a small neghbourhood around those writes, so the >>> fragmentation is reduced. This will improve subsequent read >>> performance. >>> >>> I could also suggest getting the latest kernel you can -- 16.04 is >>> already getting on for a year old, and there may be performance >>> improvements in upstream kernels which affect your workload. There's >>> an Ubuntu kernel PPA you can use to get the new kernels without too >>> much pain. >> >> >> >> >> >> >> >> -- >> Peter Zaitsev, CEO, Percona >> Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > I think that you have a problem with extent bookkeeping (if i > understand how btrfs manage extents). > So for deal with it, try enable compression, as compression will force > all extents to be fragmented with size ~128kb. No, it will compress everything in chunks of 128kB, but it will not fragment things any more than they already would have been (it may actually _reduce_ fragmentation because there is less data being stored on disk). This representation is a bug in the FIEMAP ioctl, it doesn't understand the way BTRFS represents things properly. IIRC, there was a patch to fix this, but I don't remember what happened with it. That said, in-line compression can help significantly, especially if you have slow storage devices. > > I did have a similar problem with MySQL (Zabbix as a workload, i.e. > most time load are random write), and i fix it, by enabling > compression. (I use debian with latest kernel from backports) > At now it just works with stable speed under stable load. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 15:09 ` Austin S. Hemmelgarn @ 2017-02-07 15:20 ` Timofey Titovets 2017-02-07 15:43 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 42+ messages in thread From: Timofey Titovets @ 2017-02-07 15:20 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Peter Zaitsev, Hugo Mills, linux-btrfs >> I think that you have a problem with extent bookkeeping (if i >> understand how btrfs manage extents). >> So for deal with it, try enable compression, as compression will force >> all extents to be fragmented with size ~128kb. > > No, it will compress everything in chunks of 128kB, but it will not fragment > things any more than they already would have been (it may actually _reduce_ > fragmentation because there is less data being stored on disk). This > representation is a bug in the FIEMAP ioctl, it doesn't understand the way > BTRFS represents things properly. IIRC, there was a patch to fix this, but > I don't remember what happened with it. > > That said, in-line compression can help significantly, especially if you > have slow storage devices. I mean that: You have a 128MB extent, you rewrite random 4k sectors, btrfs will not split 128MB extent, and not free up data, (i don't know internal algo, so i can't predict when this will hapen), and after some time, btrfs will rebuild extents, and split 128 MB exten to several more smaller. But when you use compression, allocator rebuilding extents much early (i think, it's because btrfs also operates with that like 128kb extent, even if it's a continuos 128MB chunk of data). -- Have a nice day, Timofey. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 15:20 ` Timofey Titovets @ 2017-02-07 15:43 ` Austin S. Hemmelgarn 2017-02-07 21:14 ` Kai Krakow 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 15:43 UTC (permalink / raw) To: Timofey Titovets; +Cc: Peter Zaitsev, Hugo Mills, linux-btrfs On 2017-02-07 10:20, Timofey Titovets wrote: >>> I think that you have a problem with extent bookkeeping (if i >>> understand how btrfs manage extents). >>> So for deal with it, try enable compression, as compression will force >>> all extents to be fragmented with size ~128kb. >> >> No, it will compress everything in chunks of 128kB, but it will not fragment >> things any more than they already would have been (it may actually _reduce_ >> fragmentation because there is less data being stored on disk). This >> representation is a bug in the FIEMAP ioctl, it doesn't understand the way >> BTRFS represents things properly. IIRC, there was a patch to fix this, but >> I don't remember what happened with it. >> >> That said, in-line compression can help significantly, especially if you >> have slow storage devices. > > > I mean that: > You have a 128MB extent, you rewrite random 4k sectors, btrfs will not > split 128MB extent, and not free up data, (i don't know internal algo, > so i can't predict when this will hapen), and after some time, btrfs > will rebuild extents, and split 128 MB exten to several more smaller. > But when you use compression, allocator rebuilding extents much early > (i think, it's because btrfs also operates with that like 128kb > extent, even if it's a continuos 128MB chunk of data). > The allocator has absolutely nothing to do with this, it's a function of the COW operation. Unless you're using nodatacow, that 128MB extent will get split the moment the data hits the storage device (either on the next commit cycle (at most 30 seconds with the default commit cycle), or when fdatasync is called, whichever is sooner). In the case of compression, it's still one extent (although on disk it will be less than 128MB) and will be split at _exactly_ the same time under _exactly_ the same circumstances as an uncompressed extent. IOW, it has absolutely nothing to do with the extent handling either. The difference arises in that compressed data effectively has a on-media block size of 128k, not 16k (the current default block size) or 4k (the old default). This means that the smallest fragment possible for a file with in-line compression enabled is 128k, while for a file without it it's equal to the filesystem block size. A larger minimum fragment size means that the maximum number of fragments a given file can have is smaller (8 times smaller in fact than without compression when using the current default block size), which means that there will be less fragmentation. Some rather complex and tedious math indicates that this is not the _only_ thing improving performance when using in-line compression, but it's probably the biggest thing doing so for the workload being discussed. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 15:43 ` Austin S. Hemmelgarn @ 2017-02-07 21:14 ` Kai Krakow 0 siblings, 0 replies; 42+ messages in thread From: Kai Krakow @ 2017-02-07 21:14 UTC (permalink / raw) To: linux-btrfs Am Tue, 7 Feb 2017 10:43:11 -0500 schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > > I mean that: > > You have a 128MB extent, you rewrite random 4k sectors, btrfs will > > not split 128MB extent, and not free up data, (i don't know > > internal algo, so i can't predict when this will hapen), and after > > some time, btrfs will rebuild extents, and split 128 MB exten to > > several more smaller. But when you use compression, allocator > > rebuilding extents much early (i think, it's because btrfs also > > operates with that like 128kb extent, even if it's a continuos > > 128MB chunk of data). > The allocator has absolutely nothing to do with this, it's a function > of the COW operation. Unless you're using nodatacow, that 128MB > extent will get split the moment the data hits the storage device > (either on the next commit cycle (at most 30 seconds with the default > commit cycle), or when fdatasync is called, whichever is sooner). In > the case of compression, it's still one extent (although on disk it > will be less than 128MB) and will be split at _exactly_ the same time > under _exactly_ the same circumstances as an uncompressed extent. > IOW, it has absolutely nothing to do with the extent handling either. I don't think that btrfs splits extents which are part of the snapshot. The extent in a snapshot will stay intact when writing to this extent in another snapshot. Of course, in the just written snapshot, the extent will be represented as a split extent mapping to the original extents data blocks plus the new data in the middle (thus resulting in three extents). This is also why small random writes without autodefrag result in a vast amount of small extents bringing the fs performance to a crawl. Do that multiple times on multiple snapshots, delete some of the original snapshots, and you're left with slack space, data blocks being inaccessible and won't be reclaimed into free space (because they are still part of the original extent), and which can only be reclaimed by a defrag operation - which would of course unshares data. Thus, if any of the above mentioned small extents is still shared with an extent originally much bigger, then it will still occupy its original space on the filesystem - even when its associated snapshot/subvolume no longer exists. Only when the last remaining tiny block of such an extent gets rewritten and the reference counter decreases to zero, the extent is given up and freed. To work around this, you can currently only unshare and recombine by doing defrag and dedupe on all snapshots. This will reclaim space sitting in parts of the original extents no longer referenced by a snapshot visible from the VFS layer. This is for performance reasons because btrfs is extent based. As far as I know, ZFS on the other side, works different. It uses block based storage for the snapshot feature and can easily throw away unused blocks. Only a second layer on top maps this back into extents. The underlying infrastructure, however, is block based storage, which also enables the volume pool to create block devices on the fly out of ZFS storage space. PS: All above given the fact I understood it right. ;-) -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 14:13 ` Peter Zaitsev 2017-02-07 15:00 ` Timofey Titovets @ 2017-02-07 16:22 ` Lionel Bouton 2017-02-07 19:57 ` Roman Mamedov 2017-02-07 20:36 ` Kai Krakow 3 siblings, 0 replies; 42+ messages in thread From: Lionel Bouton @ 2017-02-07 16:22 UTC (permalink / raw) To: Peter Zaitsev, Hugo Mills, linux-btrfs Hi Peter, Le 07/02/2017 à 15:13, Peter Zaitsev a écrit : > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. > > I have not seen autodefrag helping much but I will try again. Is > there any autodefrag documentation available about how is it expected > to work and if it can be tuned in any way There's not much that can be done if the same file is modified in 2 different subvolumes (typically the original and a R/W snapshot). You either break the reflink around the modification to limit the amount of fragmentation (which will use disk space and write I/O) or get fragmentation on at least one subvolume (which will add seeks). So the only options are either to flatten the files (which can be done incrementally by defragmenting them on both sides when they change) or only defragment the most used volume (especially if the other is a relatively short-lived snapshot where performance won't degrade much until it is removed and won't matter much). I just modified our defragmenter scheduler to be aware of multiple subvolumes and support ignoring some of them. The previous version (not tagged, sorry) was battle tested on a Ceph cluster and was designed for it. Autodefrag didn't work with Ceph with our workload (latency went through the roof, OSDs were timing out requests, ...) and our scheduler with some simple Ceph BTRFS related tunings gave us even better performance than XFS (which is usually the recommended choice with current Ceph versions). The current version is probably still rough around the edges as it is brand new (most of the work was done last Sunday) and only running on a backup server with a situation not much different from yours : a large PostgreSQL slave (>50GB) which is snapshoted hourly and daily, with a daily snapshot used to start a PostgreSQL instance for "tests on real data" purposes + a copy of a <10TB NFS server with similar snapshots in place. All of this is on a single RAID10 13-14TB BTRFS. In our case using autodefrag on this slowly degraded performance to the point where off-site backups became slow enough to warrant preventive measures. The current scheduler looks for the mountpoints of top BTRFS volumes (so you have to mount the top volume somewhere), and defragments them avoiding : - read-only snapshots, - all data below configurable subdirs (including read-write subvolumes even if they are mounted elsewhere), see README.md for instructions. It slowly walks all files eligible for defragmentation and in parallel detects writes to the same filesystem, including writes to read-write subvolumes mounted elsewhere to trigger defragmentation. The scheduler uses an estimated "cost" for each file to prioritize defragmentation tasks and with default settings tries to keep I/O activity low enough that it doesn't slow down other tasks too much. However it defragments files whole, which might put some strain for huge ibdata* files if you didn't switch to file per table. In our case defragmenting 1GB files is OK and doesn't have a major impact. We are already seeing better performance (our total daily backup time is below worrying levels again) and the scheduler didn't even finish walking the whole filesystem (there are approximately 8 millions files and it is configured to evaluate them over a week). This is probably because it follows the most write-active files (which are in the PostgreSQL slave directory) and defragmented most of them early. Note that it is tuned for filesystems using ~2TB 7200rpm drives (there are some options that will adapt it to subsystems with more I/O capacity). Using drives with different capacities shouldn't need tuning, but it probably will not work well on SSD (it should be configured to speed up significantly). See https://github.com/jtek/ceph-utils you want btrfs-defrag-scheduler.rb Some parameters are available (start it with --help). You should probably start it with --verbose at least until you are comfortable with it to get a list of which files are defragmented and many debug messages you probably want to ignore (or you'll probably have to read the Ruby code to fully understand what they mean). I don't provide any warranty for it but the worst I believe can happen is no performance improvements or performance degradation until you stop it. If you don't blacklist read-write snapshots with the .no-defrag file (see README.md) defragmentation will probably eat more disk space than usual. Space usage will go up rapidly during defragmentation if you have snapshots, it is supposed to go down after all snapshots referring to fragmented files are removed and replaced by new snapshots (where fragmentation should be more stable). Best regards, Lionel ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 14:13 ` Peter Zaitsev 2017-02-07 15:00 ` Timofey Titovets 2017-02-07 16:22 ` Lionel Bouton @ 2017-02-07 19:57 ` Roman Mamedov 2017-02-07 20:36 ` Kai Krakow 3 siblings, 0 replies; 42+ messages in thread From: Roman Mamedov @ 2017-02-07 19:57 UTC (permalink / raw) To: Peter Zaitsev; +Cc: Hugo Mills, linux-btrfs On Tue, 7 Feb 2017 09:13:25 -0500 Peter Zaitsev <pz@percona.com> wrote: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. It still does provide some advantage, as in each write into new area since last hour snapshot is going to be CoW'ed only once, as opposed to every new write getting CoW'ed every time no matter what. I'm not sold on autodefrag, what I'd suggest instead is to schedule regular defrag ("btrfs fi defrag") of the database files, e.g. daily. This may increase space usage temporarily as it will partially unmerge extents previously shared across snapshots, but you won't get away runaway fragmentation anymore, as you would without nodatacow or with periodical snapshotting. -- With respect, Roman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 14:13 ` Peter Zaitsev ` (2 preceding siblings ...) 2017-02-07 19:57 ` Roman Mamedov @ 2017-02-07 20:36 ` Kai Krakow 2017-02-07 20:44 ` Lionel Bouton 2017-02-07 20:47 ` Austin S. Hemmelgarn 3 siblings, 2 replies; 42+ messages in thread From: Kai Krakow @ 2017-02-07 20:36 UTC (permalink / raw) To: linux-btrfs Am Tue, 7 Feb 2017 09:13:25 -0500 schrieb Peter Zaitsev <pz@percona.com>: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 20:36 ` Kai Krakow @ 2017-02-07 20:44 ` Lionel Bouton 2017-02-07 20:47 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 42+ messages in thread From: Lionel Bouton @ 2017-02-07 20:44 UTC (permalink / raw) To: Kai Krakow, linux-btrfs Le 07/02/2017 à 21:36, Kai Krakow a écrit : > [...] > I think I've read that btrfs snapshots do not guarantee single point in > time snapshots - the snapshot may be smeared across a longer period of > time while the kernel is still writing data. So parts of your writes > may still end up in the snapshot after issuing the snapshot command, > instead of in the working copy as expected. I don't think so for three reasons : - it's so far away from admin's expectations that someone would have documented this in "man btrfs-subvolume", - the CoW nature of Btrfs makes this trivial : it only has to keep old versions of data and the corresponding tree for it to work instead of unlinking them, - the backup server I referred to restarted a PostgreSQL system from snapshots about one thousand time now without a single problem while being almost continuously being updated by streaming replication. Lionel ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 20:36 ` Kai Krakow 2017-02-07 20:44 ` Lionel Bouton @ 2017-02-07 20:47 ` Austin S. Hemmelgarn 2017-02-07 21:25 ` Lionel Bouton [not found] ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com> 1 sibling, 2 replies; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 20:47 UTC (permalink / raw) To: linux-btrfs On 2017-02-07 15:36, Kai Krakow wrote: > Am Tue, 7 Feb 2017 09:13:25 -0500 > schrieb Peter Zaitsev <pz@percona.com>: > >> Hi Hugo, >> >> For the use case I'm looking for I'm interested in having snapshot(s) >> open at all time. Imagine for example snapshot being created every >> hour and several of these snapshots kept at all time providing quick >> recovery points to the state of 1,2,3 hours ago. In such case (as I >> think you also describe) nodatacow does not provide any advantage. > > Out of curiosity, I see one problem here: > > If you're doing snapshots of the live database, each snapshot leaves > the database files like killing the database in-flight. Like shutting > the system down in the middle of writing data. > > This is because I think there's no API for user space to subscribe to > events like a snapshot - unlike e.g. the VSS API (volume snapshot > service) in Windows. You should put the database into frozen state to > prepare it for a hotcopy before creating the snapshot, then ensure all > data is flushed before continuing. Correct. > > I think I've read that btrfs snapshots do not guarantee single point in > time snapshots - the snapshot may be smeared across a longer period of > time while the kernel is still writing data. So parts of your writes > may still end up in the snapshot after issuing the snapshot command, > instead of in the working copy as expected. Also correct AFAICT, and this needs to be better documented (for most people, the term snapshot implies atomicity of the operation). > > How is this going to be addressed? Is there some snapshot aware API to > let user space subscribe to such events and do proper preparation? Is > this planned? LVM could be a user of such an API, too. I think this > could have nice enterprise-grade value for Linux. Ideally, such an API should be in the VFS layer, not just BTRFS. Reflinking exists in other filesystems already, it's only a matter of time before they decide to do snapshotting too. > > XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But > still, also this needs to be integrated with MySQL to properly work. I > once (years ago) researched on this but gave up on my plans when I > planned database backups for our web server infrastructure. We moved to > creating SQL dumps instead, although there're binlogs which can be used > to recover to a clean and stable transactional state after taking > snapshots. But I simply didn't want to fiddle around with properly > cleaning up binlogs which accumulate horribly much space usage over > time. The cleanup process requires to create a cold copy or dump of the > complete database from time to time, only then it's safe to remove all > binlogs up to that point in time. Sadly, freezefs (the generic interface based off of xfs_freeze) only works for block device snapshots. Filesystem level snapshots need the application software to sync all it's data and then stop writing until the snapshot is complete. As of right now, the sanest way I can come up with for a database server is to find a way to do a point-in-time SQL dump of the database (this also has the advantage that it works as a backup, and decouples you from the backing storage format). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 20:47 ` Austin S. Hemmelgarn @ 2017-02-07 21:25 ` Lionel Bouton 2017-02-07 21:35 ` Kai Krakow [not found] ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com> 1 sibling, 1 reply; 42+ messages in thread From: Lionel Bouton @ 2017-02-07 21:25 UTC (permalink / raw) To: Austin S. Hemmelgarn, linux-btrfs Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit : > On 2017-02-07 15:36, Kai Krakow wrote: >> Am Tue, 7 Feb 2017 09:13:25 -0500 >> schrieb Peter Zaitsev <pz@percona.com>: >> >>> Hi Hugo, >>> >>> For the use case I'm looking for I'm interested in having snapshot(s) >>> open at all time. Imagine for example snapshot being created every >>> hour and several of these snapshots kept at all time providing quick >>> recovery points to the state of 1,2,3 hours ago. In such case (as I >>> think you also describe) nodatacow does not provide any advantage. >> >> Out of curiosity, I see one problem here: >> >> If you're doing snapshots of the live database, each snapshot leaves >> the database files like killing the database in-flight. Like shutting >> the system down in the middle of writing data. >> >> This is because I think there's no API for user space to subscribe to >> events like a snapshot - unlike e.g. the VSS API (volume snapshot >> service) in Windows. You should put the database into frozen state to >> prepare it for a hotcopy before creating the snapshot, then ensure all >> data is flushed before continuing. > Correct. >> >> I think I've read that btrfs snapshots do not guarantee single point in >> time snapshots - the snapshot may be smeared across a longer period of >> time while the kernel is still writing data. So parts of your writes >> may still end up in the snapshot after issuing the snapshot command, >> instead of in the working copy as expected. > Also correct AFAICT, and this needs to be better documented (for most > people, the term snapshot implies atomicity of the operation). Atomicity can be a relative term. If the snapshot atomicity is relative to barriers but not relative to individual writes between barriers then AFAICT it's fine because the filesystem doesn't make any promise it won't keep even in the context of its snapshots. Consider a power loss : the filesystems atomicity guarantees can't go beyond what the hardware guarantees which means not all current in fly write will reach the disk and partial writes can happen. Modern filesystems will remain consistent though and if an application using them makes uses of f*sync it can provide its own guarantees too. The same should apply to snapshots : all the writes in fly can complete or not on disk before the snapshot what matters is that both the snapshot and these writes will be completed after the next barrier (and any robust application will ignore all the in fly writes it finds in the snapshot if they were part of a batch that should be atomically commited). This is why AFAIK PostgreSQL or MySQL with their default ACID compliant configuration will recover from a BTRFS snapshot in the same way they recover from a power loss. Lionel ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 21:25 ` Lionel Bouton @ 2017-02-07 21:35 ` Kai Krakow 2017-02-07 22:27 ` Hans van Kranenburg 2017-02-08 19:08 ` Goffredo Baroncelli 0 siblings, 2 replies; 42+ messages in thread From: Kai Krakow @ 2017-02-07 21:35 UTC (permalink / raw) To: linux-btrfs Am Tue, 7 Feb 2017 22:25:29 +0100 schrieb Lionel Bouton <lionel-subscription@bouton.name>: > Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit : > > On 2017-02-07 15:36, Kai Krakow wrote: > >> Am Tue, 7 Feb 2017 09:13:25 -0500 > >> schrieb Peter Zaitsev <pz@percona.com>: > >> > [...] > >> > >> Out of curiosity, I see one problem here: > >> > >> If you're doing snapshots of the live database, each snapshot > >> leaves the database files like killing the database in-flight. > >> Like shutting the system down in the middle of writing data. > >> > >> This is because I think there's no API for user space to subscribe > >> to events like a snapshot - unlike e.g. the VSS API (volume > >> snapshot service) in Windows. You should put the database into > >> frozen state to prepare it for a hotcopy before creating the > >> snapshot, then ensure all data is flushed before continuing. > > Correct. > >> > >> I think I've read that btrfs snapshots do not guarantee single > >> point in time snapshots - the snapshot may be smeared across a > >> longer period of time while the kernel is still writing data. So > >> parts of your writes may still end up in the snapshot after > >> issuing the snapshot command, instead of in the working copy as > >> expected. > > Also correct AFAICT, and this needs to be better documented (for > > most people, the term snapshot implies atomicity of the > > operation). > > Atomicity can be a relative term. If the snapshot atomicity is > relative to barriers but not relative to individual writes between > barriers then AFAICT it's fine because the filesystem doesn't make > any promise it won't keep even in the context of its snapshots. > Consider a power loss : the filesystems atomicity guarantees can't go > beyond what the hardware guarantees which means not all current in fly > write will reach the disk and partial writes can happen. Modern > filesystems will remain consistent though and if an application using > them makes uses of f*sync it can provide its own guarantees too. The > same should apply to snapshots : all the writes in fly can complete or > not on disk before the snapshot what matters is that both the snapshot > and these writes will be completed after the next barrier (and any > robust application will ignore all the in fly writes it finds in the > snapshot if they were part of a batch that should be atomically > commited). > > This is why AFAIK PostgreSQL or MySQL with their default ACID > compliant configuration will recover from a BTRFS snapshot in the > same way they recover from a power loss. This is what I meant in my other reply. But this is also why it should be documented. Wrongly implying that snapshots are single point in time snapshots is a wrong assumption with possibly horrible side effects one wouldn't expect. Taking a snapshot is like a power loss - even tho there is no power loss. So the database has to be properly configured. It is simply short sighted if you don't think about this fact. The documentation should really point that fact out. -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 21:35 ` Kai Krakow @ 2017-02-07 22:27 ` Hans van Kranenburg 2017-02-08 19:08 ` Goffredo Baroncelli 1 sibling, 0 replies; 42+ messages in thread From: Hans van Kranenburg @ 2017-02-07 22:27 UTC (permalink / raw) To: Kai Krakow, linux-btrfs On 02/07/2017 10:35 PM, Kai Krakow wrote: > Am Tue, 7 Feb 2017 22:25:29 +0100 > schrieb Lionel Bouton <lionel-subscription@bouton.name>: > >> Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit : >>> On 2017-02-07 15:36, Kai Krakow wrote: >>>> Am Tue, 7 Feb 2017 09:13:25 -0500 >>>> schrieb Peter Zaitsev <pz@percona.com>: >>>> >> [...] >>>> >>>> Out of curiosity, I see one problem here: >>>> >>>> If you're doing snapshots of the live database, each snapshot >>>> leaves the database files like killing the database in-flight. >>>> Like shutting the system down in the middle of writing data. >>>> >>>> This is because I think there's no API for user space to subscribe >>>> to events like a snapshot - unlike e.g. the VSS API (volume >>>> snapshot service) in Windows. You should put the database into >>>> frozen state to prepare it for a hotcopy before creating the >>>> snapshot, then ensure all data is flushed before continuing. >>> Correct. >>>> >>>> I think I've read that btrfs snapshots do not guarantee single >>>> point in time snapshots - the snapshot may be smeared across a >>>> longer period of time while the kernel is still writing data. So >>>> parts of your writes may still end up in the snapshot after >>>> issuing the snapshot command, instead of in the working copy as >>>> expected. >>> Also correct AFAICT, and this needs to be better documented (for >>> most people, the term snapshot implies atomicity of the >>> operation). >> >> Atomicity can be a relative term. If the snapshot atomicity is >> relative to barriers but not relative to individual writes between >> barriers then AFAICT it's fine because the filesystem doesn't make >> any promise it won't keep even in the context of its snapshots. >> Consider a power loss : the filesystems atomicity guarantees can't go >> beyond what the hardware guarantees which means not all current in fly >> write will reach the disk and partial writes can happen. Modern >> filesystems will remain consistent though and if an application using >> them makes uses of f*sync it can provide its own guarantees too. The >> same should apply to snapshots : all the writes in fly can complete or >> not on disk before the snapshot what matters is that both the snapshot >> and these writes will be completed after the next barrier (and any >> robust application will ignore all the in fly writes it finds in the >> snapshot if they were part of a batch that should be atomically >> commited). >> >> This is why AFAIK PostgreSQL or MySQL with their default ACID >> compliant configuration will recover from a BTRFS snapshot in the >> same way they recover from a power loss. > > This is what I meant in my other reply. But this is also why it should > be documented. Wrongly implying that snapshots are single point in time > snapshots is a wrong assumption with possibly horrible side effects one > wouldn't expect. It depends on what the definition of time is. (whoa!!) A snapshot is taken of a single point in the lifetime of a filesystem tree (a generation, the point where a transaction commits)...? > Taking a snapshot is like a power loss - even tho there is no power > loss. So the database has to be properly configured. It is simply short > sighted if you don't think about this fact. The documentation should > really point that fact out. I'd almost say that it would be short sighted to assume a btrfs snapshot would *not* behave like a power loss. At least, to me (thinking as a sysadmin) it feels really weird to think of it in any other way than that. Oh wait, that's what you mean, or not? What is the thing that the documentation should point out? I'm not trying to be trolling, the piled up double negations make this discussion a bit hard to read. Moo -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 21:35 ` Kai Krakow 2017-02-07 22:27 ` Hans van Kranenburg @ 2017-02-08 19:08 ` Goffredo Baroncelli 1 sibling, 0 replies; 42+ messages in thread From: Goffredo Baroncelli @ 2017-02-08 19:08 UTC (permalink / raw) To: Kai Krakow, linux-btrfs On 2017-02-07 22:35, Kai Krakow wrote: [...] >> >> Atomicity can be a relative term. If the snapshot atomicity is >> relative to barriers but not relative to individual writes between >> barriers then AFAICT it's fine because the filesystem doesn't make >> any promise it won't keep even in the context of its snapshots. >> Consider a power loss : the filesystems atomicity guarantees can't go >> beyond what the hardware guarantees which means not all current in fly >> write will reach the disk and partial writes can happen. Modern >> filesystems will remain consistent though and if an application using >> them makes uses of f*sync it can provide its own guarantees too. The >> same should apply to snapshots : all the writes in fly can complete or >> not on disk before the snapshot what matters is that both the snapshot >> and these writes will be completed after the next barrier (and any >> robust application will ignore all the in fly writes it finds in the >> snapshot if they were part of a batch that should be atomically >> commited). >> >> This is why AFAIK PostgreSQL or MySQL with their default ACID >> compliant configuration will recover from a BTRFS snapshot in the >> same way they recover from a power loss. > > This is what I meant in my other reply. But this is also why it should > be documented. Wrongly implying that snapshots are single point in time > snapshots is a wrong assumption with possibly horrible side effects one > wouldn't expect. I don't understand what are you saying. Until now, my understanding was that "all the writings which were passed to btrfs before the snapshot time are in the snapshot. The ones after not". Am I wrong ? Which are the others possible interpretations ? [..] -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com>]
* Re: BTRFS for OLTP Databases [not found] ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com> @ 2017-02-13 12:44 ` Austin S. Hemmelgarn 2017-02-13 17:16 ` linux-btrfs 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-13 12:44 UTC (permalink / raw) To: linux-btrfs On 2017-02-09 22:58, Andrei Borzenkov wrote: > 07.02.2017 23:47, Austin S. Hemmelgarn пишет: > ... >> Sadly, freezefs (the generic interface based off of xfs_freeze) only >> works for block device snapshots. Filesystem level snapshots need the >> application software to sync all it's data and then stop writing until >> the snapshot is complete. >> > > I expect databases to be using directio, otherwise we have problems even > without using snapshots. Is it still an issue with directio? It is less of an issue, but it's still an issue because you can still call for snapshot creation in the middle of an application I/O request. In other words, the application wouldn't need to worry about syncing data, but it would need to worry about making sure it's not actually writing anything when the snapshot happens. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-13 12:44 ` Austin S. Hemmelgarn @ 2017-02-13 17:16 ` linux-btrfs 0 siblings, 0 replies; 42+ messages in thread From: linux-btrfs @ 2017-02-13 17:16 UTC (permalink / raw) To: Austin S. Hemmelgarn, linux-btrfs W dniu 2017-02-13 o 13:44 PM, Austin S. Hemmelgarn pisze: > On 2017-02-09 22:58, Andrei Borzenkov wrote: >> 07.02.2017 23:47, Austin S. Hemmelgarn пишет: >> ... >>> Sadly, freezefs (the generic interface based off of xfs_freeze) only >>> works for block device snapshots. Filesystem level snapshots need the >>> application software to sync all it's data and then stop writing until >>> the snapshot is complete. >>> >> >> I expect databases to be using directio, otherwise we have problems even >> without using snapshots. Is it still an issue with directio? > It is less of an issue, but it's still an issue because you can still > call for snapshot creation in the middle of an application I/O > request. In other words, the application wouldn't need to worry about > syncing data, but it would need to worry about making sure it's not > actually writing anything when the snapshot happens. > I think this should work the other way around. Snapshot should wait until all directio writes are done, and new requests sent when creating snapshot, should wait until snapshot is done. -- Adrian Brzeziński ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 14:00 ` Hugo Mills 2017-02-07 14:13 ` Peter Zaitsev @ 2017-02-07 19:31 ` Peter Zaitsev 2017-02-07 19:50 ` Austin S. Hemmelgarn 2017-02-08 2:11 ` Peter Zaitsev 2 siblings, 1 reply; 42+ messages in thread From: Peter Zaitsev @ 2017-02-07 19:31 UTC (permalink / raw) To: Hugo Mills, Peter Zaitsev, linux-btrfs Hi Hugo, As I re-read it closely (and also other comments in the thread) I know understand there is a difference how nodatacow works even if snapshot are in place. On autodefrag I wonder is there some more detailed documentation about how autodefrag works. The manual https://btrfs.wiki.kernel.org/index.php/Mount_options has very general statement. What does "detect random IO" really means ? It also talks about defragmenting the file - is i really about the whole file which is triggered for defrag or is defrag locally ? Ie I would understand what as writes happen the 1MB block is checked and if it is more than X fragments it is defragmented or something like that. Also does autodefrag works with nodatacow (ie with snapshot) or are these exclusive ? > > There's another approach which might be worth testing, which is to > use autodefrag. This will increase data write I/O, because where you > have one or more small writes in a region, it will also read and write > the data in a small neghbourhood around those writes, so the > fragmentation is reduced. This will improve subsequent read > performance. > > I could also suggest getting the latest kernel you can -- 16.04 is > already getting on for a year old, and there may be performance > improvements in upstream kernels which affect your workload. There's > an Ubuntu kernel PPA you can use to get the new kernels without too > much pain. > > > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 19:31 ` Peter Zaitsev @ 2017-02-07 19:50 ` Austin S. Hemmelgarn 2017-02-07 20:19 ` Kai Krakow 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 19:50 UTC (permalink / raw) To: Peter Zaitsev, Hugo Mills, linux-btrfs On 2017-02-07 14:31, Peter Zaitsev wrote: > Hi Hugo, > > As I re-read it closely (and also other comments in the thread) I know > understand there is a difference how nodatacow works even if snapshot are > in place. > > On autodefrag I wonder is there some more detailed documentation about how > autodefrag works. > > The manual https://btrfs.wiki.kernel.org/index.php/Mount_options has > very general statement. > > What does "detect random IO" really means ? It also talks about > defragmenting the file - is i really about the whole file which is > triggered for defrag or is defrag locally ? Ie I would understand what > as writes happen the 1MB block is checked and if it is more than X > fragments it is defragmented or something like that. I don't know the exact algorithm, but I'm pretty sure it's similar to what bcache uses to bypass the cache device for sequential I/O. In essence, it's going to trigger for database usage. > > Also does autodefrag works with nodatacow (ie with snapshot) or are these > exclusive ? I'm not sure about this one. I would assume based on the fact that many other things don't work with nodatacow and that regular defrag doesn't work on files which are currently mapped as executable code that it does not, but I could be completely wrong about this too. > > >> >> There's another approach which might be worth testing, which is to >> use autodefrag. This will increase data write I/O, because where you >> have one or more small writes in a region, it will also read and write >> the data in a small neghbourhood around those writes, so the >> fragmentation is reduced. This will improve subsequent read >> performance. >> >> I could also suggest getting the latest kernel you can -- 16.04 is >> already getting on for a year old, and there may be performance >> improvements in upstream kernels which affect your workload. There's >> an Ubuntu kernel PPA you can use to get the new kernels without too >> much pain. >> >> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 19:50 ` Austin S. Hemmelgarn @ 2017-02-07 20:19 ` Kai Krakow 2017-02-07 20:27 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 42+ messages in thread From: Kai Krakow @ 2017-02-07 20:19 UTC (permalink / raw) To: linux-btrfs Am Tue, 7 Feb 2017 14:50:04 -0500 schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > > Also does autodefrag works with nodatacow (ie with snapshot) or > > are these exclusive ? > I'm not sure about this one. I would assume based on the fact that > many other things don't work with nodatacow and that regular defrag > doesn't work on files which are currently mapped as executable code > that it does not, but I could be completely wrong about this too. Technically, there's nothing that prevents autodefrag to work for nodatacow files. The question is: is it really necessary? Standard file systems also have no autodefrag, it's not an issue there because they are essentially nodatacow. Simply defrag the database file once and you're done. Transactional MySQL uses huge data files, probably preallocated. It should simply work with nodatacow. On the other hand: Using snapshots clearly introduces fragmentation over time. If autodefrag kicks in (given, it is supported for nodatacow), it will slowly unshare all data over time. This somehow defeats the purpose of having snapshots in the first place for this scenario. In conclusion, I'd recommend to run some maintenance scripts from time to time, one to re-share identical blocks, and one to defragment the current workspace. The bees daemon comes into mind here... I haven't tried it but it sounds like it could fill a gap here: https://github.com/Zygo/bees Another option comes into mind: XFS now supports shared-extents copies. You could simply do a cold copy of the database with this feature resulting in the same effect as a snapshot, without seeing the other performance problems of btrfs. Tho, the fragmentation issue would remain, and I think there's no dedupe application for XFS yet. -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 20:19 ` Kai Krakow @ 2017-02-07 20:27 ` Austin S. Hemmelgarn 2017-02-07 20:54 ` Kai Krakow 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 20:27 UTC (permalink / raw) To: linux-btrfs On 2017-02-07 15:19, Kai Krakow wrote: > Am Tue, 7 Feb 2017 14:50:04 -0500 > schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > >>> Also does autodefrag works with nodatacow (ie with snapshot) or >>> are these exclusive ? >> I'm not sure about this one. I would assume based on the fact that >> many other things don't work with nodatacow and that regular defrag >> doesn't work on files which are currently mapped as executable code >> that it does not, but I could be completely wrong about this too. > > Technically, there's nothing that prevents autodefrag to work for > nodatacow files. The question is: is it really necessary? Standard file > systems also have no autodefrag, it's not an issue there because they > are essentially nodatacow. Simply defrag the database file once and > you're done. Transactional MySQL uses huge data files, probably > preallocated. It should simply work with nodatacow. The thing is, I don't have enough knowledge of how defrag is implemented in BTRFS to say for certain that ti doesn't use COW semantics somewhere (and I would actually expect it to do so, since that in theory makes many things _much_ easier to handle), and if it uses COW somewhere, then it by definition doesn't work on NOCOW files. > > On the other hand: Using snapshots clearly introduces fragmentation over > time. If autodefrag kicks in (given, it is supported for nodatacow), it > will slowly unshare all data over time. This somehow defeats the > purpose of having snapshots in the first place for this scenario. > > In conclusion, I'd recommend to run some maintenance scripts from time > to time, one to re-share identical blocks, and one to defragment the > current workspace. > > The bees daemon comes into mind here... I haven't tried it but it > sounds like it could fill a gap here: > > https://github.com/Zygo/bees > > Another option comes into mind: XFS now supports shared-extents > copies. You could simply do a cold copy of the database with this > feature resulting in the same effect as a snapshot, without seeing the > other performance problems of btrfs. Tho, the fragmentation issue would > remain, and I think there's no dedupe application for XFS yet. There isn't, but cp --reflink=auto with a reasonably recent version of coreutils should be able to reflink the file properly. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 20:27 ` Austin S. Hemmelgarn @ 2017-02-07 20:54 ` Kai Krakow 2017-02-08 12:12 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 42+ messages in thread From: Kai Krakow @ 2017-02-07 20:54 UTC (permalink / raw) To: linux-btrfs Am Tue, 7 Feb 2017 15:27:34 -0500 schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > >> I'm not sure about this one. I would assume based on the fact that > >> many other things don't work with nodatacow and that regular defrag > >> doesn't work on files which are currently mapped as executable code > >> that it does not, but I could be completely wrong about this too. > > > > Technically, there's nothing that prevents autodefrag to work for > > nodatacow files. The question is: is it really necessary? Standard > > file systems also have no autodefrag, it's not an issue there > > because they are essentially nodatacow. Simply defrag the database > > file once and you're done. Transactional MySQL uses huge data > > files, probably preallocated. It should simply work with > > nodatacow. > The thing is, I don't have enough knowledge of how defrag is > implemented in BTRFS to say for certain that ti doesn't use COW > semantics somewhere (and I would actually expect it to do so, since > that in theory makes many things _much_ easier to handle), and if it > uses COW somewhere, then it by definition doesn't work on NOCOW files. A dev would be needed on this. But from a non-dev point of view, the defrag operation itself is CoW: Blocks are rewritten to another location in contiguous order. Only metadata CoW should be needed for this operation. It should be nothing else than writing to a nodatacow snapshot... Just that the snapshot is more or less implicit and temporary. Hmm? *curious* -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 20:54 ` Kai Krakow @ 2017-02-08 12:12 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-08 12:12 UTC (permalink / raw) To: linux-btrfs On 2017-02-07 15:54, Kai Krakow wrote: > Am Tue, 7 Feb 2017 15:27:34 -0500 > schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > >>>> I'm not sure about this one. I would assume based on the fact that >>>> many other things don't work with nodatacow and that regular defrag >>>> doesn't work on files which are currently mapped as executable code >>>> that it does not, but I could be completely wrong about this too. >>> >>> Technically, there's nothing that prevents autodefrag to work for >>> nodatacow files. The question is: is it really necessary? Standard >>> file systems also have no autodefrag, it's not an issue there >>> because they are essentially nodatacow. Simply defrag the database >>> file once and you're done. Transactional MySQL uses huge data >>> files, probably preallocated. It should simply work with >>> nodatacow. >> The thing is, I don't have enough knowledge of how defrag is >> implemented in BTRFS to say for certain that ti doesn't use COW >> semantics somewhere (and I would actually expect it to do so, since >> that in theory makes many things _much_ easier to handle), and if it >> uses COW somewhere, then it by definition doesn't work on NOCOW files. > > A dev would be needed on this. But from a non-dev point of view, the > defrag operation itself is CoW: Blocks are rewritten to another > location in contiguous order. Only metadata CoW should be needed for > this operation. > > It should be nothing else than writing to a nodatacow snapshot... Just > that the snapshot is more or less implicit and temporary. > > Hmm? *curious* > The gimmicky part though is that the file has to remain accessible throughout the entire operation, and the defrad can't lose changes that occur while the file is being defragmented. In many filesystems (NTFS on Windows for example), a defrag functions similarly to a pvmove operation in LVM, as each extent gets moved, writes to that region get indirected to the new location and treat the areas that were written to as having been moved already. The thing is, on BTRFS that would result in extents getting split, which means COW is probably involved at some level in the data path too. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 14:00 ` Hugo Mills 2017-02-07 14:13 ` Peter Zaitsev 2017-02-07 19:31 ` Peter Zaitsev @ 2017-02-08 2:11 ` Peter Zaitsev 2017-02-08 12:14 ` Martin Raiber 2 siblings, 1 reply; 42+ messages in thread From: Peter Zaitsev @ 2017-02-08 2:11 UTC (permalink / raw) To: linux-btrfs Hi Kai, I guess your message did not make it to me as I'm not subscribed to the list. I totally understand what the the snapshot is "crash consistent" - consistent to the state of the disk you would find if you shut down the power with no notice, for many applications it is a problem however it is fine for many databases which already need to be able to recover correctly from power loss for MySQL this works well for Innodb storage engine it does not work for MyISAM The great of such "uncoordinated" snapshot is what it is instant and have very little production impact - if you want to "freeze" multiple filesystems or even worse flush MyISAM table it can take a lot of time and can be unacceptable for many 24/7 workloads. Or are you saying BTRFS snapshots do not provide this kind of consistency ? > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. -- Regards, Kai On Tue, Feb 7, 2017 at 9:00 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > On Tue, Feb 07, 2017 at 08:53:35AM -0500, Peter Zaitsev wrote: >> Hi, >> >> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL >> Workload. >> >> It did not go very well ranging from multi-seconds stalls where no >> transactions are completed to the finally kernel OOPS with "no space left >> on device" error message and filesystem going read only. >> >> I'm complete newbie in BTRFS so I assume I'm doing something wrong. >> >> Do you have any advice on how BTRFS should be tuned for OLTP workload >> (large files having a lot of random writes) ? Or is this the case where >> one should simply stay away from BTRFS and use something else ? >> >> One item recommended in some places is "nodatacow" this however defeats >> the main purpose I'm looking at BTRFS - I am interested in "free" >> snapshots which look very attractive to use for database recovery scenarios >> allow instant rollback to the previous state. > > Well, nodatacow will still allow snapshots to work, but it also > allows the data to fragment. Each snapshot made will cause subsequent > writes to shared areas to be CoWed once (and then it reverts to > unshared and nodatacow again). > > There's another approach which might be worth testing, which is to > use autodefrag. This will increase data write I/O, because where you > have one or more small writes in a region, it will also read and write > the data in a small neghbourhood around those writes, so the > fragmentation is reduced. This will improve subsequent read > performance. > > I could also suggest getting the latest kernel you can -- 16.04 is > already getting on for a year old, and there may be performance > improvements in upstream kernels which affect your workload. There's > an Ubuntu kernel PPA you can use to get the new kernels without too > much pain. > > Hugo. > > -- > Hugo Mills | I don't care about "it works on my machine". We are > hugo@... carfax.org.uk | not shipping your machine. > http://carfax.org.uk/ | > PGP: E2AB1DE4 | -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-08 2:11 ` Peter Zaitsev @ 2017-02-08 12:14 ` Martin Raiber 2017-02-08 13:00 ` Adrian Brzezinski 2017-02-08 13:08 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 42+ messages in thread From: Martin Raiber @ 2017-02-08 12:14 UTC (permalink / raw) To: Peter Zaitsev, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2780 bytes --] Hi, On 08.02.2017 03:11 Peter Zaitsev wrote: > Out of curiosity, I see one problem here: > If you're doing snapshots of the live database, each snapshot leaves > the database files like killing the database in-flight. Like shutting > the system down in the middle of writing data. > > This is because I think there's no API for user space to subscribe to > events like a snapshot - unlike e.g. the VSS API (volume snapshot > service) in Windows. You should put the database into frozen state to > prepare it for a hotcopy before creating the snapshot, then ensure all > data is flushed before continuing. > > I think I've read that btrfs snapshots do not guarantee single point in > time snapshots - the snapshot may be smeared across a longer period of > time while the kernel is still writing data. So parts of your writes > may still end up in the snapshot after issuing the snapshot command, > instead of in the working copy as expected. > > How is this going to be addressed? Is there some snapshot aware API to > let user space subscribe to such events and do proper preparation? Is > this planned? LVM could be a user of such an API, too. I think this > could have nice enterprise-grade value for Linux. > > XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But > still, also this needs to be integrated with MySQL to properly work. I > once (years ago) researched on this but gave up on my plans when I > planned database backups for our web server infrastructure. We moved to > creating SQL dumps instead, although there're binlogs which can be used > to recover to a clean and stable transactional state after taking > snapshots. But I simply didn't want to fiddle around with properly > cleaning up binlogs which accumulate horribly much space usage over > time. The cleanup process requires to create a cold copy or dump of the > complete database from time to time, only then it's safe to remove all > binlogs up to that point in time. little bit off topic, but I for one would be on board with such an effort. It "just" needs coordination between the backup software/snapshot tools, the backed up software and the various snapshot providers. If you look at the Windows VSS API, this would be a relatively large undertaking if all the corner cases are taken into account, like e.g. a database having the database log on a separate volume from the data, dependencies between different components etc. You'll know more about this, but databases usually fsync quite often in their default configuration, so btrfs snapshots shouldn't be much behind the properly snapshotted state, so I see the advantages more with usability and taking care of corner cases automatically. Regards, Martin Raiber [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3826 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-08 12:14 ` Martin Raiber @ 2017-02-08 13:00 ` Adrian Brzezinski 2017-02-08 13:08 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 42+ messages in thread From: Adrian Brzezinski @ 2017-02-08 13:00 UTC (permalink / raw) To: Martin Raiber, Peter Zaitsev, linux-btrfs W dniu 2017-02-08 o 13:14 PM, Martin Raiber pisze: > Hi, > > On 08.02.2017 03:11 Peter Zaitsev wrote: >> Out of curiosity, I see one problem here: >> If you're doing snapshots of the live database, each snapshot leaves >> the database files like killing the database in-flight. Like shutting >> the system down in the middle of writing data. >> >> This is because I think there's no API for user space to subscribe to >> events like a snapshot - unlike e.g. the VSS API (volume snapshot >> service) in Windows. You should put the database into frozen state to >> prepare it for a hotcopy before creating the snapshot, then ensure all >> data is flushed before continuing. >> >> I think I've read that btrfs snapshots do not guarantee single point in >> time snapshots - the snapshot may be smeared across a longer period of >> time while the kernel is still writing data. So parts of your writes >> may still end up in the snapshot after issuing the snapshot command, >> instead of in the working copy as expected. >> >> How is this going to be addressed? Is there some snapshot aware API to >> let user space subscribe to such events and do proper preparation? Is >> this planned? LVM could be a user of such an API, too. I think this >> could have nice enterprise-grade value for Linux. >> >> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >> still, also this needs to be integrated with MySQL to properly work. I >> once (years ago) researched on this but gave up on my plans when I >> planned database backups for our web server infrastructure. We moved to >> creating SQL dumps instead, although there're binlogs which can be used >> to recover to a clean and stable transactional state after taking >> snapshots. But I simply didn't want to fiddle around with properly >> cleaning up binlogs which accumulate horribly much space usage over >> time. The cleanup process requires to create a cold copy or dump of the >> complete database from time to time, only then it's safe to remove all >> binlogs up to that point in time. > little bit off topic, but I for one would be on board with such an > effort. It "just" needs coordination between the backup > software/snapshot tools, the backed up software and the various snapshot > providers. If you look at the Windows VSS API, this would be a > relatively large undertaking if all the corner cases are taken into > account, like e.g. a database having the database log on a separate > volume from the data, dependencies between different components etc. > > You'll know more about this, but databases usually fsync quite often in > their default configuration, so btrfs snapshots shouldn't be much behind > the properly snapshotted state, so I see the advantages more with > usability and taking care of corner cases automatically. > > Regards, > Martin Raiber xfs_freeze works also for BTRFS... -- Adrian Brzeziński ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-08 12:14 ` Martin Raiber 2017-02-08 13:00 ` Adrian Brzezinski @ 2017-02-08 13:08 ` Austin S. Hemmelgarn 2017-02-08 13:26 ` Martin Raiber 1 sibling, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-08 13:08 UTC (permalink / raw) To: Martin Raiber, Peter Zaitsev, linux-btrfs On 2017-02-08 07:14, Martin Raiber wrote: > Hi, > > On 08.02.2017 03:11 Peter Zaitsev wrote: >> Out of curiosity, I see one problem here: >> If you're doing snapshots of the live database, each snapshot leaves >> the database files like killing the database in-flight. Like shutting >> the system down in the middle of writing data. >> >> This is because I think there's no API for user space to subscribe to >> events like a snapshot - unlike e.g. the VSS API (volume snapshot >> service) in Windows. You should put the database into frozen state to >> prepare it for a hotcopy before creating the snapshot, then ensure all >> data is flushed before continuing. >> >> I think I've read that btrfs snapshots do not guarantee single point in >> time snapshots - the snapshot may be smeared across a longer period of >> time while the kernel is still writing data. So parts of your writes >> may still end up in the snapshot after issuing the snapshot command, >> instead of in the working copy as expected. >> >> How is this going to be addressed? Is there some snapshot aware API to >> let user space subscribe to such events and do proper preparation? Is >> this planned? LVM could be a user of such an API, too. I think this >> could have nice enterprise-grade value for Linux. >> >> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >> still, also this needs to be integrated with MySQL to properly work. I >> once (years ago) researched on this but gave up on my plans when I >> planned database backups for our web server infrastructure. We moved to >> creating SQL dumps instead, although there're binlogs which can be used >> to recover to a clean and stable transactional state after taking >> snapshots. But I simply didn't want to fiddle around with properly >> cleaning up binlogs which accumulate horribly much space usage over >> time. The cleanup process requires to create a cold copy or dump of the >> complete database from time to time, only then it's safe to remove all >> binlogs up to that point in time. > > little bit off topic, but I for one would be on board with such an > effort. It "just" needs coordination between the backup > software/snapshot tools, the backed up software and the various snapshot > providers. If you look at the Windows VSS API, this would be a > relatively large undertaking if all the corner cases are taken into > account, like e.g. a database having the database log on a separate > volume from the data, dependencies between different components etc. > > You'll know more about this, but databases usually fsync quite often in > their default configuration, so btrfs snapshots shouldn't be much behind > the properly snapshotted state, so I see the advantages more with > usability and taking care of corner cases automatically. Just my perspective, but BTRFS (and XFS, and OCFS2) already provide reflinking to userspace, and therefore it's fully possible to implement this in userspace. Having a version of the fsfreeze (the generic form of xfs_freeze) stuff that worked on individual sub-trees would be nice from a practical perspective, but implementing it would not be easy by any means, and would be essentially necessary for a VSS-like API. In the meantime though, it is fully possible for the application software to implement this itself without needing anything more from the kernel. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-08 13:08 ` Austin S. Hemmelgarn @ 2017-02-08 13:26 ` Martin Raiber 2017-02-08 13:32 ` Austin S. Hemmelgarn 2017-02-08 13:38 ` Peter Zaitsev 0 siblings, 2 replies; 42+ messages in thread From: Martin Raiber @ 2017-02-08 13:26 UTC (permalink / raw) To: Austin S. Hemmelgarn, Peter Zaitsev, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4253 bytes --] On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: > On 2017-02-08 07:14, Martin Raiber wrote: >> Hi, >> >> On 08.02.2017 03:11 Peter Zaitsev wrote: >>> Out of curiosity, I see one problem here: >>> If you're doing snapshots of the live database, each snapshot leaves >>> the database files like killing the database in-flight. Like shutting >>> the system down in the middle of writing data. >>> >>> This is because I think there's no API for user space to subscribe to >>> events like a snapshot - unlike e.g. the VSS API (volume snapshot >>> service) in Windows. You should put the database into frozen state to >>> prepare it for a hotcopy before creating the snapshot, then ensure all >>> data is flushed before continuing. >>> >>> I think I've read that btrfs snapshots do not guarantee single point in >>> time snapshots - the snapshot may be smeared across a longer period of >>> time while the kernel is still writing data. So parts of your writes >>> may still end up in the snapshot after issuing the snapshot command, >>> instead of in the working copy as expected. >>> >>> How is this going to be addressed? Is there some snapshot aware API to >>> let user space subscribe to such events and do proper preparation? Is >>> this planned? LVM could be a user of such an API, too. I think this >>> could have nice enterprise-grade value for Linux. >>> >>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >>> still, also this needs to be integrated with MySQL to properly work. I >>> once (years ago) researched on this but gave up on my plans when I >>> planned database backups for our web server infrastructure. We moved to >>> creating SQL dumps instead, although there're binlogs which can be used >>> to recover to a clean and stable transactional state after taking >>> snapshots. But I simply didn't want to fiddle around with properly >>> cleaning up binlogs which accumulate horribly much space usage over >>> time. The cleanup process requires to create a cold copy or dump of the >>> complete database from time to time, only then it's safe to remove all >>> binlogs up to that point in time. >> >> little bit off topic, but I for one would be on board with such an >> effort. It "just" needs coordination between the backup >> software/snapshot tools, the backed up software and the various snapshot >> providers. If you look at the Windows VSS API, this would be a >> relatively large undertaking if all the corner cases are taken into >> account, like e.g. a database having the database log on a separate >> volume from the data, dependencies between different components etc. >> >> You'll know more about this, but databases usually fsync quite often in >> their default configuration, so btrfs snapshots shouldn't be much behind >> the properly snapshotted state, so I see the advantages more with >> usability and taking care of corner cases automatically. > Just my perspective, but BTRFS (and XFS, and OCFS2) already provide > reflinking to userspace, and therefore it's fully possible to > implement this in userspace. Having a version of the fsfreeze (the > generic form of xfs_freeze) stuff that worked on individual sub-trees > would be nice from a practical perspective, but implementing it would > not be easy by any means, and would be essentially necessary for a > VSS-like API. In the meantime though, it is fully possible for the > application software to implement this itself without needing anything > more from the kernel. VSS snapshots whole volumes, not individual files (so comparable to an LVM snapshot). The sub-folder freeze would be something useful in some situations, but duplicating the files+extends might also take too long in a lot of situations. You are correct that the kernel features are there and what is missing is a user-space daemon, plus a protocol that facilitates/coordinates the backups/snapshots. Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and manages its on buffer pool which won't get the FIFREEZE and flush, but as said, the default configuration is to flush/fsync on every commit. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3826 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-08 13:26 ` Martin Raiber @ 2017-02-08 13:32 ` Austin S. Hemmelgarn 2017-02-08 14:28 ` Adrian Brzezinski 2017-02-08 13:38 ` Peter Zaitsev 1 sibling, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-08 13:32 UTC (permalink / raw) To: Martin Raiber, Peter Zaitsev, linux-btrfs On 2017-02-08 08:26, Martin Raiber wrote: > On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: >> On 2017-02-08 07:14, Martin Raiber wrote: >>> Hi, >>> >>> On 08.02.2017 03:11 Peter Zaitsev wrote: >>>> Out of curiosity, I see one problem here: >>>> If you're doing snapshots of the live database, each snapshot leaves >>>> the database files like killing the database in-flight. Like shutting >>>> the system down in the middle of writing data. >>>> >>>> This is because I think there's no API for user space to subscribe to >>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot >>>> service) in Windows. You should put the database into frozen state to >>>> prepare it for a hotcopy before creating the snapshot, then ensure all >>>> data is flushed before continuing. >>>> >>>> I think I've read that btrfs snapshots do not guarantee single point in >>>> time snapshots - the snapshot may be smeared across a longer period of >>>> time while the kernel is still writing data. So parts of your writes >>>> may still end up in the snapshot after issuing the snapshot command, >>>> instead of in the working copy as expected. >>>> >>>> How is this going to be addressed? Is there some snapshot aware API to >>>> let user space subscribe to such events and do proper preparation? Is >>>> this planned? LVM could be a user of such an API, too. I think this >>>> could have nice enterprise-grade value for Linux. >>>> >>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >>>> still, also this needs to be integrated with MySQL to properly work. I >>>> once (years ago) researched on this but gave up on my plans when I >>>> planned database backups for our web server infrastructure. We moved to >>>> creating SQL dumps instead, although there're binlogs which can be used >>>> to recover to a clean and stable transactional state after taking >>>> snapshots. But I simply didn't want to fiddle around with properly >>>> cleaning up binlogs which accumulate horribly much space usage over >>>> time. The cleanup process requires to create a cold copy or dump of the >>>> complete database from time to time, only then it's safe to remove all >>>> binlogs up to that point in time. >>> >>> little bit off topic, but I for one would be on board with such an >>> effort. It "just" needs coordination between the backup >>> software/snapshot tools, the backed up software and the various snapshot >>> providers. If you look at the Windows VSS API, this would be a >>> relatively large undertaking if all the corner cases are taken into >>> account, like e.g. a database having the database log on a separate >>> volume from the data, dependencies between different components etc. >>> >>> You'll know more about this, but databases usually fsync quite often in >>> their default configuration, so btrfs snapshots shouldn't be much behind >>> the properly snapshotted state, so I see the advantages more with >>> usability and taking care of corner cases automatically. >> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide >> reflinking to userspace, and therefore it's fully possible to >> implement this in userspace. Having a version of the fsfreeze (the >> generic form of xfs_freeze) stuff that worked on individual sub-trees >> would be nice from a practical perspective, but implementing it would >> not be easy by any means, and would be essentially necessary for a >> VSS-like API. In the meantime though, it is fully possible for the >> application software to implement this itself without needing anything >> more from the kernel. > > VSS snapshots whole volumes, not individual files (so comparable to an > LVM snapshot). The sub-folder freeze would be something useful in some > situations, but duplicating the files+extends might also take too long > in a lot of situations. You are correct that the kernel features are > there and what is missing is a user-space daemon, plus a protocol that > facilitates/coordinates the backups/snapshots. > > Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not > really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and > manages its on buffer pool which won't get the FIFREEZE and flush, but > as said, the default configuration is to flush/fsync on every commit. OK, there's part of the misunderstanding. You can't FIFREEZE a BTRFS filesystem and then take a snapshot in it, because the snapshot requires writing to the filesystem (which the FIFREEZE would prevent, so a script that tried to do this would deadlock). A new version of the FIFREEZE ioctl would be needed that operates on subvolumes. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-08 13:32 ` Austin S. Hemmelgarn @ 2017-02-08 14:28 ` Adrian Brzezinski 0 siblings, 0 replies; 42+ messages in thread From: Adrian Brzezinski @ 2017-02-08 14:28 UTC (permalink / raw) To: Austin S. Hemmelgarn, Martin Raiber, Peter Zaitsev, linux-btrfs W dniu 2017-02-08 o 14:32 PM, Austin S. Hemmelgarn pisze: > On 2017-02-08 08:26, Martin Raiber wrote: >> On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: >>> On 2017-02-08 07:14, Martin Raiber wrote: >>>> Hi, >>>> >>>> On 08.02.2017 03:11 Peter Zaitsev wrote: >>>>> Out of curiosity, I see one problem here: >>>>> If you're doing snapshots of the live database, each snapshot leaves >>>>> the database files like killing the database in-flight. Like shutting >>>>> the system down in the middle of writing data. >>>>> >>>>> This is because I think there's no API for user space to subscribe to >>>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot >>>>> service) in Windows. You should put the database into frozen state to >>>>> prepare it for a hotcopy before creating the snapshot, then ensure >>>>> all >>>>> data is flushed before continuing. >>>>> >>>>> I think I've read that btrfs snapshots do not guarantee single >>>>> point in >>>>> time snapshots - the snapshot may be smeared across a longer >>>>> period of >>>>> time while the kernel is still writing data. So parts of your writes >>>>> may still end up in the snapshot after issuing the snapshot command, >>>>> instead of in the working copy as expected. >>>>> >>>>> How is this going to be addressed? Is there some snapshot aware >>>>> API to >>>>> let user space subscribe to such events and do proper preparation? Is >>>>> this planned? LVM could be a user of such an API, too. I think this >>>>> could have nice enterprise-grade value for Linux. >>>>> >>>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM >>>>> snapshots. But >>>>> still, also this needs to be integrated with MySQL to properly >>>>> work. I >>>>> once (years ago) researched on this but gave up on my plans when I >>>>> planned database backups for our web server infrastructure. We >>>>> moved to >>>>> creating SQL dumps instead, although there're binlogs which can be >>>>> used >>>>> to recover to a clean and stable transactional state after taking >>>>> snapshots. But I simply didn't want to fiddle around with properly >>>>> cleaning up binlogs which accumulate horribly much space usage over >>>>> time. The cleanup process requires to create a cold copy or dump >>>>> of the >>>>> complete database from time to time, only then it's safe to remove >>>>> all >>>>> binlogs up to that point in time. >>>> >>>> little bit off topic, but I for one would be on board with such an >>>> effort. It "just" needs coordination between the backup >>>> software/snapshot tools, the backed up software and the various >>>> snapshot >>>> providers. If you look at the Windows VSS API, this would be a >>>> relatively large undertaking if all the corner cases are taken into >>>> account, like e.g. a database having the database log on a separate >>>> volume from the data, dependencies between different components etc. >>>> >>>> You'll know more about this, but databases usually fsync quite >>>> often in >>>> their default configuration, so btrfs snapshots shouldn't be much >>>> behind >>>> the properly snapshotted state, so I see the advantages more with >>>> usability and taking care of corner cases automatically. >>> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide >>> reflinking to userspace, and therefore it's fully possible to >>> implement this in userspace. Having a version of the fsfreeze (the >>> generic form of xfs_freeze) stuff that worked on individual sub-trees >>> would be nice from a practical perspective, but implementing it would >>> not be easy by any means, and would be essentially necessary for a >>> VSS-like API. In the meantime though, it is fully possible for the >>> application software to implement this itself without needing anything >>> more from the kernel. >> >> VSS snapshots whole volumes, not individual files (so comparable to an >> LVM snapshot). The sub-folder freeze would be something useful in some >> situations, but duplicating the files+extends might also take too long >> in a lot of situations. You are correct that the kernel features are >> there and what is missing is a user-space daemon, plus a protocol that >> facilitates/coordinates the backups/snapshots. >> >> Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not >> really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and >> manages its on buffer pool which won't get the FIFREEZE and flush, but >> as said, the default configuration is to flush/fsync on every commit. > OK, there's part of the misunderstanding. You can't FIFREEZE a BTRFS > filesystem and then take a snapshot in it, because the snapshot > requires writing to the filesystem (which the FIFREEZE would prevent, > so a script that tried to do this would deadlock). A new version of > the FIFREEZE ioctl would be needed that operates on subvolumes. You can also you put your filesystem on LVM, and take LVM snapshots. -- Adrian Brzeziński ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-08 13:26 ` Martin Raiber 2017-02-08 13:32 ` Austin S. Hemmelgarn @ 2017-02-08 13:38 ` Peter Zaitsev 1 sibling, 0 replies; 42+ messages in thread From: Peter Zaitsev @ 2017-02-08 13:38 UTC (permalink / raw) To: Martin Raiber; +Cc: Austin S. Hemmelgarn, linux-btrfs Hi, When it comes to MySQL I'm not really sure what you're trying to achieve. Because MySQL manages its own cache flushing OS cache to the disk and "freezing" FS does not really do much - it will still need to do crash recovery when such snapshot is restored. The reason people would use xfs_freeze with MySQL is when we have the database spread across different filesystems - typically log files placed on the different partition than the data or databases placed on different partitions. In this case you need to have consistent single point in time snapshot across the filesystems for backup to be recoverable. More common approach though is to keep it KISS and have everything on single filesystem. On Wed, Feb 8, 2017 at 8:26 AM, Martin Raiber <martin@urbackup.org> wrote: > On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: >> On 2017-02-08 07:14, Martin Raiber wrote: >>> Hi, >>> >>> On 08.02.2017 03:11 Peter Zaitsev wrote: >>>> Out of curiosity, I see one problem here: >>>> If you're doing snapshots of the live database, each snapshot leaves >>>> the database files like killing the database in-flight. Like shutting >>>> the system down in the middle of writing data. >>>> >>>> This is because I think there's no API for user space to subscribe to >>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot >>>> service) in Windows. You should put the database into frozen state to >>>> prepare it for a hotcopy before creating the snapshot, then ensure all >>>> data is flushed before continuing. >>>> >>>> I think I've read that btrfs snapshots do not guarantee single point in >>>> time snapshots - the snapshot may be smeared across a longer period of >>>> time while the kernel is still writing data. So parts of your writes >>>> may still end up in the snapshot after issuing the snapshot command, >>>> instead of in the working copy as expected. >>>> >>>> How is this going to be addressed? Is there some snapshot aware API to >>>> let user space subscribe to such events and do proper preparation? Is >>>> this planned? LVM could be a user of such an API, too. I think this >>>> could have nice enterprise-grade value for Linux. >>>> >>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >>>> still, also this needs to be integrated with MySQL to properly work. I >>>> once (years ago) researched on this but gave up on my plans when I >>>> planned database backups for our web server infrastructure. We moved to >>>> creating SQL dumps instead, although there're binlogs which can be used >>>> to recover to a clean and stable transactional state after taking >>>> snapshots. But I simply didn't want to fiddle around with properly >>>> cleaning up binlogs which accumulate horribly much space usage over >>>> time. The cleanup process requires to create a cold copy or dump of the >>>> complete database from time to time, only then it's safe to remove all >>>> binlogs up to that point in time. >>> >>> little bit off topic, but I for one would be on board with such an >>> effort. It "just" needs coordination between the backup >>> software/snapshot tools, the backed up software and the various snapshot >>> providers. If you look at the Windows VSS API, this would be a >>> relatively large undertaking if all the corner cases are taken into >>> account, like e.g. a database having the database log on a separate >>> volume from the data, dependencies between different components etc. >>> >>> You'll know more about this, but databases usually fsync quite often in >>> their default configuration, so btrfs snapshots shouldn't be much behind >>> the properly snapshotted state, so I see the advantages more with >>> usability and taking care of corner cases automatically. >> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide >> reflinking to userspace, and therefore it's fully possible to >> implement this in userspace. Having a version of the fsfreeze (the >> generic form of xfs_freeze) stuff that worked on individual sub-trees >> would be nice from a practical perspective, but implementing it would >> not be easy by any means, and would be essentially necessary for a >> VSS-like API. In the meantime though, it is fully possible for the >> application software to implement this itself without needing anything >> more from the kernel. > > VSS snapshots whole volumes, not individual files (so comparable to an > LVM snapshot). The sub-folder freeze would be something useful in some > situations, but duplicating the files+extends might also take too long > in a lot of situations. You are correct that the kernel features are > there and what is missing is a user-space daemon, plus a protocol that > facilitates/coordinates the backups/snapshots. > > Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not > really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and > manages its on buffer pool which won't get the FIFREEZE and flush, but > as said, the default configuration is to flush/fsync on every commit. > > > -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev 2017-02-07 14:00 ` Hugo Mills @ 2017-02-07 14:47 ` Peter Grandi 2017-02-07 15:06 ` Austin S. Hemmelgarn 2017-02-07 18:27 ` Jeff Mahoney 3 siblings, 0 replies; 42+ messages in thread From: Peter Grandi @ 2017-02-07 14:47 UTC (permalink / raw) To: linux-btrfs > I have tried BTRFS from Ubuntu 16.04 LTS for write intensive > OLTP MySQL Workload. This has a lot of interesting and mostly agreeable information: https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp The main target of Btrfs is where one wants checksums and occasional snapshot for backup (rather than rollback) and applications do whole-file rewrites or appends. > It did not go very well ranging from multi-seconds stalls > where no transactions are completed That usually is more because of the "clever" design and defaults of the Linux page cache and block IO subsystem, which are astutely pessimized for every workload, but especially for read-modify-write ones, never mind for RMW workloads on copy-on-write filesystems. That most OS designs are pessimized for anything like a "write intensive OLTP" workload is not new, M Stonebraker complained about that 35 years ago, and nothing much has changed: http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d > to the finally kernel OOPS with "no space left on device" > error message and filesystem going read only. That's because Btrfs has a a two-level allocator, where space is allocated in 1GiB chunks (distinct as to data and metadata) and then in 16KiB nodes, and this makes it far more likely for free space fragmentation to occur. Therefore Btrfs has a free space compactor ('btrfs balance') that must be used the more often the more updates happen. > interested in "free" snapshots which look very attractive The general problem is that it is pretty much impossible to have read-modify-write rollbacks for cheap, because the writes in general are scattered (that is their time coherence is very different from their spatial coherence). That means either heavy spatial fragmentation or huge write amplification. The 'snapshot' type of DM/LVM2 device delivers heavy spatial fragmentation, Btrfs does a balance of both. Another commenter has mentioned the use of 'nodatacow' to prevent RMW resulting in huge write-amplification. > to use for database recovery scenarios allow instant rollback > to the previous state. You may be more interested in NILFS2 for that, but there are significant tradeoffs there too, and NILFS2 requires a free space compactor too, plus since NILFS2 gives up on short-term spatial coherence, the compactor also needs to compact data space. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev 2017-02-07 14:00 ` Hugo Mills 2017-02-07 14:47 ` Peter Grandi @ 2017-02-07 15:06 ` Austin S. Hemmelgarn 2017-02-07 19:39 ` Kai Krakow 2017-02-07 18:27 ` Jeff Mahoney 3 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 15:06 UTC (permalink / raw) To: Peter Zaitsev, linux-btrfs On 2017-02-07 08:53, Peter Zaitsev wrote: > Hi, > > I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL > Workload. > > It did not go very well ranging from multi-seconds stalls where no > transactions are completed to the finally kernel OOPS with "no space left > on device" error message and filesystem going read only. How much spare space did you have allocated in the filesystem? At a minimum, you want at least a few GB beyond what you expect to be the maximum size of your data-set times the number of snapshots you plan to keep around at any given time. > > I'm complete newbie in BTRFS so I assume I'm doing something wrong. Not exactly wrong, but getting this to work efficiently is more art than engineering. > > Do you have any advice on how BTRFS should be tuned for OLTP workload > (large files having a lot of random writes) ? Or is this the case where > one should simply stay away from BTRFS and use something else ? The general recommendation is usually to avoid BTRFS for such things. There are however a number of things you can do to improve performance: 1. Use a backing storage format that has the minimal amount of complexity. The more data structures that get updated when a record changes, the worse the performance will be. I don't have enough experience with MySQL to give a specific recommendation on what backing storage format to use, but someone else might. 2. Avoid large numbers of small transactions. The smaller the transaction, the worse it will fragment things. 3. Use autodefrag. This will increase write load on the storage device, but it should improve performance for reads. 4. Try using in-line compression. This can actually significantly improve performance, especially if you have slow storage devices and a really nice CPU. 5. If you're running raid10 mode for BTRFS, run raid1 on top of two LVM or MD RAID0 devices instead. This sounds stupid, but it actually will hugely improve both read and write performance without sacrificing any data safety. 6. Look at I/O scheduler tuning. This can have a huge impact, especially considering that most of the defaults for the various schedulers are somewhat poor for most modern systems. I won't go into the details here, since there are a huge number of online resources about this. > > One item recommended in some places is "nodatacow" this however defeats > the main purpose I'm looking at BTRFS - I am interested in "free" > snapshots which look very attractive to use for database recovery scenarios > allow instant rollback to the previous state. Snapshots aren't free. They are quick, but they aren't free by any means. If you're going to be using snapshots, keep them to a minimum, performance scales inversely proportionate to the number of snapshots, and this has a much bigger impact the more you're trying to do on the filesystem. Also, consider whether or not you _actually_ need filesystem level snapshots. I don't know about your full software stack, but most good OLTP software supports rollback segments (or an equivalent with a different name), and those are probably what you want to use, not filesystem snapshots. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 15:06 ` Austin S. Hemmelgarn @ 2017-02-07 19:39 ` Kai Krakow 2017-02-07 19:59 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 42+ messages in thread From: Kai Krakow @ 2017-02-07 19:39 UTC (permalink / raw) To: linux-btrfs Am Tue, 7 Feb 2017 10:06:34 -0500 schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > 4. Try using in-line compression. This can actually significantly > improve performance, especially if you have slow storage devices and > a really nice CPU. Just a side note: With nodatacow there'll be no compression, I think. At least for files with "chattr +C" there'll be no compression. I thus think "nodatacow" has the same effect. -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 19:39 ` Kai Krakow @ 2017-02-07 19:59 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 19:59 UTC (permalink / raw) To: linux-btrfs On 2017-02-07 14:39, Kai Krakow wrote: > Am Tue, 7 Feb 2017 10:06:34 -0500 > schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>: > >> 4. Try using in-line compression. This can actually significantly >> improve performance, especially if you have slow storage devices and >> a really nice CPU. > > Just a side note: With nodatacow there'll be no compression, I think. > At least for files with "chattr +C" there'll be no compression. I thus > think "nodatacow" has the same effect. You're absolutely right, thanks for mentioning this, I completely forgot to point it out myself. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev ` (2 preceding siblings ...) 2017-02-07 15:06 ` Austin S. Hemmelgarn @ 2017-02-07 18:27 ` Jeff Mahoney 2017-02-07 18:59 ` Peter Zaitsev 3 siblings, 1 reply; 42+ messages in thread From: Jeff Mahoney @ 2017-02-07 18:27 UTC (permalink / raw) To: Peter Zaitsev, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2026 bytes --] On 2/7/17 8:53 AM, Peter Zaitsev wrote: > Hi, > > I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL > Workload. > > It did not go very well ranging from multi-seconds stalls where no > transactions are completed to the finally kernel OOPS with "no space left > on device" error message and filesystem going read only. > > I'm complete newbie in BTRFS so I assume I'm doing something wrong. > > Do you have any advice on how BTRFS should be tuned for OLTP workload > (large files having a lot of random writes) ? Or is this the case where > one should simply stay away from BTRFS and use something else ? > > One item recommended in some places is "nodatacow" this however defeats > the main purpose I'm looking at BTRFS - I am interested in "free" > snapshots which look very attractive to use for database recovery scenarios > allow instant rollback to the previous state. > Hi Peter - There seems to be some misunderstanding around how nodatacow works. Nodatacow doesn't prohibit snapshot use. Snapshots are still allowed and, of course, will cause CoW to happen when a write occurs, but only on the first write. Subsequent writes will not CoW again. This does mean you don't get CRC protection for data, though. Since most databases do this internally, that is probably no great loss. You will get fragmentation, but that's true of any random-write workload on btrfs. Timothy's comment about how extents are accounted is more-or-less correct. The file extents in the file system trees reference data extents in the extent tree. When portions of the data extent are unreferenced, they're not necessarily released. A balance operation will usually split the data extents so that the unused space is released. As for the Oopses with ENOSPC, that's something we'd want to look into if it can be reproduced with a more recent kernel. We shouldn't be getting ENOSPC anywhere sensitive anymore. -Jeff -- Jeff Mahoney SUSE Labs [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 18:27 ` Jeff Mahoney @ 2017-02-07 18:59 ` Peter Zaitsev 2017-02-07 19:54 ` Austin S. Hemmelgarn 2017-02-07 22:08 ` Hans van Kranenburg 0 siblings, 2 replies; 42+ messages in thread From: Peter Zaitsev @ 2017-02-07 18:59 UTC (permalink / raw) To: Jeff Mahoney; +Cc: linux-btrfs Jeff, Thank you very much for explanations. Indeed it was not clear in the documentation - I read it simply as "if you have snapshots enabled nodatacow makes no difference" I will rebuild the database in this mode from scratch and see how performance changes. So far the most frustating for me was periodic stalls for many seconds (running sysbench workload). What was the most puzzling I get this even if I run workload at the 50% or less of the full load - Ie database can handle 1000 transactions/sec and I only inject 500/sec and I still have those stalls. This is where it looks to me like some work is being delayed and when it requires stall for a few seconds to catch up. I wonder if there are some configuration options available to play with. So far I found BTRFS rather "zero configuration" which is great if it works but it is also great to have more levers to pull if you're having some troubles. On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoney <jeffm@suse.com> wrote: > On 2/7/17 8:53 AM, Peter Zaitsev wrote: >> Hi, >> >> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL >> Workload. >> >> It did not go very well ranging from multi-seconds stalls where no >> transactions are completed to the finally kernel OOPS with "no space left >> on device" error message and filesystem going read only. >> >> I'm complete newbie in BTRFS so I assume I'm doing something wrong. >> >> Do you have any advice on how BTRFS should be tuned for OLTP workload >> (large files having a lot of random writes) ? Or is this the case where >> one should simply stay away from BTRFS and use something else ? >> >> One item recommended in some places is "nodatacow" this however defeats >> the main purpose I'm looking at BTRFS - I am interested in "free" >> snapshots which look very attractive to use for database recovery scenarios >> allow instant rollback to the previous state. >> > > Hi Peter - > > There seems to be some misunderstanding around how nodatacow works. > Nodatacow doesn't prohibit snapshot use. Snapshots are still allowed > and, of course, will cause CoW to happen when a write occurs, but only > on the first write. Subsequent writes will not CoW again. This does > mean you don't get CRC protection for data, though. Since most > databases do this internally, that is probably no great loss. You will > get fragmentation, but that's true of any random-write workload on btrfs. > > Timothy's comment about how extents are accounted is more-or-less > correct. The file extents in the file system trees reference data > extents in the extent tree. When portions of the data extent are > unreferenced, they're not necessarily released. A balance operation > will usually split the data extents so that the unused space is released. > > As for the Oopses with ENOSPC, that's something we'd want to look into > if it can be reproduced with a more recent kernel. We shouldn't be > getting ENOSPC anywhere sensitive anymore. > > -Jeff > > -- > Jeff Mahoney > SUSE Labs > -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 18:59 ` Peter Zaitsev @ 2017-02-07 19:54 ` Austin S. Hemmelgarn 2017-02-07 20:40 ` Peter Zaitsev 2017-02-07 22:08 ` Hans van Kranenburg 1 sibling, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-02-07 19:54 UTC (permalink / raw) To: Peter Zaitsev; +Cc: Jeff Mahoney, linux-btrfs On 2017-02-07 13:59, Peter Zaitsev wrote: > Jeff, > > Thank you very much for explanations. Indeed it was not clear in the > documentation - I read it simply as "if you have snapshots enabled > nodatacow makes no difference" > > I will rebuild the database in this mode from scratch and see how > performance changes. > > So far the most frustating for me was periodic stalls for many seconds > (running sysbench workload). What was the most puzzling I get this > even if I run workload at the 50% or less of the full load - Ie > database can handle 1000 transactions/sec and I only inject 500/sec > and I still have those stalls. > > This is where it looks to me like some work is being delayed and when > it requires stall for a few seconds to catch up. I wonder if there > are some configuration options available to play with. > > So far I found BTRFS rather "zero configuration" which is great if it > works but it is also great to have more levers to pull if you're > having some troubles. It's worth keeping in mind that there is more to the storage stack than just the filesystem, and BTRFS tends to be more sensitive to the behavior of other components in the stack than most other filesystems are. The stalls you're describing sound more like a symptom of the brain-dead writeback buffering defaults used by the VFS layer than they do an issue with BTRFS (although BTRFS tends to be a bit more heavily impacted by this than most other filesystems). Try fiddling with the /proc/sys/vm/dirty_* sysctls (there is some pretty good documentation in Documentation/sysctl/vm.txt in the kernel source) and see if that helps. The default values it uses are at most 20% of RAM, which is an insane amount of data to buffer before starting writeback when you're talking about systems with 16GB of RAM. > > > On Tue, Feb 7, 2017 at 1:27 PM, Jeff Mahoney <jeffm@suse.com> wrote: >> On 2/7/17 8:53 AM, Peter Zaitsev wrote: >>> Hi, >>> >>> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive OLTP MySQL >>> Workload. >>> >>> It did not go very well ranging from multi-seconds stalls where no >>> transactions are completed to the finally kernel OOPS with "no space left >>> on device" error message and filesystem going read only. >>> >>> I'm complete newbie in BTRFS so I assume I'm doing something wrong. >>> >>> Do you have any advice on how BTRFS should be tuned for OLTP workload >>> (large files having a lot of random writes) ? Or is this the case where >>> one should simply stay away from BTRFS and use something else ? >>> >>> One item recommended in some places is "nodatacow" this however defeats >>> the main purpose I'm looking at BTRFS - I am interested in "free" >>> snapshots which look very attractive to use for database recovery scenarios >>> allow instant rollback to the previous state. >>> >> >> Hi Peter - >> >> There seems to be some misunderstanding around how nodatacow works. >> Nodatacow doesn't prohibit snapshot use. Snapshots are still allowed >> and, of course, will cause CoW to happen when a write occurs, but only >> on the first write. Subsequent writes will not CoW again. This does >> mean you don't get CRC protection for data, though. Since most >> databases do this internally, that is probably no great loss. You will >> get fragmentation, but that's true of any random-write workload on btrfs. >> >> Timothy's comment about how extents are accounted is more-or-less >> correct. The file extents in the file system trees reference data >> extents in the extent tree. When portions of the data extent are >> unreferenced, they're not necessarily released. A balance operation >> will usually split the data extents so that the unused space is released. >> >> As for the Oopses with ENOSPC, that's something we'd want to look into >> if it can be reproduced with a more recent kernel. We shouldn't be >> getting ENOSPC anywhere sensitive anymore. >> >> -Jeff >> >> -- >> Jeff Mahoney >> SUSE Labs >> > > > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 19:54 ` Austin S. Hemmelgarn @ 2017-02-07 20:40 ` Peter Zaitsev 0 siblings, 0 replies; 42+ messages in thread From: Peter Zaitsev @ 2017-02-07 20:40 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Jeff Mahoney, linux-btrfs Austin, I recognize there are other components too. In this case I'm actually comparing BTRFS to XFS and EXT4 so I'm 100% sure it is file system related. Also I'm using O_DIRECT asynchronous IO with MySQL which means there are no significant dirty block size on the file system level. I'll see if it helps though Also I assumed this is something well known as it is documented in Gotchas here: https://btrfs.wiki.kernel.org/index.php/Gotchas (Fragmentation section) > > It's worth keeping in mind that there is more to the storage stack than just > the filesystem, and BTRFS tends to be more sensitive to the behavior of > other components in the stack than most other filesystems are. The stalls > you're describing sound more like a symptom of the brain-dead writeback > buffering defaults used by the VFS layer than they do an issue with BTRFS > (although BTRFS tends to be a bit more heavily impacted by this than most > other filesystems). Try fiddling with the /proc/sys/vm/dirty_* sysctls > (there is some pretty good documentation in Documentation/sysctl/vm.txt in > the kernel source) and see if that helps. The default values it uses are at > most 20% of RAM, which is an insane amount of data to buffer before starting > writeback when you're talking about systems with 16GB of RAM. > -- Peter Zaitsev, CEO, Percona Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: BTRFS for OLTP Databases 2017-02-07 18:59 ` Peter Zaitsev 2017-02-07 19:54 ` Austin S. Hemmelgarn @ 2017-02-07 22:08 ` Hans van Kranenburg 1 sibling, 0 replies; 42+ messages in thread From: Hans van Kranenburg @ 2017-02-07 22:08 UTC (permalink / raw) To: Peter Zaitsev; +Cc: linux-btrfs On 02/07/2017 07:59 PM, Peter Zaitsev wrote: > > So far the most frustating for me was periodic stalls for many seconds > (running sysbench workload). What was the most puzzling I get this > even if I run workload at the 50% or less of the full load - Ie > database can handle 1000 transactions/sec and I only inject 500/sec > and I still have those stalls. > > This is where it looks to me like some work is being delayed and when > it requires stall for a few seconds to catch up. I wonder if there > are some configuration options available to play with. What happens during these stalls? Do you mean a 'stall' like it seems nothing is happening at all, or a 'stall' during which something is so busy that something else cannot continue? Is there some kernel thread doing a lot of cpu? What does the /proc/<pid>/stack show? Is it huge write spikes with not many writes in between, or do you generate enough action to be writing to disk all the time? If the stalls show the behaviour of huge disk-write spikes, during which applications seem to be blocked from continuing to write more, and if during that time you see btrfs-transaction active in the kernel, aaaand, if your test is doing a lot of writes all over the place (not only simply appending table files sequentially, but changing a lot and touching a lot of metadata) and you're pushing it, it might be space cache related. I think the /proc/<pid>/stack of the btrfs-transaction will show you something related to free space cache in this case. In this case, it might be interesting to test the free space tree (instead of the default free space cache): http://events.linuxfoundation.org/sites/events/files/slides/vault2016_0.pdf Using free space tree helped me a lot on write-heavy filesystems (like a backup server with concurrent rsync data streaming in, also doing snapshotting) from having incoming traffic drop to the ground every time there was a transaction commit. -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2017-02-13 17:16 UTC | newest] Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-07 13:53 BTRFS for OLTP Databases Peter Zaitsev 2017-02-07 14:00 ` Hugo Mills 2017-02-07 14:13 ` Peter Zaitsev 2017-02-07 15:00 ` Timofey Titovets 2017-02-07 15:09 ` Austin S. Hemmelgarn 2017-02-07 15:20 ` Timofey Titovets 2017-02-07 15:43 ` Austin S. Hemmelgarn 2017-02-07 21:14 ` Kai Krakow 2017-02-07 16:22 ` Lionel Bouton 2017-02-07 19:57 ` Roman Mamedov 2017-02-07 20:36 ` Kai Krakow 2017-02-07 20:44 ` Lionel Bouton 2017-02-07 20:47 ` Austin S. Hemmelgarn 2017-02-07 21:25 ` Lionel Bouton 2017-02-07 21:35 ` Kai Krakow 2017-02-07 22:27 ` Hans van Kranenburg 2017-02-08 19:08 ` Goffredo Baroncelli [not found] ` <b0de25a7-989e-d16a-2ce6-2b6c1edde08b@gmail.com> 2017-02-13 12:44 ` Austin S. Hemmelgarn 2017-02-13 17:16 ` linux-btrfs 2017-02-07 19:31 ` Peter Zaitsev 2017-02-07 19:50 ` Austin S. Hemmelgarn 2017-02-07 20:19 ` Kai Krakow 2017-02-07 20:27 ` Austin S. Hemmelgarn 2017-02-07 20:54 ` Kai Krakow 2017-02-08 12:12 ` Austin S. Hemmelgarn 2017-02-08 2:11 ` Peter Zaitsev 2017-02-08 12:14 ` Martin Raiber 2017-02-08 13:00 ` Adrian Brzezinski 2017-02-08 13:08 ` Austin S. Hemmelgarn 2017-02-08 13:26 ` Martin Raiber 2017-02-08 13:32 ` Austin S. Hemmelgarn 2017-02-08 14:28 ` Adrian Brzezinski 2017-02-08 13:38 ` Peter Zaitsev 2017-02-07 14:47 ` Peter Grandi 2017-02-07 15:06 ` Austin S. Hemmelgarn 2017-02-07 19:39 ` Kai Krakow 2017-02-07 19:59 ` Austin S. Hemmelgarn 2017-02-07 18:27 ` Jeff Mahoney 2017-02-07 18:59 ` Peter Zaitsev 2017-02-07 19:54 ` Austin S. Hemmelgarn 2017-02-07 20:40 ` Peter Zaitsev 2017-02-07 22:08 ` Hans van Kranenburg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.