All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Adriaanse <alex@oseberg.io>
To: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Ongoing Btrfs stability issues
Date: Thu, 15 Feb 2018 16:18:22 +0000	[thread overview]
Message-ID: <SN2PR03MB22697EDC5BC991C819353117A9F40@SN2PR03MB2269.namprd03.prod.outlook.com> (raw)

We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 years. There is so much I love about Btrfs: CoW snapshots, compression, subvolumes, flexibility, the tools, etc. However, lack of stability has been a serious ongoing issue for us, and we're getting to the point that it's becoming hard to justify continuing to use it unless we make some changes that will get it stable. The instability manifests itself mostly in the form of the VM completely crashing, I/O operations freezing, or the filesystem going into readonly mode. We've spent an enormous amount of time trying to recover corrupted filesystems, and the time that servers were down as a result of Btrfs instability has accumulated to many days.

We've made many changes to try to improve Btrfs stability: upgrading to newer kernels, setting up nightly balances, setting up monitoring to ensure our filesystems stay under 70% utilization, etc. This has definitely helped quite a bit, but even with these things in place it's still unstable. Take https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created yesterday: we've had 4 VMs (out of 20) go down over the past week alone because of Btrfs errors. Thankfully, no data was lost, but I did have to copy everything over to a new filesystem.

Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O utilization is often pegged at 100%). The filesystems that get little I/O seem pretty stable, but the ones that undergo a lot of I/O activity are the ones that suffer from the most instability problems. We run the following balances on every filesystem every night:

    btrfs balance start -dusage=10 <fs>
    btrfs balance start -dusage=20 <fs>
    btrfs balance start -dusage=40,limit=100 <fs>

We also use the following btrfs-snap cronjobs to implement rotating snapshots, with short-term snapshots taking place every 15 minutes and less frequent ones being retained for up to 3 days:

    0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r <fs> 23
    15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r <fs> 15m 3
    0 0 * * * /opt/btrfs-snap/btrfs-snap -r <fs> daily 3

Our filesystems are mounted with the "compress=lzo" option.

Are we doing something wrong? Are there things we should change to improve stability? I wouldn't be surprised if eliminating snapshots would stabilize things, but if we do that we might as well be using a filesystem like XFS. Are there fixes queued up that will solve the problems listed in the Bugzilla ticket referenced above? Or is our I/O-intensive workload just not a good fit for Btrfs?

Thanks,

Alex

             reply	other threads:[~2018-02-15 16:18 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-15 16:18 Alex Adriaanse [this message]
2018-02-15 18:00 ` Ongoing Btrfs stability issues Nikolay Borisov
2018-02-15 19:41   ` Alex Adriaanse
2018-02-15 20:42     ` Nikolay Borisov
2018-02-16  4:54       ` Alex Adriaanse
2018-02-16  7:40         ` Nikolay Borisov
2018-02-16 19:44 ` Austin S. Hemmelgarn
2018-02-17  3:03   ` Duncan
2018-02-17  4:34     ` Shehbaz Jaffer
2018-02-17 15:18       ` Hans van Kranenburg
2018-02-17 16:42         ` Shehbaz Jaffer
2018-03-01 19:04   ` Alex Adriaanse
2018-03-01 19:40     ` Nikolay Borisov
2018-03-02 17:29       ` Liu Bo
2018-03-08 17:40         ` Alex Adriaanse
2018-03-09  9:54           ` Nikolay Borisov
2018-03-09 19:05             ` Alex Adriaanse
2018-03-10 12:04               ` Nikolay Borisov
2018-03-10 14:29                 ` Christoph Anton Mitterer
2018-03-11 17:51                   ` Goffredo Baroncelli
2018-03-11 22:37                     ` Christoph Anton Mitterer
2018-03-12 21:22                       ` Goffredo Baroncelli
2018-03-12 21:48                         ` Christoph Anton Mitterer
2018-03-13 19:36                           ` Goffredo Baroncelli
2018-03-13 20:10                             ` Christoph Anton Mitterer
2018-03-14 12:02                             ` Austin S. Hemmelgarn
2018-03-14 18:39                               ` Goffredo Baroncelli
2018-03-14 19:27                                 ` Austin S. Hemmelgarn
2018-03-14 22:17                                   ` Goffredo Baroncelli
2018-03-13 13:47               ` Patrik Lundquist
2018-03-02  4:02     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN2PR03MB22697EDC5BC991C819353117A9F40@SN2PR03MB2269.namprd03.prod.outlook.com \
    --to=alex@oseberg.io \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.