All of lore.kernel.org
 help / color / mirror / Atom feed
* Ongoing Btrfs stability issues
@ 2018-02-15 16:18 Alex Adriaanse
  2018-02-15 18:00 ` Nikolay Borisov
  2018-02-16 19:44 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 31+ messages in thread
From: Alex Adriaanse @ 2018-02-15 16:18 UTC (permalink / raw)
  To: linux-btrfs

We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 years. There is so much I love about Btrfs: CoW snapshots, compression, subvolumes, flexibility, the tools, etc. However, lack of stability has been a serious ongoing issue for us, and we're getting to the point that it's becoming hard to justify continuing to use it unless we make some changes that will get it stable. The instability manifests itself mostly in the form of the VM completely crashing, I/O operations freezing, or the filesystem going into readonly mode. We've spent an enormous amount of time trying to recover corrupted filesystems, and the time that servers were down as a result of Btrfs instability has accumulated to many days.

We've made many changes to try to improve Btrfs stability: upgrading to newer kernels, setting up nightly balances, setting up monitoring to ensure our filesystems stay under 70% utilization, etc. This has definitely helped quite a bit, but even with these things in place it's still unstable. Take https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created yesterday: we've had 4 VMs (out of 20) go down over the past week alone because of Btrfs errors. Thankfully, no data was lost, but I did have to copy everything over to a new filesystem.

Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O utilization is often pegged at 100%). The filesystems that get little I/O seem pretty stable, but the ones that undergo a lot of I/O activity are the ones that suffer from the most instability problems. We run the following balances on every filesystem every night:

    btrfs balance start -dusage=10 <fs>
    btrfs balance start -dusage=20 <fs>
    btrfs balance start -dusage=40,limit=100 <fs>

We also use the following btrfs-snap cronjobs to implement rotating snapshots, with short-term snapshots taking place every 15 minutes and less frequent ones being retained for up to 3 days:

    0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r <fs> 23
    15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r <fs> 15m 3
    0 0 * * * /opt/btrfs-snap/btrfs-snap -r <fs> daily 3

Our filesystems are mounted with the "compress=lzo" option.

Are we doing something wrong? Are there things we should change to improve stability? I wouldn't be surprised if eliminating snapshots would stabilize things, but if we do that we might as well be using a filesystem like XFS. Are there fixes queued up that will solve the problems listed in the Bugzilla ticket referenced above? Or is our I/O-intensive workload just not a good fit for Btrfs?

Thanks,

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-15 16:18 Ongoing Btrfs stability issues Alex Adriaanse
@ 2018-02-15 18:00 ` Nikolay Borisov
  2018-02-15 19:41   ` Alex Adriaanse
  2018-02-16 19:44 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 31+ messages in thread
From: Nikolay Borisov @ 2018-02-15 18:00 UTC (permalink / raw)
  To: Alex Adriaanse, linux-btrfs



On 15.02.2018 18:18, Alex Adriaanse wrote:
> We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 years. There is so much I love about Btrfs: CoW snapshots, compression, subvolumes, flexibility, the tools, etc. However, lack of stability has been a serious ongoing issue for us, and we're getting to the point that it's becoming hard to justify continuing to use it unless we make some changes that will get it stable. The instability manifests itself mostly in the form of the VM completely crashing, I/O operations freezing, or the filesystem going into readonly mode. We've spent an enormous amount of time trying to recover corrupted filesystems, and the time that servers were down as a result of Btrfs instability has accumulated to many days.
> 
> We've made many changes to try to improve Btrfs stability: upgrading to newer kernels, setting up nightly balances, setting up monitoring to ensure our filesystems stay under 70% utilization, etc. This has definitely helped quite a bit, but even with these things in place it's still unstable. Take https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created yesterday: we've had 4 VMs (out of 20) go down over the past week alone because of Btrfs errors. Thankfully, no data was lost, but I did have to copy everything over to a new filesystem.

So in all of the cases you are hitting some form of premature enospc.
There was a fix that landed in 4.15 that should have fixed a rather
long-standing issue with the way metadata reservations are satisfied,
namely:

996478ca9c46 ("btrfs: change how we decide to commit transactions during
flushing").

That commit was introduced in 4.14.3 stable kernel. Since you are not
using upstream kernel I'd advise you check whether the respective commit
is contained in the kernel versions you are using.

Other than that in the reports you mentioned there is one crash in
__del_reloc_root which looks rather interesting, at the very least it
shouldn't crash...

> Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O utilization is often pegged at 100%). The filesystems that get little I/O seem pretty stable, but the ones that undergo a lot of I/O activity are the ones that suffer from the most instability problems. We run the following balances on every filesystem every night:
> 
>     btrfs balance start -dusage=10 <fs>
>     btrfs balance start -dusage=20 <fs>
>     btrfs balance start -dusage=40,limit=100 <fs>
> 
> We also use the following btrfs-snap cronjobs to implement rotating snapshots, with short-term snapshots taking place every 15 minutes and less frequent ones being retained for up to 3 days:
> 
>     0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r <fs> 23
>     15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r <fs> 15m 3
>     0 0 * * * /opt/btrfs-snap/btrfs-snap -r <fs> daily 3
> 
> Our filesystems are mounted with the "compress=lzo" option.
> 
> Are we doing something wrong? Are there things we should change to improve stability? I wouldn't be surprised if eliminating snapshots would stabilize things, but if we do that we might as well be using a filesystem like XFS. Are there fixes queued up that will solve the problems listed in the Bugzilla ticket referenced above? Or is our I/O-intensive workload just not a good fit for Btrfs?
> 
> Thanks,
> 
> Alex--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-15 18:00 ` Nikolay Borisov
@ 2018-02-15 19:41   ` Alex Adriaanse
  2018-02-15 20:42     ` Nikolay Borisov
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Adriaanse @ 2018-02-15 19:41 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs


> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov <nborisov@suse.com> wrote:
> 
> So in all of the cases you are hitting some form of premature enospc.
> There was a fix that landed in 4.15 that should have fixed a rather
> long-standing issue with the way metadata reservations are satisfied,
> namely:
> 
> 996478ca9c46 ("btrfs: change how we decide to commit transactions during
> flushing").
> 
> That commit was introduced in 4.14.3 stable kernel. Since you are not
> using upstream kernel I'd advise you check whether the respective commit
> is contained in the kernel versions you are using.
> 
> Other than that in the reports you mentioned there is one crash in
> __del_reloc_root which looks rather interesting, at the very least it
> shouldn't crash...

I checked the Debian source code that's used for building the kernels that we run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 contain the changes associated with the commit you referenced. So crash instances #2, #3, and #4 at https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all running kernels that contain this fix already.

Could it be that some on-disk data structures got (silently) corrupted while we were running pre-4.14.7 kernels, and the aforementioned fix doesn't address anything relating to damage that has already been done? If so, is there a way to detect and/or repair this for existing filesystems other than running a "btrfs check --repair" or rebuilding filesystems (both of which require a significant amount of downtime)?

Thanks,

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-15 19:41   ` Alex Adriaanse
@ 2018-02-15 20:42     ` Nikolay Borisov
  2018-02-16  4:54       ` Alex Adriaanse
  0 siblings, 1 reply; 31+ messages in thread
From: Nikolay Borisov @ 2018-02-15 20:42 UTC (permalink / raw)
  To: Alex Adriaanse; +Cc: linux-btrfs



On 15.02.2018 21:41, Alex Adriaanse wrote:
> 
>> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> So in all of the cases you are hitting some form of premature enospc.
>> There was a fix that landed in 4.15 that should have fixed a rather
>> long-standing issue with the way metadata reservations are satisfied,
>> namely:
>>
>> 996478ca9c46 ("btrfs: change how we decide to commit transactions during
>> flushing").
>>
>> That commit was introduced in 4.14.3 stable kernel. Since you are not
>> using upstream kernel I'd advise you check whether the respective commit
>> is contained in the kernel versions you are using.
>>
>> Other than that in the reports you mentioned there is one crash in
>> __del_reloc_root which looks rather interesting, at the very least it
>> shouldn't crash...
> 
> I checked the Debian source code that's used for building the kernels that we run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 contain the changes associated with the commit you referenced. So crash instances #2, #3, and #4 at https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all running kernels that contain this fix already.
> 
> Could it be that some on-disk data structures got (silently) corrupted while we were running pre-4.14.7 kernels, and the aforementioned fix doesn't address anything relating to damage that has already been done? If so, is there a way to detect and/or repair this for existing filesystems other than running a "btrfs check --repair" or rebuilding filesystems (both of which require a significant amount of downtime)?

>From the logs provided I can see only a single crash, the others are
just ENOSPC which can cause corruption due to delayed refs (in majority
of examples) not finishing. Is btrfs hosted on the EBS volume or on the
ephemeral storage of the instance? Is the EBS an ssd? If it's ssd are
you using an io scheduler for those ebs devices? You ca check what the
io scheduler for a device is by reading the following sysfs file:

/sys/block/<disk device>/queue/scheduler


> 
> Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-15 20:42     ` Nikolay Borisov
@ 2018-02-16  4:54       ` Alex Adriaanse
  2018-02-16  7:40         ` Nikolay Borisov
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Adriaanse @ 2018-02-16  4:54 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs


> On Feb 15, 2018, at 2:42 PM, Nikolay Borisov <nborisov@suse.com> wrote:
> 
> On 15.02.2018 21:41, Alex Adriaanse wrote:
>> 
>>> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>>> 
>>> So in all of the cases you are hitting some form of premature enospc.
>>> There was a fix that landed in 4.15 that should have fixed a rather
>>> long-standing issue with the way metadata reservations are satisfied,
>>> namely:
>>> 
>>> 996478ca9c46 ("btrfs: change how we decide to commit transactions during
>>> flushing").
>>> 
>>> That commit was introduced in 4.14.3 stable kernel. Since you are not
>>> using upstream kernel I'd advise you check whether the respective commit
>>> is contained in the kernel versions you are using.
>>> 
>>> Other than that in the reports you mentioned there is one crash in
>>> __del_reloc_root which looks rather interesting, at the very least it
>>> shouldn't crash...
>> 
>> I checked the Debian source code that's used for building the kernels that we run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 contain the changes associated with the commit you referenced. So crash instances #2, #3, and #4 at https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all running kernels that contain this fix already.
>> 
>> Could it be that some on-disk data structures got (silently) corrupted while we were running pre-4.14.7 kernels, and the aforementioned fix doesn't address anything relating to damage that has already been done? If so, is there a way to detect and/or repair this for existing filesystems other than running a "btrfs check --repair" or rebuilding filesystems (both of which require a significant amount of downtime)?
> 
> From the logs provided I can see only a single crash, the others are
> just ENOSPC which can cause corruption due to delayed refs (in majority
> of examples) not finishing. Is btrfs hosted on the EBS volume or on the
> ephemeral storage of the instance? Is the EBS an ssd? If it's ssd are
> you using an io scheduler for those ebs devices? You ca check what the
> io scheduler for a device is by reading the following sysfs file:
> 
> /sys/block/<disk device>/queue/scheduler

It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS volumes are all SSD. We didn't change the default schedulers on the VMs and it looks like it's using mq-deadline:

$ cat /sys/block/xvdc/queue/scheduler
[mq-deadline] none

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-16  4:54       ` Alex Adriaanse
@ 2018-02-16  7:40         ` Nikolay Borisov
  0 siblings, 0 replies; 31+ messages in thread
From: Nikolay Borisov @ 2018-02-16  7:40 UTC (permalink / raw)
  To: Alex Adriaanse; +Cc: linux-btrfs



On 16.02.2018 06:54, Alex Adriaanse wrote:
> 
>> On Feb 15, 2018, at 2:42 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> On 15.02.2018 21:41, Alex Adriaanse wrote:
>>>
>>>> On Feb 15, 2018, at 12:00 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>>>>
>>>> So in all of the cases you are hitting some form of premature enospc.
>>>> There was a fix that landed in 4.15 that should have fixed a rather
>>>> long-standing issue with the way metadata reservations are satisfied,
>>>> namely:
>>>>
>>>> 996478ca9c46 ("btrfs: change how we decide to commit transactions during
>>>> flushing").
>>>>
>>>> That commit was introduced in 4.14.3 stable kernel. Since you are not
>>>> using upstream kernel I'd advise you check whether the respective commit
>>>> is contained in the kernel versions you are using.
>>>>
>>>> Other than that in the reports you mentioned there is one crash in
>>>> __del_reloc_root which looks rather interesting, at the very least it
>>>> shouldn't crash...
>>>
>>> I checked the Debian source code that's used for building the kernels that we run, and can confirm that both 4.14.7-1~bpo9+1 and 4.14.13-1~bpo9+1 contain the changes associated with the commit you referenced. So crash instances #2, #3, and #4 at https://bugzilla.kernel.org/show_bug.cgi?id=198787 were all running kernels that contain this fix already.
>>>
>>> Could it be that some on-disk data structures got (silently) corrupted while we were running pre-4.14.7 kernels, and the aforementioned fix doesn't address anything relating to damage that has already been done? If so, is there a way to detect and/or repair this for existing filesystems other than running a "btrfs check --repair" or rebuilding filesystems (both of which require a significant amount of downtime)?
>>
>> From the logs provided I can see only a single crash, the others are
>> just ENOSPC which can cause corruption due to delayed refs (in majority
>> of examples) not finishing. Is btrfs hosted on the EBS volume or on the
>> ephemeral storage of the instance? Is the EBS an ssd? If it's ssd are
>> you using an io scheduler for those ebs devices? You ca check what the
>> io scheduler for a device is by reading the following sysfs file:
>>
>> /sys/block/<disk device>/queue/scheduler
> 
> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS volumes are all SSD. We didn't change the default schedulers on the VMs and it looks like it's using mq-deadline:
> 
> $ cat /sys/block/xvdc/queue/scheduler
> [mq-deadline] none

SO one thing I can advise to test is set the scheduler for that xvdc to
none. Next, I'ad advise you backport the following patch to your kernel:
https://github.com/kdave/btrfs-devel/commit/1b816c23e91f70603c532af52cccf17e68393682

then mount the filesystem with -o enospc_debug. And the next time an
enospc occurs additional info should be printed in dmesg with the state
of the space_info structure.

> 
> Alex--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-15 16:18 Ongoing Btrfs stability issues Alex Adriaanse
  2018-02-15 18:00 ` Nikolay Borisov
@ 2018-02-16 19:44 ` Austin S. Hemmelgarn
  2018-02-17  3:03   ` Duncan
  2018-03-01 19:04   ` Alex Adriaanse
  1 sibling, 2 replies; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-16 19:44 UTC (permalink / raw)
  To: Alex Adriaanse, linux-btrfs

On 2018-02-15 11:18, Alex Adriaanse wrote:
> We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 years. There is so much I love about Btrfs: CoW snapshots, compression, subvolumes, flexibility, the tools, etc. However, lack of stability has been a serious ongoing issue for us, and we're getting to the point that it's becoming hard to justify continuing to use it unless we make some changes that will get it stable. The instability manifests itself mostly in the form of the VM completely crashing, I/O operations freezing, or the filesystem going into readonly mode. We've spent an enormous amount of time trying to recover corrupted filesystems, and the time that servers were down as a result of Btrfs instability has accumulated to many days.
> 
> We've made many changes to try to improve Btrfs stability: upgrading to newer kernels, setting up nightly balances, setting up monitoring to ensure our filesystems stay under 70% utilization, etc. This has definitely helped quite a bit, but even with these things in place it's still unstable. Take https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created yesterday: we've had 4 VMs (out of 20) go down over the past week alone because of Btrfs errors. Thankfully, no data was lost, but I did have to copy everything over to a new filesystem.
> 
> Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O utilization is often pegged at 100%). The filesystems that get little I/O seem pretty stable, but the ones that undergo a lot of I/O activity are the ones that suffer from the most instability problems. We run the following balances on every filesystem every night:
> 
>      btrfs balance start -dusage=10 <fs>
>      btrfs balance start -dusage=20 <fs>
>      btrfs balance start -dusage=40,limit=100 <fs>
I would suggest changing this to eliminate the balance with '-dusage=10' 
(it's redundant with the '-dusage=20' one unless your filesystem is in 
pathologically bad shape), and adding equivalent filters for balancing 
metadata (which generally goes pretty fast).

Unless you've got a huge filesystem, you can also cut down on that limit 
filter.  100 data chunks that are 40% full is up to 40GB of data to move 
on a normally sized filesystem, or potentially up to 200GB if you've got 
a really big filesystem (I forget what point BTRFS starts scaling up 
chunk sizes at, but I'm pretty sure it's in the TB range).
> 
> We also use the following btrfs-snap cronjobs to implement rotating snapshots, with short-term snapshots taking place every 15 minutes and less frequent ones being retained for up to 3 days:
> 
>      0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r <fs> 23
>      15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r <fs> 15m 3
>      0 0 * * * /opt/btrfs-snap/btrfs-snap -r <fs> daily 3
> 
> Our filesystems are mounted with the "compress=lzo" option.
> 
> Are we doing something wrong? Are there things we should change to improve stability? I wouldn't be surprised if eliminating snapshots would stabilize things, but if we do that we might as well be using a filesystem like XFS. Are there fixes queued up that will solve the problems listed in the Bugzilla ticket referenced above? Or is our I/O-intensive workload just not a good fit for Btrfs?

This will probably sound like an odd question, but does BTRFS think your 
storage devices are SSD's or not?  Based on what you're saying, it 
sounds like you're running into issues resulting from the 
over-aggressive SSD 'optimizations' that were done by BTRFS until very 
recently.

You can verify if this is what's causing your problems or not by either 
upgrading to a recent mainline kernel version (I know the changes are in 
4.15, I don't remember for certain if they're in 4.14 or not, but I 
think they are), or by adding 'nossd' to your mount options, and then 
seeing if you still have the problems or not (I suspect this is only 
part of it, and thus changing this will reduce the issues, but not 
completely eliminate them).  Make sure and run a full balance after 
changing either item, as the aforementioned 'optimizations' have an 
impact on how data is organized on-disk (which is ultimately what causes 
the issues), so they will have a lingering effect if you don't balance 
everything.

'autodefrag' is the other mount option that I would try toggling (turn 
it off if you've got it on, or on if you've got it off).  I doubt it 
will have much impact, but it does change how things end up on disk.

Additionally to all that, make sure your monitoring isn't just looking 
at the regular `df` command's output, it's woefully insufficient for 
monitoring space usage on BTRFS.  If you want to check things properly, 
you want to be looking at the data in /sys/fs/btrfs/<UUID>/allocation, 
more specifically checking the following percentages:

1. The sum of the values in /sys/fs/btrfs/<UUID/allocation/*/disk_total 
relative to the sum total of the size of the block devices for the 
filesystem.
2. The ratio of /sys/fs/btrfs/<UUID>/allocation/data/bytes_used to 
/sys/fs/btrfs/<UUID>/allocation/data/total_bytes.
3. The ratio of /sys/fs/btrfs/<UUID>/allocation/metadata/bytes_used to 
/sys/fs/btrfs/<UUID>/allocation/metadata/total_bytes.
4. The ratio of /sys/fs/btrfs/<UUID>/allocation/system/bytes_used to 
/sys/fs/btrfs/<UUID>/allocation/system/total_bytes.

Regular `df` effectively reports the total of items 2, 3, and 4, but 
those don't really matter unless item 1 is also close to 100%.

Of those, you ideally want the first percentage (which is how much space 
BTRFS has allocated at the upper level in it's two-level allocator) to 
be less than 80 or 90 percent (unless you have a very small filesystem, 
in which case you should try to keep about 2.5GB free, which is enough 
for two data allocations and two metadata allocations), and in most 
cases you want the other three to be as close to 100% as possible 
(higher values there mean less space wasted at the higher allocation 
level).  If that first percentage gets up to 100%, and one of the others 
hits 100%, you'll get -ENOSPC for certain system calls (which ones 
depends on which of the others is at 100%), but traditional `df` can 
still show much less than 100% utilization.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-16 19:44 ` Austin S. Hemmelgarn
@ 2018-02-17  3:03   ` Duncan
  2018-02-17  4:34     ` Shehbaz Jaffer
  2018-03-01 19:04   ` Alex Adriaanse
  1 sibling, 1 reply; 31+ messages in thread
From: Duncan @ 2018-02-17  3:03 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
excerpted:

> This will probably sound like an odd question, but does BTRFS think your
> storage devices are SSD's or not?  Based on what you're saying, it
> sounds like you're running into issues resulting from the
> over-aggressive SSD 'optimizations' that were done by BTRFS until very
> recently.
> 
> You can verify if this is what's causing your problems or not by either
> upgrading to a recent mainline kernel version (I know the changes are in
> 4.15, I don't remember for certain if they're in 4.14 or not, but I
> think they are), or by adding 'nossd' to your mount options, and then
> seeing if you still have the problems or not (I suspect this is only
> part of it, and thus changing this will reduce the issues, but not
> completely eliminate them).  Make sure and run a full balance after
> changing either item, as the aforementioned 'optimizations' have an
> impact on how data is organized on-disk (which is ultimately what causes
> the issues), so they will have a lingering effect if you don't balance
> everything.

According to the wiki, 4.14 does indeed have the ssd changes.

According to the bug, he's running 4.13.x on one server and 4.14.x on 
two.  So upgrading the one to 4.14.x should mean all will have that fix.

However, without a full balance it /will/ take some time to settle down 
(again, assuming btrfs was using ssd mode), so the lingering effect could 
still be creating problems on the 4.14 kernel servers for the moment.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-17  3:03   ` Duncan
@ 2018-02-17  4:34     ` Shehbaz Jaffer
  2018-02-17 15:18       ` Hans van Kranenburg
  0 siblings, 1 reply; 31+ messages in thread
From: Shehbaz Jaffer @ 2018-02-17  4:34 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

>It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS volumes are all SSD

I have recently done some SSD corruption experiments on small set of
workloads, so I thought I would share my experience.

While creating btrfs using mkfs.btrfs command for SSDs, by default the
metadata duplication option is disabled. this renders btrfs-scrubbing
ineffective, as there are no redundant metadata to restore corrupted
metadata from.
So if there are any errors during read operation on SSD, unlike HDD
where the corruptions would be handled by btrfs scrub on the fly while
detecting checksum error, for SSD the read would fail as uncorrectable
error.

Could you confirm if metadata DUP is enabled for your system by
running the following cmd:

$btrfs fi df /mnt # mount is the mount point
Data, single: total=8.00MiB, used=64.00KiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=168.00MiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

If metadata is single in your case as well (and not DUP), that may be
the problem for btrfs-scrub not working effectively on the fly
(mid-stream bit-rot correction), causing reliability issues. A couple
of such bugs that are observed specifically for SSDs is reported here:

https://bugzilla.kernel.org/show_bug.cgi?id=198463
https://bugzilla.kernel.org/show_bug.cgi?id=198807

These do not occur for HDD, and I believe should not occur when
filesystem is mounted with nossd mode.

On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
> excerpted:
>
>> This will probably sound like an odd question, but does BTRFS think your
>> storage devices are SSD's or not?  Based on what you're saying, it
>> sounds like you're running into issues resulting from the
>> over-aggressive SSD 'optimizations' that were done by BTRFS until very
>> recently.
>>
>> You can verify if this is what's causing your problems or not by either
>> upgrading to a recent mainline kernel version (I know the changes are in
>> 4.15, I don't remember for certain if they're in 4.14 or not, but I
>> think they are), or by adding 'nossd' to your mount options, and then
>> seeing if you still have the problems or not (I suspect this is only
>> part of it, and thus changing this will reduce the issues, but not
>> completely eliminate them).  Make sure and run a full balance after
>> changing either item, as the aforementioned 'optimizations' have an
>> impact on how data is organized on-disk (which is ultimately what causes
>> the issues), so they will have a lingering effect if you don't balance
>> everything.
>
> According to the wiki, 4.14 does indeed have the ssd changes.
>
> According to the bug, he's running 4.13.x on one server and 4.14.x on
> two.  So upgrading the one to 4.14.x should mean all will have that fix.
>
> However, without a full balance it /will/ take some time to settle down
> (again, assuming btrfs was using ssd mode), so the lingering effect could
> still be creating problems on the 4.14 kernel servers for the moment.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Shehbaz Jaffer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-17  4:34     ` Shehbaz Jaffer
@ 2018-02-17 15:18       ` Hans van Kranenburg
  2018-02-17 16:42         ` Shehbaz Jaffer
  0 siblings, 1 reply; 31+ messages in thread
From: Hans van Kranenburg @ 2018-02-17 15:18 UTC (permalink / raw)
  To: Shehbaz Jaffer, Duncan; +Cc: linux-btrfs

On 02/17/2018 05:34 AM, Shehbaz Jaffer wrote:
>> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS volumes are all SSD
> 
> I have recently done some SSD corruption experiments on small set of
> workloads, so I thought I would share my experience.
> 
> While creating btrfs using mkfs.btrfs command for SSDs, by default the
> metadata duplication option is disabled. this renders btrfs-scrubbing
> ineffective, as there are no redundant metadata to restore corrupted
> metadata from.
> So if there are any errors during read operation on SSD, unlike HDD
> where the corruptions would be handled by btrfs scrub on the fly while
> detecting checksum error, for SSD the read would fail as uncorrectable
> error.

First of all, the ssd mount option does not have anything to do with
having single or DUP metadata.

Well, both the things that happen by default (mkfs using single, mount
enabling the ssd option) are happening because of the lookup result on
the rotational flag, but that's all.

> Could you confirm if metadata DUP is enabled for your system by
> running the following cmd:
> 
> $btrfs fi df /mnt # mount is the mount point
> Data, single: total=8.00MiB, used=64.00KiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=168.00MiB, used=112.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> If metadata is single in your case as well (and not DUP), that may be
> the problem for btrfs-scrub not working effectively on the fly
> (mid-stream bit-rot correction), causing reliability issues. A couple
> of such bugs that are observed specifically for SSDs is reported here:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=198463
> https://bugzilla.kernel.org/show_bug.cgi?id=198807

Here you show that when you have 'single' metadata, there's no copy to
recover from. This is expected.

Also, instead of physically damaging flash cells inside your SSD, you
are writing data to a perfectly working one. This is a different failure
scenario.

One of the reasons to turn off DUP for metadata by default on SSD is
(from man mkfs.btrfs):

    "The controllers may put data written in a short timespan into the
same physical storage unit (cell, block etc). In case this unit dies,
both copies are lost. BTRFS does not add any artificial delay between
metadata writes." .. "The traditional rotational hard drives usually
fail at the sector level."

And, of course, in case of EBS, you don't have any idea at all where the
data actually ends up, since it's talking to a black box service, and
not an SSD.

In any case, using DUP instead of single obviously increases the chance
of recovery in case of failures that corrupt one copy of the data when
it's travelling between system memory and disk, while you're sending two
of them right after each other, so you're totally right that it's better
to enable.

> These do not occur for HDD, and I believe should not occur when
> filesystem is mounted with nossd mode.

So to reiterate, mounting nossd does not make your metadata writes DUP.

> On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
>> excerpted:
>>
>>> This will probably sound like an odd question, but does BTRFS think your
>>> storage devices are SSD's or not?  Based on what you're saying, it
>>> sounds like you're running into issues resulting from the
>>> over-aggressive SSD 'optimizations' that were done by BTRFS until very
>>> recently.
>>>
>>> You can verify if this is what's causing your problems or not by either
>>> upgrading to a recent mainline kernel version (I know the changes are in
>>> 4.15, I don't remember for certain if they're in 4.14 or not, but I
>>> think they are), or by adding 'nossd' to your mount options, and then
>>> seeing if you still have the problems or not (I suspect this is only
>>> part of it, and thus changing this will reduce the issues, but not
>>> completely eliminate them).  Make sure and run a full balance after
>>> changing either item, as the aforementioned 'optimizations' have an
>>> impact on how data is organized on-disk (which is ultimately what causes
>>> the issues), so they will have a lingering effect if you don't balance
>>> everything.
>>
>> According to the wiki, 4.14 does indeed have the ssd changes.
>>
>> According to the bug, he's running 4.13.x on one server and 4.14.x on
>> two.  So upgrading the one to 4.14.x should mean all will have that fix.
>>
>> However, without a full balance it /will/ take some time to settle down
>> (again, assuming btrfs was using ssd mode), so the lingering effect could
>> still be creating problems on the 4.14 kernel servers for the moment.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-17 15:18       ` Hans van Kranenburg
@ 2018-02-17 16:42         ` Shehbaz Jaffer
  0 siblings, 0 replies; 31+ messages in thread
From: Shehbaz Jaffer @ 2018-02-17 16:42 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Duncan, linux-btrfs

>First of all, the ssd mount option does not have anything to do with
having single or DUP metadata.

Sorry about that, I agree with you. -nossd would not help in
increasing reliability in any way. One alternative would be to format
and force duplication of metadata during filesystem creation on SSD.
but again, as you described, there is the likelihood of consecutive
writes of original and copy of the metadata going to the same cell,
which may not end up giving us good reliability.

>Also, instead of physically damaging flash cells inside your SSD, you
are writing data to a perfectly working one. This is a different failure
scenario.

By writing data to a working SSD, I am emulating byte and block
corruptions, which is valid failure scenario. In this case, read
operation on SSD would take place successfully (no EIO from device),
but internally, the blocks read would be corrupted. Here, btrfs
detects cksum failures, tries to correct them using scrubber, but
fails to do so for SSD.

For the scenario of physically damaged flash cells that you mentioned,
I am currently performing experiments where I inject -EIO at places
where btrfs tries reading or writing a block. this is to see how btrfs
handles a failed block access to a damaged cell. Would that cover the
failure scenario you described? If not, could you elaborate on other
alternatives to emulate physically damaged flash cells?

> In any case, using DUP instead of single obviously increases the chance
of recovery in case of failures that corrupt one copy of the data when
it's travelling between system memory and disk, while you're sending two
of them right after each other, so you're totally right that it's better
to enable

yes, DUP is better than single, however, as you pointed correctly, it
may not be the perfect solution to the problem.


On Sat, Feb 17, 2018 at 10:18 AM, Hans van Kranenburg
<hans.van.kranenburg@mendix.com> wrote:
> On 02/17/2018 05:34 AM, Shehbaz Jaffer wrote:
>>> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS volumes are all SSD
>>
>> I have recently done some SSD corruption experiments on small set of
>> workloads, so I thought I would share my experience.
>>
>> While creating btrfs using mkfs.btrfs command for SSDs, by default the
>> metadata duplication option is disabled. this renders btrfs-scrubbing
>> ineffective, as there are no redundant metadata to restore corrupted
>> metadata from.
>> So if there are any errors during read operation on SSD, unlike HDD
>> where the corruptions would be handled by btrfs scrub on the fly while
>> detecting checksum error, for SSD the read would fail as uncorrectable
>> error.
>
> First of all, the ssd mount option does not have anything to do with
> having single or DUP metadata.
>
> Well, both the things that happen by default (mkfs using single, mount
> enabling the ssd option) are happening because of the lookup result on
> the rotational flag, but that's all.
>
>> Could you confirm if metadata DUP is enabled for your system by
>> running the following cmd:
>>
>> $btrfs fi df /mnt # mount is the mount point
>> Data, single: total=8.00MiB, used=64.00KiB
>> System, single: total=4.00MiB, used=16.00KiB
>> Metadata, single: total=168.00MiB, used=112.00KiB
>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>
>> If metadata is single in your case as well (and not DUP), that may be
>> the problem for btrfs-scrub not working effectively on the fly
>> (mid-stream bit-rot correction), causing reliability issues. A couple
>> of such bugs that are observed specifically for SSDs is reported here:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=198463
>> https://bugzilla.kernel.org/show_bug.cgi?id=198807
>
> Here you show that when you have 'single' metadata, there's no copy to
> recover from. This is expected.
>
> Also, instead of physically damaging flash cells inside your SSD, you
> are writing data to a perfectly working one. This is a different failure
> scenario.
>
> One of the reasons to turn off DUP for metadata by default on SSD is
> (from man mkfs.btrfs):
>
>     "The controllers may put data written in a short timespan into the
> same physical storage unit (cell, block etc). In case this unit dies,
> both copies are lost. BTRFS does not add any artificial delay between
> metadata writes." .. "The traditional rotational hard drives usually
> fail at the sector level."
>
> And, of course, in case of EBS, you don't have any idea at all where the
> data actually ends up, since it's talking to a black box service, and
> not an SSD.
>
> In any case, using DUP instead of single obviously increases the chance
> of recovery in case of failures that corrupt one copy of the data when
> it's travelling between system memory and disk, while you're sending two
> of them right after each other, so you're totally right that it's better
> to enable.
>
>> These do not occur for HDD, and I believe should not occur when
>> filesystem is mounted with nossd mode.
>
> So to reiterate, mounting nossd does not make your metadata writes DUP.
>
>> On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>>> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
>>> excerpted:
>>>
>>>> This will probably sound like an odd question, but does BTRFS think your
>>>> storage devices are SSD's or not?  Based on what you're saying, it
>>>> sounds like you're running into issues resulting from the
>>>> over-aggressive SSD 'optimizations' that were done by BTRFS until very
>>>> recently.
>>>>
>>>> You can verify if this is what's causing your problems or not by either
>>>> upgrading to a recent mainline kernel version (I know the changes are in
>>>> 4.15, I don't remember for certain if they're in 4.14 or not, but I
>>>> think they are), or by adding 'nossd' to your mount options, and then
>>>> seeing if you still have the problems or not (I suspect this is only
>>>> part of it, and thus changing this will reduce the issues, but not
>>>> completely eliminate them).  Make sure and run a full balance after
>>>> changing either item, as the aforementioned 'optimizations' have an
>>>> impact on how data is organized on-disk (which is ultimately what causes
>>>> the issues), so they will have a lingering effect if you don't balance
>>>> everything.
>>>
>>> According to the wiki, 4.14 does indeed have the ssd changes.
>>>
>>> According to the bug, he's running 4.13.x on one server and 4.14.x on
>>> two.  So upgrading the one to 4.14.x should mean all will have that fix.
>>>
>>> However, without a full balance it /will/ take some time to settle down
>>> (again, assuming btrfs was using ssd mode), so the lingering effect could
>>> still be creating problems on the 4.14 kernel servers for the moment.
>
> --
> Hans van Kranenburg



-- 
Shehbaz Jaffer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-02-16 19:44 ` Austin S. Hemmelgarn
  2018-02-17  3:03   ` Duncan
@ 2018-03-01 19:04   ` Alex Adriaanse
  2018-03-01 19:40     ` Nikolay Borisov
  2018-03-02  4:02     ` Qu Wenruo
  1 sibling, 2 replies; 31+ messages in thread
From: Alex Adriaanse @ 2018-03-01 19:04 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
> I would suggest changing this to eliminate the balance with '-dusage=10' (it's redundant with the '-dusage=20' one unless your filesystem is in pathologically bad shape), and adding equivalent filters for balancing metadata (which generally goes pretty fast).
> 
> Unless you've got a huge filesystem, you can also cut down on that limit filter.  100 data chunks that are 40% full is up to 40GB of data to move on a normally sized filesystem, or potentially up to 200GB if you've got a really big filesystem (I forget what point BTRFS starts scaling up chunk sizes at, but I'm pretty sure it's in the TB range).

Thanks so much for the suggestions so far, everyone. I wanted to report back on this. Last Friday I made the following changes per suggestions from this thread:

1. Change the nightly balance to the following:

    btrfs balance start -dusage=20 <fs>
    btrfs balance start -dusage=40,limit=10 <fs>
    btrfs balance start -musage=30 <fs>

2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD space allocation fix.

3. Boot Linux with the elevator=noop option

4. Change /sys/block/xvd*/queue/scheduler to "none"

5. Mount all our Btrfs filesystems with the "enospc_debug" option.

6. I did NOT add the "nossd" flag because I didn't think it'd make much of a difference after that SSD space allocation fix.

7. After applying the above changes, ran a full balance on all the Btrfs filesystems. I also have not experimented with autodefrag yet.


Despite the changes above, we just experienced another crash this morning. Kernel message (with enospc_debug turned on for the given mountpoint):

[496003.170278] use_block_rsv: 46 callbacks suppressed
[496003.170279] BTRFS: block rsv returned -28
[496003.173875] ------------[ cut here ]------------
[496003.177186] WARNING: CPU: 2 PID: 362 at /build/linux-3RM5ap/linux-4.14.13/fs/btrfs/extent-tree.c:8458 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[496003.185369] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
[496003.218484] CPU: 2 PID: 362 Comm: btrfs-transacti Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
[496003.224618] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[496003.228702] task: ffff8fc0fb6bd0c0 task.stack: ffff9e81c3ac0000
[496003.233081] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[496003.237220] RSP: 0018:ffff9e81c3ac3958 EFLAGS: 00010282
[496003.241404] RAX: 000000000000001d RBX: ffff8fc0fbeac128 RCX: 0000000000000000
[496003.248004] RDX: 0000000000000000 RSI: ffff8fc100a966f8 RDI: ffff8fc100a966f8
[496003.253896] RBP: 0000000000004000 R08: 0000000000000001 R09: 000000000001667b
[496003.258508] R10: 0000000000000001 R11: 000000000001667b R12: ffff8fc0fbeac000
[496003.264759] R13: ffff8fc0fac22800 R14: 0000000000000001 R15: 00000000ffffffe4
[496003.271203] FS:  0000000000000000(0000) GS:ffff8fc100a80000(0000) knlGS:0000000000000000
[496003.278169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[496003.283917] CR2: 00007efe00f36000 CR3: 0000000102a0a001 CR4: 00000000001606e0
[496003.290309] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[496003.296985] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[496003.303335] Call Trace:
[496003.307113]  ? __pagevec_lru_add_fn+0x270/0x270
[496003.312126]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
[496003.316995]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
[496003.321568]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
[496003.326684]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
[496003.332724]  ? set_extent_bit+0x19/0x20 [btrfs]
[496003.337991]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
[496003.343436]  ? btrfs_merge_delayed_refs+0x8f/0x560 [btrfs]
[496003.349322]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[496003.355157]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[496003.360707]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[496003.366022]  ? remove_wait_queue+0x60/0x60
[496003.370898]  transaction_kthread+0x195/0x1b0 [btrfs]
[496003.376411]  kthread+0xfc/0x130
[496003.380741]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[496003.386404]  ? kthread_create_on_node+0x70/0x70
[496003.391287]  ? do_group_exit+0x3a/0xa0
[496003.396201]  ret_from_fork+0x1f/0x30
[496003.400779] Code: ff 48 c7 c6 28 b7 4c c0 48 c7 c7 a0 11 52 c0 e8 2c b0 43 c2 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 38 29 4d c0 e8 90 0f ea c1 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 6e eb
[496003.416648] ---[ end trace 6f05416539a50c4c ]---
[496003.422202] BTRFS: Transaction aborted (error -28)
[496003.422227] ------------[ cut here ]------------
[496003.427612] WARNING: CPU: 2 PID: 362 at /build/linux-3RM5ap/linux-4.14.13/fs/btrfs/extent-tree.c:7076 __btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
[496003.440365] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
[496003.492382] CPU: 2 PID: 362 Comm: btrfs-transacti Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
[496003.502085] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[496003.508078] task: ffff8fc0fb6bd0c0 task.stack: ffff9e81c3ac0000
[496003.514294] RIP: 0010:__btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
[496003.520807] RSP: 0018:ffff9e81c3ac3c30 EFLAGS: 00010286
[496003.525146] RAX: 0000000000000026 RBX: 000000d9dc6cc000 RCX: 0000000000000000
[496003.530148] RDX: 0000000000000000 RSI: ffff8fc100a966f8 RDI: ffff8fc100a966f8
[496003.535193] RBP: 00000000ffffffe4 R08: 0000000000000001 R09: 00000000000166a3
[496003.540516] R10: 0000000000000001 R11: 00000000000166a3 R12: ffff8fc0fbeac000
[496003.545968] R13: ffff8fba3bedd930 R14: 0000000000000000 R15: 0000000000000002
[496003.551066] FS:  0000000000000000(0000) GS:ffff8fc100a80000(0000) knlGS:0000000000000000
[496003.557073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[496003.561445] CR2: 00007efe00f36000 CR3: 0000000102a0a001 CR4: 00000000001606e0
[496003.566625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[496003.571756] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[496003.576948] Call Trace:
[496003.579607]  ? btrfs_merge_delayed_refs+0x8f/0x560 [btrfs]
[496003.583885]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[496003.588099]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[496003.592474]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[496003.596766]  ? remove_wait_queue+0x60/0x60
[496003.600288]  transaction_kthread+0x195/0x1b0 [btrfs]
[496003.605278]  kthread+0xfc/0x130
[496003.608768]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[496003.614107]  ? kthread_create_on_node+0x70/0x70
[496003.617867]  ? do_group_exit+0x3a/0xa0
[496003.621231]  ret_from_fork+0x1f/0x30
[496003.624472] Code: 13 1b 0a 00 e9 39 f9 ff ff 8b 74 24 30 48 c7 c7 a8 24 4d c0 e8 a0 b7 ea c1 0f ff eb cc 89 ee 48 c7 c7 a8 24 4d c0 e8 8e b7 ea c1 <0f> ff e9 f9 f8 ff ff 8b 94 24 c0 00 00 00 48 89 c1 49 89 d8 48
[496003.637764] ---[ end trace 6f05416539a50c4d ]---
[496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
[496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
[496003.641996] BTRFS info (device xvdc): forced readonly
[496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
[496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
[496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
[496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
[496003.642091] Oops: 0000 [#1] SMP PTI
[496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
[496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
[496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
[496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
[496003.642151] RSP: 0018:ffff9e81c7b8fab0 EFLAGS: 00010286
[496003.642153] RAX: 0000000000000000 RBX: ffff8fb90a10a3c0 RCX: ffffca5d1fda5a5f
[496003.642154] RDX: 0000000000000001 RSI: ffff8fc05eae62c0 RDI: ffff8fbc4fd87d70
[496003.642154] RBP: ffff8fbbb5139000 R08: 0000000000000000 R09: 0000000000000000
[496003.642155] R10: ffff8fc05eae62c0 R11: 00000000000001bc R12: ffff8fc0fbeac000
[496003.642156] R13: ffff8fbc4fd87d70 R14: ffff8fbc4fd87800 R15: 00000000ffffffe4
[496003.642157] FS:  00007f64196708c0(0000) GS:ffff8fc100a40000(0000) knlGS:0000000000000000
[496003.642159] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[496003.642160] CR2: 0000000000000000 CR3: 000000069b972004 CR4: 00000000001606e0
[496003.642162] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[496003.642163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[496003.642164] Call Trace:
[496003.642185]  free_reloc_roots+0x22/0x60 [btrfs]
[496003.642202]  merge_reloc_roots+0x184/0x260 [btrfs]
[496003.642217]  relocate_block_group+0x29a/0x610 [btrfs]
[496003.642232]  btrfs_relocate_block_group+0x17b/0x230 [btrfs]
[496003.642254]  btrfs_relocate_chunk+0x38/0xb0 [btrfs]
[496003.642272]  btrfs_balance+0xa15/0x1250 [btrfs]
[496003.642292]  btrfs_ioctl_balance+0x368/0x380 [btrfs]
[496003.642309]  btrfs_ioctl+0x1170/0x24e0 [btrfs]
[496003.642312]  ? mem_cgroup_try_charge+0x86/0x1a0
[496003.642315]  ? __handle_mm_fault+0x640/0x10e0
[496003.642318]  ? do_vfs_ioctl+0x9f/0x600
[496003.642319]  do_vfs_ioctl+0x9f/0x600
[496003.642321]  ? handle_mm_fault+0xc6/0x1b0
[496003.642325]  ? __do_page_fault+0x289/0x500
[496003.642327]  SyS_ioctl+0x74/0x80
[496003.642330]  system_call_fast_compare_end+0xc/0x6f
[496003.642332] RIP: 0033:0x7f64186f8e07
[496003.642333] RSP: 002b:00007ffcdf69d1b8 EFLAGS: 00000206
[496003.642334] Code: 8b a7 f0 01 00 00 4d 8b b4 24 40 14 00 00 4d 8d ae 70 05 00 00 4c 89 ef e8 c2 b9 3e c2 49 8b 9e 68 05 00 00 48 8b 45 00 48 85 db <48> 8b 10 75 0e e9 ad 00 00 00 48 8b 5b 10 48 85 db 74 11 48 3b
[496003.642376] RIP: __del_reloc_root+0x3c/0x100 [btrfs] RSP: ffff9e81c7b8fab0
[496003.642377] CR2: 0000000000000000
[496003.642393] ---[ end trace 6f05416539a50c4e ]---
[496003.692818] BTRFS info (device xvdb): relocating block group 10657726464 flags data
[496003.981807] BTRFS: error (device xvdc) in btrfs_run_delayed_refs:3089: errno=-28 No space left
[496003.989872] BTRFS warning (device xvdc): Skipping commit of aborted transaction.
[496003.996616] BTRFS: error (device xvdc) in cleanup_transaction:1873: errno=-28 No space left
[496004.517098] BTRFS info (device xvdb): found 7338 extents

At the time of this Oops "btrfs balance start -dusage 40 limit 10 <fs>" was running (it had been running for about 3 hours). Disk I/O utilization was close to 100%, both before the balance started, as well as between the time the balance started and the Oops above.

"btrfs filesystem usage" output, before rebooting:
Overall:
    Device size:		   1.37TiB
    Device allocated:		 688.04GiB
    Device unallocated:		 711.96GiB
    Device missing:		     0.00B
    Used:			 608.88GiB
    Free (estimated):		 788.31GiB	(min: 788.31GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:673.01GiB, Used:596.66GiB
   /dev/xvdc	 673.01GiB

Metadata,single: Size:15.00GiB, Used:12.23GiB
   /dev/xvdc	  15.00GiB

System,single: Size:32.00MiB, Used:112.00KiB
   /dev/xvdc	  32.00MiB

Unallocated:
   /dev/xvdc	 711.96GiB

Sum of /sys/fs/btrfs/<UUID/allocation/*/disk_total (after reboot and after an interrupted balance finished) is 683GB; total device size is 1400GB, so about 49% utilization.

Responding to Shehbaz's email:

On Feb 16, 2018, at 10:34 PM, Shehbaz Jaffer <shehbazjaffer007@gmail.com> wrote:
> Could you confirm if metadata DUP is enabled for your system by
> running the following cmd:
> 
> $btrfs fi df /mnt # mount is the mount point
> Data, single: total=8.00MiB, used=64.00KiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=168.00MiB, used=112.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> If metadata is single in your case as well (and not DUP), that may be
> the problem for btrfs-scrub not working effectively on the fly
> (mid-stream bit-rot correction), causing reliability issues. A couple
> of such bugs that are observed specifically for SSDs is reported here:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=198463
> https://bugzilla.kernel.org/show_bug.cgi?id=198807
> 
> These do not occur for HDD, and I believe should not occur when
> filesystem is mounted with nossd mode.

"btrfs filesystem df" output, after rebooting:

Data, single: total=668.01GiB, used=548.25GiB
System, single: total=32.00MiB, used=112.00KiB
Metadata, single: total=15.00GiB, used=9.11GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

We don't run btrfs scrubs. I'm not convinced that duplicate metadata or running scrubs would've helped here, as I'm not experiencing any other signs of corruption other than disk space allocation being messed up.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-01 19:04   ` Alex Adriaanse
@ 2018-03-01 19:40     ` Nikolay Borisov
  2018-03-02 17:29       ` Liu Bo
  2018-03-02  4:02     ` Qu Wenruo
  1 sibling, 1 reply; 31+ messages in thread
From: Nikolay Borisov @ 2018-03-01 19:40 UTC (permalink / raw)
  To: Alex Adriaanse; +Cc: linux-btrfs



On  1.03.2018 21:04, Alex Adriaanse wrote:
> On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>> I would suggest changing this to eliminate the balance with '-dusage=10' (it's redundant with the '-dusage=20' one unless your filesystem is in pathologically bad shape), and adding equivalent filters for balancing metadata (which generally goes pretty fast).
>>
>> Unless you've got a huge filesystem, you can also cut down on that limit filter.  100 data chunks that are 40% full is up to 40GB of data to move on a normally sized filesystem, or potentially up to 200GB if you've got a really big filesystem (I forget what point BTRFS starts scaling up chunk sizes at, but I'm pretty sure it's in the TB range).
> 
> Thanks so much for the suggestions so far, everyone. I wanted to report back on this. Last Friday I made the following changes per suggestions from this thread:
> 
> 1. Change the nightly balance to the following:
> 
>     btrfs balance start -dusage=20 <fs>
>     btrfs balance start -dusage=40,limit=10 <fs>
>     btrfs balance start -musage=30 <fs>
> 
> 2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD space allocation fix.
> 
> 3. Boot Linux with the elevator=noop option
> 
> 4. Change /sys/block/xvd*/queue/scheduler to "none"
> 
> 5. Mount all our Btrfs filesystems with the "enospc_debug" option.

SO that's good, however you didn't apply the out of tree patch (it has
already been merged into the for-next so will likely land in 4.17) I
pointed you at. As a result when you your ENOSPC error there is no extra
information being printed so we can't really reason about what might be
going wrong in the metadata flushing algorithms.



<snip>
> [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
> [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
> [496003.641996] BTRFS info (device xvdc): forced readonly
> [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
> [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
> [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
> [496003.642091] Oops: 0000 [#1] SMP PTI
> [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
> [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
> [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]


if you happen to have the vmlinux of that kernel can you run the
following from the kernel source directory:

./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux


> [496003.642151] RSP: 0018:ffff9e81c7b8fab0 EFLAGS: 00010286
> [496003.642153] RAX: 0000000000000000 RBX: ffff8fb90a10a3c0 RCX: ffffca5d1fda5a5f
> [496003.642154] RDX: 0000000000000001 RSI: ffff8fc05eae62c0 RDI: ffff8fbc4fd87d70
> [496003.642154] RBP: ffff8fbbb5139000 R08: 0000000000000000 R09: 0000000000000000
> [496003.642155] R10: ffff8fc05eae62c0 R11: 00000000000001bc R12: ffff8fc0fbeac000
> [496003.642156] R13: ffff8fbc4fd87d70 R14: ffff8fbc4fd87800 R15: 00000000ffffffe4
> [496003.642157] FS:  00007f64196708c0(0000) GS:ffff8fc100a40000(0000) knlGS:0000000000000000
> [496003.642159] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [496003.642160] CR2: 0000000000000000 CR3: 000000069b972004 CR4: 00000000001606e0
> [496003.642162] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [496003.642163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [496003.642164] Call Trace:
> [496003.642185]  free_reloc_roots+0x22/0x60 [btrfs]
> [496003.642202]  merge_reloc_roots+0x184/0x260 [btrfs]
> [496003.642217]  relocate_block_group+0x29a/0x610 [btrfs]
> [496003.642232]  btrfs_relocate_block_group+0x17b/0x230 [btrfs]
> [496003.642254]  btrfs_relocate_chunk+0x38/0xb0 [btrfs]
> [496003.642272]  btrfs_balance+0xa15/0x1250 [btrfs]
> [496003.642292]  btrfs_ioctl_balance+0x368/0x380 [btrfs]
> [496003.642309]  btrfs_ioctl+0x1170/0x24e0 [btrfs]
> [496003.642312]  ? mem_cgroup_try_charge+0x86/0x1a0
> [496003.642315]  ? __handle_mm_fault+0x640/0x10e0
> [496003.642318]  ? do_vfs_ioctl+0x9f/0x600
> [496003.642319]  do_vfs_ioctl+0x9f/0x600
> [496003.642321]  ? handle_mm_fault+0xc6/0x1b0
> [496003.642325]  ? __do_page_fault+0x289/0x500
> [496003.642327]  SyS_ioctl+0x74/0x80
> [496003.642330]  system_call_fast_compare_end+0xc/0x6f
> [496003.642332] RIP: 0033:0x7f64186f8e07
> [496003.642333] RSP: 002b:00007ffcdf69d1b8 EFLAGS: 00000206
> [496003.642334] Code: 8b a7 f0 01 00 00 4d 8b b4 24 40 14 00 00 4d 8d ae 70 05 00 00 4c 89 ef e8 c2 b9 3e c2 49 8b 9e 68 05 00 00 48 8b 45 00 48 85 db <48> 8b 10 75 0e e9 ad 00 00 00 48 8b 5b 10 48 85 db 74 11 48 3b
> [496003.642376] RIP: __del_reloc_root+0x3c/0x100 [btrfs] RSP: ffff9e81c7b8fab0
> [496003.642377] CR2: 0000000000000000
> [496003.642393] ---[ end trace 6f05416539a50c4e ]---

This is the second time I've seen this crash in __del_reloc_root in your
logs. It seems you are hitting some yet-undiscovered bug when ENOSPC is
hit during balance. I suspect balance is not hardened enough to deal
with ENOSPC errors.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-01 19:04   ` Alex Adriaanse
  2018-03-01 19:40     ` Nikolay Borisov
@ 2018-03-02  4:02     ` Qu Wenruo
  1 sibling, 0 replies; 31+ messages in thread
From: Qu Wenruo @ 2018-03-02  4:02 UTC (permalink / raw)
  To: Alex Adriaanse, Austin S. Hemmelgarn; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 16330 bytes --]



On 2018年03月02日 03:04, Alex Adriaanse wrote:
> On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>> I would suggest changing this to eliminate the balance with '-dusage=10' (it's redundant with the '-dusage=20' one unless your filesystem is in pathologically bad shape), and adding equivalent filters for balancing metadata (which generally goes pretty fast).
>>
>> Unless you've got a huge filesystem, you can also cut down on that limit filter.  100 data chunks that are 40% full is up to 40GB of data to move on a normally sized filesystem, or potentially up to 200GB if you've got a really big filesystem (I forget what point BTRFS starts scaling up chunk sizes at, but I'm pretty sure it's in the TB range).
> 
> Thanks so much for the suggestions so far, everyone. I wanted to report back on this. Last Friday I made the following changes per suggestions from this thread:
> 
> 1. Change the nightly balance to the following:
> 
>     btrfs balance start -dusage=20 <fs>
>     btrfs balance start -dusage=40,limit=10 <fs>
>     btrfs balance start -musage=30 <fs>
> 
> 2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD space allocation fix.
> 
> 3. Boot Linux with the elevator=noop option
> 
> 4. Change /sys/block/xvd*/queue/scheduler to "none"
> 
> 5. Mount all our Btrfs filesystems with the "enospc_debug" option.
> 
> 6. I did NOT add the "nossd" flag because I didn't think it'd make much of a difference after that SSD space allocation fix.
> 
> 7. After applying the above changes, ran a full balance on all the Btrfs filesystems. I also have not experimented with autodefrag yet.
> 
> 
> Despite the changes above, we just experienced another crash this morning. Kernel message (with enospc_debug turned on for the given mountpoint):

Would you please try to use "btrfs check" to check the filesystem offline?
I'm wondering if it's extent tree or free space cache get corrupted and
makes kernel confused about its space allocation.


I'm not completely sure, but it may also be something wrong with the
space cache.

So either mount it with nospace_cache option of use "btrfs check
--clear-space-cache v1" may help.

Thanks,
Qu

> 
> [496003.170278] use_block_rsv: 46 callbacks suppressed
> [496003.170279] BTRFS: block rsv returned -28
> [496003.173875] ------------[ cut here ]------------
> [496003.177186] WARNING: CPU: 2 PID: 362 at /build/linux-3RM5ap/linux-4.14.13/fs/btrfs/extent-tree.c:8458 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [496003.185369] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
> [496003.218484] CPU: 2 PID: 362 Comm: btrfs-transacti Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> [496003.224618] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [496003.228702] task: ffff8fc0fb6bd0c0 task.stack: ffff9e81c3ac0000
> [496003.233081] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [496003.237220] RSP: 0018:ffff9e81c3ac3958 EFLAGS: 00010282
> [496003.241404] RAX: 000000000000001d RBX: ffff8fc0fbeac128 RCX: 0000000000000000
> [496003.248004] RDX: 0000000000000000 RSI: ffff8fc100a966f8 RDI: ffff8fc100a966f8
> [496003.253896] RBP: 0000000000004000 R08: 0000000000000001 R09: 000000000001667b
> [496003.258508] R10: 0000000000000001 R11: 000000000001667b R12: ffff8fc0fbeac000
> [496003.264759] R13: ffff8fc0fac22800 R14: 0000000000000001 R15: 00000000ffffffe4
> [496003.271203] FS:  0000000000000000(0000) GS:ffff8fc100a80000(0000) knlGS:0000000000000000
> [496003.278169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [496003.283917] CR2: 00007efe00f36000 CR3: 0000000102a0a001 CR4: 00000000001606e0
> [496003.290309] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [496003.296985] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [496003.303335] Call Trace:
> [496003.307113]  ? __pagevec_lru_add_fn+0x270/0x270
> [496003.312126]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
> [496003.316995]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
> [496003.321568]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
> [496003.326684]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
> [496003.332724]  ? set_extent_bit+0x19/0x20 [btrfs]
> [496003.337991]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
> [496003.343436]  ? btrfs_merge_delayed_refs+0x8f/0x560 [btrfs]
> [496003.349322]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
> [496003.355157]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
> [496003.360707]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [496003.366022]  ? remove_wait_queue+0x60/0x60
> [496003.370898]  transaction_kthread+0x195/0x1b0 [btrfs]
> [496003.376411]  kthread+0xfc/0x130
> [496003.380741]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [496003.386404]  ? kthread_create_on_node+0x70/0x70
> [496003.391287]  ? do_group_exit+0x3a/0xa0
> [496003.396201]  ret_from_fork+0x1f/0x30
> [496003.400779] Code: ff 48 c7 c6 28 b7 4c c0 48 c7 c7 a0 11 52 c0 e8 2c b0 43 c2 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 38 29 4d c0 e8 90 0f ea c1 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 6e eb
> [496003.416648] ---[ end trace 6f05416539a50c4c ]---
> [496003.422202] BTRFS: Transaction aborted (error -28)
> [496003.422227] ------------[ cut here ]------------
> [496003.427612] WARNING: CPU: 2 PID: 362 at /build/linux-3RM5ap/linux-4.14.13/fs/btrfs/extent-tree.c:7076 __btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
> [496003.440365] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
> [496003.492382] CPU: 2 PID: 362 Comm: btrfs-transacti Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> [496003.502085] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [496003.508078] task: ffff8fc0fb6bd0c0 task.stack: ffff9e81c3ac0000
> [496003.514294] RIP: 0010:__btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
> [496003.520807] RSP: 0018:ffff9e81c3ac3c30 EFLAGS: 00010286
> [496003.525146] RAX: 0000000000000026 RBX: 000000d9dc6cc000 RCX: 0000000000000000
> [496003.530148] RDX: 0000000000000000 RSI: ffff8fc100a966f8 RDI: ffff8fc100a966f8
> [496003.535193] RBP: 00000000ffffffe4 R08: 0000000000000001 R09: 00000000000166a3
> [496003.540516] R10: 0000000000000001 R11: 00000000000166a3 R12: ffff8fc0fbeac000
> [496003.545968] R13: ffff8fba3bedd930 R14: 0000000000000000 R15: 0000000000000002
> [496003.551066] FS:  0000000000000000(0000) GS:ffff8fc100a80000(0000) knlGS:0000000000000000
> [496003.557073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [496003.561445] CR2: 00007efe00f36000 CR3: 0000000102a0a001 CR4: 00000000001606e0
> [496003.566625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [496003.571756] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [496003.576948] Call Trace:
> [496003.579607]  ? btrfs_merge_delayed_refs+0x8f/0x560 [btrfs]
> [496003.583885]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
> [496003.588099]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
> [496003.592474]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [496003.596766]  ? remove_wait_queue+0x60/0x60
> [496003.600288]  transaction_kthread+0x195/0x1b0 [btrfs]
> [496003.605278]  kthread+0xfc/0x130
> [496003.608768]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [496003.614107]  ? kthread_create_on_node+0x70/0x70
> [496003.617867]  ? do_group_exit+0x3a/0xa0
> [496003.621231]  ret_from_fork+0x1f/0x30
> [496003.624472] Code: 13 1b 0a 00 e9 39 f9 ff ff 8b 74 24 30 48 c7 c7 a8 24 4d c0 e8 a0 b7 ea c1 0f ff eb cc 89 ee 48 c7 c7 a8 24 4d c0 e8 8e b7 ea c1 <0f> ff e9 f9 f8 ff ff 8b 94 24 c0 00 00 00 48 89 c1 49 89 d8 48
> [496003.637764] ---[ end trace 6f05416539a50c4d ]---
> [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
> [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
> [496003.641996] BTRFS info (device xvdc): forced readonly
> [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
> [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
> [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
> [496003.642091] Oops: 0000 [#1] SMP PTI
> [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
> [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
> [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
> [496003.642151] RSP: 0018:ffff9e81c7b8fab0 EFLAGS: 00010286
> [496003.642153] RAX: 0000000000000000 RBX: ffff8fb90a10a3c0 RCX: ffffca5d1fda5a5f
> [496003.642154] RDX: 0000000000000001 RSI: ffff8fc05eae62c0 RDI: ffff8fbc4fd87d70
> [496003.642154] RBP: ffff8fbbb5139000 R08: 0000000000000000 R09: 0000000000000000
> [496003.642155] R10: ffff8fc05eae62c0 R11: 00000000000001bc R12: ffff8fc0fbeac000
> [496003.642156] R13: ffff8fbc4fd87d70 R14: ffff8fbc4fd87800 R15: 00000000ffffffe4
> [496003.642157] FS:  00007f64196708c0(0000) GS:ffff8fc100a40000(0000) knlGS:0000000000000000
> [496003.642159] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [496003.642160] CR2: 0000000000000000 CR3: 000000069b972004 CR4: 00000000001606e0
> [496003.642162] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [496003.642163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [496003.642164] Call Trace:
> [496003.642185]  free_reloc_roots+0x22/0x60 [btrfs]
> [496003.642202]  merge_reloc_roots+0x184/0x260 [btrfs]
> [496003.642217]  relocate_block_group+0x29a/0x610 [btrfs]
> [496003.642232]  btrfs_relocate_block_group+0x17b/0x230 [btrfs]
> [496003.642254]  btrfs_relocate_chunk+0x38/0xb0 [btrfs]
> [496003.642272]  btrfs_balance+0xa15/0x1250 [btrfs]
> [496003.642292]  btrfs_ioctl_balance+0x368/0x380 [btrfs]
> [496003.642309]  btrfs_ioctl+0x1170/0x24e0 [btrfs]
> [496003.642312]  ? mem_cgroup_try_charge+0x86/0x1a0
> [496003.642315]  ? __handle_mm_fault+0x640/0x10e0
> [496003.642318]  ? do_vfs_ioctl+0x9f/0x600
> [496003.642319]  do_vfs_ioctl+0x9f/0x600
> [496003.642321]  ? handle_mm_fault+0xc6/0x1b0
> [496003.642325]  ? __do_page_fault+0x289/0x500
> [496003.642327]  SyS_ioctl+0x74/0x80
> [496003.642330]  system_call_fast_compare_end+0xc/0x6f
> [496003.642332] RIP: 0033:0x7f64186f8e07
> [496003.642333] RSP: 002b:00007ffcdf69d1b8 EFLAGS: 00000206
> [496003.642334] Code: 8b a7 f0 01 00 00 4d 8b b4 24 40 14 00 00 4d 8d ae 70 05 00 00 4c 89 ef e8 c2 b9 3e c2 49 8b 9e 68 05 00 00 48 8b 45 00 48 85 db <48> 8b 10 75 0e e9 ad 00 00 00 48 8b 5b 10 48 85 db 74 11 48 3b
> [496003.642376] RIP: __del_reloc_root+0x3c/0x100 [btrfs] RSP: ffff9e81c7b8fab0
> [496003.642377] CR2: 0000000000000000
> [496003.642393] ---[ end trace 6f05416539a50c4e ]---
> [496003.692818] BTRFS info (device xvdb): relocating block group 10657726464 flags data
> [496003.981807] BTRFS: error (device xvdc) in btrfs_run_delayed_refs:3089: errno=-28 No space left
> [496003.989872] BTRFS warning (device xvdc): Skipping commit of aborted transaction.
> [496003.996616] BTRFS: error (device xvdc) in cleanup_transaction:1873: errno=-28 No space left
> [496004.517098] BTRFS info (device xvdb): found 7338 extents
> 
> At the time of this Oops "btrfs balance start -dusage 40 limit 10 <fs>" was running (it had been running for about 3 hours). Disk I/O utilization was close to 100%, both before the balance started, as well as between the time the balance started and the Oops above.
> 
> "btrfs filesystem usage" output, before rebooting:
> Overall:
>     Device size:		   1.37TiB
>     Device allocated:		 688.04GiB
>     Device unallocated:		 711.96GiB
>     Device missing:		     0.00B
>     Used:			 608.88GiB
>     Free (estimated):		 788.31GiB	(min: 788.31GiB)
>     Data ratio:			      1.00
>     Metadata ratio:		      1.00
>     Global reserve:		 512.00MiB	(used: 0.00B)
> 
> Data,single: Size:673.01GiB, Used:596.66GiB
>    /dev/xvdc	 673.01GiB
> 
> Metadata,single: Size:15.00GiB, Used:12.23GiB
>    /dev/xvdc	  15.00GiB
> 
> System,single: Size:32.00MiB, Used:112.00KiB
>    /dev/xvdc	  32.00MiB
> 
> Unallocated:
>    /dev/xvdc	 711.96GiB
> 
> Sum of /sys/fs/btrfs/<UUID/allocation/*/disk_total (after reboot and after an interrupted balance finished) is 683GB; total device size is 1400GB, so about 49% utilization.
> 
> Responding to Shehbaz's email:
> 
> On Feb 16, 2018, at 10:34 PM, Shehbaz Jaffer <shehbazjaffer007@gmail.com> wrote:
>> Could you confirm if metadata DUP is enabled for your system by
>> running the following cmd:
>>
>> $btrfs fi df /mnt # mount is the mount point
>> Data, single: total=8.00MiB, used=64.00KiB
>> System, single: total=4.00MiB, used=16.00KiB
>> Metadata, single: total=168.00MiB, used=112.00KiB
>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>
>> If metadata is single in your case as well (and not DUP), that may be
>> the problem for btrfs-scrub not working effectively on the fly
>> (mid-stream bit-rot correction), causing reliability issues. A couple
>> of such bugs that are observed specifically for SSDs is reported here:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=198463
>> https://bugzilla.kernel.org/show_bug.cgi?id=198807
>>
>> These do not occur for HDD, and I believe should not occur when
>> filesystem is mounted with nossd mode.
> 
> "btrfs filesystem df" output, after rebooting:
> 
> Data, single: total=668.01GiB, used=548.25GiB
> System, single: total=32.00MiB, used=112.00KiB
> Metadata, single: total=15.00GiB, used=9.11GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> We don't run btrfs scrubs. I'm not convinced that duplicate metadata or running scrubs would've helped here, as I'm not experiencing any other signs of corruption other than disk space allocation being messed up.
> 
> Thanks,
> 
> Alex--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-01 19:40     ` Nikolay Borisov
@ 2018-03-02 17:29       ` Liu Bo
  2018-03-08 17:40         ` Alex Adriaanse
  0 siblings, 1 reply; 31+ messages in thread
From: Liu Bo @ 2018-03-02 17:29 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Alex Adriaanse, linux-btrfs

On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote:
> 
> 
> On  1.03.2018 21:04, Alex Adriaanse wrote:
> > On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
...
> <snip>
> > [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
> > [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
> > [496003.641996] BTRFS info (device xvdc): forced readonly
> > [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
> > [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
> > [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
> > [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
> > [496003.642091] Oops: 0000 [#1] SMP PTI
> > [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
> > [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> > [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> > [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
> > [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
> 
> 
> if you happen to have the vmlinux of that kernel can you run the
> following from the kernel source directory:
> 
> ./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux
>

I thought this was fixed by bb166d7 btrfs: fix NULL pointer dereference from free_reloc_roots(),
Alex, do you mind checking if it's included in your kernel?

You can also check if the following change is merged in kernel-src deb.

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3a49a3c..9841fae 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2400,11 +2400,11 @@ void free_reloc_roots(struct list_head *list)
        while (!list_empty(list)) {
                reloc_root = list_entry(list->next, struct btrfs_root,
                                        root_list);
+               __del_reloc_root(reloc_root);
                free_extent_buffer(reloc_root->node);
                free_extent_buffer(reloc_root->commit_root);
                reloc_root->node = NULL;
                reloc_root->commit_root = NULL;
-               __del_reloc_root(reloc_root);
        }
 }


Thanks,

-liubo

> 
> > [496003.642151] RSP: 0018:ffff9e81c7b8fab0 EFLAGS: 00010286
> > [496003.642153] RAX: 0000000000000000 RBX: ffff8fb90a10a3c0 RCX: ffffca5d1fda5a5f
> > [496003.642154] RDX: 0000000000000001 RSI: ffff8fc05eae62c0 RDI: ffff8fbc4fd87d70
> > [496003.642154] RBP: ffff8fbbb5139000 R08: 0000000000000000 R09: 0000000000000000
> > [496003.642155] R10: ffff8fc05eae62c0 R11: 00000000000001bc R12: ffff8fc0fbeac000
> > [496003.642156] R13: ffff8fbc4fd87d70 R14: ffff8fbc4fd87800 R15: 00000000ffffffe4
> > [496003.642157] FS:  00007f64196708c0(0000) GS:ffff8fc100a40000(0000) knlGS:0000000000000000
> > [496003.642159] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [496003.642160] CR2: 0000000000000000 CR3: 000000069b972004 CR4: 00000000001606e0
> > [496003.642162] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [496003.642163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [496003.642164] Call Trace:
> > [496003.642185]  free_reloc_roots+0x22/0x60 [btrfs]
> > [496003.642202]  merge_reloc_roots+0x184/0x260 [btrfs]
> > [496003.642217]  relocate_block_group+0x29a/0x610 [btrfs]
> > [496003.642232]  btrfs_relocate_block_group+0x17b/0x230 [btrfs]
> > [496003.642254]  btrfs_relocate_chunk+0x38/0xb0 [btrfs]
> > [496003.642272]  btrfs_balance+0xa15/0x1250 [btrfs]
> > [496003.642292]  btrfs_ioctl_balance+0x368/0x380 [btrfs]
> > [496003.642309]  btrfs_ioctl+0x1170/0x24e0 [btrfs]
> > [496003.642312]  ? mem_cgroup_try_charge+0x86/0x1a0
> > [496003.642315]  ? __handle_mm_fault+0x640/0x10e0
> > [496003.642318]  ? do_vfs_ioctl+0x9f/0x600
> > [496003.642319]  do_vfs_ioctl+0x9f/0x600
> > [496003.642321]  ? handle_mm_fault+0xc6/0x1b0
> > [496003.642325]  ? __do_page_fault+0x289/0x500
> > [496003.642327]  SyS_ioctl+0x74/0x80
> > [496003.642330]  system_call_fast_compare_end+0xc/0x6f
> > [496003.642332] RIP: 0033:0x7f64186f8e07
> > [496003.642333] RSP: 002b:00007ffcdf69d1b8 EFLAGS: 00000206
> > [496003.642334] Code: 8b a7 f0 01 00 00 4d 8b b4 24 40 14 00 00 4d 8d ae 70 05 00 00 4c 89 ef e8 c2 b9 3e c2 49 8b 9e 68 05 00 00 48 8b 45 00 48 85 db <48> 8b 10 75 0e e9 ad 00 00 00 48 8b 5b 10 48 85 db 74 11 48 3b
> > [496003.642376] RIP: __del_reloc_root+0x3c/0x100 [btrfs] RSP: ffff9e81c7b8fab0
> > [496003.642377] CR2: 0000000000000000
> > [496003.642393] ---[ end trace 6f05416539a50c4e ]---
> 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-02 17:29       ` Liu Bo
@ 2018-03-08 17:40         ` Alex Adriaanse
  2018-03-09  9:54           ` Nikolay Borisov
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Adriaanse @ 2018-03-08 17:40 UTC (permalink / raw)
  To: bo.li.liu, Nikolay Borisov, Qu Wenruo; +Cc: linux-btrfs

On Mar 2, 2018, at 11:29 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote:
>> On  1.03.2018 21:04, Alex Adriaanse wrote:
>>> Thanks so much for the suggestions so far, everyone. I wanted to report back on this. Last Friday I made the following changes per suggestions from this thread:
>>> 
>>> 1. Change the nightly balance to the following:
>>> 
>>>    btrfs balance start -dusage=20 <fs>
>>>    btrfs balance start -dusage=40,limit=10 <fs>
>>>    btrfs balance start -musage=30 <fs>
>>> 
>>> 2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD space allocation fix.
>>> 
>>> 3. Boot Linux with the elevator=noop option
>>> 
>>> 4. Change /sys/block/xvd*/queue/scheduler to "none"
>>> 
>>> 5. Mount all our Btrfs filesystems with the "enospc_debug" option.
>> 
>> SO that's good, however you didn't apply the out of tree patch (it has
>> already been merged into the for-next so will likely land in 4.17) I
>> pointed you at. As a result when you your ENOSPC error there is no extra
>> information being printed so we can't really reason about what might be
>> going wrong in the metadata flushing algorithms.

Sorry, I clearly missed that one. I have applied the patch you referenced and rebooted the VM in question. This morning we had another FS failure on the same machine that caused it to go into readonly mode. This happened after that device was experiencing 100% I/O utilization for some time. No balance was running at the time; last balance finished about 6 hours prior to the error.

Kernel messages:
[211238.262683] use_block_rsv: 163 callbacks suppressed
[211238.262683] BTRFS: block rsv returned -28
[211238.266718] ------------[ cut here ]------------
[211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
[211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
[211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[211238.330742] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
[211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[211238.340454] RSP: 0018:ffffb234c3b5b958 EFLAGS: 00010282
[211238.344782] RAX: 000000000000001d RBX: ffff9cb43bdea128 RCX: 0000000000000000
[211238.350562] RDX: 0000000000000000 RSI: ffff9cb440a166f8 RDI: ffff9cb440a166f8
[211238.356066] RBP: 0000000000004000 R08: 0000000000000001 R09: 0000000000007d81
[211238.361649] R10: 0000000000000001 R11: 0000000000007d81 R12: ffff9cb43bdea000
[211238.367304] R13: ffff9cb437f2c800 R14: 0000000000000001 R15: 00000000ffffffe4
[211238.372658] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
[211238.379048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[211238.384681] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
[211238.391380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[211238.398050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[211238.404730] Call Trace:
[211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
[211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
[211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
[211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
[211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
[211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[211238.442899]  ? remove_wait_queue+0x60/0x60
[211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
[211238.450578]  kthread+0xfc/0x130
[211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[211238.458381]  ? kthread_create_on_node+0x70/0x70
[211238.462225]  ? do_group_exit+0x3a/0xa0
[211238.465586]  ret_from_fork+0x1f/0x30
[211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
[211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
[211238.486524] BTRFS info (device xvdc): space_info 4 has 18446744073258958848 free, is not full
[211238.493014] BTRFS info (device xvdc): space_info total=10737418240, used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, readonly=65536
[211238.503799] BTRFS: Transaction aborted (error -28)
[211238.503824] ------------[ cut here ]------------
[211238.507828] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:7081 __btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
[211238.514371] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
[211238.552170] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
[211238.557481] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[211238.562201] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
[211238.566531] RIP: 0010:__btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
[211238.571114] RSP: 0018:ffffb234c3b5bc30 EFLAGS: 00010286
[211238.574984] RAX: 0000000000000026 RBX: 000000ffb55b9000 RCX: 0000000000000006
[211238.579691] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff9cb440a166f0
[211238.584312] RBP: 00000000ffffffe4 R08: 0000000000000001 R09: 0000000000007da8
[211238.589144] R10: 0000000000000001 R11: 0000000000007da8 R12: ffff9cb43bdea000
[211238.594002] R13: ffff9caca472ed90 R14: 0000000000000000 R15: 000000000000a872
[211238.598817] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
[211238.604230] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[211238.608419] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
[211238.613183] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[211238.617894] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[211238.622673] Call Trace:
[211238.625288]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[211238.629399]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[211238.633234]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[211238.637135]  ? remove_wait_queue+0x60/0x60
[211238.640505]  transaction_kthread+0x195/0x1b0 [btrfs]
[211238.644313]  kthread+0xfc/0x130
[211238.647251]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[211238.651407]  ? kthread_create_on_node+0x70/0x70
[211238.654830]  ? do_group_exit+0x3a/0xa0
[211238.657961]  ret_from_fork+0x1f/0x30
[211238.661090] Code: 33 1b 0a 00 e9 39 f9 ff ff 8b 74 24 30 48 c7 c7 c8 07 59 c0 e8 a0 d7 9e d5 0f ff eb cc 89 ee 48 c7 c7 c8 07 59 c0 e8 8e d7 9e d5 <0f> ff e9 f9 f8 ff ff 8b 94 24 c0 00 00 00 48 89 c1 49 89 d8 48
[211238.672818] ---[ end trace 48dd1ab4e2e46f6f ]---
[211238.676465] BTRFS: error (device xvdc) in __btrfs_free_extent:7081: errno=-28 No space left
[211238.676713] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9337: errno=-28 No space left
[211238.676716] BTRFS info (device xvdc): forced readonly
[211238.694217] BTRFS: error (device xvdc) in btrfs_run_delayed_refs:3089: errno=-28 No space left
[211238.700660] BTRFS warning (device xvdc): Skipping commit of aborted transaction.
[211238.706303] BTRFS: error (device xvdc) in cleanup_transaction:1873: errno=-28 No space left
...
[211243.218453] BTRFS error (device xvdc): pending csums is 4448256


btrfs filesystem usage:

Overall:
    Device size:		   1.37TiB
    Device allocated:		 656.04GiB
    Device unallocated:		 743.96GiB
    Device missing:		     0.00B
    Used:			 630.21GiB
    Free (estimated):		 767.05GiB	(min: 767.05GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:646.01GiB, Used:622.92GiB
   /dev/xvdc	 646.01GiB

Metadata,single: Size:10.00GiB, Used:7.29GiB
   /dev/xvdc	  10.00GiB

System,single: Size:32.00MiB, Used:112.00KiB
   /dev/xvdc	  32.00MiB

Unallocated:
   /dev/xvdc	 743.96GiB

>> <snip>
>>> [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
>>> [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
>>> [496003.641996] BTRFS info (device xvdc): forced readonly
>>> [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
>>> [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
>>> [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
>>> [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
>>> [496003.642091] Oops: 0000 [#1] SMP PTI
>>> [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
>>> [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
>>> [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>>> [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
>>> [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
>> 
>> 
>> if you happen to have the vmlinux of that kernel can you run the
>> following from the kernel source directory:
>> 
>> ./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux
>> 

I used extract-vmlinux on the Debian-provided kernel, which is what was running previously, then ran the faddr2line command (with the filename first), but it couldn't find a match. I also did the same thing for the current kernel (for which I have a full vmlinux file available) with the function/offset from today's message, and still, no match.

$ scripts/faddr2line vmlinux btrfs_alloc_tree_block+0x39b/0x4c0
no match for btrfs_alloc_tree_block+0x39b/0x4c0


> I thought this was fixed by bb166d7 btrfs: fix NULL pointer dereference from free_reloc_roots(),
> Alex, do you mind checking if it's included in your kernel?
> 
> You can also check if the following change is merged in kernel-src deb.
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 3a49a3c..9841fae 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -2400,11 +2400,11 @@ void free_reloc_roots(struct list_head *list)
>        while (!list_empty(list)) {
>                reloc_root = list_entry(list->next, struct btrfs_root,
>                                        root_list);
> +               __del_reloc_root(reloc_root);
>                free_extent_buffer(reloc_root->node);
>                free_extent_buffer(reloc_root->commit_root);
>                reloc_root->node = NULL;
>                reloc_root->commit_root = NULL;
> -               __del_reloc_root(reloc_root);
>        }
> }
> 


I can confirm that this commit has indeed been applied to both the kernel we were running this morning that error'ed out, as well as the stock Debian linux-image-4.14.0-0.bpo.3-amd64 package that we were running when last week's error happened.


On Mar 1, 2018, at 10:02 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Would you please try to use "btrfs check" to check the filesystem offline?
> I'm wondering if it's extent tree or free space cache get corrupted and
> makes kernel confused about its space allocation.
> 
> 
> I'm not completely sure, but it may also be something wrong with the
> space cache.
> 
> So either mount it with nospace_cache option of use "btrfs check
> --clear-space-cache v1" may help.


Running btrfs check some time after the VM had been rebooted and FS had been remounted (and then unmounted), I didn't see any btrfs check errors. However, running btrfs check today immediately after rebooting the VM, before remounting the FS, yielded the following output:

# btrfs check /dev/xvdc
Checking filesystem on /dev/xvdc
UUID: 2dd35565-641a-40f7-b29d-916c5c7e3ffe
checking extents
parent transid verify failed on 929281753088 wanted 202268 found 202782
parent transid verify failed on 929281753088 wanted 202268 found 202782
Ignoring transid failure
leaf parent key incorrect 929281753088
bad block 929281753088
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
checking csums
checking root refs
Transid errors in file system
found 676316246080 bytes used err is 1
total csum bytes: 652799620
total tree bytes: 7539572736
total fs tree bytes: 3933503488
total extent tree bytes: 2580004864
btree space waste bytes: 1502581682
file data blocks allocated: 4644964134912
 referenced 1503039082496


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-08 17:40         ` Alex Adriaanse
@ 2018-03-09  9:54           ` Nikolay Borisov
  2018-03-09 19:05             ` Alex Adriaanse
  0 siblings, 1 reply; 31+ messages in thread
From: Nikolay Borisov @ 2018-03-09  9:54 UTC (permalink / raw)
  To: Alex Adriaanse, bo.li.liu, Qu Wenruo; +Cc: linux-btrfs

<snip>
> Sorry, I clearly missed that one. I have applied the patch you referenced and rebooted the VM in question. This morning we had another FS failure on the same machine that caused it to go into readonly mode. This happened after that device was experiencing 100% I/O utilization for some time. No balance was running at the time; last balance finished about 6 hours prior to the error.
> 
> Kernel messages:
> [211238.262683] use_block_rsv: 163 callbacks suppressed
> [211238.262683] BTRFS: block rsv returned -28
> [211238.266718] ------------[ cut here ]------------
> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [211238.330742] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.340454] RSP: 0018:ffffb234c3b5b958 EFLAGS: 00010282
> [211238.344782] RAX: 000000000000001d RBX: ffff9cb43bdea128 RCX: 0000000000000000
> [211238.350562] RDX: 0000000000000000 RSI: ffff9cb440a166f8 RDI: ffff9cb440a166f8
> [211238.356066] RBP: 0000000000004000 R08: 0000000000000001 R09: 0000000000007d81
> [211238.361649] R10: 0000000000000001 R11: 0000000000007d81 R12: ffff9cb43bdea000
> [211238.367304] R13: ffff9cb437f2c800 R14: 0000000000000001 R15: 00000000ffffffe4
> [211238.372658] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
> [211238.379048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [211238.384681] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
> [211238.391380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [211238.398050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [211238.404730] Call Trace:
> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [211238.442899]  ? remove_wait_queue+0x60/0x60
> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
> [211238.450578]  kthread+0xfc/0x130
> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [211238.458381]  ? kthread_create_on_node+0x70/0x70
> [211238.462225]  ? do_group_exit+0x3a/0xa0
> [211238.465586]  ret_from_fork+0x1f/0x30
> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
> [211238.486524] BTRFS info (device xvdc): space_info 4 has 18446744073258958848 free, is not full
> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, readonly=65536

Ok so the numbers here are helpful, they show that we have enough space
to allocate a chunk. I've also looked at the logic in 4.14.13 and all
the necessary patches are there. Unfortunately none of this matters due
to the fact that reserve_metadata_bytes is being called with
BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
any effort to flush anything at all.

Can you tell again what the workload is - is it some sort of a database,
constantly writing to its files? If so, btrfs is not really suited for
rewrite-heavy workloads since this causes excessive CoW. In such cases
you really ought to set nodatacow on the specified files. For more
information:

https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

The other thing that comes to mind is to try and tune the default commit
interval. Currently this is 30 seconds, meaning a transaction will
happen roughly every 30 seconds (unless there is enough data batched
that it should happen "NOW", where "NOW" is defined as "during the life
cycle of some arbitrary operation"). Perhaps in your case you could
potentially lower it to, say 15s. This will likely have performance
impact but should reduce the premature ENOSPC since there should be a
lot less data in the transaction.

In the meantime I will investigate what can be done in case we get
ENOSPC from NO_FLUSH contexts, my feeling is that we should be able to
force allocate a new chunk. But this will take some time to verify.

<snip>

> 
> 
> btrfs filesystem usage:
> 
> Overall:
>     Device size:		   1.37TiB
>     Device allocated:		 656.04GiB
>     Device unallocated:		 743.96GiB
>     Device missing:		     0.00B
>     Used:			 630.21GiB
>     Free (estimated):		 767.05GiB	(min: 767.05GiB)
>     Data ratio:			      1.00
>     Metadata ratio:		      1.00
>     Global reserve:		 512.00MiB	(used: 0.00B)
> 
> Data,single: Size:646.01GiB, Used:622.92GiB
>    /dev/xvdc	 646.01GiB
> 
> Metadata,single: Size:10.00GiB, Used:7.29GiB
>    /dev/xvdc	  10.00GiB
> 
> System,single: Size:32.00MiB, Used:112.00KiB
>    /dev/xvdc	  32.00MiB
> 
> Unallocated:
>    /dev/xvdc	 743.96GiB
> 
>>> <snip>
>>>> [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
>>>> [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
>>>> [496003.641996] BTRFS info (device xvdc): forced readonly
>>>> [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
>>>> [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
>>>> [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
>>>> [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
>>>> [496003.642091] Oops: 0000 [#1] SMP PTI
>>>> [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
>>>> [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
>>>> [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>>>> [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
>>>> [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
>>>
>>>
>>> if you happen to have the vmlinux of that kernel can you run the
>>> following from the kernel source directory:
>>>
>>> ./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux
>>>
> 
> I used extract-vmlinux on the Debian-provided kernel, which is what was running previously, then ran the faddr2line command (with the filename first), but it couldn't find a match. I also did the same thing for the current kernel (for which I have a full vmlinux file available) with the function/offset from today's message, and still, no match.
> 
> $ scripts/faddr2line vmlinux btrfs_alloc_tree_block+0x39b/0x4c0
> no match for btrfs_alloc_tree_block+0x39b/0x4c0
> 
> 
>> I thought this was fixed by bb166d7 btrfs: fix NULL pointer dereference from free_reloc_roots(),
>> Alex, do you mind checking if it's included in your kernel?
>>
>> You can also check if the following change is merged in kernel-src deb.
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 3a49a3c..9841fae 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -2400,11 +2400,11 @@ void free_reloc_roots(struct list_head *list)
>>        while (!list_empty(list)) {
>>                reloc_root = list_entry(list->next, struct btrfs_root,
>>                                        root_list);
>> +               __del_reloc_root(reloc_root);
>>                free_extent_buffer(reloc_root->node);
>>                free_extent_buffer(reloc_root->commit_root);
>>                reloc_root->node = NULL;
>>                reloc_root->commit_root = NULL;
>> -               __del_reloc_root(reloc_root);
>>        }
>> }
>>
> 
> 
> I can confirm that this commit has indeed been applied to both the kernel we were running this morning that error'ed out, as well as the stock Debian linux-image-4.14.0-0.bpo.3-amd64 package that we were running when last week's error happened.
> 
> 
> On Mar 1, 2018, at 10:02 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> Would you please try to use "btrfs check" to check the filesystem offline?
>> I'm wondering if it's extent tree or free space cache get corrupted and
>> makes kernel confused about its space allocation.
>>
>>
>> I'm not completely sure, but it may also be something wrong with the
>> space cache.
>>
>> So either mount it with nospace_cache option of use "btrfs check
>> --clear-space-cache v1" may help.
> 
> 
> Running btrfs check some time after the VM had been rebooted and FS had been remounted (and then unmounted), I didn't see any btrfs check errors. However, running btrfs check today immediately after rebooting the VM, before remounting the FS, yielded the following output:
> 
> # btrfs check /dev/xvdc
> Checking filesystem on /dev/xvdc
> UUID: 2dd35565-641a-40f7-b29d-916c5c7e3ffe
> checking extents
> parent transid verify failed on 929281753088 wanted 202268 found 202782
> parent transid verify failed on 929281753088 wanted 202268 found 202782
> Ignoring transid failure
> leaf parent key incorrect 929281753088
> bad block 929281753088
> Errors found in extent allocation tree or chunk allocation
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> Transid errors in file system
> found 676316246080 bytes used err is 1
> total csum bytes: 652799620
> total tree bytes: 7539572736
> total fs tree bytes: 3933503488
> total extent tree bytes: 2580004864
> btree space waste bytes: 1502581682
> file data blocks allocated: 4644964134912
>  referenced 1503039082496
> 
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-09  9:54           ` Nikolay Borisov
@ 2018-03-09 19:05             ` Alex Adriaanse
  2018-03-10 12:04               ` Nikolay Borisov
  2018-03-13 13:47               ` Patrik Lundquist
  0 siblings, 2 replies; 31+ messages in thread
From: Alex Adriaanse @ 2018-03-09 19:05 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Mar 9, 2018, at 3:54 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> <snip>
>> Sorry, I clearly missed that one. I have applied the patch you referenced and rebooted the VM in question. This morning we had another FS failure on the same machine that caused it to go into readonly mode. This happened after that device was experiencing 100% I/O utilization for some time. No balance was running at the time; last balance finished about 6 hours prior to the error.
>> 
>> Kernel messages:
>> [211238.262683] use_block_rsv: 163 callbacks suppressed
>> [211238.262683] BTRFS: block rsv returned -28
>> [211238.266718] ------------[ cut here ]------------
>> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
>> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
>> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>> [211238.330742] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
>> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.340454] RSP: 0018:ffffb234c3b5b958 EFLAGS: 00010282
>> [211238.344782] RAX: 000000000000001d RBX: ffff9cb43bdea128 RCX: 0000000000000000
>> [211238.350562] RDX: 0000000000000000 RSI: ffff9cb440a166f8 RDI: ffff9cb440a166f8
>> [211238.356066] RBP: 0000000000004000 R08: 0000000000000001 R09: 0000000000007d81
>> [211238.361649] R10: 0000000000000001 R11: 0000000000007d81 R12: ffff9cb43bdea000
>> [211238.367304] R13: ffff9cb437f2c800 R14: 0000000000000001 R15: 00000000ffffffe4
>> [211238.372658] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
>> [211238.379048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [211238.384681] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
>> [211238.391380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [211238.398050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [211238.404730] Call Trace:
>> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
>> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
>> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
>> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
>> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
>> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
>> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
>> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
>> [211238.442899]  ? remove_wait_queue+0x60/0x60
>> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
>> [211238.450578]  kthread+0xfc/0x130
>> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
>> [211238.458381]  ? kthread_create_on_node+0x70/0x70
>> [211238.462225]  ? do_group_exit+0x3a/0xa0
>> [211238.465586]  ret_from_fork+0x1f/0x30
>> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
>> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
>> [211238.486524] BTRFS info (device xvdc): space_info 4 has 18446744073258958848 free, is not full
>> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, readonly=65536
> 
> Ok so the numbers here are helpful, they show that we have enough space
> to allocate a chunk. I've also looked at the logic in 4.14.13 and all
> the necessary patches are there. Unfortunately none of this matters due
> to the fact that reserve_metadata_bytes is being called with
> BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
> any effort to flush anything at all.
> 
> Can you tell again what the workload is - is it some sort of a database,
> constantly writing to its files?

Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load on these machines. We also have snapshots being deleted and created every 15 minutes. Looking at historical atop data for the two most recent crashes:

1. Right before last week's crash, looking at the last three minutes before crash, Postgres databases were averaging a total of 45 MB/s in reads and 29 MB/s in writes. btrfs balance was also running at that time, which was reading at 1 MB/s and writing at 30 MB/s. btrfs-transacti kernel processes were doing only a small amount disk I/O, until a minute before the crash, at which time they ramped up to 0.4 MB/s in reads and 6 MB/s in writes. For cumulative I/O on the entire VM In the three minutes prior to the crash we saw an average of 95% I/O utilization, 2,200 read ops/sec, 47 MB/s read, 2,700 write ops/sec, 77 MB/s written.

2. Right before yesterday's crash (which had no balance running), Postgres databases were averaging a total of 81 MB/s in reads and 14 MB/s in writes. btrfs-cleaner and btrfs-transacti kernel processes weren't doing much I/O until the minute before the crash, at which point btrfs-cleaner was reading at 2 MB/s and writing at 16 MB/s, and btrfs-transacti was reading at 1 MB/s and writing at 5 MB/s. For cumulative I/O on the entire VM In the three minutes prior to the crash we saw an average of 103% I/O utilization, 2,600 read ops/sec, 86 MB/s read, 1,300 write ops/sec, 35 MB/s written.

I just realized both crashes happened within 5 minutes of snapshot creation + deletion. Not sure if that's just a coincidence.


> If so, btrfs is not really suited for
> rewrite-heavy workloads since this causes excessive CoW. In such cases
> you really ought to set nodatacow on the specified files. For more
> information:
> 
> https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

Am I correct to understand that nodatacow doesn't really avoid CoW when you're using snapshots? In a filesystem that's snapshotted every 15 minutes, is there a difference between normal CoW and nodatacow when (in the case of Postgres) you update a small portion of a 1GB file many times per minute? Do you anticipate us seeing a benefit in stability and performance if we set nodatacow for the entire FS while retaining snapshots? Does nodatacow increase the chance of corruption in a database like Postgres, i.e. are writes still properly ordered/sync'ed when flushed to disk?


> The other thing that comes to mind is to try and tune the default commit
> interval. Currently this is 30 seconds, meaning a transaction will
> happen roughly every 30 seconds (unless there is enough data batched
> that it should happen "NOW", where "NOW" is defined as "during the life
> cycle of some arbitrary operation"). Perhaps in your case you could
> potentially lower it to, say 15s. This will likely have performance
> impact but should reduce the premature ENOSPC since there should be a
> lot less data in the transaction.
> 
> In the meantime I will investigate what can be done in case we get
> ENOSPC from NO_FLUSH contexts, my feeling is that we should be able to
> force allocate a new chunk. But this will take some time to verify.

Thanks for your help!

Alex


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-09 19:05             ` Alex Adriaanse
@ 2018-03-10 12:04               ` Nikolay Borisov
  2018-03-10 14:29                 ` Christoph Anton Mitterer
  2018-03-13 13:47               ` Patrik Lundquist
  1 sibling, 1 reply; 31+ messages in thread
From: Nikolay Borisov @ 2018-03-10 12:04 UTC (permalink / raw)
  To: Alex Adriaanse; +Cc: linux-btrfs



On  9.03.2018 21:05, Alex Adriaanse wrote:
> Am I correct to understand that nodatacow doesn't really avoid CoW when you're using snapshots? In a filesystem that's snapshotted 

Yes, so nodatacow won't interfere with how snapshots operate. For more
information on that topic check the following mailing list thread:
https://www.spinics.net/lists/linux-btrfs/msg62715.html

every 15 minutes, is there a difference between normal CoW and nodatacow
when (in the case of Postgres) you update a small portion of a 1GB file
many times per minute? Do you anticipate us seeing a benefit in
stability and performance if we set nodatacow for the
So regarding this, you can check :
https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

Essentially every bit of small, random postgres update in the db file
will cause a CoW operation + checksum IO which cause, and I quote, "
thrashing on HDDs and excessive multi-second spikes of CPU load on
systems with an SSD or large amount a RAM."

So for OLTP workloads you definitely want nodatacow enabled, bear in
mind this also disables crc checksumming, but your db engine should
already have such functionality implemented in it.

entire FS while retaining snapshots? Does nodatacow increase the chance
of corruption in a database like Postgres, i.e. are writes still
properly ordered/sync'ed when flushed to disk?

Well most modern DB already implement some sort of a WAL, so the
reliability responsibility is shifted on the db engine.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-10 12:04               ` Nikolay Borisov
@ 2018-03-10 14:29                 ` Christoph Anton Mitterer
  2018-03-11 17:51                   ` Goffredo Baroncelli
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Anton Mitterer @ 2018-03-10 14:29 UTC (permalink / raw)
  To: Nikolay Borisov, Alex Adriaanse; +Cc: linux-btrfs

On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
> So for OLTP workloads you definitely want nodatacow enabled, bear in
> mind this also disables crc checksumming, but your db engine should
> already have such functionality implemented in it.

Unlike repeated claims made here on the list and other places... I
woudln't now *any* DB system which actually does this per default and
or in a way that would be comparable to filesystem lvl checksumming.


Look back in the archives... when I've asked several times for
checksumming support *with* nodatacow, I evaluated the existing status
for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
this either not enabled per default, not at all, or requiring special
support for the program using the DB.


Similar btw: no single VM image type I've evaluated back then had any
form of checksumming integrated.


Still, one of the major deficiencies (not in comparison to other fs,
but in comparison to how it should be) of btrfs unfortunately :-(


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-10 14:29                 ` Christoph Anton Mitterer
@ 2018-03-11 17:51                   ` Goffredo Baroncelli
  2018-03-11 22:37                     ` Christoph Anton Mitterer
  0 siblings, 1 reply; 31+ messages in thread
From: Goffredo Baroncelli @ 2018-03-11 17:51 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Nikolay Borisov, Alex Adriaanse; +Cc: linux-btrfs

On 03/10/2018 03:29 PM, Christoph Anton Mitterer wrote:
> On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
>> So for OLTP workloads you definitely want nodatacow enabled, bear in
>> mind this also disables crc checksumming, but your db engine should
>> already have such functionality implemented in it.
> 
> Unlike repeated claims made here on the list and other places... I
> woudln't now *any* DB system which actually does this per default and
> or in a way that would be comparable to filesystem lvl checksumming.
> 

I agree with you, also nobody warn that without checksum in case of RAID filesystem BTRFS is not capable anymore to check if a stripe is correct or not

> 
> Look back in the archives... when I've asked several times for
> checksumming support *with* nodatacow, I evaluated the existing status
> for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
> this either not enabled per default, not at all, or requiring special
> support for the program using the DB.
> 
> 
> Similar btw: no single VM image type I've evaluated back then had any
> form of checksumming integrated.
> 
> 
> Still, one of the major deficiencies (not in comparison to other fs,
> but in comparison to how it should be) of btrfs unfortunately :-(

COW is needed to properly checksum the data. Otherwise is not possible to ensure the coherency between data and checksum (however I have to point out that BTRFS fails even in this case [*]).
We could rearrange this sentence, saying that: if you want checksum, you need COW...

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[*] https://www.spinics.net/lists/linux-btrfs/msg69185.html

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-11 17:51                   ` Goffredo Baroncelli
@ 2018-03-11 22:37                     ` Christoph Anton Mitterer
  2018-03-12 21:22                       ` Goffredo Baroncelli
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Anton Mitterer @ 2018-03-11 22:37 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote:
> 
> COW is needed to properly checksum the data. Otherwise is not
> possible to ensure the coherency between data and checksum (however I
> have to point out that BTRFS fails even in this case [*]).
> We could rearrange this sentence, saying that: if you want checksum,
> you need COW...

No,... not really... the meta-data is anyway always CoWed... so if you
do checksum *and* notdatacow,..., the only thing that could possibly
happen (in the worst case) is, that data that actually made it
correctly to the disk is falsely determined bad, as the metadata (i.e.
the checksums) weren't upgraded correctly.

That however is probably much less likely than the other way round,..
i.e. bad data went to disk and would be detected with checksuming.


I had lots of discussions about this here on the list, and no one ever
brought up a real argument against it... I also had an off-list
discussion with Chris Mason who IIRC confirmed that it would actually
work as I imagine it... with the only two problems:
- good data possibly be marked bad because of bad checksums
- reads giving back EIO where people would rather prefer bad data
(not really sure if this were really his two arguments,... I'd have to
look it up, so don't nail me down).


Long story short:

In any case, I think giving back bad data without EIO is unacceptable.
If someone really doesn't care (e.g. because he has higher level
checksumming and possibly even repair) he could still manually disable
checksumming.

The little chance of having a false positive weights IMO far less that
have very large amounts of data (DBs, VM images are our typical cases)
completely unprotected.

And not having checksumming with notdatacow breaks any safe raid repair
(so in that case "repair" may even overwrite good data),... which is
IMO also unacceptable.
And the typical use cases for nodatacow (VMs, DBs) are in turn not so
uncommon to want RAID.


I really like btrfs,... and it's not that other fs (which typically
have no checksumming at all) would perform better here... but not
having it for these major use case is a big disappointment for me.


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-11 22:37                     ` Christoph Anton Mitterer
@ 2018-03-12 21:22                       ` Goffredo Baroncelli
  2018-03-12 21:48                         ` Christoph Anton Mitterer
  0 siblings, 1 reply; 31+ messages in thread
From: Goffredo Baroncelli @ 2018-03-12 21:22 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

On 03/11/2018 11:37 PM, Christoph Anton Mitterer wrote:
> On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote:
>>
>> COW is needed to properly checksum the data. Otherwise is not
>> possible to ensure the coherency between data and checksum (however I
>> have to point out that BTRFS fails even in this case [*]).
>> We could rearrange this sentence, saying that: if you want checksum,
>> you need COW...
> 
> No,... not really... the meta-data is anyway always CoWed... so if you
> do checksum *and* notdatacow,..., the only thing that could possibly
> happen (in the worst case) is, that data that actually made it
> correctly to the disk is falsely determined bad, as the metadata (i.e.
> the checksums) weren't upgraded correctly.
> 
> That however is probably much less likely than the other way round,..
> i.e. bad data went to disk and would be detected with checksuming.

Unfortunately no, the likelihood might be 100%: there are some patterns which trigger this problem quite easily. See The link which I posted in my previous email. There was a program which creates a bad checksum (in COW+DATASUM mode), and the file became unreadable.

> 
> 
> I had lots of discussions about this here on the list, and no one ever
> brought up a real argument against it... I also had an off-list
> discussion with Chris Mason who IIRC confirmed that it would actually
> work as I imagine it... with the only two problems:
> - good data possibly be marked bad because of bad checksums
> - reads giving back EIO where people would rather prefer bad data

If you cannot know if a checksum is bad or the data is bad, the checksum is not useful at all!

If I read correctly what you wrote, it seems that you consider a "minor issue" the fact that the checksum is not correct. If you accept the possibility that a checksum might be wrong, you wont trust anymore the checksum; so the checksum became not useful.
 

> (not really sure if this were really his two arguments,... I'd have to
> look it up, so don't nail me down).
> 
> 
> Long story short:
> 
> In any case, I think giving back bad data without EIO is unacceptable.
> If someone really doesn't care (e.g. because he has higher level
> checksumming and possibly even repair) he could still manually disable
> checksumming.
> 
> The little chance of having a false positive weights IMO far less that
> have very large amounts of data (DBs, VM images are our typical cases)
> completely unprotected.

Again, you are assuming that the likelihood of having a bad checksum is low. Unfortunately this is not true. There are pattern which exploits this bug with a likelihood=100%.

> 
> And not having checksumming with notdatacow breaks any safe raid repair
> (so in that case "repair" may even overwrite good data),... which is
> IMO also unacceptable.
> And the typical use cases for nodatacow (VMs, DBs) are in turn not so
> uncommon to want RAID.
> 
> 
> I really like btrfs,... and it's not that other fs (which typically
> have no checksumming at all) would perform better here... but not
> having it for these major use case is a big disappointment for me.
> 
> 
> Cheers,
> Chris.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-12 21:22                       ` Goffredo Baroncelli
@ 2018-03-12 21:48                         ` Christoph Anton Mitterer
  2018-03-13 19:36                           ` Goffredo Baroncelli
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Anton Mitterer @ 2018-03-12 21:48 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
> Unfortunately no, the likelihood might be 100%: there are some
> patterns which trigger this problem quite easily. See The link which
> I posted in my previous email. There was a program which creates a
> bad checksum (in COW+DATASUM mode), and the file became unreadable.
But that rather seems like a plain bug?!

No reason that would conceptually make checksumming+notdatacow
impossible.

AFAIU, the conceptual thin would be about:
- data is written in nodatacow
  => thus a checksum must be written as well, so write it
- what can then of course happen is
  - both csum and data are written => fine
  - csum is written but data not and then some crash => csum will show
    that => fine
  - data is written but csum not and then some crash => csum will give
    false positive

Still better few false positives, as many unnoticed data corruptions
and no true raid repair.


> If you cannot know if a checksum is bad or the data is bad, the
> checksum is not useful at all!
Why not? It's anyway only uncertain in the case of crash,... and it at
least tells you that something is fishy.
A program which cares about its data will ensure its own journaling
means and can simply recover by this... or users could then just roll
in a backup.
Or one could provide some API/userland tool to recompute the csums of
the affected file (and possibly live with bad data).


> If I read correctly what you wrote, it seems that you consider a
> "minor issue" the fact that the checksum is not correct. If you
> accept the possibility that a checksum might be wrong, you wont trust
> anymore the checksum; so the checksum became not useful.
There's simply no disadvantage to not having checksumming at all in the
nodatacow case.
Cause then you never have an idea whether your data is correct or
not... the case with checksumming + datacow, which can give a false
positive on a crash when data was written correctly, but not the
checksum, covers at least the other cases of data corruption (silent
data corruption, csum written, but data not or only partially in case
of a crash).


> Again, you are assuming that the likelihood of having a bad checksum
> is low. Unfortunately this is not true. There are pattern which
> exploits this bug with a likelihood=100%.

Okay I don't understand why this would be so and wouldn't assume that
the IO pattern can affect it heavily... but I'm not really btrfs
expert.

My blind assumption would have been that writing an extent of data
takes much longer to complete than writing the corresponding checksum.

Even if not... I should be only a problem in case of a crash during
that,.. and than I'd still prefer to get the false positive than bad
data.


Anyway... it's not going to happen so the discussion is pointless.
I think people can probably use dm-integrity (which btw: does no CoW
either (IIRC) and still can provide integrity... ;-) ) to see whether
their data is valid.
No nice but since it won't change on btrfs, a possible alternative.


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-09 19:05             ` Alex Adriaanse
  2018-03-10 12:04               ` Nikolay Borisov
@ 2018-03-13 13:47               ` Patrik Lundquist
  1 sibling, 0 replies; 31+ messages in thread
From: Patrik Lundquist @ 2018-03-13 13:47 UTC (permalink / raw)
  To: Alex Adriaanse; +Cc: linux-btrfs

On 9 March 2018 at 20:05, Alex Adriaanse <alex@oseberg.io> wrote:
>
> Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load on these machines.

Dump the databases and recreate them with --data-checksums and Btrfs
No_COW attribute.

You can add this to /etc/postgresql-common/createcluster.conf in
Debian/Ubuntu if you use pg_createcluster:
initdb_options = '--data-checksums'

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-12 21:48                         ` Christoph Anton Mitterer
@ 2018-03-13 19:36                           ` Goffredo Baroncelli
  2018-03-13 20:10                             ` Christoph Anton Mitterer
  2018-03-14 12:02                             ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 31+ messages in thread
From: Goffredo Baroncelli @ 2018-03-13 19:36 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:
> On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
>> Unfortunately no, the likelihood might be 100%: there are some
>> patterns which trigger this problem quite easily. See The link which
>> I posted in my previous email. There was a program which creates a
>> bad checksum (in COW+DATASUM mode), and the file became unreadable.
> But that rather seems like a plain bug?!

You are right, unfortunately it seems that it is catalogate as WONT-FIX :(

> No reason that would conceptually make checksumming+notdatacow
> impossible.
> 
> AFAIU, the conceptual thin would be about:
> - data is written in nodatacow
>   => thus a checksum must be written as well, so write it
> - what can then of course happen is
>   - both csum and data are written => fine
>   - csum is written but data not and then some crash => csum will show
>     that => fine
>   - data is written but csum not and then some crash => csum will give
>     false positive
> 
> Still better few false positives, as many unnoticed data corruptions
> and no true raid repair.

A checksum mismatch, is returned as -EIO by a read() syscall. This is an event handled badly by most part of the programs. 
I.e. suppose that a page of a VM ram image file has a wrong checksum. When the VM starts, tries to read the page, got -EIO and aborts. It is even possible that it could not print which page is corrupted. In this case, how the user understand the problem, and what he could do ?


[....]

> 
>> Again, you are assuming that the likelihood of having a bad checksum
>> is low. Unfortunately this is not true. There are pattern which
>> exploits this bug with a likelihood=100%.
> 
> Okay I don't understand why this would be so and wouldn't assume that
> the IO pattern can affect it heavily... but I'm not really btrfs
> expert.
> 
> My blind assumption would have been that writing an extent of data
> takes much longer to complete than writing the corresponding checksum.

The problem is the following: there is a time window between the checksum computation and the writing the data on the disk (which is done at the lower level via a DMA channel), where if the data is update the checksum would mismatch. This happens if we have two threads, where the first commits the data on the disk, and the second one updates the data (I think that both VM and database could behave so).

In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist. 

I am curious how ZFS solves this problem.

However I have to point out that this problem is not solved by the COW. COW solved only the problem about an interrupted commit of the filesystem, where the data is update in place (so it is available by the user), but the metadata not.


> 
> Even if not... I should be only a problem in case of a crash during
> that,.. and than I'd still prefer to get the false positive than bad
> data.

How you can know if it is a "bad data" or a "bad checksum" ?


> 
> 
> Anyway... it's not going to happen so the discussion is pointless.
> I think people can probably use dm-integrity (which btw: does no CoW
> either (IIRC) and still can provide integrity... ;-) ) to see whether
> their data is valid.
> No nice but since it won't change on btrfs, a possible alternative.

Even in this case I am curious about dm-integrity would sole this issue.

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-13 19:36                           ` Goffredo Baroncelli
@ 2018-03-13 20:10                             ` Christoph Anton Mitterer
  2018-03-14 12:02                             ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Anton Mitterer @ 2018-03-13 20:10 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

On Tue, 2018-03-13 at 20:36 +0100, Goffredo Baroncelli wrote:
> A checksum mismatch, is returned as -EIO by a read() syscall. This is
> an event handled badly by most part of the programs.
Then these programs must simply be fixed... otherwise they'll also fail
under normal circumstances with btrfs, if there is any corruption.


> The problem is the following: there is a time window between the
> checksum computation and the writing the data on the disk (which is
> done at the lower level via a DMA channel), where if the data is
> update the checksum would mismatch. This happens if we have two
> threads, where the first commits the data on the disk, and the second
> one updates the data (I think that both VM and database could behave
> so).
Well that's clear... but isn't that time frame also there if the extent
is just written without CoW (regardless of checksumming)?
Obviously there would need to be some protection here anyway, so that
such data is taken e.g. from RAM, before the write has completed, so
that the read wouldn't take place while the write has only half
finished?!
So I'd naively assume one could just enlarge that protection to the
completion of checksum writing,...



> In btrfs, a checksum mismatch creates an -EIO error during the
> reading. In a conventional filesystem (or a btrfs filesystem w/o
> datasum) there is no checksum, so this problem doesn't exist.
If ext writes an extent (can't that be up to 128MiB there?), then I'm
sure it cannot write that atomically (in terms of hardware)... so there
is likely some protection around this operation, that there are no
concurrent reads of that particular extent from the disk, while the
write hasn't finished yet.



> > Even if not... I should be only a problem in case of a crash during
> > that,.. and than I'd still prefer to get the false positive than
> > bad
> > data.
> 
> How you can know if it is a "bad data" or a "bad checksum" ?
Well as I've said, in my naive thinking this should only be a problem
in case of a crash... and then, yes, one cannot say whether it's bad
data or checksum (that's exactly what I'm saying)... but I rather
prefer to know that something might be fishy, then not knowing anything
and perhaps even get good data "RAID-repaired" with bad one...


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-13 19:36                           ` Goffredo Baroncelli
  2018-03-13 20:10                             ` Christoph Anton Mitterer
@ 2018-03-14 12:02                             ` Austin S. Hemmelgarn
  2018-03-14 18:39                               ` Goffredo Baroncelli
  1 sibling, 1 reply; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2018-03-14 12:02 UTC (permalink / raw)
  To: kreijack, Christoph Anton Mitterer; +Cc: linux-btrfs

On 2018-03-13 15:36, Goffredo Baroncelli wrote:
> On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:
>> On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
>>> Unfortunately no, the likelihood might be 100%: there are some
>>> patterns which trigger this problem quite easily. See The link which
>>> I posted in my previous email. There was a program which creates a
>>> bad checksum (in COW+DATASUM mode), and the file became unreadable.
>> But that rather seems like a plain bug?!
> 
> You are right, unfortunately it seems that it is catalogate as WONT-FIX :(
> 
>> No reason that would conceptually make checksumming+notdatacow
>> impossible.
>>
>> AFAIU, the conceptual thin would be about:
>> - data is written in nodatacow
>>    => thus a checksum must be written as well, so write it
>> - what can then of course happen is
>>    - both csum and data are written => fine
>>    - csum is written but data not and then some crash => csum will show
>>      that => fine
>>    - data is written but csum not and then some crash => csum will give
>>      false positive
>>
>> Still better few false positives, as many unnoticed data corruptions
>> and no true raid repair.
> 
> A checksum mismatch, is returned as -EIO by a read() syscall. This is an event handled badly by most part of the programs.
> I.e. suppose that a page of a VM ram image file has a wrong checksum. When the VM starts, tries to read the page, got -EIO and aborts. It is even possible that it could not print which page is corrupted. In this case, how the user understand the problem, and what he could do ?
Check the kernel log on the host system, which should have an error 
message saying which block failed.  If the VM itself actually gets to 
the point of booting into an OS (and properly propagates things like 
-EIO to the guest environment like it should), that OS should also log 
where the error was.

Most of the reason user applications don't tell you where the error was 
is because the kernel already does it on any sensible system, and the 
kernel tells you _exactly_ where the error was (exact block and device 
that threw the error), which user applications can't really do (they 
generally can't get sufficiently low-level information to give you all 
the info the kernel does).
> 
>>
>>> Again, you are assuming that the likelihood of having a bad checksum
>>> is low. Unfortunately this is not true. There are pattern which
>>> exploits this bug with a likelihood=100%.
>>
>> Okay I don't understand why this would be so and wouldn't assume that
>> the IO pattern can affect it heavily... but I'm not really btrfs
>> expert.
>>
>> My blind assumption would have been that writing an extent of data
>> takes much longer to complete than writing the corresponding checksum.
> 
> The problem is the following: there is a time window between the checksum computation and the writing the data on the disk (which is done at the lower level via a DMA channel), where if the data is update the checksum would mismatch. This happens if we have two threads, where the first commits the data on the disk, and the second one updates the data (I think that both VM and database could behave so).
Though it only matters if you use O_DIRECT or the files in question are 
NOCOW.
> 
> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
> 
> I am curious how ZFS solves this problem.
It doesn't support disabling COW or the O_DIRECT flag, so it just never 
has the problem in the first place.
> 
> However I have to point out that this problem is not solved by the COW. COW solved only the problem about an interrupted commit of the filesystem, where the data is update in place (so it is available by the user), but the metadata not.
COW is irrelevant if you're bypassing it.  It's only enforced for 
metadata so that you don't have to check the FS every time you mount it 
(because the way BTRFS uses it guarantees consistency of the metadata).
> 
>>
>> Even if not... I should be only a problem in case of a crash during
>> that,.. and than I'd still prefer to get the false positive than bad
>> data.
> 
> How you can know if it is a "bad data" or a "bad checksum" ?
You can't directly.  Just like you can't know which copy in a two-device 
MD RAID1 array is bad when they mismatch.

That's part of why I'm not all that fond of the idea of having checksums 
without COW, you need to verify the data using secondary means anyway, 
so why exactly should you waste time verifying it twice?
> 
>>
>> Anyway... it's not going to happen so the discussion is pointless.
>> I think people can probably use dm-integrity (which btw: does no CoW
>> either (IIRC) and still can provide integrity... ;-) ) to see whether
>> their data is valid.
>> No nice but since it won't change on btrfs, a possible alternative.
> 
> Even in this case I am curious about dm-integrity would sole this issue.
dm-integrity uses journaling, and actually based on the testing I've 
done, will typically have much worse performance than the overhead of 
just enabling COW on files on BTRFS and manually defragmenting them on a 
regular basis.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-14 12:02                             ` Austin S. Hemmelgarn
@ 2018-03-14 18:39                               ` Goffredo Baroncelli
  2018-03-14 19:27                                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 31+ messages in thread
From: Goffredo Baroncelli @ 2018-03-14 18:39 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Christoph Anton Mitterer; +Cc: linux-btrfs

On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
[...]
>>
>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
>>
>> I am curious how ZFS solves this problem.
> It doesn't support disabling COW or the O_DIRECT flag, so it just never has the problem in the first place.

I would like to perform some tests: however I think that you are right. if you make a "double buffering" approach (copy the data in the page cache, compute the checksum, then write the data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be sufficient; what about mmap ? Are we sure that this does a double buffering ?

I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer this to the checksum mismatch bug.


>>
>> However I have to point out that this problem is not solved by the COW. COW solved only the problem about an interrupted commit of the filesystem, where the data is update in place (so it is available by the user), but the metadata not.
> COW is irrelevant if you're bypassing it.  It's only enforced for metadata so that you don't have to check the FS every time you mount it (because the way BTRFS uses it guarantees consistency of the metadata).
>>
>>>
>>> Even if not... I should be only a problem in case of a crash during
>>> that,.. and than I'd still prefer to get the false positive than bad
>>> data.
>>
>> How you can know if it is a "bad data" or a "bad checksum" ?
> You can't directly.  Just like you can't know which copy in a two-device MD RAID1 array is bad when they mismatch.
> 
> That's part of why I'm not all that fond of the idea of having checksums without COW, you need to verify the data using secondary means anyway, so why exactly should you waste time verifying it twice?

This is true

>>
>>>
>>> Anyway... it's not going to happen so the discussion is pointless.
>>> I think people can probably use dm-integrity (which btw: does no CoW
>>> either (IIRC) and still can provide integrity... ;-) ) to see whether
>>> their data is valid.
>>> No nice but since it won't change on btrfs, a possible alternative.
>>
>> Even in this case I am curious about dm-integrity would sole this issue.
> dm-integrity uses journaling, and actually based on the testing I've done, will typically have much worse performance than the overhead of just enabling COW on files on BTRFS and manually defragmenting them on a regular basis.

Good to know
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-14 18:39                               ` Goffredo Baroncelli
@ 2018-03-14 19:27                                 ` Austin S. Hemmelgarn
  2018-03-14 22:17                                   ` Goffredo Baroncelli
  0 siblings, 1 reply; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2018-03-14 19:27 UTC (permalink / raw)
  To: kreijack, Christoph Anton Mitterer; +Cc: linux-btrfs

On 2018-03-14 14:39, Goffredo Baroncelli wrote:
> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
> [...]
>>>
>>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
>>>
>>> I am curious how ZFS solves this problem.
>> It doesn't support disabling COW or the O_DIRECT flag, so it just never has the problem in the first place.
> 
> I would like to perform some tests: however I think that you are right. if you make a "double buffering" approach (copy the data in the page cache, compute the checksum, then write the data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be sufficient; what about mmap ? Are we sure that this does a double buffering ?
There's a whole lot of applications that would be showing some pretty 
serious issues if checksumming didn't work correctly with mmap(), so I 
think it does work correctly given that we don't have hordes of angry 
users and sysadmins beating down the doors.
> 
> I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer this to the checksum mismatch bug.
This is only reasonable if you are writing to the files.  Checksums 
appear to be checked on O_DIRECT reads, and outside of databases and 
VM's, read-only access accounts for a significant percentage of O_DIRECT 
usage, partly because it is needed for AIO support (nginx for example 
can serve files using AIO and O_DIRECT and gets a pretty serious 
performance boost on heavily loaded systems by doing so).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Ongoing Btrfs stability issues
  2018-03-14 19:27                                 ` Austin S. Hemmelgarn
@ 2018-03-14 22:17                                   ` Goffredo Baroncelli
  0 siblings, 0 replies; 31+ messages in thread
From: Goffredo Baroncelli @ 2018-03-14 22:17 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Christoph Anton Mitterer; +Cc: linux-btrfs

On 03/14/2018 08:27 PM, Austin S. Hemmelgarn wrote:
> On 2018-03-14 14:39, Goffredo Baroncelli wrote:
>> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
>> [...]
>>>>
>>>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
>>>>
>>>> I am curious how ZFS solves this problem.
>>> It doesn't support disabling COW or the O_DIRECT flag, so it just never has the problem in the first place.
>>
>> I would like to perform some tests: however I think that you are right. if you make a "double buffering" approach (copy the data in the page cache, compute the checksum, then write the data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be sufficient; what about mmap ? Are we sure that this does a double buffering ?
> There's a whole lot of applications that would be showing some pretty serious issues if checksumming didn't work correctly with mmap(), so I think it does work correctly given that we don't have hordes of angry users and sysadmins beating down the doors.

I tried to do in parallel updating a page and writing in different thread; I was unable to reproduce a checksum mismatch; so it seems that mmap are safe from this point of view;

>>
>> I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer this to the checksum mismatch bug.
> This is only reasonable if you are writing to the files.  Checksums appear to be checked on O_DIRECT reads, and outside of databases and VM's, read-only access accounts for a significant percentage of O_DIRECT usage, partly because it is needed for AIO support (nginx for example can serve files using AIO and O_DIRECT and gets a pretty serious performance boost on heavily loaded systems by doing so).
> 

So O_DIRECT should be unsupported/ignored only for the writing ? It could be a good compromise...

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-03-14 22:17 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-15 16:18 Ongoing Btrfs stability issues Alex Adriaanse
2018-02-15 18:00 ` Nikolay Borisov
2018-02-15 19:41   ` Alex Adriaanse
2018-02-15 20:42     ` Nikolay Borisov
2018-02-16  4:54       ` Alex Adriaanse
2018-02-16  7:40         ` Nikolay Borisov
2018-02-16 19:44 ` Austin S. Hemmelgarn
2018-02-17  3:03   ` Duncan
2018-02-17  4:34     ` Shehbaz Jaffer
2018-02-17 15:18       ` Hans van Kranenburg
2018-02-17 16:42         ` Shehbaz Jaffer
2018-03-01 19:04   ` Alex Adriaanse
2018-03-01 19:40     ` Nikolay Borisov
2018-03-02 17:29       ` Liu Bo
2018-03-08 17:40         ` Alex Adriaanse
2018-03-09  9:54           ` Nikolay Borisov
2018-03-09 19:05             ` Alex Adriaanse
2018-03-10 12:04               ` Nikolay Borisov
2018-03-10 14:29                 ` Christoph Anton Mitterer
2018-03-11 17:51                   ` Goffredo Baroncelli
2018-03-11 22:37                     ` Christoph Anton Mitterer
2018-03-12 21:22                       ` Goffredo Baroncelli
2018-03-12 21:48                         ` Christoph Anton Mitterer
2018-03-13 19:36                           ` Goffredo Baroncelli
2018-03-13 20:10                             ` Christoph Anton Mitterer
2018-03-14 12:02                             ` Austin S. Hemmelgarn
2018-03-14 18:39                               ` Goffredo Baroncelli
2018-03-14 19:27                                 ` Austin S. Hemmelgarn
2018-03-14 22:17                                   ` Goffredo Baroncelli
2018-03-13 13:47               ` Patrik Lundquist
2018-03-02  4:02     ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.