linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
@ 2018-09-19  8:43 Tomasz Chmielewski
  2018-09-19  9:33 ` Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Tomasz Chmielewski @ 2018-09-19  8:43 UTC (permalink / raw)
  To: Btrfs BTRFS

I have a mysql slave which writes to a RAID-1 btrfs filesystem (with 
4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.

The slave receives around 0.5-1 MB/s of data from the master over the 
network, which is then saved to MySQL's relay log and executed. In ideal 
conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s 
of data written to disk.

MySQL directory and files in it are chattr +C (since the directory was 
created, so all files are really +C); there are no snapshots.


Now, an interesting thing.

When the filesystem is mounted with these options in fstab:

defaults,noatime,discard


We can see a *constant* write of 25-100 MB/s to each disk. The system is 
generally unresponsive and it sometimes takes long seconds for a simple 
command executed in bash to return.


However, as soon as we remount the filesystem with space_cache=v2 - 
writes drop to just around 3-10 MB/s to each disk. If we remount to 
space_cache - lots of writes, system unresponsive. Again remount to 
space_cache=v2 - low writes, system responsive.


That's a huuge, 10x overhead! Is it expected? Especially that 
space_cache=v1 is still the default mount option?


Tomasz Chmielewski
https://lxadm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19  8:43 very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2) Tomasz Chmielewski
@ 2018-09-19  9:33 ` Qu Wenruo
  2018-09-19 12:00 ` Remi Gauvin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2018-09-19  9:33 UTC (permalink / raw)
  To: Tomasz Chmielewski, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 2618 bytes --]



On 2018/9/19 下午4:43, Tomasz Chmielewski wrote:
> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.

This sounds a little concerning.
Not about the the usage percentage itself. but the size and how many
free space cache could be updated for each transaction.

Detail will follow below.

> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.

Not familiar with space cache nor MySQL workload, but at least we don't
need to bother extra data CoW.

> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.

The main concern here is how many metadata block groups are involved in
one transaction.

From my observation, although free space cache files (v1 space cache)
are marked NODATACOW, they in fact get updated in a COW behavior.

This means if there are say 100 metadata block groups get updated, then
we need to write around 12M data just for space cache.

On the other than, if we fix v1 space cache to really do NODATACOW, then
it should hugely reduce the IO for free space cache

> 
> 
> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.

Have you tried nospace_cache? I think it should behavior a little worse
than v2 space cache but much better than the *broken* v1 space cache.

And for v2 space cache, it's already based on btrfs btree, which get
CoWed like all other btrfs btrees, thus no need to update the whole
space cache for each metadata block group. (Although in theory, the
overhead should still be larger than *working* v1 cache)

Thanks,
Qu

> 
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?
> 
> 
> Tomasz Chmielewski
> https://lxadm.com


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19  8:43 very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2) Tomasz Chmielewski
  2018-09-19  9:33 ` Qu Wenruo
@ 2018-09-19 12:00 ` Remi Gauvin
  2018-09-19 17:58 ` Hans van Kranenburg
  2018-09-20  7:46 ` Duncan
  3 siblings, 0 replies; 9+ messages in thread
From: Remi Gauvin @ 2018-09-19 12:00 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

On 2018-09-19 04:43 AM, Tomasz Chmielewski wrote:
> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.

Not related to the issue you are reporting, but I thought it's worth
mentioning, (since not many do), that using chattr +C on a BTRFS Raid 1
is a dangerous thing.  without COW, the 2 copies are never synchronized,
even if a scrub is executed.  So any kind of unclean shutdown that
interrupts writes (not to mention the extreme of a temporarily
disconnected drive.) will result in files that are inconsistent.  (ie,
depending on which disk happens to read at the time, the data will be
different on each read.)



[-- Attachment #2: remi.vcf --]
[-- Type: text/x-vcard, Size: 193 bytes --]

begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19  8:43 very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2) Tomasz Chmielewski
  2018-09-19  9:33 ` Qu Wenruo
  2018-09-19 12:00 ` Remi Gauvin
@ 2018-09-19 17:58 ` Hans van Kranenburg
  2018-09-19 20:04   ` Martin Steigerwald
  2018-09-20  7:46 ` Duncan
  3 siblings, 1 reply; 9+ messages in thread
From: Hans van Kranenburg @ 2018-09-19 17:58 UTC (permalink / raw)
  To: Tomasz Chmielewski, Btrfs BTRFS

Hi,

On 09/19/2018 10:43 AM, Tomasz Chmielewski wrote:
> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.
> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.

Did you already test the difference with/without 'discard'? Also, I
think that depending on the tooling that you use to view disk IO,
discards will also show up as disk write statistics.

> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?

Yes, that does not surprise me.

https://events.static.linuxfound.org/sites/events/files/slides/vault2016_0.pdf

Free space cache v1 is the default because of issues with btrfs-progs,
not because it's unwise to use the kernel code. I can totally recommend
using it. The linked presentation above gives some good background
information.

Another thing that's interesting is finding out what kind of things
btrfs is writing if it's writing that much MB/s to disk. Doing this is
not very trivial.

I've been spending quite some time researching these kind of issues.

Here's what I found out:
https://www.spinics.net/lists/linux-btrfs/msg70624.html (oh wow, that's
almost a year ago already)

There are a bunch of tracepoints in the kernel code that could help
debugging all of this more, but I've not yet gotten around to writing
something to conveniently to use them to live show what's happening.

I'm still using the "Thanks to a bug, solved in [2]" in the above
mailing list post way of combining extent allocators in btrfs now to
keep things workable on the larger filesystem.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19 17:58 ` Hans van Kranenburg
@ 2018-09-19 20:04   ` Martin Steigerwald
  2018-09-19 20:11     ` Hans van Kranenburg
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Steigerwald @ 2018-09-19 20:04 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Tomasz Chmielewski, Btrfs BTRFS

Hans van Kranenburg - 19.09.18, 19:58:
> However, as soon as we remount the filesystem with space_cache=v2 -
> 
> > writes drop to just around 3-10 MB/s to each disk. If we remount to
> > space_cache - lots of writes, system unresponsive. Again remount to
> > space_cache=v2 - low writes, system responsive.
> > 
> > That's a huuge, 10x overhead! Is it expected? Especially that
> > space_cache=v1 is still the default mount option?
> 
> Yes, that does not surprise me.
> 
> https://events.static.linuxfound.org/sites/events/files/slides/vault20
> 16_0.pdf
> 
> Free space cache v1 is the default because of issues with btrfs-progs,
> not because it's unwise to use the kernel code. I can totally
> recommend using it. The linked presentation above gives some good
> background information.

What issues in btrfs-progs are that?

I am wondering whether to switch to freespace tree v2. Would it provide 
benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 
on a laptop?

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19 20:04   ` Martin Steigerwald
@ 2018-09-19 20:11     ` Hans van Kranenburg
  2018-09-19 20:30       ` Nikolay Borisov
  2018-09-20  0:55       ` Qu Wenruo
  0 siblings, 2 replies; 9+ messages in thread
From: Hans van Kranenburg @ 2018-09-19 20:11 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Tomasz Chmielewski, Btrfs BTRFS

On 09/19/2018 10:04 PM, Martin Steigerwald wrote:
> Hans van Kranenburg - 19.09.18, 19:58:
>> However, as soon as we remount the filesystem with space_cache=v2 -
>>
>>> writes drop to just around 3-10 MB/s to each disk. If we remount to
>>> space_cache - lots of writes, system unresponsive. Again remount to
>>> space_cache=v2 - low writes, system responsive.
>>>
>>> That's a huuge, 10x overhead! Is it expected? Especially that
>>> space_cache=v1 is still the default mount option?
>>
>> Yes, that does not surprise me.
>>
>> https://events.static.linuxfound.org/sites/events/files/slides/vault20
>> 16_0.pdf
>>
>> Free space cache v1 is the default because of issues with btrfs-progs,
>> not because it's unwise to use the kernel code. I can totally
>> recommend using it. The linked presentation above gives some good
>> background information.
> 
> What issues in btrfs-progs are that?

Missing code to make offline changes to a filesystem that has a free
space tree. So when using btrfstune / repair / whatever you first need
to remove the whole free space tree with a command, and then add it back
on the next mount.

For me personally that's not a problem (I don't have to make offline
changes), but I understand that having that situation out of the box for
every new user would be a bit awkward.

> I am wondering whether to switch to freespace tree v2. Would it provide 
> benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 
> on a laptop?

As shown in the linked presentation, it provides benefit on a largeish
filesystem and if your writes are touching a lot of different block
groups (since v1 writes out the full space cache for all of them on
every transaction commit). I'd say, it provides benefit as soon as you
encounter filesystem delays because of it, and as soon as you see using
it eases the pain a lot. So, yes, that's your case.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19 20:11     ` Hans van Kranenburg
@ 2018-09-19 20:30       ` Nikolay Borisov
  2018-09-20  0:55       ` Qu Wenruo
  1 sibling, 0 replies; 9+ messages in thread
From: Nikolay Borisov @ 2018-09-19 20:30 UTC (permalink / raw)
  To: Hans van Kranenburg, Martin Steigerwald; +Cc: Tomasz Chmielewski, Btrfs BTRFS



On 19.09.2018 23:11, Hans van Kranenburg wrote:
> On 09/19/2018 10:04 PM, Martin Steigerwald wrote:
>> Hans van Kranenburg - 19.09.18, 19:58:
>>> However, as soon as we remount the filesystem with space_cache=v2 -
>>>
>>>> writes drop to just around 3-10 MB/s to each disk. If we remount to
>>>> space_cache - lots of writes, system unresponsive. Again remount to
>>>> space_cache=v2 - low writes, system responsive.
>>>>
>>>> That's a huuge, 10x overhead! Is it expected? Especially that
>>>> space_cache=v1 is still the default mount option?
>>>
>>> Yes, that does not surprise me.
>>>
>>> https://events.static.linuxfound.org/sites/events/files/slides/vault20
>>> 16_0.pdf
>>>
>>> Free space cache v1 is the default because of issues with btrfs-progs,
>>> not because it's unwise to use the kernel code. I can totally
>>> recommend using it. The linked presentation above gives some good
>>> background information.
>>
>> What issues in btrfs-progs are that?
> 
> Missing code to make offline changes to a filesystem that has a free
> space tree. So when using btrfstune / repair / whatever you first need
> to remove the whole free space tree with a command, and then add it back
> on the next mount.

And as a matter of fact this code has already been published on the
mailing list for review, and even some parts got merged so we are in
good shape to get it into progs and eventually switch the kernel to
default to v2 cache.
> 
> For me personally that's not a problem (I don't have to make offline
> changes), but I understand that having that situation out of the box for
> every new user would be a bit awkward.
> 
>> I am wondering whether to switch to freespace tree v2. Would it provide 
>> benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 
>> on a laptop?
> 
> As shown in the linked presentation, it provides benefit on a largeish
> filesystem and if your writes are touching a lot of different block
> groups (since v1 writes out the full space cache for all of them on
> every transaction commit). I'd say, it provides benefit as soon as you
> encounter filesystem delays because of it, and as soon as you see using
> it eases the pain a lot. So, yes, that's your case.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19 20:11     ` Hans van Kranenburg
  2018-09-19 20:30       ` Nikolay Borisov
@ 2018-09-20  0:55       ` Qu Wenruo
  1 sibling, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2018-09-20  0:55 UTC (permalink / raw)
  To: Hans van Kranenburg, Martin Steigerwald; +Cc: Tomasz Chmielewski, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 2513 bytes --]



On 2018/9/20 上午4:11, Hans van Kranenburg wrote:
> On 09/19/2018 10:04 PM, Martin Steigerwald wrote:
>> Hans van Kranenburg - 19.09.18, 19:58:
>>> However, as soon as we remount the filesystem with space_cache=v2 -
>>>
>>>> writes drop to just around 3-10 MB/s to each disk. If we remount to
>>>> space_cache - lots of writes, system unresponsive. Again remount to
>>>> space_cache=v2 - low writes, system responsive.
>>>>
>>>> That's a huuge, 10x overhead! Is it expected? Especially that
>>>> space_cache=v1 is still the default mount option?
>>>
>>> Yes, that does not surprise me.
>>>
>>> https://events.static.linuxfound.org/sites/events/files/slides/vault20
>>> 16_0.pdf
>>>
>>> Free space cache v1 is the default because of issues with btrfs-progs,
>>> not because it's unwise to use the kernel code. I can totally
>>> recommend using it. The linked presentation above gives some good
>>> background information.
>>
>> What issues in btrfs-progs are that?
> 
> Missing code to make offline changes to a filesystem that has a free
> space tree. So when using btrfstune / repair / whatever you first need
> to remove the whole free space tree with a command, and then add it back
> on the next mount.
> 
> For me personally that's not a problem (I don't have to make offline
> changes), but I understand that having that situation out of the box for
> every new user would be a bit awkward.
> 
>> I am wondering whether to switch to freespace tree v2. Would it provide 
>> benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 
>> on a laptop?
> 
> As shown in the linked presentation, it provides benefit on a largeish
> filesystem and if your writes are touching a lot of different block
> groups (since v1 writes out the full space cache for all of them on
> every transaction commit).

In fact that's the problem.

From free space cache inode flags, it's
NODATASUM|NODATACOW|NOCOMPRESS|PREALLOC.

But the fact is, if it's modified, the whole file just get CoWed.

If we could change it to follow the inode flags, we could reduce
overhead even smaller than v2 one.
(v1 needs at least (1 + n) * sectorsize(4K) one for the header which
contains the csum, while v2 needs metadata CoW which is at least
nodesize (default to 16K)).

Thanks,
Qu

> I'd say, it provides benefit as soon as you
> encounter filesystem delays because of it, and as soon as you see using
> it eases the pain a lot. So, yes, that's your case.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
  2018-09-19  8:43 very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2) Tomasz Chmielewski
                   ` (2 preceding siblings ...)
  2018-09-19 17:58 ` Hans van Kranenburg
@ 2018-09-20  7:46 ` Duncan
  3 siblings, 0 replies; 9+ messages in thread
From: Duncan @ 2018-09-20  7:46 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Chmielewski posted on Wed, 19 Sep 2018 10:43:18 +0200 as excerpted:

> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.
> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.
> 
> 
> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.
> 
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?

The other replies are good but I've not seen this pointed out yet...

Perhaps you are accounting for this already, but you don't /say/ you are, 
while you do mention repeatedly toggling the space-cache options, which 
would trigger it so you /need/ to account for it...

I'm not sure about space_cache=v2 (it's probably more efficient with it 
even if it does have to do it), but I'm quite sure that space_cache=v1 
takes some time after initial mount with it to scan the filesystem and 
actually create the map of available free space that is the space_cache.

Now you said ssds, which should be reasonably fast, but you also say 3-
device btrfs raid1, with each device ~2TB, and the filesystem ~40% full, 
which should be ~2 TB of data, which is likely somewhat fragmented so 
it's likely rather more than 2 TB of data chunks to scan for free space, 
and that's going to take /some/ time even on SSDs!

So if you're toggling settings like that in your tests, be sure to let 
the filesystem rebuild its cache that you just toggled and give it time 
to complete that and quiesce, before you start trying to measure write 
amplification.

Otherwise it's not write-amplification you're measuring, but the churn 
from the filesystem still trying to reset its cache after you toggled it!


Also, while 4.17 is well after the ssd mount option (usually auto-
detected, check /proc/mounts, mount output, or dmesg, to see if the ssd 
mount option is being added) fixes that went in in 4.14, if the 
filesystem has been in use for several kernel cycles and in particular 
before 4.14, with the ssd mount option active, and you've not rebalanced 
since then, you may well still have serious space fragmentation from 
that, which could increase the amount of data in the space_cache map 
rather drastically, thus increasing the time it takes to update the 
space_cache, particularly v1, after toggling it on.

A balance can help correct that, but it might well be easier and should 
result in a better layout to simply blow the filesystem away with a 
mkfs.btrfs and start over.


Meanwhile, as Remi already mentioned, you might want to reconsider nocow 
on btrfs raid1, since nocow defeats checksumming and thus scrub, which 
verifies checksums, simply skips it, and if the two copies get out of 
sync for some reason...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-09-20 13:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19  8:43 very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2) Tomasz Chmielewski
2018-09-19  9:33 ` Qu Wenruo
2018-09-19 12:00 ` Remi Gauvin
2018-09-19 17:58 ` Hans van Kranenburg
2018-09-19 20:04   ` Martin Steigerwald
2018-09-19 20:11     ` Hans van Kranenburg
2018-09-19 20:30       ` Nikolay Borisov
2018-09-20  0:55       ` Qu Wenruo
2018-09-20  7:46 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).