On 2018/9/19 下午4:43, Tomasz Chmielewski wrote:
> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.

This sounds a little concerning.
Not about the the usage percentage itself. but the size and how many
free space cache could be updated for each transaction.

Detail will follow below.

> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.

Not familiar with space cache nor MySQL workload, but at least we don't
need to bother extra data CoW.

> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.

The main concern here is how many metadata block groups are involved in
one transaction.

From my observation, although free space cache files (v1 space cache)
are marked NODATACOW, they in fact get updated in a COW behavior.

This means if there are say 100 metadata block groups get updated, then
we need to write around 12M data just for space cache.

On the other than, if we fix v1 space cache to really do NODATACOW, then
it should hugely reduce the IO for free space cache

> 
> 
> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.

Have you tried nospace_cache? I think it should behavior a little worse
than v2 space cache but much better than the *broken* v1 space cache.

And for v2 space cache, it's already based on btrfs btree, which get
CoWed like all other btrfs btrees, thus no need to update the whole
space cache for each metadata block group. (Although in theory, the
overhead should still be larger than *working* v1 cache)

Thanks,
Qu

> 
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?
> 
> 
> Tomasz Chmielewski
> https://lxadm.com