All of lore.kernel.org
 help / color / mirror / Atom feed
* Unexpected reflink/subvol snapshot behaviour
@ 2021-01-21 22:20 Dave Chinner
  2021-01-23  8:42 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dave Chinner @ 2021-01-21 22:20 UTC (permalink / raw)
  To: linux-btrfs

Hi btrfs-gurus,

I'm running a simple reflink/snapshot/COW scalability test at the
moment. It is just a loop that does "fio overwrite of 10,000 4kB
random direct IOs in a 4GB file; snapshot" and I want to check a
couple of things I'm seeing with btrfs. fio config file is appended
to the email.

Firstly, what is the expected "space amplification" of such a
workload over 1000 iterations on btrfs? This will write 40GB of user
data, and I'm seeing btrfs consume ~220GB of space for the workload
regardless of whether I use subvol snapshot or file clones
(reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
wondering if this is expected or whether there's something else
going on. XFS amplification for 1000 iterations using reflink is
only 1.4x, so 5.5x seems somewhat excessive to me.

On a similar note, the IO bandwidth consumed by btrfs is way out of
proportion with the amount of user data being written. I'm seeing
multiple GBs being written by btrfs on every iteration - easily
exceeding 5GB of writes per cycle in the later iterations of the
test. Given that only 40MB of user data is being written per cycle,
there's a write amplification factor of well over 100x ocurring
here. In comparison, XFS is writing roughly consistently at 80MB/s
to disk over the course of the entire workload, largely because of
journal traffic for the transactions run during COW and clone
operations.  Is such a huge amount of of IO expected for btrfs in
this situation?

As a side effect of that IO load, btrfs is driving the machine hard
into memory reclaim because the page cache footprint of each
writeback cycle. btrfs is dirtying a large number of metadata pages
in the page cache (at least 50% of the ram in the machine is dirtied
on every snapshot/reflink cycle). Hence when the system needs memory
reclaim, it hits large amounts of memory it can't reclaim
immediately and things go bad very quickly.  This is causing
everything on the machine to stall while btrfs dumps the dirty
metadata pages to disk at over 1GB/s and 10,000 IOPS for several
seconds. Is this expected behaviour?

Next, subvol snapshot and clone time appears to be scale with the
number of snapshots/clones already present. The initial clone/subvol
snapshot command take a few milliseconds. At 50 snapshots it take
1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
>850 it seems to level off at about 30s a snapshot. There are
outliers that take double this time (63s was the longest) and the
variation between iterations can be quite substantial. Is this
expected scalablity?

On subvol snapshot execution, there appears to be a bug manifesting
occasionally and may be one of the reasons for things being so
variable. The visible manifestation is that every so often a subvol
snapshot takes 0.02s instead of the multiple seconds all the
snapshots around it are taking:

 $ grep -C 1 ": 0.0" results/btrfs-snap/2021-01-21-22\:08\:15-1000/snapshot_times | sed 's/://'
 snapshot 0 0.02
 snapshot 1 0.06
 snapshot 2 0.10
 --
 snapshot 25 0.77
 snapshot 26 0.02
 snapshot 27 0.85
 --
 snapshot 51 1.45
 snapshot 52 0.02
 snapshot 53 1.51
 --
 snapshot 78 2.35
 snapshot 79 0.03
 snapshot 80 2.31
 --
 snapshot 104 3.22
 snapshot 105 0.02
 snapshot 106 3.44
 --
 snapshot 130 4.25
 snapshot 131 0.02
 snapshot 132 4.53
 --
 snapshot 156 5.38
 snapshot 157 0.02
 snapshot 158 5.76
 --
 snapshot 183 6.17
 snapshot 184 0.02
 snapshot 185 6.94
 --
 snapshot 209 8.08
 snapshot 210 0.04
 snapshot 211 6.91
 --
 snapshot 235 8.77
 snapshot 236 0.02
 snapshot 237 9.80
 --
 snapshot 288 10.91
 snapshot 289 0.04
 snapshot 290 9.07
 --
 snapshot 314 11.81
 snapshot 315 0.04
 snapshot 316 11.74
 --
 snapshot 340 11.83
 snapshot 341 0.05
 snapshot 342 12.11
 --
 snapshot 367 11.95
 snapshot 368 0.06
 snapshot 369 11.83
 --
 snapshot 393 13.66
 snapshot 394 0.03
 snapshot 395 10.98
 --
 snapshot 419 14.04
 snapshot 420 0.04
 snapshot 421 12.62
 --
 snapshot 472 22.10
 snapshot 473 0.03
 snapshot 474 14.90
 --
 snapshot 498 14.48
 snapshot 499 0.03
 snapshot 500 17.46
 --
 snapshot 524 20.50
 snapshot 525 0.04
 snapshot 526 18.01
 --
 snapshot 577 55.81
 snapshot 578 0.08
 snapshot 579 34.02
 --
 snapshot 603 22.81
 snapshot 604 0.03
 snapshot 605 19.26
 --
 snapshot 682 30.88
 snapshot 683 0.02
 snapshot 684 14.83
 --
 snapshot 708 19.90
 snapshot 709 0.03
 snapshot 710 15.38
 --
 snapshot 761 25.63
 snapshot 762 0.05
 snapshot 763 15.58
 --
 snapshot 787 15.33
 snapshot 788 0.03
 snapshot 789 15.08
 --
 snapshot 866 23.77
 snapshot 867 0.04
 snapshot 868 27.40
 --
 snapshot 892 15.33
 snapshot 893 0.03
 snapshot 894 13.38
 --
 snapshot 945 15.32
 snapshot 946 0.05
 snapshot 947 15.52
 --
 snapshot 971 15.30
 snapshot 972 0.03
 snapshot 973 14.88

It seems .... unlikely that random snapshots of exactly the same
repeating workloadi have such a variance in execution time. And then
I noticed that they exactly correlate with the order of magnitude
fio performance drops that manifested occasionally:

$ for i in `grep ": 0.0" results/btrfs-snap/2021-01-21-22\:08\:15-1000/snapshot_times | sed 's/://' |cut -d " " -f 2`; do grep -C 1 " $i:" results/btrfs-snap/2021-01-21-22\:08\:15-1000/fio_times ; echo --- ; done
fio loop 0:   write: IOPS=43.7k, BW=171MiB/s (179MB/s)(39.1MiB/229msec); 0 zone resets
fio loop 1:   write: IOPS=30.1k, BW=118MiB/s (123MB/s)(39.1MiB/332msec); 0 zone resets
---
fio loop 0:   write: IOPS=43.7k, BW=171MiB/s (179MB/s)(39.1MiB/229msec); 0 zone resets
fio loop 1:   write: IOPS=30.1k, BW=118MiB/s (123MB/s)(39.1MiB/332msec); 0 zone resets
fio loop 2:   write: IOPS=33.7k, BW=132MiB/s (138MB/s)(39.1MiB/297msec); 0 zone resets
---
fio loop 25:   write: IOPS=15.7k, BW=61.3MiB/s (64.3MB/s)(39.1MiB/637msec); 0 zone resets
fio loop 26:   write: IOPS=5537, BW=21.6MiB/s (22.7MB/s)(39.1MiB/1806msec); 0 zone resets
fio loop 27:   write: IOPS=15.4k, BW=60.2MiB/s (63.1MB/s)(39.1MiB/649msec); 0 zone resets
---
fio loop 51:   write: IOPS=12.5k, BW=48.0MiB/s (51.3MB/s)(39.1MiB/798msec); 0 zone resets
fio loop 52:   write: IOPS=3480, BW=13.6MiB/s (14.3MB/s)(39.1MiB/2873msec); 0 zone resets
fio loop 53:   write: IOPS=9345, BW=36.5MiB/s (38.3MB/s)(39.1MiB/1070msec); 0 zone resets
---
fio loop 78:   write: IOPS=6887, BW=26.9MiB/s (28.2MB/s)(39.1MiB/1452msec); 0 zone resets
fio loop 79:   write: IOPS=1955, BW=7823KiB/s (8011kB/s)(39.1MiB/5113msec); 0 zone resets
fio loop 80:   write: IOPS=7751, BW=30.3MiB/s (31.8MB/s)(39.1MiB/1290msec); 0 zone resets
---
fio loop 104:   write: IOPS=8340, BW=32.6MiB/s (34.2MB/s)(39.1MiB/1199msec); 0 zone resets
fio loop 105:   write: IOPS=1546, BW=6184KiB/s (6333kB/s)(39.1MiB/6468msec); 0 zone resets
fio loop 106:   write: IOPS=7262, BW=28.4MiB/s (29.7MB/s)(39.1MiB/1377msec); 0 zone resets
---
fio loop 130:   write: IOPS=7788, BW=30.4MiB/s (31.9MB/s)(39.1MiB/1284msec); 0 zone resets
fio loop 131:   write: IOPS=1268, BW=5074KiB/s (5195kB/s)(39.1MiB/7884msec); 0 zone resets
fio loop 132:   write: IOPS=6468, BW=25.3MiB/s (26.5MB/s)(39.1MiB/1546msec); 0 zone resets
---
fio loop 156:   write: IOPS=7137, BW=27.9MiB/s (29.2MB/s)(39.1MiB/1401msec); 0 zone resets
fio loop 157:   write: IOPS=1487, BW=5949KiB/s (6092kB/s)(39.1MiB/6724msec); 0 zone resets
fio loop 158:   write: IOPS=8904, BW=34.8MiB/s (36.5MB/s)(39.1MiB/1123msec); 0 zone resets
---
fio loop 183:   write: IOPS=6002, BW=23.4MiB/s (24.6MB/s)(39.1MiB/1666msec); 0 zone resets
fio loop 184:   write: IOPS=936, BW=3746KiB/s (3836kB/s)(39.1MiB/10679msec); 0 zone resets
fio loop 185:   write: IOPS=7230, BW=28.2MiB/s (29.6MB/s)(39.1MiB/1383msec); 0 zone resets
---
fio loop 209:   write: IOPS=5521, BW=21.6MiB/s (22.6MB/s)(39.1MiB/1811msec); 0 zone resets
fio loop 210:   write: IOPS=775, BW=3101KiB/s (3175kB/s)(39.1MiB/12899msec); 0 zone resets
fio loop 211:   write: IOPS=6489, BW=25.3MiB/s (26.6MB/s)(39.1MiB/1541msec); 0 zone resets
---
fio loop 235:   write: IOPS=7230, BW=28.2MiB/s (29.6MB/s)(39.1MiB/1383msec); 0 zone resets
fio loop 236:   write: IOPS=758, BW=3035KiB/s (3108kB/s)(39.1MiB/13178msec); 0 zone resets
fio loop 237:   write: IOPS=8071, BW=31.5MiB/s (33.1MB/s)(39.1MiB/1239msec); 0 zone resets
---
fio loop 288:   write: IOPS=5552, BW=21.7MiB/s (22.7MB/s)(39.1MiB/1801msec); 0 zone resets
fio loop 289:   write: IOPS=652, BW=2612KiB/s (2675kB/s)(39.1MiB/15314msec); 0 zone resets
fio loop 290:   write: IOPS=6027, BW=23.5MiB/s (24.7MB/s)(39.1MiB/1659msec); 0 zone resets
---
fio loop 314:   write: IOPS=5186, BW=20.3MiB/s (21.2MB/s)(39.1MiB/1928msec); 0 zone resets
fio loop 315:   write: IOPS=669, BW=2680KiB/s (2744kB/s)(39.1MiB/14926msec); 0 zone resets
fio loop 316:   write: IOPS=7163, BW=27.0MiB/s (29.3MB/s)(39.1MiB/1396msec); 0 zone resets
---
fio loop 340:   write: IOPS=5170, BW=20.2MiB/s (21.2MB/s)(39.1MiB/1934msec); 0 zone resets
fio loop 341:   write: IOPS=697, BW=2791KiB/s (2858kB/s)(39.1MiB/14333msec); 0 zone resets
fio loop 342:   write: IOPS=6345, BW=24.8MiB/s (25.0MB/s)(39.1MiB/1576msec); 0 zone resets
---
fio loop 367:   write: IOPS=5509, BW=21.5MiB/s (22.6MB/s)(39.1MiB/1815msec); 0 zone resets
fio loop 368:   write: IOPS=607, BW=2429KiB/s (2488kB/s)(39.1MiB/16466msec); 0 zone resets
fio loop 369:   write: IOPS=6402, BW=25.0MiB/s (26.2MB/s)(39.1MiB/1562msec); 0 zone resets
---
fio loop 393:   write: IOPS=7331, BW=28.6MiB/s (30.0MB/s)(39.1MiB/1364msec); 0 zone resets
fio loop 394:   write: IOPS=637, BW=2550KiB/s (2612kB/s)(39.1MiB/15684msec); 0 zone resets
fio loop 395:   write: IOPS=7358, BW=28.7MiB/s (30.1MB/s)(39.1MiB/1359msec); 0 zone resets
---
fio loop 419:   write: IOPS=6480, BW=25.3MiB/s (26.5MB/s)(39.1MiB/1543msec); 0 zone resets
fio loop 420:   write: IOPS=620, BW=2484KiB/s (2543kB/s)(39.1MiB/16104msec); 0 zone resets
fio loop 421:   write: IOPS=7007, BW=27.4MiB/s (28.7MB/s)(39.1MiB/1427msec); 0 zone resets
---
fio loop 472:   write: IOPS=6313, BW=24.7MiB/s (25.9MB/s)(39.1MiB/1584msec); 0 zone resets
fio loop 473:   write: IOPS=455, BW=1822KiB/s (1866kB/s)(39.1MiB/21951msec); 0 zone resets
fio loop 474:   write: IOPS=6715, BW=26.2MiB/s (27.5MB/s)(39.1MiB/1489msec); 0 zone resets
---
fio loop 498:   write: IOPS=7662, BW=29.9MiB/s (31.4MB/s)(39.1MiB/1305msec); 0 zone resets
fio loop 499:   write: IOPS=470, BW=1882KiB/s (1928kB/s)(39.1MiB/21249msec); 0 zone resets
fio loop 500:   write: IOPS=4228, BW=16.5MiB/s (17.3MB/s)(39.1MiB/2365msec); 0 zone resets
---
fio loop 524:   write: IOPS=6697, BW=26.2MiB/s (27.4MB/s)(39.1MiB/1493msec); 0 zone resets
fio loop 525:   write: IOPS=454, BW=1818KiB/s (1861kB/s)(39.1MiB/22004msec); 0 zone resets
fio loop 526:   write: IOPS=7112, BW=27.8MiB/s (29.1MB/s)(39.1MiB/1406msec); 0 zone resets
---
fio loop 577:   write: IOPS=4222, BW=16.5MiB/s (17.3MB/s)(39.1MiB/2368msec); 0 zone resets
fio loop 578:   write: IOPS=150, BW=602KiB/s (617kB/s)(39.1MiB/66416msec); 0 zone resets
fio loop 579:   write: IOPS=6038, BW=23.6MiB/s (24.7MB/s)(39.1MiB/1656msec); 0 zone resets
---
fio loop 603:   write: IOPS=5991, BW=23.4MiB/s (24.5MB/s)(39.1MiB/1669msec); 0 zone resets
fio loop 604:   write: IOPS=441, BW=1764KiB/s (1806kB/s)(39.1MiB/22674msec); 0 zone resets
fio loop 605:   write: IOPS=6056, BW=23.7MiB/s (24.8MB/s)(39.1MiB/1651msec); 0 zone resets
---
fio loop 682:   write: IOPS=6226, BW=24.3MiB/s (25.5MB/s)(39.1MiB/1606msec); 0 zone resets
fio loop 683:   write: IOPS=322, BW=1290KiB/s (1321kB/s)(39.1MiB/31002msec); 0 zone resets
fio loop 684:   write: IOPS=5934, BW=23.2MiB/s (24.3MB/s)(39.1MiB/1685msec); 0 zone resets
---
fio loop 708:   write: IOPS=5614, BW=21.9MiB/s (22.0MB/s)(39.1MiB/1781msec); 0 zone resets
fio loop 709:   write: IOPS=473, BW=1894KiB/s (1939kB/s)(39.1MiB/21124msec); 0 zone resets
fio loop 710:   write: IOPS=6816, BW=26.6MiB/s (27.9MB/s)(39.1MiB/1467msec); 0 zone resets
---
fio loop 761:   write: IOPS=6301, BW=24.6MiB/s (25.8MB/s)(39.1MiB/1587msec); 0 zone resets
fio loop 762:   write: IOPS=448, BW=1796KiB/s (1839kB/s)(39.1MiB/22275msec); 0 zone resets
fio loop 763:   write: IOPS=7490, BW=29.3MiB/s (30.7MB/s)(39.1MiB/1335msec); 0 zone resets
---
fio loop 787:   write: IOPS=6729, BW=26.3MiB/s (27.6MB/s)(39.1MiB/1486msec); 0 zone resets
fio loop 788:   write: IOPS=579, BW=2318KiB/s (2374kB/s)(39.1MiB/17253msec); 0 zone resets
fio loop 789:   write: IOPS=5356, BW=20.9MiB/s (21.9MB/s)(39.1MiB/1867msec); 0 zone resets
---
fio loop 866:   write: IOPS=6720, BW=26.3MiB/s (27.5MB/s)(39.1MiB/1488msec); 0 zone resets
fio loop 867:   write: IOPS=314, BW=1258KiB/s (1288kB/s)(39.1MiB/31791msec); 0 zone resets
fio loop 868:   write: IOPS=5602, BW=21.9MiB/s (22.9MB/s)(39.1MiB/1785msec); 0 zone resets
---
fio loop 892:   write: IOPS=6915, BW=27.0MiB/s (28.3MB/s)(39.1MiB/1446msec); 0 zone resets
fio loop 893:   write: IOPS=598, BW=2395KiB/s (2452kB/s)(39.1MiB/16704msec); 0 zone resets
fio loop 894:   write: IOPS=6544, BW=25.6MiB/s (26.8MB/s)(39.1MiB/1528msec); 0 zone resets
---
fio loop 945:   write: IOPS=6176, BW=24.1MiB/s (25.3MB/s)(39.1MiB/1619msec); 0 zone resets
fio loop 946:   write: IOPS=570, BW=2281KiB/s (2336kB/s)(39.1MiB/17536msec); 0 zone resets
fio loop 947:   write: IOPS=6631, BW=25.9MiB/s (27.2MB/s)(39.1MiB/1508msec); 0 zone resets
---
fio loop 971:   write: IOPS=8539, BW=33.4MiB/s (34.0MB/s)(39.1MiB/1171msec); 0 zone resets
fio loop 972:   write: IOPS=579, BW=2317KiB/s (2372kB/s)(39.1MiB/17265msec); 0 zone resets
fio loop 973:   write: IOPS=6265, BW=24.5MiB/s (25.7MB/s)(39.1MiB/1596msec); 0 zone resets
---


In these instances, fio takes about as long as I would expect the
snapshot to have taken to run. Regardless of the cause, something
looks to be broken here...

An astute reader might also notice that fio performance really drops
away quickly as the number of snapshots goes up. Loop 0 is the "no
snapshots" performance. By 10 snapshots, performance is half the
no-snapshot rate. By 50 snapshots, performance is a quarter of the
no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
about 15% of the non-snapshot performance. Is this expected
performance degradation as snapshot count increases?

And before you ask, reflink copies of the fio file rather than
subvol snapshots have largely the same performance, IO and
behavioural characteristics. The only difference is that clone
copying also has a cyclic FIO performance dip (every 3-4 cycles)
that corresponds with the system driving hard into memory reclaim
during periodic writeback from btrfs.

FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
performance stays largely consistent across all 1000 iterations at
around 13-14k +/-2k IOPS. The reflink time also scales linearly with
the number of extents in the source file and levels off at about
10-11s per cycle as the extent count in the source file levels off
at ~850,000 extents. XFS completes the 1000 iterations of
write/clone in about 4 hours, btrfs completels the same part of the
workload in about 9 hours.

Oh, I almost forget - FIEMAP performance. After the reflink test, I
map all the extents in all the cloned files to a) count the extents
and b) confirm that the difference between clones is correct (~10000
extents not shared with the previous iteration). Pulling the extent
maps out of XFS takes about 3s a clone (~850,000 extents), or 30
minutes for the whole set when run serialised. btrfs takes 90-100s
per clone - after 8 hours it had only managed to map 380 files and
was running at 6-7000 read IOPS the entire time. IOWs, it was taking
_half a million_ read IOs to map the extents of a single clone that
only had a million extents in it. Is it expected that FIEMAP is so
slow and IO intensive on cloned files?

As there are no performance anomolies or memory reclaim issues with
XFS running this workload, I suspect the issues I note above are
btrfs issues, not expected behaviour.  I'm not sure what the
expected scalability of btrfs file clones and snapshots are though,
so I'm interested to hear if these results are expected or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

JOBS=4
IODEPTH=4
IOCOUNT=$((10000 / $JOBS))
FILESIZE=4g

cat >$fio_config <<EOF
[global]
name=${DST}.name
directory=${DST}
size=${FILESIZE}
randrepeat=0
bs=4k
ioengine=libaio
iodepth=${IODEPTH}
iodepth_low=2
direct=1
end_fsync=1
fallocate=none
overwrite=1
number_ios=${IOCOUNT}
runtime=30s
group_reporting=1
disable_lat=1
lat_percentiles=0
clat_percentiles=0
slat_percentiles=0
disk_util=0

[j1]
filename=testfile
rw=randwrite

[j2]
filename=testfile
rw=randwrite

[j3]
filename=testfile
rw=randwrite

[j4]
filename=testfile
rw=randwrite
EOF


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-21 22:20 Unexpected reflink/subvol snapshot behaviour Dave Chinner
@ 2021-01-23  8:42 ` Qu Wenruo
  2021-01-23  8:51   ` Qu Wenruo
                     ` (3 more replies)
  2021-01-24  0:19 ` Zygo Blaxell
  2021-02-02  2:14 ` Darrick J. Wong
  2 siblings, 4 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-01-23  8:42 UTC (permalink / raw)
  To: Dave Chinner, linux-btrfs



On 2021/1/22 上午6:20, Dave Chinner wrote:
> Hi btrfs-gurus,
>
> I'm running a simple reflink/snapshot/COW scalability test at the
> moment. It is just a loop that does "fio overwrite of 10,000 4kB
> random direct IOs in a 4GB file; snapshot" and I want to check a
> couple of things I'm seeing with btrfs. fio config file is appended
> to the email.
>
> Firstly, what is the expected "space amplification" of such a
> workload over 1000 iterations on btrfs? This will write 40GB of user
> data, and I'm seeing btrfs consume ~220GB of space for the workload
> regardless of whether I use subvol snapshot or file clones
> (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> wondering if this is expected or whether there's something else
> going on. XFS amplification for 1000 iterations using reflink is
> only 1.4x, so 5.5x seems somewhat excessive to me.

This is mostly due to the way btrfs handles COW and the lazy extent
freeing behavior.

For btrfs, an extent only get freed when there is no reference on any
part of it, and

This means, if we have an file which has one 128K file extent written to
disk, and then write 4K, which will be COWed to another 4K extent, the
128K extent is still kept as is, even the no longer referred 4K range is
still kept there, with extra 4K space usage.

This not only increase the space usage, but also increase metadata usage.
But reduce the complexity on extent tree and snapshot creation.


For the worst case, btrfs can allocate a 128 MiB file extent, and have
good luck to write 127MiB into the extent. It will take 127MiB + 128MiB
space, until the last 1MiB of the original extent get freed, the full
128MiB can be freed.


Thus above reflink/snapshot + DIO write is going to be very unfriendly
for fs with lazy extent freeing and default data COW behavior.

That's also why btrfs has a worse fragmentation problem.

>
> On a similar note, the IO bandwidth consumed by btrfs is way out of
> proportion with the amount of user data being written. I'm seeing
> multiple GBs being written by btrfs on every iteration - easily
> exceeding 5GB of writes per cycle in the later iterations of the
> test. Given that only 40MB of user data is being written per cycle,
> there's a write amplification factor of well over 100x ocurring
> here. In comparison, XFS is writing roughly consistently at 80MB/s
> to disk over the course of the entire workload, largely because of
> journal traffic for the transactions run during COW and clone
> operations.  Is such a huge amount of of IO expected for btrfs in
> this situation?

That's interesting. Any analyse on the type of bios submitted for the
device?

My educated guess is, metadata takes most of the space, and due to
default DUP metadata profile, it get doubled to 5G?

>
> As a side effect of that IO load, btrfs is driving the machine hard
> into memory reclaim because the page cache footprint of each
> writeback cycle. btrfs is dirtying a large number of metadata pages
> in the page cache (at least 50% of the ram in the machine is dirtied
> on every snapshot/reflink cycle). Hence when the system needs memory
> reclaim, it hits large amounts of memory it can't reclaim
> immediately and things go bad very quickly.  This is causing
> everything on the machine to stall while btrfs dumps the dirty
> metadata pages to disk at over 1GB/s and 10,000 IOPS for several
> seconds. Is this expected behaviour?

This may be caused by above mentioned lazy extent freeing (bookend
extent) behavior.

Especially when 4K dio is submitted, each 4K write will cause an new
extent, greatly increasing metadata usage.

For the 10,000 4KiB DIO write inside a 4GiB file, it would easily lead
to 10,000 extents just in one iteration.
And with several iteration, the 4GiB file will be so heavily fragmented
that all extents are just in 4K size. (2^20 extents, which will take
100MiB metadata just for one subvol).

And since you're also taking snapshot, this means each new extent in
each subvol will always has its reference there, no way to be freed, and
cause tons of slowdown just because the amount of metadata.

>
> Next, subvol snapshot and clone time appears to be scale with the
> number of snapshots/clones already present. The initial clone/subvol
> snapshot command take a few milliseconds. At 50 snapshots it take
> 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
>> 850 it seems to level off at about 30s a snapshot. There are
> outliers that take double this time (63s was the longest) and the
> variation between iterations can be quite substantial. Is this
> expected scalablity?

The snapshot will make the current subvolume to be fully committed
before really taking the snapshot.

Considering above metadata overhead, I believe most of the performance
penalty should come from the metadata writeback, not the snapshot
creation itself.

If you just create a big subvolume, sync the fs, and try to take as many
snapshot as you wish, the overhead should be pretty the same as
snapshotting an empty subvolume.

>
> On subvol snapshot execution, there appears to be a bug manifesting
> occasionally and may be one of the reasons for things being so
> variable. The visible manifestation is that every so often a subvol
> snapshot takes 0.02s instead of the multiple seconds all the
> snapshots around it are taking:

That 0.02s the real overhead for snapshot creation.

The short snapshot creation time means those snapshot creation just wait
for the same transaction to be committed, thus they don't need to wait
for the full transaction committment, just need to do the snapshot.


[...]

> In these instances, fio takes about as long as I would expect the
> snapshot to have taken to run. Regardless of the cause, something
> looks to be broken here...
>
> An astute reader might also notice that fio performance really drops
> away quickly as the number of snapshots goes up. Loop 0 is the "no
> snapshots" performance. By 10 snapshots, performance is half the
> no-snapshot rate. By 50 snapshots, performance is a quarter of the
> no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> about 15% of the non-snapshot performance. Is this expected
> performance degradation as snapshot count increases?

No, this is mostly due to the exploding amount of metadata caused by the
near-worst case workload.

Yeah, btrfs is pretty bad at handling small dio writes, which can easily
explode the metadata usage.

Thus for such dio case, we recommend to use preallocated file +
nodatacow, so that we won't create new extents (unless snapshot is
involved).

>
> And before you ask, reflink copies of the fio file rather than
> subvol snapshots have largely the same performance, IO and
> behavioural characteristics. The only difference is that clone
> copying also has a cyclic FIO performance dip (every 3-4 cycles)
> that corresponds with the system driving hard into memory reclaim
> during periodic writeback from btrfs.
>
> FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
> performance stays largely consistent across all 1000 iterations at
> around 13-14k +/-2k IOPS. The reflink time also scales linearly with
> the number of extents in the source file and levels off at about
> 10-11s per cycle as the extent count in the source file levels off
> at ~850,000 extents. XFS completes the 1000 iterations of
> write/clone in about 4 hours, btrfs completels the same part of the
> workload in about 9 hours.
>
> Oh, I almost forget - FIEMAP performance. After the reflink test, I
> map all the extents in all the cloned files to a) count the extents
> and b) confirm that the difference between clones is correct (~10000
> extents not shared with the previous iteration). Pulling the extent
> maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> minutes for the whole set when run serialised. btrfs takes 90-100s
> per clone - after 8 hours it had only managed to map 380 files and
> was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> _half a million_ read IOs to map the extents of a single clone that
> only had a million extents in it. Is it expected that FIEMAP is so
> slow and IO intensive on cloned files?

Exploding fragments, definitely needs a lot of metadata read, right?

>
> As there are no performance anomolies or memory reclaim issues with
> XFS running this workload, I suspect the issues I note above are
> btrfs issues, not expected behaviour.  I'm not sure what the
> expected scalability of btrfs file clones and snapshots are though,
> so I'm interested to hear if these results are expected or not.

I hate to say that, yes, you find the worst scenario workload for btrfs.

4K dio + snapshot is the best way to explode the already high btrfs
metadata usage, and exploit the lazy extent reclaim behavior.

But if no snapshot is involved, at least you can limit the damage, a
4GiB file can only be at most 1M 4K file extents.
But with snapshots, there is no upper limit now.

Thanks,
Qu

>
> Cheers,
>
> Dave.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-23  8:42 ` Qu Wenruo
@ 2021-01-23  8:51   ` Qu Wenruo
  2021-01-23 10:39   ` Roman Mamedov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-01-23  8:51 UTC (permalink / raw)
  To: Dave Chinner, linux-btrfs



On 2021/1/23 下午4:42, Qu Wenruo wrote:
>
>
> On 2021/1/22 上午6:20, Dave Chinner wrote:
>> Hi btrfs-gurus,
>>
>> I'm running a simple reflink/snapshot/COW scalability test at the
>> moment. It is just a loop that does "fio overwrite of 10,000 4kB
>> random direct IOs in a 4GB file; snapshot" and I want to check a
>> couple of things I'm seeing with btrfs. fio config file is appended
>> to the email.
>>
>> Firstly, what is the expected "space amplification" of such a
>> workload over 1000 iterations on btrfs? This will write 40GB of user
>> data, and I'm seeing btrfs consume ~220GB of space for the workload
>> regardless of whether I use subvol snapshot or file clones
>> (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
>> wondering if this is expected or whether there's something else
>> going on. XFS amplification for 1000 iterations using reflink is
>> only 1.4x, so 5.5x seems somewhat excessive to me.
>
> This is mostly due to the way btrfs handles COW and the lazy extent
> freeing behavior.
>
> For btrfs, an extent only get freed when there is no reference on any
> part of it, and
>
> This means, if we have an file which has one 128K file extent written to
> disk, and then write 4K, which will be COWed to another 4K extent, the
> 128K extent is still kept as is, even the no longer referred 4K range is
> still kept there, with extra 4K space usage.
>
> This not only increase the space usage, but also increase metadata usage.
> But reduce the complexity on extent tree and snapshot creation.
>
>
> For the worst case, btrfs can allocate a 128 MiB file extent, and have
> good luck to write 127MiB into the extent. It will take 127MiB + 128MiB
> space, until the last 1MiB of the original extent get freed, the full
> 128MiB can be freed.
>
>
> Thus above reflink/snapshot + DIO write is going to be very unfriendly
> for fs with lazy extent freeing and default data COW behavior.
>
> That's also why btrfs has a worse fragmentation problem.
>
>>
>> On a similar note, the IO bandwidth consumed by btrfs is way out of
>> proportion with the amount of user data being written. I'm seeing
>> multiple GBs being written by btrfs on every iteration - easily
>> exceeding 5GB of writes per cycle in the later iterations of the
>> test. Given that only 40MB of user data is being written per cycle,
>> there's a write amplification factor of well over 100x ocurring
>> here. In comparison, XFS is writing roughly consistently at 80MB/s
>> to disk over the course of the entire workload, largely because of
>> journal traffic for the transactions run during COW and clone
>> operations.  Is such a huge amount of of IO expected for btrfs in
>> this situation?
>
> That's interesting. Any analyse on the type of bios submitted for the
> device?
>
> My educated guess is, metadata takes most of the space, and due to
> default DUP metadata profile, it get doubled to 5G?
>
>>
>> As a side effect of that IO load, btrfs is driving the machine hard
>> into memory reclaim because the page cache footprint of each
>> writeback cycle. btrfs is dirtying a large number of metadata pages
>> in the page cache (at least 50% of the ram in the machine is dirtied
>> on every snapshot/reflink cycle). Hence when the system needs memory
>> reclaim, it hits large amounts of memory it can't reclaim
>> immediately and things go bad very quickly.  This is causing
>> everything on the machine to stall while btrfs dumps the dirty
>> metadata pages to disk at over 1GB/s and 10,000 IOPS for several
>> seconds. Is this expected behaviour?
>
> This may be caused by above mentioned lazy extent freeing (bookend
> extent) behavior.
>
> Especially when 4K dio is submitted, each 4K write will cause an new
> extent, greatly increasing metadata usage.
>
> For the 10,000 4KiB DIO write inside a 4GiB file, it would easily lead
> to 10,000 extents just in one iteration.
> And with several iteration, the 4GiB file will be so heavily fragmented
> that all extents are just in 4K size. (2^20 extents, which will take
> 100MiB metadata just for one subvol).

To add more, due to the extra metadata usage, the workload increases
memory pressure, forcing btrfs to flush its meatdata back to disk.

While when a btrfs tree block is not yet written to disk, btrfs can skip
the COW for the metadata, just modify it in memory.

But when a tree block is written back to disk, btrfs has to do metadata
COW. This further increase the metadata usage, causing a negative spiral
for more metadata usage.

>
> And since you're also taking snapshot, this means each new extent in
> each subvol will always has its reference there, no way to be freed, and
> cause tons of slowdown just because the amount of metadata.
>
>>
>> Next, subvol snapshot and clone time appears to be scale with the
>> number of snapshots/clones already present. The initial clone/subvol
>> snapshot command take a few milliseconds. At 50 snapshots it take
>> 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
>>> 850 it seems to level off at about 30s a snapshot. There are
>> outliers that take double this time (63s was the longest) and the
>> variation between iterations can be quite substantial. Is this
>> expected scalablity?
>
> The snapshot will make the current subvolume to be fully committed
> before really taking the snapshot.
>
> Considering above metadata overhead, I believe most of the performance
> penalty should come from the metadata writeback, not the snapshot
> creation itself.
>
> If you just create a big subvolume, sync the fs, and try to take as many
> snapshot as you wish, the overhead should be pretty the same as
> snapshotting an empty subvolume.
>
>>
>> On subvol snapshot execution, there appears to be a bug manifesting
>> occasionally and may be one of the reasons for things being so
>> variable. The visible manifestation is that every so often a subvol
>> snapshot takes 0.02s instead of the multiple seconds all the
>> snapshots around it are taking:
>
> That 0.02s the real overhead for snapshot creation.
>
> The short snapshot creation time means those snapshot creation just wait
> for the same transaction to be committed, thus they don't need to wait
> for the full transaction committment, just need to do the snapshot.
>
>
> [...]
>
>> In these instances, fio takes about as long as I would expect the
>> snapshot to have taken to run. Regardless of the cause, something
>> looks to be broken here...
>>
>> An astute reader might also notice that fio performance really drops
>> away quickly as the number of snapshots goes up. Loop 0 is the "no
>> snapshots" performance. By 10 snapshots, performance is half the
>> no-snapshot rate. By 50 snapshots, performance is a quarter of the
>> no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
>> about 15% of the non-snapshot performance. Is this expected
>> performance degradation as snapshot count increases?
>
> No, this is mostly due to the exploding amount of metadata caused by the
> near-worst case workload.
>
> Yeah, btrfs is pretty bad at handling small dio writes, which can easily
> explode the metadata usage.
>
> Thus for such dio case, we recommend to use preallocated file +
> nodatacow, so that we won't create new extents (unless snapshot is
> involved).
>
>>
>> And before you ask, reflink copies of the fio file rather than
>> subvol snapshots have largely the same performance, IO and
>> behavioural characteristics. The only difference is that clone
>> copying also has a cyclic FIO performance dip (every 3-4 cycles)
>> that corresponds with the system driving hard into memory reclaim
>> during periodic writeback from btrfs.
>>
>> FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
>> performance stays largely consistent across all 1000 iterations at
>> around 13-14k +/-2k IOPS. The reflink time also scales linearly with
>> the number of extents in the source file and levels off at about
>> 10-11s per cycle as the extent count in the source file levels off
>> at ~850,000 extents. XFS completes the 1000 iterations of
>> write/clone in about 4 hours, btrfs completels the same part of the
>> workload in about 9 hours.
>>
>> Oh, I almost forget - FIEMAP performance. After the reflink test, I
>> map all the extents in all the cloned files to a) count the extents
>> and b) confirm that the difference between clones is correct (~10000
>> extents not shared with the previous iteration). Pulling the extent
>> maps out of XFS takes about 3s a clone (~850,000 extents), or 30
>> minutes for the whole set when run serialised. btrfs takes 90-100s
>> per clone - after 8 hours it had only managed to map 380 files and
>> was running at 6-7000 read IOPS the entire time. IOWs, it was taking
>> _half a million_ read IOs to map the extents of a single clone that
>> only had a million extents in it. Is it expected that FIEMAP is so
>> slow and IO intensive on cloned files?
>
> Exploding fragments, definitely needs a lot of metadata read, right?
>
>>
>> As there are no performance anomolies or memory reclaim issues with
>> XFS running this workload, I suspect the issues I note above are
>> btrfs issues, not expected behaviour.  I'm not sure what the
>> expected scalability of btrfs file clones and snapshots are though,
>> so I'm interested to hear if these results are expected or not.
>
> I hate to say that, yes, you find the worst scenario workload for btrfs.
>
> 4K dio + snapshot is the best way to explode the already high btrfs
> metadata usage, and exploit the lazy extent reclaim behavior.
>
> But if no snapshot is involved, at least you can limit the damage, a
> 4GiB file can only be at most 1M 4K file extents.
> But with snapshots, there is no upper limit now.
>
> Thanks,
> Qu
>
>>
>> Cheers,
>>
>> Dave.
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-23  8:42 ` Qu Wenruo
  2021-01-23  8:51   ` Qu Wenruo
@ 2021-01-23 10:39   ` Roman Mamedov
  2021-01-23 10:58     ` Qu Wenruo
  2021-01-24 13:08   ` Filipe Manana
  2021-01-24 22:36   ` Dave Chinner
  3 siblings, 1 reply; 16+ messages in thread
From: Roman Mamedov @ 2021-01-23 10:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Dave Chinner, linux-btrfs

On Sat, 23 Jan 2021 16:42:33 +0800
Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> For the worst case, btrfs can allocate a 128 MiB file extent, and have
> good luck to write 127MiB into the extent. It will take 127MiB + 128MiB
> space, until the last 1MiB of the original extent get freed, the full
> 128MiB can be freed.

Does it mean enabling compression actually mitigates this issue as a
side-effect? Since each extent will be limited to only 128K.

What are the typical extent sizes without compression?

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-23 10:39   ` Roman Mamedov
@ 2021-01-23 10:58     ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-01-23 10:58 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Dave Chinner, linux-btrfs



On 2021/1/23 下午6:39, Roman Mamedov wrote:
> On Sat, 23 Jan 2021 16:42:33 +0800
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>> For the worst case, btrfs can allocate a 128 MiB file extent, and have
>> good luck to write 127MiB into the extent. It will take 127MiB + 128MiB
>> space, until the last 1MiB of the original extent get freed, the full
>> 128MiB can be freed.
>
> Does it mean enabling compression actually mitigates this issue as a
> side-effect? Since each extent will be limited to only 128K.

Yes! Compression brings the side effect of reducing the maximum data
extent size, thus mitigates the problem by 1024.

But the problem is still there.

Thanks,
Qu
>
> What are the typical extent sizes without compression?
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-21 22:20 Unexpected reflink/subvol snapshot behaviour Dave Chinner
  2021-01-23  8:42 ` Qu Wenruo
@ 2021-01-24  0:19 ` Zygo Blaxell
  2021-01-24 21:43   ` Dave Chinner
  2021-02-02  2:14 ` Darrick J. Wong
  2 siblings, 1 reply; 16+ messages in thread
From: Zygo Blaxell @ 2021-01-24  0:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs

On Fri, Jan 22, 2021 at 09:20:51AM +1100, Dave Chinner wrote:
> Hi btrfs-gurus,
> 
> I'm running a simple reflink/snapshot/COW scalability test at the
> moment. It is just a loop that does "fio overwrite of 10,000 4kB
> random direct IOs in a 4GB file; snapshot" and I want to check a
> couple of things I'm seeing with btrfs. fio config file is appended
> to the email.
> 
> Firstly, what is the expected "space amplification" of such a
> workload over 1000 iterations on btrfs? This will write 40GB of user
> data, and I'm seeing btrfs consume ~220GB of space for the workload
> regardless of whether I use subvol snapshot or file clones
> (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> wondering if this is expected or whether there's something else
> going on. XFS amplification for 1000 iterations using reflink is
> only 1.4x, so 5.5x seems somewhat excessive to me.
> 
> On a similar note, the IO bandwidth consumed by btrfs is way out of
> proportion with the amount of user data being written. I'm seeing
> multiple GBs being written by btrfs on every iteration - easily
> exceeding 5GB of writes per cycle in the later iterations of the
> test. Given that only 40MB of user data is being written per cycle,
> there's a write amplification factor of well over 100x ocurring
> here. In comparison, XFS is writing roughly consistently at 80MB/s
> to disk over the course of the entire workload, largely because of
> journal traffic for the transactions run during COW and clone
> operations.  Is such a huge amount of of IO expected for btrfs in
> this situation?
> 
> As a side effect of that IO load, btrfs is driving the machine hard
> into memory reclaim because the page cache footprint of each
> writeback cycle. btrfs is dirtying a large number of metadata pages
> in the page cache (at least 50% of the ram in the machine is dirtied
> on every snapshot/reflink cycle). Hence when the system needs memory
> reclaim, it hits large amounts of memory it can't reclaim
> immediately and things go bad very quickly.  This is causing
> everything on the machine to stall while btrfs dumps the dirty
> metadata pages to disk at over 1GB/s and 10,000 IOPS for several
> seconds. Is this expected behaviour?
> 
> Next, subvol snapshot and clone time appears to be scale with the
> number of snapshots/clones already present. The initial clone/subvol
> snapshot command take a few milliseconds. At 50 snapshots it take
> 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
> >850 it seems to level off at about 30s a snapshot. There are
> outliers that take double this time (63s was the longest) and the
> variation between iterations can be quite substantial. Is this
> expected scalablity?
> 
> On subvol snapshot execution, there appears to be a bug manifesting
> occasionally and may be one of the reasons for things being so
> variable. The visible manifestation is that every so often a subvol
> snapshot takes 0.02s instead of the multiple seconds all the
> snapshots around it are taking:
> 
>  $ grep -C 1 ": 0.0" results/btrfs-snap/2021-01-21-22\:08\:15-1000/snapshot_times | sed 's/://'
>  snapshot 0 0.02
>  snapshot 1 0.06
>  snapshot 2 0.10
>  --
>  snapshot 25 0.77
>  snapshot 26 0.02
>  snapshot 27 0.85
>  --
>  snapshot 51 1.45
>  snapshot 52 0.02
>  snapshot 53 1.51
>  --
>  snapshot 78 2.35
>  snapshot 79 0.03
>  snapshot 80 2.31
>  --
>  snapshot 104 3.22
>  snapshot 105 0.02
>  snapshot 106 3.44
>  --
>  snapshot 130 4.25
>  snapshot 131 0.02
>  snapshot 132 4.53
>  --
>  snapshot 156 5.38
>  snapshot 157 0.02
>  snapshot 158 5.76
>  --
>  snapshot 183 6.17
>  snapshot 184 0.02
>  snapshot 185 6.94
>  --
>  snapshot 209 8.08
>  snapshot 210 0.04
>  snapshot 211 6.91
>  --
>  snapshot 235 8.77
>  snapshot 236 0.02
>  snapshot 237 9.80
>  --
>  snapshot 288 10.91
>  snapshot 289 0.04
>  snapshot 290 9.07
>  --
>  snapshot 314 11.81
>  snapshot 315 0.04
>  snapshot 316 11.74
>  --
>  snapshot 340 11.83
>  snapshot 341 0.05
>  snapshot 342 12.11
>  --
>  snapshot 367 11.95
>  snapshot 368 0.06
>  snapshot 369 11.83
>  --
>  snapshot 393 13.66
>  snapshot 394 0.03
>  snapshot 395 10.98
>  --
>  snapshot 419 14.04
>  snapshot 420 0.04
>  snapshot 421 12.62
>  --
>  snapshot 472 22.10
>  snapshot 473 0.03
>  snapshot 474 14.90
>  --
>  snapshot 498 14.48
>  snapshot 499 0.03
>  snapshot 500 17.46
>  --
>  snapshot 524 20.50
>  snapshot 525 0.04
>  snapshot 526 18.01
>  --
>  snapshot 577 55.81
>  snapshot 578 0.08
>  snapshot 579 34.02
>  --
>  snapshot 603 22.81
>  snapshot 604 0.03
>  snapshot 605 19.26
>  --
>  snapshot 682 30.88
>  snapshot 683 0.02
>  snapshot 684 14.83
>  --
>  snapshot 708 19.90
>  snapshot 709 0.03
>  snapshot 710 15.38
>  --
>  snapshot 761 25.63
>  snapshot 762 0.05
>  snapshot 763 15.58
>  --
>  snapshot 787 15.33
>  snapshot 788 0.03
>  snapshot 789 15.08
>  --
>  snapshot 866 23.77
>  snapshot 867 0.04
>  snapshot 868 27.40
>  --
>  snapshot 892 15.33
>  snapshot 893 0.03
>  snapshot 894 13.38
>  --
>  snapshot 945 15.32
>  snapshot 946 0.05
>  snapshot 947 15.52
>  --
>  snapshot 971 15.30
>  snapshot 972 0.03
>  snapshot 973 14.88
> 
> It seems .... unlikely that random snapshots of exactly the same
> repeating workloadi have such a variance in execution time. And then

btrfs delays a lot of metadata updates (millions, if you have enough
memory) and then runs them in giant batches during commits, so they can
show up as latency spikes at random times while you're benchmarking.
That is likely part of what is happening here.

The current behavior is something of a regression--there used to be a
latency feedback loop to avoid queueing up too many metadata updates
before throttling the processes that were generating the updates.
It's not clear that simply reverting that change is a good way forward.

> I noticed that they exactly correlate with the order of magnitude
> fio performance drops that manifested occasionally:
> 
> $ for i in `grep ": 0.0" results/btrfs-snap/2021-01-21-22\:08\:15-1000/snapshot_times | sed 's/://' |cut -d " " -f 2`; do grep -C 1 " $i:" results/btrfs-snap/2021-01-21-22\:08\:15-1000/fio_times ; echo --- ; done
> fio loop 0:   write: IOPS=43.7k, BW=171MiB/s (179MB/s)(39.1MiB/229msec); 0 zone resets
> fio loop 1:   write: IOPS=30.1k, BW=118MiB/s (123MB/s)(39.1MiB/332msec); 0 zone resets
> ---
> fio loop 0:   write: IOPS=43.7k, BW=171MiB/s (179MB/s)(39.1MiB/229msec); 0 zone resets
> fio loop 1:   write: IOPS=30.1k, BW=118MiB/s (123MB/s)(39.1MiB/332msec); 0 zone resets
> fio loop 2:   write: IOPS=33.7k, BW=132MiB/s (138MB/s)(39.1MiB/297msec); 0 zone resets
> ---
> fio loop 25:   write: IOPS=15.7k, BW=61.3MiB/s (64.3MB/s)(39.1MiB/637msec); 0 zone resets
> fio loop 26:   write: IOPS=5537, BW=21.6MiB/s (22.7MB/s)(39.1MiB/1806msec); 0 zone resets
> fio loop 27:   write: IOPS=15.4k, BW=60.2MiB/s (63.1MB/s)(39.1MiB/649msec); 0 zone resets
> ---
> fio loop 51:   write: IOPS=12.5k, BW=48.0MiB/s (51.3MB/s)(39.1MiB/798msec); 0 zone resets
> fio loop 52:   write: IOPS=3480, BW=13.6MiB/s (14.3MB/s)(39.1MiB/2873msec); 0 zone resets
> fio loop 53:   write: IOPS=9345, BW=36.5MiB/s (38.3MB/s)(39.1MiB/1070msec); 0 zone resets
> ---
> fio loop 78:   write: IOPS=6887, BW=26.9MiB/s (28.2MB/s)(39.1MiB/1452msec); 0 zone resets
> fio loop 79:   write: IOPS=1955, BW=7823KiB/s (8011kB/s)(39.1MiB/5113msec); 0 zone resets
> fio loop 80:   write: IOPS=7751, BW=30.3MiB/s (31.8MB/s)(39.1MiB/1290msec); 0 zone resets
> ---
> fio loop 104:   write: IOPS=8340, BW=32.6MiB/s (34.2MB/s)(39.1MiB/1199msec); 0 zone resets
> fio loop 105:   write: IOPS=1546, BW=6184KiB/s (6333kB/s)(39.1MiB/6468msec); 0 zone resets
> fio loop 106:   write: IOPS=7262, BW=28.4MiB/s (29.7MB/s)(39.1MiB/1377msec); 0 zone resets
> ---
> fio loop 130:   write: IOPS=7788, BW=30.4MiB/s (31.9MB/s)(39.1MiB/1284msec); 0 zone resets
> fio loop 131:   write: IOPS=1268, BW=5074KiB/s (5195kB/s)(39.1MiB/7884msec); 0 zone resets
> fio loop 132:   write: IOPS=6468, BW=25.3MiB/s (26.5MB/s)(39.1MiB/1546msec); 0 zone resets
> ---
> fio loop 156:   write: IOPS=7137, BW=27.9MiB/s (29.2MB/s)(39.1MiB/1401msec); 0 zone resets
> fio loop 157:   write: IOPS=1487, BW=5949KiB/s (6092kB/s)(39.1MiB/6724msec); 0 zone resets
> fio loop 158:   write: IOPS=8904, BW=34.8MiB/s (36.5MB/s)(39.1MiB/1123msec); 0 zone resets
> ---
> fio loop 183:   write: IOPS=6002, BW=23.4MiB/s (24.6MB/s)(39.1MiB/1666msec); 0 zone resets
> fio loop 184:   write: IOPS=936, BW=3746KiB/s (3836kB/s)(39.1MiB/10679msec); 0 zone resets
> fio loop 185:   write: IOPS=7230, BW=28.2MiB/s (29.6MB/s)(39.1MiB/1383msec); 0 zone resets
> ---
> fio loop 209:   write: IOPS=5521, BW=21.6MiB/s (22.6MB/s)(39.1MiB/1811msec); 0 zone resets
> fio loop 210:   write: IOPS=775, BW=3101KiB/s (3175kB/s)(39.1MiB/12899msec); 0 zone resets
> fio loop 211:   write: IOPS=6489, BW=25.3MiB/s (26.6MB/s)(39.1MiB/1541msec); 0 zone resets
> ---
> fio loop 235:   write: IOPS=7230, BW=28.2MiB/s (29.6MB/s)(39.1MiB/1383msec); 0 zone resets
> fio loop 236:   write: IOPS=758, BW=3035KiB/s (3108kB/s)(39.1MiB/13178msec); 0 zone resets
> fio loop 237:   write: IOPS=8071, BW=31.5MiB/s (33.1MB/s)(39.1MiB/1239msec); 0 zone resets
> ---
> fio loop 288:   write: IOPS=5552, BW=21.7MiB/s (22.7MB/s)(39.1MiB/1801msec); 0 zone resets
> fio loop 289:   write: IOPS=652, BW=2612KiB/s (2675kB/s)(39.1MiB/15314msec); 0 zone resets
> fio loop 290:   write: IOPS=6027, BW=23.5MiB/s (24.7MB/s)(39.1MiB/1659msec); 0 zone resets
> ---
> fio loop 314:   write: IOPS=5186, BW=20.3MiB/s (21.2MB/s)(39.1MiB/1928msec); 0 zone resets
> fio loop 315:   write: IOPS=669, BW=2680KiB/s (2744kB/s)(39.1MiB/14926msec); 0 zone resets
> fio loop 316:   write: IOPS=7163, BW=27.0MiB/s (29.3MB/s)(39.1MiB/1396msec); 0 zone resets
> ---
> fio loop 340:   write: IOPS=5170, BW=20.2MiB/s (21.2MB/s)(39.1MiB/1934msec); 0 zone resets
> fio loop 341:   write: IOPS=697, BW=2791KiB/s (2858kB/s)(39.1MiB/14333msec); 0 zone resets
> fio loop 342:   write: IOPS=6345, BW=24.8MiB/s (25.0MB/s)(39.1MiB/1576msec); 0 zone resets
> ---
> fio loop 367:   write: IOPS=5509, BW=21.5MiB/s (22.6MB/s)(39.1MiB/1815msec); 0 zone resets
> fio loop 368:   write: IOPS=607, BW=2429KiB/s (2488kB/s)(39.1MiB/16466msec); 0 zone resets
> fio loop 369:   write: IOPS=6402, BW=25.0MiB/s (26.2MB/s)(39.1MiB/1562msec); 0 zone resets
> ---
> fio loop 393:   write: IOPS=7331, BW=28.6MiB/s (30.0MB/s)(39.1MiB/1364msec); 0 zone resets
> fio loop 394:   write: IOPS=637, BW=2550KiB/s (2612kB/s)(39.1MiB/15684msec); 0 zone resets
> fio loop 395:   write: IOPS=7358, BW=28.7MiB/s (30.1MB/s)(39.1MiB/1359msec); 0 zone resets
> ---
> fio loop 419:   write: IOPS=6480, BW=25.3MiB/s (26.5MB/s)(39.1MiB/1543msec); 0 zone resets
> fio loop 420:   write: IOPS=620, BW=2484KiB/s (2543kB/s)(39.1MiB/16104msec); 0 zone resets
> fio loop 421:   write: IOPS=7007, BW=27.4MiB/s (28.7MB/s)(39.1MiB/1427msec); 0 zone resets
> ---
> fio loop 472:   write: IOPS=6313, BW=24.7MiB/s (25.9MB/s)(39.1MiB/1584msec); 0 zone resets
> fio loop 473:   write: IOPS=455, BW=1822KiB/s (1866kB/s)(39.1MiB/21951msec); 0 zone resets
> fio loop 474:   write: IOPS=6715, BW=26.2MiB/s (27.5MB/s)(39.1MiB/1489msec); 0 zone resets
> ---
> fio loop 498:   write: IOPS=7662, BW=29.9MiB/s (31.4MB/s)(39.1MiB/1305msec); 0 zone resets
> fio loop 499:   write: IOPS=470, BW=1882KiB/s (1928kB/s)(39.1MiB/21249msec); 0 zone resets
> fio loop 500:   write: IOPS=4228, BW=16.5MiB/s (17.3MB/s)(39.1MiB/2365msec); 0 zone resets
> ---
> fio loop 524:   write: IOPS=6697, BW=26.2MiB/s (27.4MB/s)(39.1MiB/1493msec); 0 zone resets
> fio loop 525:   write: IOPS=454, BW=1818KiB/s (1861kB/s)(39.1MiB/22004msec); 0 zone resets
> fio loop 526:   write: IOPS=7112, BW=27.8MiB/s (29.1MB/s)(39.1MiB/1406msec); 0 zone resets
> ---
> fio loop 577:   write: IOPS=4222, BW=16.5MiB/s (17.3MB/s)(39.1MiB/2368msec); 0 zone resets
> fio loop 578:   write: IOPS=150, BW=602KiB/s (617kB/s)(39.1MiB/66416msec); 0 zone resets
> fio loop 579:   write: IOPS=6038, BW=23.6MiB/s (24.7MB/s)(39.1MiB/1656msec); 0 zone resets
> ---
> fio loop 603:   write: IOPS=5991, BW=23.4MiB/s (24.5MB/s)(39.1MiB/1669msec); 0 zone resets
> fio loop 604:   write: IOPS=441, BW=1764KiB/s (1806kB/s)(39.1MiB/22674msec); 0 zone resets
> fio loop 605:   write: IOPS=6056, BW=23.7MiB/s (24.8MB/s)(39.1MiB/1651msec); 0 zone resets
> ---
> fio loop 682:   write: IOPS=6226, BW=24.3MiB/s (25.5MB/s)(39.1MiB/1606msec); 0 zone resets
> fio loop 683:   write: IOPS=322, BW=1290KiB/s (1321kB/s)(39.1MiB/31002msec); 0 zone resets
> fio loop 684:   write: IOPS=5934, BW=23.2MiB/s (24.3MB/s)(39.1MiB/1685msec); 0 zone resets
> ---
> fio loop 708:   write: IOPS=5614, BW=21.9MiB/s (22.0MB/s)(39.1MiB/1781msec); 0 zone resets
> fio loop 709:   write: IOPS=473, BW=1894KiB/s (1939kB/s)(39.1MiB/21124msec); 0 zone resets
> fio loop 710:   write: IOPS=6816, BW=26.6MiB/s (27.9MB/s)(39.1MiB/1467msec); 0 zone resets
> ---
> fio loop 761:   write: IOPS=6301, BW=24.6MiB/s (25.8MB/s)(39.1MiB/1587msec); 0 zone resets
> fio loop 762:   write: IOPS=448, BW=1796KiB/s (1839kB/s)(39.1MiB/22275msec); 0 zone resets
> fio loop 763:   write: IOPS=7490, BW=29.3MiB/s (30.7MB/s)(39.1MiB/1335msec); 0 zone resets
> ---
> fio loop 787:   write: IOPS=6729, BW=26.3MiB/s (27.6MB/s)(39.1MiB/1486msec); 0 zone resets
> fio loop 788:   write: IOPS=579, BW=2318KiB/s (2374kB/s)(39.1MiB/17253msec); 0 zone resets
> fio loop 789:   write: IOPS=5356, BW=20.9MiB/s (21.9MB/s)(39.1MiB/1867msec); 0 zone resets
> ---
> fio loop 866:   write: IOPS=6720, BW=26.3MiB/s (27.5MB/s)(39.1MiB/1488msec); 0 zone resets
> fio loop 867:   write: IOPS=314, BW=1258KiB/s (1288kB/s)(39.1MiB/31791msec); 0 zone resets
> fio loop 868:   write: IOPS=5602, BW=21.9MiB/s (22.9MB/s)(39.1MiB/1785msec); 0 zone resets
> ---
> fio loop 892:   write: IOPS=6915, BW=27.0MiB/s (28.3MB/s)(39.1MiB/1446msec); 0 zone resets
> fio loop 893:   write: IOPS=598, BW=2395KiB/s (2452kB/s)(39.1MiB/16704msec); 0 zone resets
> fio loop 894:   write: IOPS=6544, BW=25.6MiB/s (26.8MB/s)(39.1MiB/1528msec); 0 zone resets
> ---
> fio loop 945:   write: IOPS=6176, BW=24.1MiB/s (25.3MB/s)(39.1MiB/1619msec); 0 zone resets
> fio loop 946:   write: IOPS=570, BW=2281KiB/s (2336kB/s)(39.1MiB/17536msec); 0 zone resets
> fio loop 947:   write: IOPS=6631, BW=25.9MiB/s (27.2MB/s)(39.1MiB/1508msec); 0 zone resets
> ---
> fio loop 971:   write: IOPS=8539, BW=33.4MiB/s (34.0MB/s)(39.1MiB/1171msec); 0 zone resets
> fio loop 972:   write: IOPS=579, BW=2317KiB/s (2372kB/s)(39.1MiB/17265msec); 0 zone resets
> fio loop 973:   write: IOPS=6265, BW=24.5MiB/s (25.7MB/s)(39.1MiB/1596msec); 0 zone resets
> ---
> 
> 
> In these instances, fio takes about as long as I would expect the
> snapshot to have taken to run. Regardless of the cause, something
> looks to be broken here...
> 
> An astute reader might also notice that fio performance really drops
> away quickly as the number of snapshots goes up. Loop 0 is the "no
> snapshots" performance. By 10 snapshots, performance is half the
> no-snapshot rate. By 50 snapshots, performance is a quarter of the
> no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> about 15% of the non-snapshot performance. Is this expected
> performance degradation as snapshot count increases?

Somewhere in that workload, there's probably a pretty high write
multiplier for unsharing subvol metadata pages in snapshots.  The worst
case is about 300x write multiply for the first few pages after a snapshot
is created, because every item referenced on the shared subvol metadata
page (there can be 150-300 of them) must have a new backreference added
to the newly created unshared metadata page, and in the worst case every
one of those new items lives in a separate metadata page that also
has to be read, modified, and written.  The write multiplier rapidly
levels off to 1x once all the snapshot's metadata pages are unshared,
after random writes to around 0.3% of the subvol.  So writing 4K to a
file in a subvol right after a snapshot was taken could hit the disks
with up to 20MB of random read and write iops before it's over.

Fragmentation pushes everything toward the worst-case scenario because it
spreads the referenced items around to separate pages, which could explain
the asymptotic performance curve for snapshots.  Without fragmentation,
all the referenced items tend to appear on the same or at least a few
adjacent pages, so the unsharing cost is much lower.  It's the same
number of pages to unshare whether it's 1 snapshot or 1000, but the
referenced items will get spread around a lot after 1000 iterarations
of that fio loop.

Reflinks don't share metadata pages, so they don't have this problem
(except when the dst of the reflink is modifying metadata pages that are
shared with a snapshot, like any other write).

> And before you ask, reflink copies of the fio file rather than
> subvol snapshots have largely the same performance, IO and
> behavioural characteristics. The only difference is that clone
> copying also has a cyclic FIO performance dip (every 3-4 cycles)
> that corresponds with the system driving hard into memory reclaim
> during periodic writeback from btrfs.
> 
> FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
> performance stays largely consistent across all 1000 iterations at
> around 13-14k +/-2k IOPS. The reflink time also scales linearly with
> the number of extents in the source file and levels off at about
> 10-11s per cycle as the extent count in the source file levels off
> at ~850,000 extents. XFS completes the 1000 iterations of
> write/clone in about 4 hours, btrfs completels the same part of the
> workload in about 9 hours.
> 
> Oh, I almost forget - FIEMAP performance. After the reflink test, I
> map all the extents in all the cloned files to a) count the extents
> and b) confirm that the difference between clones is correct (~10000
> extents not shared with the previous iteration). Pulling the extent
> maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> minutes for the whole set when run serialised. btrfs takes 90-100s
> per clone - after 8 hours it had only managed to map 380 files and
> was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> _half a million_ read IOs to map the extents of a single clone that
> only had a million extents in it. Is it expected that FIEMAP is so
> slow and IO intensive on cloned files?

There were severe performance issues with FIEMAP (or anything else that
does backref lookup) on kernels before 5.7, especially on files bigger
than a few hundred MB (among other things, it was searching the entire
file for matching forward ref instead of just around the area where the
backref was).  FIEMAP looks at backrefs to populate the 'shared' bit,
so it was affected by this bug.

There might still be a big IO overhead for backref search on current
kernels.  The worst case is some gigabytes of metadata pages for extent
references, if every referencing item ends up stored on its own metadata
page, and if FIEMAP has to read many of them before it finds a reference
that matches the logical file offset so it can set or clear the 'shared'
bit.

I'm not sure the worst case is even bounded--you could have billions of
references to an extent and I don't know of any reason why you couldn't
fill a disk with them (other than btrfs getting too slow to finish before
the disk crumbles to dust).

TREE_SEARCH_V2 doesn't have a 'shared' bit to populate, so it runs _much_
faster than FIEMAP.

> As there are no performance anomolies or memory reclaim issues with
> XFS running this workload, I suspect the issues I note above are
> btrfs issues, not expected behaviour.  I'm not sure what the
> expected scalability of btrfs file clones and snapshots are though,
> so I'm interested to hear if these results are expected or not.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> JOBS=4
> IODEPTH=4
> IOCOUNT=$((10000 / $JOBS))
> FILESIZE=4g
> 
> cat >$fio_config <<EOF
> [global]
> name=${DST}.name
> directory=${DST}
> size=${FILESIZE}
> randrepeat=0
> bs=4k
> ioengine=libaio
> iodepth=${IODEPTH}
> iodepth_low=2
> direct=1
> end_fsync=1
> fallocate=none
> overwrite=1
> number_ios=${IOCOUNT}
> runtime=30s
> group_reporting=1
> disable_lat=1
> lat_percentiles=0
> clat_percentiles=0
> slat_percentiles=0
> disk_util=0
> 
> [j1]
> filename=testfile
> rw=randwrite
> 
> [j2]
> filename=testfile
> rw=randwrite
> 
> [j3]
> filename=testfile
> rw=randwrite
> 
> [j4]
> filename=testfile
> rw=randwrite
> EOF
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-23  8:42 ` Qu Wenruo
  2021-01-23  8:51   ` Qu Wenruo
  2021-01-23 10:39   ` Roman Mamedov
@ 2021-01-24 13:08   ` Filipe Manana
  2021-01-24 22:36   ` Dave Chinner
  3 siblings, 0 replies; 16+ messages in thread
From: Filipe Manana @ 2021-01-24 13:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Dave Chinner, linux-btrfs

On Sat, Jan 23, 2021 at 8:46 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/1/22 上午6:20, Dave Chinner wrote:
> > Hi btrfs-gurus,
> >
> > I'm running a simple reflink/snapshot/COW scalability test at the
> > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > random direct IOs in a 4GB file; snapshot" and I want to check a
> > couple of things I'm seeing with btrfs. fio config file is appended
> > to the email.
> >
> > Firstly, what is the expected "space amplification" of such a
> > workload over 1000 iterations on btrfs? This will write 40GB of user
> > data, and I'm seeing btrfs consume ~220GB of space for the workload
> > regardless of whether I use subvol snapshot or file clones
> > (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> > wondering if this is expected or whether there's something else
> > going on. XFS amplification for 1000 iterations using reflink is
> > only 1.4x, so 5.5x seems somewhat excessive to me.
>
> This is mostly due to the way btrfs handles COW and the lazy extent
> freeing behavior.
>
> For btrfs, an extent only get freed when there is no reference on any
> part of it, and
>
> This means, if we have an file which has one 128K file extent written to
> disk, and then write 4K, which will be COWed to another 4K extent, the
> 128K extent is still kept as is, even the no longer referred 4K range is
> still kept there, with extra 4K space usage.
>
> This not only increase the space usage, but also increase metadata usage.
> But reduce the complexity on extent tree and snapshot creation.
>
>
> For the worst case, btrfs can allocate a 128 MiB file extent, and have
> good luck to write 127MiB into the extent. It will take 127MiB + 128MiB
> space, until the last 1MiB of the original extent get freed, the full
> 128MiB can be freed.

That is all true, but it does not apply to Dave's test.
If you look at the fio job, it does direct IO writes, all with a fixed
size of 4K, plus the file they write into was not preallocated
(fallocate=none).

>
>
> Thus above reflink/snapshot + DIO write is going to be very unfriendly
> for fs with lazy extent freeing and default data COW behavior.
>
> That's also why btrfs has a worse fragmentation problem.
>
> >
> > On a similar note, the IO bandwidth consumed by btrfs is way out of
> > proportion with the amount of user data being written. I'm seeing
> > multiple GBs being written by btrfs on every iteration - easily
> > exceeding 5GB of writes per cycle in the later iterations of the
> > test. Given that only 40MB of user data is being written per cycle,
> > there's a write amplification factor of well over 100x ocurring
> > here. In comparison, XFS is writing roughly consistently at 80MB/s
> > to disk over the course of the entire workload, largely because of
> > journal traffic for the transactions run during COW and clone
> > operations.  Is such a huge amount of of IO expected for btrfs in
> > this situation?
>
> That's interesting. Any analyse on the type of bios submitted for the
> device?
>
> My educated guess is, metadata takes most of the space, and due to
> default DUP metadata profile, it get doubled to 5G?
>
> >
> > As a side effect of that IO load, btrfs is driving the machine hard
> > into memory reclaim because the page cache footprint of each
> > writeback cycle. btrfs is dirtying a large number of metadata pages
> > in the page cache (at least 50% of the ram in the machine is dirtied
> > on every snapshot/reflink cycle). Hence when the system needs memory
> > reclaim, it hits large amounts of memory it can't reclaim
> > immediately and things go bad very quickly.  This is causing
> > everything on the machine to stall while btrfs dumps the dirty
> > metadata pages to disk at over 1GB/s and 10,000 IOPS for several
> > seconds. Is this expected behaviour?
>
> This may be caused by above mentioned lazy extent freeing (bookend
> extent) behavior.
>
> Especially when 4K dio is submitted, each 4K write will cause an new
> extent, greatly increasing metadata usage.
>
> For the 10,000 4KiB DIO write inside a 4GiB file, it would easily lead
> to 10,000 extents just in one iteration.
> And with several iteration, the 4GiB file will be so heavily fragmented
> that all extents are just in 4K size. (2^20 extents, which will take
> 100MiB metadata just for one subvol).
>
> And since you're also taking snapshot, this means each new extent in
> each subvol will always has its reference there, no way to be freed, and
> cause tons of slowdown just because the amount of metadata.
>
> >
> > Next, subvol snapshot and clone time appears to be scale with the
> > number of snapshots/clones already present. The initial clone/subvol
> > snapshot command take a few milliseconds. At 50 snapshots it take
> > 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
> >> 850 it seems to level off at about 30s a snapshot. There are
> > outliers that take double this time (63s was the longest) and the
> > variation between iterations can be quite substantial. Is this
> > expected scalablity?
>
> The snapshot will make the current subvolume to be fully committed
> before really taking the snapshot.
>
> Considering above metadata overhead, I believe most of the performance
> penalty should come from the metadata writeback, not the snapshot
> creation itself.
>
> If you just create a big subvolume, sync the fs, and try to take as many
> snapshot as you wish, the overhead should be pretty the same as
> snapshotting an empty subvolume.
>
> >
> > On subvol snapshot execution, there appears to be a bug manifesting
> > occasionally and may be one of the reasons for things being so
> > variable. The visible manifestation is that every so often a subvol
> > snapshot takes 0.02s instead of the multiple seconds all the
> > snapshots around it are taking:
>
> That 0.02s the real overhead for snapshot creation.
>
> The short snapshot creation time means those snapshot creation just wait
> for the same transaction to be committed, thus they don't need to wait
> for the full transaction committment, just need to do the snapshot.
>
>
> [...]
>
> > In these instances, fio takes about as long as I would expect the
> > snapshot to have taken to run. Regardless of the cause, something
> > looks to be broken here...
> >
> > An astute reader might also notice that fio performance really drops
> > away quickly as the number of snapshots goes up. Loop 0 is the "no
> > snapshots" performance. By 10 snapshots, performance is half the
> > no-snapshot rate. By 50 snapshots, performance is a quarter of the
> > no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> > about 15% of the non-snapshot performance. Is this expected
> > performance degradation as snapshot count increases?
>
> No, this is mostly due to the exploding amount of metadata caused by the
> near-worst case workload.
>
> Yeah, btrfs is pretty bad at handling small dio writes, which can easily
> explode the metadata usage.
>
> Thus for such dio case, we recommend to use preallocated file +
> nodatacow, so that we won't create new extents (unless snapshot is
> involved).
>
> >
> > And before you ask, reflink copies of the fio file rather than
> > subvol snapshots have largely the same performance, IO and
> > behavioural characteristics. The only difference is that clone
> > copying also has a cyclic FIO performance dip (every 3-4 cycles)
> > that corresponds with the system driving hard into memory reclaim
> > during periodic writeback from btrfs.
> >
> > FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
> > performance stays largely consistent across all 1000 iterations at
> > around 13-14k +/-2k IOPS. The reflink time also scales linearly with
> > the number of extents in the source file and levels off at about
> > 10-11s per cycle as the extent count in the source file levels off
> > at ~850,000 extents. XFS completes the 1000 iterations of
> > write/clone in about 4 hours, btrfs completels the same part of the
> > workload in about 9 hours.
> >
> > Oh, I almost forget - FIEMAP performance. After the reflink test, I
> > map all the extents in all the cloned files to a) count the extents
> > and b) confirm that the difference between clones is correct (~10000
> > extents not shared with the previous iteration). Pulling the extent
> > maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> > minutes for the whole set when run serialised. btrfs takes 90-100s
> > per clone - after 8 hours it had only managed to map 380 files and
> > was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> > _half a million_ read IOs to map the extents of a single clone that
> > only had a million extents in it. Is it expected that FIEMAP is so
> > slow and IO intensive on cloned files?
>
> Exploding fragments, definitely needs a lot of metadata read, right?
>
> >
> > As there are no performance anomolies or memory reclaim issues with
> > XFS running this workload, I suspect the issues I note above are
> > btrfs issues, not expected behaviour.  I'm not sure what the
> > expected scalability of btrfs file clones and snapshots are though,
> > so I'm interested to hear if these results are expected or not.
>
> I hate to say that, yes, you find the worst scenario workload for btrfs.
>
> 4K dio + snapshot is the best way to explode the already high btrfs
> metadata usage, and exploit the lazy extent reclaim behavior.
>
> But if no snapshot is involved, at least you can limit the damage, a
> 4GiB file can only be at most 1M 4K file extents.
> But with snapshots, there is no upper limit now.
>
> Thanks,
> Qu
>
> >
> > Cheers,
> >
> > Dave.
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-24  0:19 ` Zygo Blaxell
@ 2021-01-24 21:43   ` Dave Chinner
  2021-01-30  1:03     ` Zygo Blaxell
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2021-01-24 21:43 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Sat, Jan 23, 2021 at 07:19:03PM -0500, Zygo Blaxell wrote:
> On Fri, Jan 22, 2021 at 09:20:51AM +1100, Dave Chinner wrote:
> > Hi btrfs-gurus,
> > 
> > I'm running a simple reflink/snapshot/COW scalability test at the
> > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > random direct IOs in a 4GB file; snapshot" and I want to check a
> > couple of things I'm seeing with btrfs. fio config file is appended
> > to the email.

....

> >  --
> >  snapshot 945 15.32
> >  snapshot 946 0.05
> >  snapshot 947 15.52
> >  --
> >  snapshot 971 15.30
> >  snapshot 972 0.03
> >  snapshot 973 14.88
> > 
> > It seems .... unlikely that random snapshots of exactly the same
> > repeating workloadi have such a variance in execution time. And then
> 
> btrfs delays a lot of metadata updates (millions, if you have enough
> memory) and then runs them in giant batches during commits, so they can
> show up as latency spikes at random times while you're benchmarking.
> That is likely part of what is happening here.

Evidence points to this being exactly the problem - multiple
gigabytes of page cache dirtying at writeback points leading to
memory pressure and huge amounts of physical IO being issued. A
single CPU running this workload basically stalls a kernel on a mostly idle
32GB/32p machine with 150,000 random 4kB write IOPS capability for
seconds at a time.

> The current behavior is something of a regression--there used to be a
> latency feedback loop to avoid queueing up too many metadata updates
> before throttling the processes that were generating the updates.
> It's not clear that simply reverting that change is a good way forward.

I see. Rock and a hard place. Am I correct in assuming that that I
shouldn't expect a fix for either the excessive metadata writeback
bandwidth or the non-deterministic system-wide behaviour that
results from it any time soon?

> > ---
> > fio loop 971:   write: IOPS=8539, BW=33.4MiB/s (34.0MB/s)(39.1MiB/1171msec); 0 zone resets
> > fio loop 972:   write: IOPS=579, BW=2317KiB/s (2372kB/s)(39.1MiB/17265msec); 0 zone resets
> > fio loop 973:   write: IOPS=6265, BW=24.5MiB/s (25.7MB/s)(39.1MiB/1596msec); 0 zone resets
> > ---
> > 
> > 
> > In these instances, fio takes about as long as I would expect the
> > snapshot to have taken to run. Regardless of the cause, something
> > looks to be broken here...
> > 
> > An astute reader might also notice that fio performance really drops
> > away quickly as the number of snapshots goes up. Loop 0 is the "no
> > snapshots" performance. By 10 snapshots, performance is half the
> > no-snapshot rate. By 50 snapshots, performance is a quarter of the
> > no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> > about 15% of the non-snapshot performance. Is this expected
> > performance degradation as snapshot count increases?
> 
> Somewhere in that workload, there's probably a pretty high write
> multiplier for unsharing subvol metadata pages in snapshots.  The worst
> case is about 300x write multiply for the first few pages after a snapshot
> is created, because every item referenced on the shared subvol metadata
> page (there can be 150-300 of them) must have a new backreference added
> to the newly created unshared metadata page, and in the worst case every
> one of those new items lives in a separate metadata page that also
> has to be read, modified, and written.  The write multiplier rapidly
> levels off to 1x once all the snapshot's metadata pages are unshared,
> after random writes to around 0.3% of the subvol.  So writing 4K to a
> file in a subvol right after a snapshot was taken could hit the disks
> with up to 20MB of random read and write iops before it's over.

That's .... really bad. but it tallies with 10,000 4kB data writes
triggering 10GB of dirty metadata pages and GB/s of write bandwidth.

Given this is how the snapshot+COW algorithm is designed, I having
trouble seeing how this problem could be mitigated. Am I correct in
assuming that this level of write amplification as snapshot cycles
increase "is what it is"?

> Fragmentation pushes everything toward the worst-case scenario because it
> spreads the referenced items around to separate pages, which could explain
> the asymptotic performance curve for snapshots.

THe performance continues to worsen long after the per-file extent
count maxxes out at just over 1 million (about cyle 100). So it
seems more to be related to the metadata overhead of the subvol, not
su much the individual file.

FWIW, concentrating on "it's a single file with lots of extents"
misses the bigger picture of "it's a compact simulation of a subvol
with tens of thousands of files in it and being randomly updated by
the production workload between snapshots". IOWs, I'm using this
workload to perform accelerated aging on a constantly modified
filesystem under a rolling snapshot regime. A snapshot every few
minutes, 24x7, is ~5-10,000 snapshots a year. I'm compressing the
modification time domain down from 5-10 minutes to a few hundred
milliseconds so I can run thousands of iterations a day and hence
see what happens over a period of months in a couple of hours...

> Without fragmentation,
> all the referenced items tend to appear on the same or at least a few
> adjacent pages, so the unsharing cost is much lower.  It's the same
> number of pages to unshare whether it's 1 snapshot or 1000, but the
> referenced items will get spread around a lot after 1000 iterarations
> of that fio loop.

Yeah, sure, but when the data is not physically located (such as a
large dataset in a subvol), you get the random overwrite behaviour
this test exercises....

> Reflinks don't share metadata pages, so they don't have this problem
> (except when the dst of the reflink is modifying metadata pages that are
> shared with a snapshot, like any other write).

Evidence suggests that they do have the same problem.  i.e. this:

> > And before you ask, reflink copies of the fio file rather than
> > subvol snapshots have largely the same performance, IO and
> > behavioural characteristics. The only difference is that clone
> > copying also has a cyclic FIO performance dip (every 3-4 cycles)
> > that corresponds with the system driving hard into memory reclaim
> > during periodic writeback from btrfs.

would be explained by the same metadata COW explosion after a
reflink on the per-inode extent tree, rather than full subvol
tree. Is that the case?

> > Oh, I almost forget - FIEMAP performance. After the reflink test, I
> > map all the extents in all the cloned files to a) count the extents
> > and b) confirm that the difference between clones is correct (~10000
> > extents not shared with the previous iteration). Pulling the extent
> > maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> > minutes for the whole set when run serialised. btrfs takes 90-100s
> > per clone - after 8 hours it had only managed to map 380 files and
> > was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> > _half a million_ read IOs to map the extents of a single clone that
> > only had a million extents in it. Is it expected that FIEMAP is so
> > slow and IO intensive on cloned files?
> 
> There were severe performance issues with FIEMAP (or anything else that
> does backref lookup) on kernels before 5.7, especially on files bigger
> than a few hundred MB (among other things, it was searching the entire
> file for matching forward ref instead of just around the area where the
> backref was).  FIEMAP looks at backrefs to populate the 'shared' bit,
> so it was affected by this bug.

I'm testing on a vanilla 5.10 kernel, so this bug should not be
present in the kernel.

> There might still be a big IO overhead for backref search on current
> kernels.  The worst case is some gigabytes of metadata pages for extent
> references, if every referencing item ends up stored on its own metadata
> page, and if FIEMAP has to read many of them before it finds a reference
> that matches the logical file offset so it can set or clear the 'shared'
> bit.

So it's a least a quadratic complexity algorithm?

> I'm not sure the worst case is even bounded--you could have billions of
> references to an extent and I don't know of any reason why you couldn't
> fill a disk with them (other than btrfs getting too slow to finish before
> the disk crumbles to dust).

Hmmm. This also sounds like a result of the way btrfs is physically
structured. Am I correct to assume this behaviour won't change any
time soon?

> TREE_SEARCH_V2 doesn't have a 'shared' bit to populate, so it runs _much_
> faster than FIEMAP.

Assuming I don't need to know about shared extents. The whole point
of using fiemap here is to be able to look at the shared extents in
the file. That's a diagnostic we use in the field for analysing
problems with reflink copied files on XFS, so I see no reason why we
wouldn't need that information on btrfs. So using a special ioctl
that doesn't provide shared extent visiblity is not a viable
solution here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-23  8:42 ` Qu Wenruo
                     ` (2 preceding siblings ...)
  2021-01-24 13:08   ` Filipe Manana
@ 2021-01-24 22:36   ` Dave Chinner
  2021-01-25  1:09     ` Qu Wenruo
  2021-01-29 23:25     ` Zygo Blaxell
  3 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2021-01-24 22:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Jan 23, 2021 at 04:42:33PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/1/22 上午6:20, Dave Chinner wrote:
> > Hi btrfs-gurus,
> > 
> > I'm running a simple reflink/snapshot/COW scalability test at the
> > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > random direct IOs in a 4GB file; snapshot" and I want to check a
> > couple of things I'm seeing with btrfs. fio config file is appended
> > to the email.
> > 
> > Firstly, what is the expected "space amplification" of such a
> > workload over 1000 iterations on btrfs? This will write 40GB of user
> > data, and I'm seeing btrfs consume ~220GB of space for the workload
> > regardless of whether I use subvol snapshot or file clones
> > (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> > wondering if this is expected or whether there's something else
> > going on. XFS amplification for 1000 iterations using reflink is
> > only 1.4x, so 5.5x seems somewhat excessive to me.
> 
> This is mostly due to the way btrfs handles COW and the lazy extent
> freeing behavior.
> 
> For btrfs, an extent only get freed when there is no reference on any
> part of it, and
> 
> This means, if we have an file which has one 128K file extent written to
> disk, and then write 4K, which will be COWed to another 4K extent, the
> 128K extent is still kept as is, even the no longer referred 4K range is
> still kept there, with extra 4K space usage.

That's not relevant to the workload I'm running. Once it reaches
steady state, it's just doing 4kB overwrites of shared 4kB extents.

> Thus above reflink/snapshot + DIO write is going to be very unfriendly
> for fs with lazy extent freeing and default data COW behavior.

I'm not freeing any extents at all. I haven't even got to running
the parts of the tests where I remove random snapshots/reflink files
to measure the impact of decreasing reference counts. THis test just
looks at increasing reference counts and COW overwrite performance.

> That's also why btrfs has a worse fragmentation problem.

Every other filesystem I've tested is fragmenting the reflink files
to the same level. btrfs is not fragmenting the file any worse than
the others - the workload is intended to fragment the file into a
million individual 4kB extents and then keep overwriting and
snapshotting to examine how the filesystem structures age under such
workloads.

> > On a similar note, the IO bandwidth consumed by btrfs is way out of
> > proportion with the amount of user data being written. I'm seeing
> > multiple GBs being written by btrfs on every iteration - easily
> > exceeding 5GB of writes per cycle in the later iterations of the
> > test. Given that only 40MB of user data is being written per cycle,
> > there's a write amplification factor of well over 100x ocurring
> > here. In comparison, XFS is writing roughly consistently at 80MB/s
> > to disk over the course of the entire workload, largely because of
> > journal traffic for the transactions run during COW and clone
> > operations.  Is such a huge amount of of IO expected for btrfs in
> > this situation?
> 
> That's interesting. Any analyse on the type of bios submitted for the
> device?

No, and I don't actually care because it's not relevant to what
I'm trying to understand. I've given you enough information to
reproduce the behaviour if you want to analyse it yourself.

> > Next, subvol snapshot and clone time appears to be scale with the
> > number of snapshots/clones already present. The initial clone/subvol
> > snapshot command take a few milliseconds. At 50 snapshots it take
> > 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
> > > 850 it seems to level off at about 30s a snapshot. There are
> > outliers that take double this time (63s was the longest) and the
> > variation between iterations can be quite substantial. Is this
> > expected scalablity?
> 
> The snapshot will make the current subvolume to be fully committed
> before really taking the snapshot.
> 
> Considering above metadata overhead, I believe most of the performance
> penalty should come from the metadata writeback, not the snapshot
> creation itself.
> 
> If you just create a big subvolume, sync the fs, and try to take as many
> snapshot as you wish, the overhead should be pretty the same as
> snapshotting an empty subvolume.

The fio workload runs fsync at the end of the overwrite, which means
all the writes and the metadata needed to reference it *must* be on
stable storage. Everything else is snapshot overhead, whether it be
the freeze of the filesystem in the case of dm-thin snapshots or
xfs-on-loopback-with-reflink-snapshots, of the internal sync that
btrfs does so that the snapshot produces a consistent snapshot
image...

> > On subvol snapshot execution, there appears to be a bug manifesting
> > occasionally and may be one of the reasons for things being so
> > variable. The visible manifestation is that every so often a subvol
> > snapshot takes 0.02s instead of the multiple seconds all the
> > snapshots around it are taking:
> 
> That 0.02s the real overhead for snapshot creation.
> 
> The short snapshot creation time means those snapshot creation just wait
> for the same transaction to be committed, thus they don't need to wait
> for the full transaction committment, just need to do the snapshot.

That doesn't explain why fio sometimes appears to be running much
slower than at other times. Maybe this implies a fsync() bug w.r.t.
DIO overwrites and that btrfs should always be running the fio
worklaod at 500-1000 iops and snapshots should always run at 0.02s

IOWs, the problem here is the inconsistent behaviour: the workload
is deterministic and repeats in exactly the same way every time, so
the behaviour of the filesystem should be the same for every single
iteration. Snapshot should either always take 0.02s and fio is
really slow, or fio should be fast and the snapshot really slow
because the snapshot has wider "metadata on stable storage"
requirements than fsync. The workload should not be swapping
randomly between the two behaviours....

> [...]
> 
> > In these instances, fio takes about as long as I would expect the
> > snapshot to have taken to run. Regardless of the cause, something
> > looks to be broken here...
> > 
> > An astute reader might also notice that fio performance really drops
> > away quickly as the number of snapshots goes up. Loop 0 is the "no
> > snapshots" performance. By 10 snapshots, performance is half the
> > no-snapshot rate. By 50 snapshots, performance is a quarter of the
> > no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> > about 15% of the non-snapshot performance. Is this expected
> > performance degradation as snapshot count increases?
> 
> No, this is mostly due to the exploding amount of metadata caused by the
> near-worst case workload.
> 
> Yeah, btrfs is pretty bad at handling small dio writes, which can easily
> explode the metadata usage.
> 
> Thus for such dio case, we recommend to use preallocated file +
> nodatacow, so that we won't create new extents (unless snapshot is
> involved).

Big picture. This is an accelerated aging test, not a prodcution
workload. Telling me how to work around the problems associated with
4kB overwrite (as if I don't already know about nodatacow and all
the functionality you lose by enabling it!) doesn't make the
problems with increasing snapshot counts that I'm exposing go away.

I'm interesting in knowing how btrfs scales and performs with large
reflink/snapshot counts - I explicitly chose 4kB DIO overwrite as
the fastest method of exposing such scalability issues because I
know exactly how bad this is for the COW algorithms all current
snapshot/reflink technologies implement.

Please don't shoot the messenger because you think the workload is
unrealistic - it simply indicates that you haven't understood the
goal of the test worklaod I am running.

Accelerating aging involves using unfriendly workloads to push the
filesystem into bad places in minutes instead of months or years.
It is not a production workload that needs optimising - if anything,
I need to make it even more aggressive and nasty, because it's just
not causing XFS reflinks any serious scalability problems at all...

> > Oh, I almost forget - FIEMAP performance. After the reflink test, I
> > map all the extents in all the cloned files to a) count the extents
> > and b) confirm that the difference between clones is correct (~10000
> > extents not shared with the previous iteration). Pulling the extent
> > maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> > minutes for the whole set when run serialised. btrfs takes 90-100s
> > per clone - after 8 hours it had only managed to map 380 files and
> > was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> > _half a million_ read IOs to map the extents of a single clone that
> > only had a million extents in it. Is it expected that FIEMAP is so
> > slow and IO intensive on cloned files?
> 
> Exploding fragments, definitely needs a lot of metadata read, right?

Well, at 1000 files, XFS does zero metadata read IO because the
extent lists for all 1000 snapshots easily fit in RAM - about 2GB of
RAM is needed, and that's the entire per-inode memory overhead of
the test. Hence when the fiemap cycle starts, it just pulls all this
from RAM and we do zero metadata read IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-24 22:36   ` Dave Chinner
@ 2021-01-25  1:09     ` Qu Wenruo
  2021-01-29 23:25     ` Zygo Blaxell
  1 sibling, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-01-25  1:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs



On 2021/1/25 上午6:36, Dave Chinner wrote:
> On Sat, Jan 23, 2021 at 04:42:33PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/1/22 上午6:20, Dave Chinner wrote:
>>> Hi btrfs-gurus,
>>>
>>> I'm running a simple reflink/snapshot/COW scalability test at the
>>> moment. It is just a loop that does "fio overwrite of 10,000 4kB
>>> random direct IOs in a 4GB file; snapshot" and I want to check a
>>> couple of things I'm seeing with btrfs. fio config file is appended
>>> to the email.
>>>
>>> Firstly, what is the expected "space amplification" of such a
>>> workload over 1000 iterations on btrfs? This will write 40GB of user
>>> data, and I'm seeing btrfs consume ~220GB of space for the workload
>>> regardless of whether I use subvol snapshot or file clones
>>> (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
>>> wondering if this is expected or whether there's something else
>>> going on. XFS amplification for 1000 iterations using reflink is
>>> only 1.4x, so 5.5x seems somewhat excessive to me.
>>
>> This is mostly due to the way btrfs handles COW and the lazy extent
>> freeing behavior.
>>
>> For btrfs, an extent only get freed when there is no reference on any
>> part of it, and
>>
>> This means, if we have an file which has one 128K file extent written to
>> disk, and then write 4K, which will be COWed to another 4K extent, the
>> 128K extent is still kept as is, even the no longer referred 4K range is
>> still kept there, with extra 4K space usage.
>
> That's not relevant to the workload I'm running. Once it reaches
> steady state, it's just doing 4kB overwrites of shared 4kB extents.
>
>> Thus above reflink/snapshot + DIO write is going to be very unfriendly
>> for fs with lazy extent freeing and default data COW behavior.
>
> I'm not freeing any extents at all. I haven't even got to running
> the parts of the tests where I remove random snapshots/reflink files
> to measure the impact of decreasing reference counts. THis test just
> looks at increasing reference counts and COW overwrite performance.
>
>> That's also why btrfs has a worse fragmentation problem.
>
> Every other filesystem I've tested is fragmenting the reflink files
> to the same level. btrfs is not fragmenting the file any worse than
> the others - the workload is intended to fragment the file into a
> million individual 4kB extents and then keep overwriting and
> snapshotting to examine how the filesystem structures age under such
> workloads.
>
>>> On a similar note, the IO bandwidth consumed by btrfs is way out of
>>> proportion with the amount of user data being written. I'm seeing
>>> multiple GBs being written by btrfs on every iteration - easily
>>> exceeding 5GB of writes per cycle in the later iterations of the
>>> test. Given that only 40MB of user data is being written per cycle,
>>> there's a write amplification factor of well over 100x ocurring
>>> here. In comparison, XFS is writing roughly consistently at 80MB/s
>>> to disk over the course of the entire workload, largely because of
>>> journal traffic for the transactions run during COW and clone
>>> operations.  Is such a huge amount of of IO expected for btrfs in
>>> this situation?
>>
>> That's interesting. Any analyse on the type of bios submitted for the
>> device?
>
> No, and I don't actually care because it's not relevant to what
> I'm trying to understand. I've given you enough information to
> reproduce the behaviour if you want to analyse it yourself.
>
>>> Next, subvol snapshot and clone time appears to be scale with the
>>> number of snapshots/clones already present. The initial clone/subvol
>>> snapshot command take a few milliseconds. At 50 snapshots it take
>>> 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
>>>> 850 it seems to level off at about 30s a snapshot. There are
>>> outliers that take double this time (63s was the longest) and the
>>> variation between iterations can be quite substantial. Is this
>>> expected scalablity?
>>
>> The snapshot will make the current subvolume to be fully committed
>> before really taking the snapshot.
>>
>> Considering above metadata overhead, I believe most of the performance
>> penalty should come from the metadata writeback, not the snapshot
>> creation itself.
>>
>> If you just create a big subvolume, sync the fs, and try to take as many
>> snapshot as you wish, the overhead should be pretty the same as
>> snapshotting an empty subvolume.
>
> The fio workload runs fsync at the end of the overwrite, which means
> all the writes and the metadata needed to reference it *must* be on
> stable storage. Everything else is snapshot overhead, whether it be
> the freeze of the filesystem in the case of dm-thin snapshots or
> xfs-on-loopback-with-reflink-snapshots, of the internal sync that
> btrfs does so that the snapshot produces a consistent snapshot
> image...
>
>>> On subvol snapshot execution, there appears to be a bug manifesting
>>> occasionally and may be one of the reasons for things being so
>>> variable. The visible manifestation is that every so often a subvol
>>> snapshot takes 0.02s instead of the multiple seconds all the
>>> snapshots around it are taking:
>>
>> That 0.02s the real overhead for snapshot creation.
>>
>> The short snapshot creation time means those snapshot creation just wait
>> for the same transaction to be committed, thus they don't need to wait
>> for the full transaction committment, just need to do the snapshot.
>
> That doesn't explain why fio sometimes appears to be running much
> slower than at other times. Maybe this implies a fsync() bug w.r.t.
> DIO overwrites and that btrfs should always be running the fio
> worklaod at 500-1000 iops and snapshots should always run at 0.02s
>
> IOWs, the problem here is the inconsistent behaviour: the workload
> is deterministic and repeats in exactly the same way every time, so
> the behaviour of the filesystem should be the same for every single
> iteration. Snapshot should either always take 0.02s and fio is
> really slow, or fio should be fast and the snapshot really slow
> because the snapshot has wider "metadata on stable storage"
> requirements than fsync. The workload should not be swapping
> randomly between the two behaviours....

That makes sense, although I still wonder if the writeback of large
amount of metadata is involved in this case.

But yes, the behavior is indeed not ideal.

>
>> [...]
>>
>>> In these instances, fio takes about as long as I would expect the
>>> snapshot to have taken to run. Regardless of the cause, something
>>> looks to be broken here...
>>>
>>> An astute reader might also notice that fio performance really drops
>>> away quickly as the number of snapshots goes up. Loop 0 is the "no
>>> snapshots" performance. By 10 snapshots, performance is half the
>>> no-snapshot rate. By 50 snapshots, performance is a quarter of the
>>> no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
>>> about 15% of the non-snapshot performance. Is this expected
>>> performance degradation as snapshot count increases?
>>
>> No, this is mostly due to the exploding amount of metadata caused by the
>> near-worst case workload.
>>
>> Yeah, btrfs is pretty bad at handling small dio writes, which can easily
>> explode the metadata usage.
>>
>> Thus for such dio case, we recommend to use preallocated file +
>> nodatacow, so that we won't create new extents (unless snapshot is
>> involved).
>
> Big picture. This is an accelerated aging test, not a prodcution
> workload. Telling me how to work around the problems associated with
> 4kB overwrite (as if I don't already know about nodatacow and all
> the functionality you lose by enabling it!) doesn't make the
> problems with increasing snapshot counts that I'm exposing go away.
>
> I'm interesting in knowing how btrfs scales and performs with large
> reflink/snapshot counts - I explicitly chose 4kB DIO overwrite as
> the fastest method of exposing such scalability issues because I
> know exactly how bad this is for the COW algorithms all current
> snapshot/reflink technologies implement.
>
> Please don't shoot the messenger because you think the workload is
> unrealistic - it simply indicates that you haven't understood the
> goal of the test worklaod I am running.

Didn't know that the objective is to emulate the aging problem, then 4K
dio is completely fine, as we also use it to bump metadata size quickly.
(But never to intentionally to create it to such a stage as your test cases)

>
> Accelerating aging involves using unfriendly workloads to push the
> filesystem into bad places in minutes instead of months or years.
> It is not a production workload that needs optimising - if anything,
> I need to make it even more aggressive and nasty, because it's just
> not causing XFS reflinks any serious scalability problems at all...
>
>>> Oh, I almost forget - FIEMAP performance. After the reflink test, I
>>> map all the extents in all the cloned files to a) count the extents
>>> and b) confirm that the difference between clones is correct (~10000
>>> extents not shared with the previous iteration). Pulling the extent
>>> maps out of XFS takes about 3s a clone (~850,000 extents), or 30
>>> minutes for the whole set when run serialised. btrfs takes 90-100s
>>> per clone - after 8 hours it had only managed to map 380 files and
>>> was running at 6-7000 read IOPS the entire time. IOWs, it was taking
>>> _half a million_ read IOs to map the extents of a single clone that
>>> only had a million extents in it. Is it expected that FIEMAP is so
>>> slow and IO intensive on cloned files?
>>
>> Exploding fragments, definitely needs a lot of metadata read, right?
>
> Well, at 1000 files, XFS does zero metadata read IO because the
> extent lists for all 1000 snapshots easily fit in RAM - about 2GB of
> RAM is needed, and that's the entire per-inode memory overhead of
> the test. Hence when the fiemap cycle starts, it just pulls all this
> from RAM and we do zero metadata read IO.

Well, you know btrfs takes more metadata space.

Btrfs takes 53 bytes for one file extent, even without the extra header
for tree blocks.

For 4GiB file with all 4KiB extent size, it's 2^20 * 53, at least 106MiB
just for one such 4GiB file.

2GiB RAM can only take at most 20 such files.

Thanks,
Qu

>
> Cheers,
>
> Dave.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-24 22:36   ` Dave Chinner
  2021-01-25  1:09     ` Qu Wenruo
@ 2021-01-29 23:25     ` Zygo Blaxell
  2021-02-02  0:13       ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Zygo Blaxell @ 2021-01-29 23:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Qu Wenruo, linux-btrfs

On Mon, Jan 25, 2021 at 09:36:55AM +1100, Dave Chinner wrote:
> On Sat, Jan 23, 2021 at 04:42:33PM +0800, Qu Wenruo wrote:
> > 
> > 
> > On 2021/1/22 上午6:20, Dave Chinner wrote:
> > > Hi btrfs-gurus,
> > > 
> > > I'm running a simple reflink/snapshot/COW scalability test at the
> > > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > > random direct IOs in a 4GB file; snapshot" and I want to check a
> > > couple of things I'm seeing with btrfs. fio config file is appended
> > > to the email.
> > > 
> > > Firstly, what is the expected "space amplification" of such a
> > > workload over 1000 iterations on btrfs? This will write 40GB of user
> > > data, and I'm seeing btrfs consume ~220GB of space for the workload
> > > regardless of whether I use subvol snapshot or file clones
> > > (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> > > wondering if this is expected or whether there's something else
> > > going on. XFS amplification for 1000 iterations using reflink is
> > > only 1.4x, so 5.5x seems somewhat excessive to me.

Each iteration produces a little under 80MB of metadata (the forward
and backward refs are pretty big compared to the size of 4K data blocks,
a little under 10% of the size).  You're writing randomly over 0.3% of
the subvol (4GB / 40MB = about 1%, plus or minus random) so each snapshot
unshares most of its metadata pages and degenerates into reflink copies.
That works out to a little under 80GB of metadata by the time the 1000
snapshots are created.

If you have dup metadata, multiply metadata size by 2.  Add the original
44GB of data (the extra 4G is because of prealloc) to the metadata size
assuming dup, and there's 204GB, not too far away from 220.

> > This is mostly due to the way btrfs handles COW and the lazy extent
> > freeing behavior.
> > 
> > For btrfs, an extent only get freed when there is no reference on any
> > part of it, and
> > 
> > This means, if we have an file which has one 128K file extent written to
> > disk, and then write 4K, which will be COWed to another 4K extent, the
> > 128K extent is still kept as is, even the no longer referred 4K range is
> > still kept there, with extra 4K space usage.
> 
> That's not relevant to the workload I'm running. Once it reaches
> steady state, it's just doing 4kB overwrites of shared 4kB extents.

Actually it is relevant, because that's _not_ what your workload is doing.

Despite having 'prealloc=0' in fio_config, fio preallocates the testfile.
That triggers btrfs's preallocation behavior for datacow extents: every
unshared block is written in place, inside the original 128MB prealloc
extents.  The writes to snapshots are creating shared references to
these blocks within the original extents, instead of creating separate
physical extents with unshared references.  Once the blocks contain data,
an overwrite will create a separate physical extent, but that applies
only when a previously written block is overwritten, and that doesn't
start happening in large numbers until later in the test.

So you are not creating a million 4K extents with an average of just
under 500 refs each (1 to 1000 snapshots minus some that get overwritten
at random).  You are creating 32 128M extents, with an average of around
16 million shared references each (32768 reflinks * 500 snapshots on
average, minus a little for random overlap).

By the time you look at these extents with FIEMAP, FIEMAP is stuck
potentially running tens of trillions of iterations trying to fill in
the "SHARED" bit for millions of extents.

Also, because you're doing prealloc on a datacow file, you are taking
a hit to calculate the block sharing on the writes, too.  Every write
that lands on the prealloc extent has to check to see if the written
block overlaps any other written block in the same extent, and that's
a shared reference check.  Overwrites don't need this check, so
performance might level out or even get better toward the end of the
test as the number of references to the original 128M extents starts to
fall off.

The same thing happens with reflink copies, except that the nested loop
over the 500 * 32768 extent refs to detect sharing moves some parts to
the inner loop (with deeper metadata tree walks) and some to the outer
loop when there are snapshots.  It'll affect the timing of FIEMAP and
the prealloc writes.

IMHO, PREALLOC should be ignored for all datacow files on btrfs.  It can't
do things people expect with a datacow file (in particular the ENOSPC
guarantee is only possible for the first write), and it does a bunch of
expensive, counterintuitive stuff that people don't expect.

PREALLOC is useful for nodatacow files and does implement expected
behavior, but it should only be used on those.

> > Thus above reflink/snapshot + DIO write is going to be very unfriendly
> > for fs with lazy extent freeing and default data COW behavior.
> 
> I'm not freeing any extents at all. I haven't even got to running
> the parts of the tests where I remove random snapshots/reflink files
> to measure the impact of decreasing reference counts. THis test just
> looks at increasing reference counts and COW overwrite performance.
> 
> > That's also why btrfs has a worse fragmentation problem.
> 
> Every other filesystem I've tested is fragmenting the reflink files
> to the same level. btrfs is not fragmenting the file any worse than
> the others - the workload is intended to fragment the file into a
> million individual 4kB extents and then keep overwriting and
> snapshotting to examine how the filesystem structures age under such
> workloads.
> 
> > > On a similar note, the IO bandwidth consumed by btrfs is way out of
> > > proportion with the amount of user data being written. I'm seeing
> > > multiple GBs being written by btrfs on every iteration - easily
> > > exceeding 5GB of writes per cycle in the later iterations of the
> > > test. Given that only 40MB of user data is being written per cycle,
> > > there's a write amplification factor of well over 100x ocurring
> > > here. In comparison, XFS is writing roughly consistently at 80MB/s
> > > to disk over the course of the entire workload, largely because of
> > > journal traffic for the transactions run during COW and clone
> > > operations.  Is such a huge amount of of IO expected for btrfs in
> > > this situation?
> > 
> > That's interesting. Any analyse on the type of bios submitted for the
> > device?
> 
> No, and I don't actually care because it's not relevant to what
> I'm trying to understand. I've given you enough information to
> reproduce the behaviour if you want to analyse it yourself.
> 
> > > Next, subvol snapshot and clone time appears to be scale with the
> > > number of snapshots/clones already present. The initial clone/subvol
> > > snapshot command take a few milliseconds. At 50 snapshots it take
> > > 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
> > > > 850 it seems to level off at about 30s a snapshot. There are
> > > outliers that take double this time (63s was the longest) and the
> > > variation between iterations can be quite substantial. Is this
> > > expected scalablity?
> > 
> > The snapshot will make the current subvolume to be fully committed
> > before really taking the snapshot.
> > 
> > Considering above metadata overhead, I believe most of the performance
> > penalty should come from the metadata writeback, not the snapshot
> > creation itself.
> > 
> > If you just create a big subvolume, sync the fs, and try to take as many
> > snapshot as you wish, the overhead should be pretty the same as
> > snapshotting an empty subvolume.
> 
> The fio workload runs fsync at the end of the overwrite, which means
> all the writes and the metadata needed to reference it *must* be on
> stable storage. 

That is not how btrfs fsync works, and your assertions that follow from
this misunderstanding are also wrong.

fsync doesn't make the delayed refs update queue smaller.  It might
even make the queue bigger, by moving dealloc writes that would have
been deferred to the next transaction into the current transaction.

btrfs fsync flushes out the data blocks to disk, then it writes journal
commands that say "create a file, point an extent record at those data
blocks, reflink the extent into the file" to the log tree, then it returns
to the user.  All the metadata reference updates for those data blocks
are left for the following btrfs transaction to commit to the filesystem
trees in the background.  If there is a crash, the following mount reads
the log tree and requeues the metadata updates in memory, so persistence
is achieved.  After each commit is finished, the log tree is discarded.

The goal is that the caller of fsync() doesn't have to wait for the whole
filesystem tree to be committed.  The caller only pays the cost to have
the specific parts of the tree they care about persisted.

In your workload, the fsync() doesn't do anything useful--it flushes
out a few MB of data blocks and a breadcrumb trail of journal commands.
When you call snapshot create, everything that was previously deferred
has to be turned into a concrete filesystem tree, both data and metadata.
The snapshot create is paying for all the work fsync() avoided.

In the rare cases where fsync() happens to run at the same time as a
transaction commit (or maybe just before), the transaction commit and
the fsync() get synchronized by trying to touch the same locks, and
return at close to the same time.  In those cases, the snapshot only
has to write out a new subvol root and some free space map changes,
which takes 0.02s.

> Everything else is snapshot overhead, whether it be
> the freeze of the filesystem in the case of dm-thin snapshots or
> xfs-on-loopback-with-reflink-snapshots, of the internal sync that
> btrfs does so that the snapshot produces a consistent snapshot
> image...
> 
> > > On subvol snapshot execution, there appears to be a bug manifesting
> > > occasionally and may be one of the reasons for things being so
> > > variable. The visible manifestation is that every so often a subvol
> > > snapshot takes 0.02s instead of the multiple seconds all the
> > > snapshots around it are taking:
> > 
> > That 0.02s the real overhead for snapshot creation.
> > 
> > The short snapshot creation time means those snapshot creation just wait
> > for the same transaction to be committed, thus they don't need to wait
> > for the full transaction committment, just need to do the snapshot.
> 
> That doesn't explain why fio sometimes appears to be running much
> slower than at other times. Maybe this implies a fsync() bug w.r.t.
> DIO overwrites and that btrfs should always be running the fio
> worklaod at 500-1000 iops and snapshots should always run at 0.02s
> 
> IOWs, the problem here is the inconsistent behaviour: the workload
> is deterministic and repeats in exactly the same way every time, so
> the behaviour of the filesystem should be the same for every single
> iteration. Snapshot should either always take 0.02s and fio is
> really slow, or fio should be fast and the snapshot really slow
> because the snapshot has wider "metadata on stable storage"
> requirements than fsync. The workload should not be swapping
> randomly between the two behaviours....

There is a transaction commit on a periodic timer, every 30 seconds
by default.  If it doesn't compete for locks and block the fio process,
it will at least compete for disk bandwidth and slow it down.  The amount
of work the snapshot create has to do in your test case is dominated by
the amount of delayed ref work queued up between the end of the previous
periodic commit and the start of the snapshot create (your metadata
outnumbers your data by 4 to 1).  This timing will be nondeterministic
in your test setup.

If you mount with -o commit=999999999 (or some sufficiently large value)
then you'll get more determinism, as all the transaction commits will
then be triggered by memory pressure and your snapshot creates.

> > [...]
> > 
> > > In these instances, fio takes about as long as I would expect the
> > > snapshot to have taken to run. Regardless of the cause, something
> > > looks to be broken here...
> > > 
> > > An astute reader might also notice that fio performance really drops
> > > away quickly as the number of snapshots goes up. Loop 0 is the "no
> > > snapshots" performance. By 10 snapshots, performance is half the
> > > no-snapshot rate. By 50 snapshots, performance is a quarter of the
> > > no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> > > about 15% of the non-snapshot performance. Is this expected
> > > performance degradation as snapshot count increases?

Performance immediately following a snapshot is expected to degrade as
the number of distinct parent or child pages referenced by a metadata
page increases.  This is not the same thing as snapshot count.

Your test is causing both to increase at the same time, and also keeping
the testfile in the "immediately following a snapshot" state.

If you have 1000 snapshots and your writes have high metadata locality
(e.g. you are appending to a single log file in each snapshot) then
the write multipliers are very close to 1.0x.  If you have low metadata
locality, even one snapshot will be followed by a big write multiplication
burst.

> > No, this is mostly due to the exploding amount of metadata caused by the
> > near-worst case workload.

Every 2 orders of magnitude more metadata items increases the O(log(N))
costs of btrfs by one unit.  By 50 snapshots or reflinks you have hundreds
of millions of metadata items, it's 6x slower and not increasing very
much any more...not too far off what we'd expect.

One problem with this theory is that we'd expect the same behavior for
reflinks too, so it might not be correct.

> > Yeah, btrfs is pretty bad at handling small dio writes, which can easily
> > explode the metadata usage.
> > 
> > Thus for such dio case, we recommend to use preallocated file +
> > nodatacow, so that we won't create new extents (unless snapshot is
> > involved).
> 
> Big picture. This is an accelerated aging test, not a prodcution
> workload. Telling me how to work around the problems associated with
> 4kB overwrite (as if I don't already know about nodatacow and all
> the functionality you lose by enabling it!) doesn't make the
> problems with increasing snapshot counts that I'm exposing go away.

I'm familiar with this workload.  I've been running something similar to
your target workload since 2014.  We build NAS backup appliance boxes:
each has about 100 client subvols ranging in size from 1GB to 10TB,
thousands to millions of files each, 1-5% daily turnover.

Multiple snapshots per hour at this scale is a really ambitious target
for btrfs.  We can theoretically do somewhere between 15 and 180 snapshot
rotates per day before the machine starts falling behind on the deletes
and running out of space.  Snapshot create and delete on btrfs come
with giant unbounded latency spikes, so we don't run them all the time.
We'll create snapshots any time a client finishes an update, but we only
delete old snapshots to recover disk space during a 3-hour maintenance
window.

While the snapshot rotates are happening, btrfs leaves CPU cores and
disks idle.  Current performance is far from the theoretical limits.

There is some active development in this area, especially in the last
year.  Several improvements happened in 2020, including a few silly bug
fixes of the form "don't make two threads fight each other for locks"
and "don't forget to wake up some important background process because
we optimized away some trigger event it was waiting for."

> I'm interesting in knowing how btrfs scales and performs with large
> reflink/snapshot counts - I explicitly chose 4kB DIO overwrite as
> the fastest method of exposing such scalability issues because I
> know exactly how bad this is for the COW algorithms all current
> snapshot/reflink technologies implement.
> 
> Please don't shoot the messenger because you think the workload is
> unrealistic - it simply indicates that you haven't understood the
> goal of the test worklaod I am running.
> 
> Accelerating aging involves using unfriendly workloads to push the
> filesystem into bad places in minutes instead of months or years.
> It is not a production workload that needs optimising - if anything,
> I need to make it even more aggressive and nasty, because it's just
> not causing XFS reflinks any serious scalability problems at all...
> 
> > > Oh, I almost forget - FIEMAP performance. After the reflink test, I
> > > map all the extents in all the cloned files to a) count the extents
> > > and b) confirm that the difference between clones is correct (~10000
> > > extents not shared with the previous iteration). Pulling the extent
> > > maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> > > minutes for the whole set when run serialised. btrfs takes 90-100s
> > > per clone - after 8 hours it had only managed to map 380 files and
> > > was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> > > _half a million_ read IOs to map the extents of a single clone that
> > > only had a million extents in it. Is it expected that FIEMAP is so
> > > slow and IO intensive on cloned files?
> > 
> > Exploding fragments, definitely needs a lot of metadata read, right?
> 
> Well, at 1000 files, XFS does zero metadata read IO because the
> extent lists for all 1000 snapshots easily fit in RAM - about 2GB of
> RAM is needed, and that's the entire per-inode memory overhead of
> the test. Hence when the fiemap cycle starts, it just pulls all this
> from RAM and we do zero metadata read IO.

If, before you start the test, you run 'truncate -s 4g testfile', so
that fio doesn't preallocate the file, things behave somewhat better,
though "better" for 80GB of metadata is still pretty awful.

If I run the test without the prealloc, filefrag takes about 4.5 seconds
to iterate 1044066 extents from a cold cache, and does 10 snapshot files
in 1.6 seconds with a warm cache (32 seconds from cold).

The sheer size of the metadata does prevent the whole thing from being
cached in RAM, at least on a 32G machine.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-24 21:43   ` Dave Chinner
@ 2021-01-30  1:03     ` Zygo Blaxell
  0 siblings, 0 replies; 16+ messages in thread
From: Zygo Blaxell @ 2021-01-30  1:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs

On Mon, Jan 25, 2021 at 08:43:46AM +1100, Dave Chinner wrote:
> On Sat, Jan 23, 2021 at 07:19:03PM -0500, Zygo Blaxell wrote:
> > On Fri, Jan 22, 2021 at 09:20:51AM +1100, Dave Chinner wrote:
> > > Hi btrfs-gurus,
> > > 
> > > I'm running a simple reflink/snapshot/COW scalability test at the
> > > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > > random direct IOs in a 4GB file; snapshot" and I want to check a
> > > couple of things I'm seeing with btrfs. fio config file is appended
> > > to the email.
> 
> ....
> 
> > >  --
> > >  snapshot 945 15.32
> > >  snapshot 946 0.05
> > >  snapshot 947 15.52
> > >  --
> > >  snapshot 971 15.30
> > >  snapshot 972 0.03
> > >  snapshot 973 14.88
> > > 
> > > It seems .... unlikely that random snapshots of exactly the same
> > > repeating workloadi have such a variance in execution time. And then
> > 
> > btrfs delays a lot of metadata updates (millions, if you have enough
> > memory) and then runs them in giant batches during commits, so they can
> > show up as latency spikes at random times while you're benchmarking.
> > That is likely part of what is happening here.
> 
> Evidence points to this being exactly the problem - multiple
> gigabytes of page cache dirtying at writeback points leading to
> memory pressure and huge amounts of physical IO being issued. A
> single CPU running this workload basically stalls a kernel on a mostly idle
> 32GB/32p machine with 150,000 random 4kB write IOPS capability for
> seconds at a time.
> 
> > The current behavior is something of a regression--there used to be a
> > latency feedback loop to avoid queueing up too many metadata updates
> > before throttling the processes that were generating the updates.
> > It's not clear that simply reverting that change is a good way forward.
> 
> I see. Rock and a hard place. Am I correct in assuming that that I
> shouldn't expect a fix for either the excessive metadata writeback
> bandwidth or the non-deterministic system-wide behaviour that
> results from it any time soon?

Josef did some investigation into it a year ago based on some of my
test cases.  There were some simple patches that made big improvements,
but they uncovered other problems.  Fixes for those are working their
way through the dev pipeline.

Part of the problem in your test case is just the sheer size of the
metadata.  That's unlikely to change soon, but you could throw RAM at
the problem.

> > > ---
> > > fio loop 971:   write: IOPS=8539, BW=33.4MiB/s (34.0MB/s)(39.1MiB/1171msec); 0 zone resets
> > > fio loop 972:   write: IOPS=579, BW=2317KiB/s (2372kB/s)(39.1MiB/17265msec); 0 zone resets
> > > fio loop 973:   write: IOPS=6265, BW=24.5MiB/s (25.7MB/s)(39.1MiB/1596msec); 0 zone resets
> > > ---
> > > 
> > > 
> > > In these instances, fio takes about as long as I would expect the
> > > snapshot to have taken to run. Regardless of the cause, something
> > > looks to be broken here...
> > > 
> > > An astute reader might also notice that fio performance really drops
> > > away quickly as the number of snapshots goes up. Loop 0 is the "no
> > > snapshots" performance. By 10 snapshots, performance is half the
> > > no-snapshot rate. By 50 snapshots, performance is a quarter of the
> > > no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> > > about 15% of the non-snapshot performance. Is this expected
> > > performance degradation as snapshot count increases?

> > Somewhere in that workload, there's probably a pretty high write
> > multiplier for unsharing subvol metadata pages in snapshots.  The worst
> > case is about 300x write multiply for the first few pages after a snapshot
> > is created, because every item referenced on the shared subvol metadata
> > page (there can be 150-300 of them) must have a new backreference added
> > to the newly created unshared metadata page, and in the worst case every
> > one of those new items lives in a separate metadata page that also
> > has to be read, modified, and written.  The write multiplier rapidly
> > levels off to 1x once all the snapshot's metadata pages are unshared,
> > after random writes to around 0.3% of the subvol.  So writing 4K to a
> > file in a subvol right after a snapshot was taken could hit the disks
> > with up to 20MB of random read and write iops before it's over.
> 
> That's .... really bad. but it tallies with 10,000 4kB data writes
> triggering 10GB of dirty metadata pages and GB/s of write bandwidth.
> 
> Given this is how the snapshot+COW algorithm is designed, I having
> trouble seeing how this problem could be mitigated. Am I correct in
> assuming that this level of write amplification as snapshot cycles
> increase "is what it is"?

I don't see any way to get rid of that write multiplication case without
making incompatible disk format changes, and maybe rethinking the entire
snapshot concept.

In normal workloads (even yours), the write multiplication burst ends
pretty quickly, but you also keep creating new snapshots and starting
new bursts.

It's more of a concern for user desktops that run backups overnight.
Things run many times slower as the user logs in in the morning,
and unshares metadata in $HOME.  By the time the user has logged in,
the write multiplication is mostly over.

> > Fragmentation pushes everything toward the worst-case scenario because it
> > spreads the referenced items around to separate pages, which could explain
> > the asymptotic performance curve for snapshots.
> 
> THe performance continues to worsen long after the per-file extent
> count maxxes out at just over 1 million (about cyle 100). So it
> seems more to be related to the metadata overhead of the subvol, not
> su much the individual file.
> 
> FWIW, concentrating on "it's a single file with lots of extents"
> misses the bigger picture of "it's a compact simulation of a subvol
> with tens of thousands of files in it and being randomly updated by
> the production workload between snapshots". IOWs, I'm using this
> workload to perform accelerated aging on a constantly modified
> filesystem under a rolling snapshot regime. A snapshot every few
> minutes, 24x7, is ~5-10,000 snapshots a year. I'm compressing the
> modification time domain down from 5-10 minutes to a few hundred
> milliseconds so I can run thousands of iterations a day and hence
> see what happens over a period of months in a couple of hours...

I'm thinking of the effects of free space fragmentation here, not
individual file fragmentation, i.e. objects get spread out over the
disk because they are landing in free space holes that are spread out
over the disk.  These objects have related items that are closely packed
in some trees (subvols) and sparsely packed in other trees (extent, csum)
because one tree is keyed by logical address and the other by physical.

There's not much difference in btrfs between a lot of small files and
a few big ones--in subvol metadata, they get densely packed into btree
pages either way.  A whole directory tree of files can live on one page.

> > Without fragmentation,
> > all the referenced items tend to appear on the same or at least a few
> > adjacent pages, so the unsharing cost is much lower.  It's the same
> > number of pages to unshare whether it's 1 snapshot or 1000, but the
> > referenced items will get spread around a lot after 1000 iterarations
> > of that fio loop.
> 
> Yeah, sure, but when the data is not physically located (such as a
> large dataset in a subvol), you get the random overwrite behaviour
> this test exercises....
> 
> > Reflinks don't share metadata pages, so they don't have this problem
> > (except when the dst of the reflink is modifying metadata pages that are
> > shared with a snapshot, like any other write).
> 
> Evidence suggests that they do have the same problem.  i.e. this:
> 
> > > And before you ask, reflink copies of the fio file rather than
> > > subvol snapshots have largely the same performance, IO and
> > > behavioural characteristics. The only difference is that clone
> > > copying also has a cyclic FIO performance dip (every 3-4 cycles)
> > > that corresponds with the system driving hard into memory reclaim
> > > during periodic writeback from btrfs.
> 
> would be explained by the same metadata COW explosion after a
> reflink on the per-inode extent tree, rather than full subvol
> tree. Is that the case?

I don't think so.  The metadata COW explosion would explain why reflinks
are _faster_ than snapshots, which I think you said you didn't observe.
(Are we talking about the 300x write multiplication explosion for
fresh snapshots here?  There are arguably multiple events that could be
described as "metadata explosion" in this test case.)

The 300x metadata write multiplication transforms a snapshot (immutable
shared subvol metadata) into a reflink copy (mutable unshared subvol
metadata).  If the file is a reflink copy to start with, then there's
no need to do any more work to make it one.

> > > Oh, I almost forget - FIEMAP performance. After the reflink test, I
> > > map all the extents in all the cloned files to a) count the extents
> > > and b) confirm that the difference between clones is correct (~10000
> > > extents not shared with the previous iteration). Pulling the extent
> > > maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> > > minutes for the whole set when run serialised. btrfs takes 90-100s
> > > per clone - after 8 hours it had only managed to map 380 files and
> > > was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> > > _half a million_ read IOs to map the extents of a single clone that
> > > only had a million extents in it. Is it expected that FIEMAP is so
> > > slow and IO intensive on cloned files?
> > 
> > There were severe performance issues with FIEMAP (or anything else that
> > does backref lookup) on kernels before 5.7, especially on files bigger
> > than a few hundred MB (among other things, it was searching the entire
> > file for matching forward ref instead of just around the area where the
> > backref was).  FIEMAP looks at backrefs to populate the 'shared' bit,
> > so it was affected by this bug.
> 
> I'm testing on a vanilla 5.10 kernel, so this bug should not be
> present in the kernel.
> 
> > There might still be a big IO overhead for backref search on current
> > kernels.  The worst case is some gigabytes of metadata pages for extent
> > references, if every referencing item ends up stored on its own metadata
> > page, and if FIEMAP has to read many of them before it finds a reference
> > that matches the logical file offset so it can set or clear the 'shared'
> > bit.
> 
> So it's a least a quadratic complexity algorithm?

Worst case is O(number_of_reflinks * number_of_snapshots * btree_depth)
for each query (which for FIEMAP would be the first block in each
logical extent reference).

Some of the backref items come with CPU search overheads because the
backrefs sometimes point to pages, not individual items.

btree depth is O(log(n)) random seeks on btrees which are probably far
more expensive than the CPU, so the CPU term won't show up in O-notation.

If your extent references mostly overlap the same physical blocks, they
will short-circuit some computations for FIEMAP.  e.g. if a physical
extent has 32768 references, but they are all 4K and none overlap, then
FIEMAP will loop all 32768 times to determine a block is not SHARED.
e.g. 2 if an extent has 1000 references but they all refer to the entire
physical extent, then the loop will terminate on the second iteration.

I'd call it O(N * log(n)) for each extent, with some noticeably large
constant terms, assuming there aren't any silly bugs left.  There are
a lot of variables that can affect the performance.

There's no caching, so every FIEMAP call forgets everything it learns
before the next FIEMAP call.  I don't know if it even caches the metadata
pages it reads.

> > I'm not sure the worst case is even bounded--you could have billions of
> > references to an extent and I don't know of any reason why you couldn't
> > fill a disk with them (other than btrfs getting too slow to finish before
> > the disk crumbles to dust).
> 
> Hmmm. This also sounds like a result of the way btrfs is physically
> structured. Am I correct to assume this behaviour won't change any
> time soon?

FIEMAP, a foreign ioctl that originated on an alien filesystem, has
three basic problems on btrfs:

1.  On btrfs, physical extents and logical extent references have separate
offsets and sizes.  FIEMAP provides for only one size and offset in struct
fiemap_extent, so the btrfs version discards the physical size information
and adds the physical offset to the physical extent's base address.
This makes FIEMAP useless for analyzing physical extent sharing on btrfs
in any case where physical.offset != 0 or physical.size != logical.size
(i.e. when you use dedupe, reflinks, snapshots, compression, prealloc,
nodatacow, or just partially overwrite extents in files).

2.  Physical extent sharing in btrfs is a per-block property, not a
per-extent property as understood by FIEMAP.  The btrfs implementation
of FIEMAP doesn't break up logical extent records into contiguous
SHARED/not-SHARED pieces, so when the struct fiemap_extent describes more
than one logical block, the SHARED bit is not necessarily accurate for
the entire logical extent.

3.  btrfs doesn't maintain backref information to the precision required
to answer FIEMAP queries trivially.  In common cases the forward
reference information can be changed without needing to update the
backward reference information at all, even though the forward reference
changes location or is duplicated.

The obvious fix for item 1 is to add the two missing fields to struct
fiemap_extent.  The fix for item 2 is less obvious:  we could add a
fm_flag that says "I don't care about sharing (or I'll compute my own
sharing), please don't waste my time on SHARED thanks" or "I really
care about sharing, please split up logical extents into contiguous
SHARED/not-SHARED and do all the extra work to calculate that accurately"
or "I only care about definitely-not-shared blocks, you can have a
few false positives for speed" so btrfs FIEMAP users could control the
cost vs quality of FIEMAP's output.  It's not obvious that item 3 even
needs fixing.

If we make those changes, we get an ioctl that looks very much like
TREE_SEARCH_V2 in the low-cost mode.  Applications that need accurate
physical sharing information on btrfs already use TREE_SEARCH_V2, so
FIEMAP does not get much attention on btrfs.

FIEMAP is the equivalent of looking at btrfs through a funhouse mirror.
FIEMAP on btrfs is slow and full of lies.

> > TREE_SEARCH_V2 doesn't have a 'shared' bit to populate, so it runs _much_
> > faster than FIEMAP.
> 
> Assuming I don't need to know about shared extents. The whole point
> of using fiemap here is to be able to look at the shared extents in
> the file. That's a diagnostic we use in the field for analysing
> problems with reflink copied files on XFS, so I see no reason why we
> wouldn't need that information on btrfs. So using a special ioctl
> that doesn't provide shared extent visiblity is not a viable
> solution here...

In your specific case, FIEMAP doesn't indicate that the prealloc at the
beginning of your test created thousands of non-overlapping 4K references
to 128MB extents in every snapshot, because there's no place to put that
information in its output.

TREE_SEARCH_V2 makes this obvious, it shows every tiny extent is a slice
of a few unique huge extents.  I don't even need to process the output
to see that--the same physical extent base address appears over and over
again on every other line, with only the physical offset changing from
one logical extent to the next.

compsize calculates block-level reference overlap in less than a second
per file (at least until the reference data gets larger than memory).
It doesn't report any of that information filefrag-style, but it does
produce the data internally.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-29 23:25     ` Zygo Blaxell
@ 2021-02-02  0:13       ` Dave Chinner
  2021-02-12  3:04         ` Zygo Blaxell
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2021-02-02  0:13 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jan 29, 2021 at 06:25:50PM -0500, Zygo Blaxell wrote:
> On Mon, Jan 25, 2021 at 09:36:55AM +1100, Dave Chinner wrote:
> > On Sat, Jan 23, 2021 at 04:42:33PM +0800, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2021/1/22 上午6:20, Dave Chinner wrote:
> > > > Hi btrfs-gurus,
> > > > 
> > > > I'm running a simple reflink/snapshot/COW scalability test at the
> > > > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > > > random direct IOs in a 4GB file; snapshot" and I want to check a
> > > > couple of things I'm seeing with btrfs. fio config file is appended
> > > > to the email.
> > > > 
> > > > Firstly, what is the expected "space amplification" of such a
> > > > workload over 1000 iterations on btrfs? This will write 40GB of user
> > > > data, and I'm seeing btrfs consume ~220GB of space for the workload
> > > > regardless of whether I use subvol snapshot or file clones
> > > > (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> > > > wondering if this is expected or whether there's something else
> > > > going on. XFS amplification for 1000 iterations using reflink is
> > > > only 1.4x, so 5.5x seems somewhat excessive to me.
> 
> Each iteration produces a little under 80MB of metadata (the forward
> and backward refs are pretty big compared to the size of 4K data blocks,
> a little under 10% of the size).  You're writing randomly over 0.3% of
> the subvol (4GB / 40MB = about 1%, plus or minus random) so each snapshot
> unshares most of its metadata pages and degenerates into reflink copies.
> That works out to a little under 80GB of metadata by the time the 1000
> snapshots are created.
> 
> If you have dup metadata, multiply metadata size by 2.  Add the original
> 44GB of data (the extra 4G is because of prealloc) to the metadata size
> assuming dup, and there's 204GB, not too far away from 220.
> 
> > > This is mostly due to the way btrfs handles COW and the lazy extent
> > > freeing behavior.
> > > 
> > > For btrfs, an extent only get freed when there is no reference on any
> > > part of it, and
> > > 
> > > This means, if we have an file which has one 128K file extent written to
> > > disk, and then write 4K, which will be COWed to another 4K extent, the
> > > 128K extent is still kept as is, even the no longer referred 4K range is
> > > still kept there, with extra 4K space usage.
> > 
> > That's not relevant to the workload I'm running. Once it reaches
> > steady state, it's just doing 4kB overwrites of shared 4kB extents.
> 
> Actually it is relevant, because that's _not_ what your workload is doing.
> 
> Despite having 'prealloc=0' in fio_config, fio preallocates the testfile.

I'm using fallocate=none, so preallocation uses write(), not
fallocate() to preallocation unwritten extents. I explicitly chose
this so that there was no interactions with unwritten extents in the
workload and it therefore is a pure overwrite test.

strace output of fio laying out the base file in iteration 0 before
any overwrites are done:

155715 unlink("/mnt/scr/testdir/testfile") = -1 ENOENT (No such file or directory)
155715 openat(AT_FDCWD, "/mnt/scr/testdir/testfile", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 5
155715 ftruncate(5, 4294967296)         = 0
155715 write(5, "\37\256\233\346J EH\303\365\303\205\2730\374\37\270\376\326u5\3036\17\327\337^/{\35T\t"..., 4096) = 4096
155715 write(5, "\341\7\3477j\36,\30\374`\f\303\331g\267\6\37\214\343\7u\v\316\4\203\361\354\332|=\370\23"..., 4096) = 4096
155715 write(5, "\250\241\264\10\27\375*\0275\224B\220H\333\361\5\206\322\355\307G\363L\27P\272\272\17r\272o\1"..., 4096) = 4096
155715 write(5, "\261P=\300\371\303fN\26*\257\217`\370f\4B\345\352|4\227v\35\250\\\374\34q\224b\24"..., 4096) = 4096
155715 write(5, "\222Z\340\254\335\37R,R\vS\210\213m\31\23jaa\353\207i[\16-,\267L\200wm\4"..., 4096) = 4096
....
155715 write(5, "z\306bv\3\322$>\317X\217\v\17\337\230\25\31k\n5\32\226\24\31c\315\24\21>x\204\23"..., 4096) = 4096
155715 write(5, "\215w\352\252FN\332b\361\316\226\231S\364\322\n\336Y\272\v\277\221\207\7;K\210\264Zl\243\r"..., 4096) = 4096
155715 write(5, "`\201Qe\331L$\5,0\372k\362\326\327\v\5Fi5\341\3045\f\300\250\252\203\347\35\371\v"..., 4096) = 4096
155715 fsync(5 <unfinished ...>
....
155715 <... fsync resumed> )            = 0
155715 fadvise64(5, 0, 4294967296, POSIX_FADV_DONTNEED <unfinished ...>
....
155715 <... fadvise64 resumed> )        = 0
155715 close(5)                         = 0
....
155715 stat("/mnt/scr/testdir/testfile", {st_mode=S_IFREG|0644, st_size=4294967296, ...}) = 0

> That triggers btrfs's preallocation behavior for datacow extents: every
> unshared block is written in place, inside the original 128MB prealloc
> extents.

I still think this is irrelevant, because there are no preallocated
extents created by fio. Hence this whole analysis based on the
initial file being laid out with preallocated unwritten extents
looks wrong to me, unless btrfs is shooting itself in the foot and
doing preallocation of unwritten space behind the scenes for
sequential writes into a sparse file.

> So you are not creating a million 4K extents with an average of just
> under 500 refs each (1 to 1000 snapshots minus some that get overwritten
> at random).  You are creating 32 128M extents, with an average of around
> 16 million shared references each (32768 reflinks * 500 snapshots on
> average, minus a little for random overlap).
> 
> By the time you look at these extents with FIEMAP, FIEMAP is stuck
> potentially running tens of trillions of iterations trying to fill in
> the "SHARED" bit for millions of extents.

Yup, that's pretty bad. Is there any plan to fix this?

> Also, because you're doing prealloc on a datacow file, you are
> taking a hit to calculate the block sharing on the writes, too.
> Every write that lands on the prealloc extent has to check to see
> if the written block overlaps any other written block in the same
> extent, and that's a shared reference check.  Overwrites don't
> need this check, so performance might level out or even get better
> toward the end of the test as the number of references to the
> original 128M extents starts to fall off.
> 
> The same thing happens with reflink copies, except that the nested
> loop over the 500 * 32768 extent refs to detect sharing moves some
> parts to the inner loop (with deeper metadata tree walks) and some
> to the outer loop when there are snapshots.  It'll affect the
> timing of FIEMAP and the prealloc writes.
> 
> IMHO, PREALLOC should be ignored for all datacow files on btrfs.
> It can't do things people expect with a datacow file (in
> particular the ENOSPC guarantee is only possible for the first
> write), and it does a bunch of
> expensive, counterintuitive stuff that people don't expect.

So what you are saying is that preallocated extents in btrfs are
compromised from an architectural POV and are largely
unfixable?

> PREALLOC is useful for nodatacow files and does implement expected
> behavior, but it should only be used on those.

Yeah, that's not an option for general use filesystems
that run applications that use fallocate() for preallocation and the
user either doesn't know about it or cannot turn it off.

> > > > Next, subvol snapshot and clone time appears to be scale with the
> > > > number of snapshots/clones already present. The initial clone/subvol
> > > > snapshot command take a few milliseconds. At 50 snapshots it take
> > > > 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
> > > > > 850 it seems to level off at about 30s a snapshot. There are
> > > > outliers that take double this time (63s was the longest) and the
> > > > variation between iterations can be quite substantial. Is this
> > > > expected scalablity?
> > > 
> > > The snapshot will make the current subvolume to be fully committed
> > > before really taking the snapshot.
> > > 
> > > Considering above metadata overhead, I believe most of the performance
> > > penalty should come from the metadata writeback, not the snapshot
> > > creation itself.
> > > 
> > > If you just create a big subvolume, sync the fs, and try to take as many
> > > snapshot as you wish, the overhead should be pretty the same as
> > > snapshotting an empty subvolume.
> > 
> > The fio workload runs fsync at the end of the overwrite, which means
> > all the writes and the metadata needed to reference it *must* be on
> > stable storage. 
> 
> That is not how btrfs fsync works, and your assertions that follow from
> this misunderstanding are also wrong.

I suspect you misunderstand what I said.

"metadata on stable storage" for a journalling filesystem means
"stable in the journal", not at it's final resting place. I'll snip
your description of the btrfs fsync journal because I know that
btrfs does this and why it was implmented the way it was and not the
way WAFL or ZFS solved the same "COW metadata is expensive
for fsync()" problem...

> In your workload, the fsync() doesn't do anything useful--it flushes
> out a few MB of data blocks and a breadcrumb trail of journal commands.
> When you call snapshot create, everything that was previously deferred
> has to be turned into a concrete filesystem tree, both data and metadata.
> The snapshot create is paying for all the work fsync() avoided.

Yup that's the same as other filesystems. in the case of XFS, a
filesystem freeze before a device snapshot performs the metadata
writeback.


> In the rare cases where fsync() happens to run at the same time as a
> transaction commit (or maybe just before), the transaction commit and
> the fsync() get synchronized by trying to touch the same locks, and
> return at close to the same time.  In those cases, the snapshot only
> has to write out a new subvol root and some free space map changes,
> which takes 0.02s.

Ok, so there are internal transaction commit/metadata COW lock-step
conflicts in btrfs? I'm guessing that they can hit anything that
runs fsync() and at any time? i.e. non-deterministic long tail
latencies can hit at any time?

> > > The short snapshot creation time means those snapshot creation just wait
> > > for the same transaction to be committed, thus they don't need to wait
> > > for the full transaction committment, just need to do the snapshot.
> > 
> > That doesn't explain why fio sometimes appears to be running much
> > slower than at other times. Maybe this implies a fsync() bug w.r.t.
> > DIO overwrites and that btrfs should always be running the fio
> > worklaod at 500-1000 iops and snapshots should always run at 0.02s
> > 
> > IOWs, the problem here is the inconsistent behaviour: the workload
> > is deterministic and repeats in exactly the same way every time, so
> > the behaviour of the filesystem should be the same for every single
> > iteration. Snapshot should either always take 0.02s and fio is
> > really slow, or fio should be fast and the snapshot really slow
> > because the snapshot has wider "metadata on stable storage"
> > requirements than fsync. The workload should not be swapping
> > randomly between the two behaviours....
> 
> There is a transaction commit on a periodic timer, every 30 seconds
> by default.  If it doesn't compete for locks and block the fio process,
> it will at least compete for disk bandwidth and slow it down.  The amount
> of work the snapshot create has to do in your test case is dominated by
> the amount of delayed ref work queued up between the end of the previous
> periodic commit and the start of the snapshot create (your metadata
> outnumbers your data by 4 to 1).  This timing will be nondeterministic
> in your test setup.
> 
> If you mount with -o commit=999999999 (or some sufficiently large value)
> then you'll get more determinism, as all the transaction commits will
> then be triggered by memory pressure and your snapshot creates.

I doubt that aggregating more changes in memory will improve
determinism. Sure, it might delay the interference for some time,
but then the machine will eventually be out of memory. That can be
trigger by a user allocation and so the user will now complain about
long application stalls due to kernel memory reclaim....

Besides, I'm not trying to tune out bad behaviours - I'm trying to
learn where the bad behaviours lie and how easy they are to
trigger...

> > > > In these instances, fio takes about as long as I would expect the
> > > > snapshot to have taken to run. Regardless of the cause, something
> > > > looks to be broken here...
> > > > 
> > > > An astute reader might also notice that fio performance really drops
> > > > away quickly as the number of snapshots goes up. Loop 0 is the "no
> > > > snapshots" performance. By 10 snapshots, performance is half the
> > > > no-snapshot rate. By 50 snapshots, performance is a quarter of the
> > > > no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> > > > about 15% of the non-snapshot performance. Is this expected
> > > > performance degradation as snapshot count increases?
> 
> Performance immediately following a snapshot is expected to degrade as
> the number of distinct parent or child pages referenced by a metadata
> page increases.  This is not the same thing as snapshot count.
> 
> Your test is causing both to increase at the same time, and also keeping
> the testfile in the "immediately following a snapshot" state.

Of course it does this - I'm compressing the time for all the
overwrites down from minutes or hours into seconds. It doesn't
change the amount of COW or metadata that needs to be updated if the
writes are spread out over 10 minutes or an hour, it just means the
metadata updates are less of a limiting factor.

> If you have 1000 snapshots and your writes have high metadata locality
> (e.g. you are appending to a single log file in each snapshot) then
> the write multipliers are very close to 1.0x.  If you have low metadata
> locality, even one snapshot will be followed by a big write multiplication
> burst.

Yup, so general use case with snapshots is low data and metadata
write locality as most files tend to get written once and then not
touched again. Hence seeing 10,000 individual random data overwrites
in a snapshot over a snapshot epoch is a fair estimation of what we
might see with a rolling snapshot every X minutes.

> > > No, this is mostly due to the exploding amount of metadata caused by the
> > > near-worst case workload.
> 
> Every 2 orders of magnitude more metadata items increases the O(log(N))
> costs of btrfs by one unit.  By 50 snapshots or reflinks you have hundreds
> of millions of metadata items, it's 6x slower and not increasing very
> much any more...not too far off what we'd expect.

Ok, so the problem is the exponential cost of maintaining all the
cross-btree references, not the btree itself. So it really is an
architectural issue and not something that can be fixed?

> One problem with this theory is that we'd expect the same behavior for
> reflinks too, so it might not be correct.

reflinks show the similar behaviour, just that it doesn't have "stop
the world" metadata writeback points like a snapshot has.

> > > Yeah, btrfs is pretty bad at handling small dio writes, which can easily
> > > explode the metadata usage.
> > > 
> > > Thus for such dio case, we recommend to use preallocated file +
> > > nodatacow, so that we won't create new extents (unless snapshot is
> > > involved).
> > 
> > Big picture. This is an accelerated aging test, not a prodcution
> > workload. Telling me how to work around the problems associated with
> > 4kB overwrite (as if I don't already know about nodatacow and all
> > the functionality you lose by enabling it!) doesn't make the
> > problems with increasing snapshot counts that I'm exposing go away.
> 
> I'm familiar with this workload.  I've been running something similar to
> your target workload since 2014.  We build NAS backup appliance boxes:
> each has about 100 client subvols ranging in size from 1GB to 10TB,
> thousands to millions of files each, 1-5% daily turnover.
> 
> Multiple snapshots per hour at this scale is a really ambitious target
> for btrfs.  We can theoretically do somewhere between 15 and 180 snapshot
> rotates per day before the machine starts falling behind on the deletes
> and running out of space.  Snapshot create and delete on btrfs come
> with giant unbounded latency spikes, so we don't run them all the time.
> We'll create snapshots any time a client finishes an update, but we only
> delete old snapshots to recover disk space during a 3-hour maintenance
> window.
>
> While the snapshot rotates are happening, btrfs leaves CPU cores and
> disks idle.  Current performance is far from the theoretical limits.

Which, IMO, is kinda sad because zero-cost snapshot-based
workload/workflows is what btrfs was specifically intended to
provide to users...

> There is some active development in this area, especially in the last
> year.  Several improvements happened in 2020, including a few silly bug
> fixes of the form "don't make two threads fight each other for locks"
> and "don't forget to wake up some important background process because
> we optimized away some trigger event it was waiting for."

But the problems of scalability don't look like bugs to me - the
exponential explosion of metadata objects when snapshots and
overwrites occur looks more like a fundamental architectural problem
than a minor implementation bug here or there.

> > > > Oh, I almost forget - FIEMAP performance. After the reflink test, I
> > > > map all the extents in all the cloned files to a) count the extents
> > > > and b) confirm that the difference between clones is correct (~10000
> > > > extents not shared with the previous iteration). Pulling the extent
> > > > maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> > > > minutes for the whole set when run serialised. btrfs takes 90-100s
> > > > per clone - after 8 hours it had only managed to map 380 files and
> > > > was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> > > > _half a million_ read IOs to map the extents of a single clone that
> > > > only had a million extents in it. Is it expected that FIEMAP is so
> > > > slow and IO intensive on cloned files?
> > > 
> > > Exploding fragments, definitely needs a lot of metadata read, right?
> > 
> > Well, at 1000 files, XFS does zero metadata read IO because the
> > extent lists for all 1000 snapshots easily fit in RAM - about 2GB of
> > RAM is needed, and that's the entire per-inode memory overhead of
> > the test. Hence when the fiemap cycle starts, it just pulls all this
> > from RAM and we do zero metadata read IO.
> 
> If, before you start the test, you run 'truncate -s 4g testfile', so
> that fio doesn't preallocate the file, things behave somewhat better,
> though "better" for 80GB of metadata is still pretty awful.

Yup, that's exactly how I have fio configured to behave. So the 80GB
of metadata is for what you consider to be the "better" case.

> If I run the test without the prealloc, filefrag takes about 4.5 seconds
> to iterate 1044066 extents from a cold cache, and does 10 snapshot files
> in 1.6 seconds with a warm cache (32 seconds from cold).

Yup, that cold cache behaviour is awful.

> The sheer size of the metadata does prevent the whole thing from being
> cached in RAM, at least on a 32G machine.

Yup, that's one of the problems I originally reported.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-01-21 22:20 Unexpected reflink/subvol snapshot behaviour Dave Chinner
  2021-01-23  8:42 ` Qu Wenruo
  2021-01-24  0:19 ` Zygo Blaxell
@ 2021-02-02  2:14 ` Darrick J. Wong
  2021-02-02  6:02   ` Dave Chinner
  2 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2021-02-02  2:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs, xfs

On Fri, Jan 22, 2021 at 09:20:51AM +1100, Dave Chinner wrote:
> Hi btrfs-gurus,
> 
> I'm running a simple reflink/snapshot/COW scalability test at the
> moment. It is just a loop that does "fio overwrite of 10,000 4kB
> random direct IOs in a 4GB file; snapshot" and I want to check a
> couple of things I'm seeing with btrfs. fio config file is appended
> to the email.
> 
> Firstly, what is the expected "space amplification" of such a
> workload over 1000 iterations on btrfs? This will write 40GB of user
> data, and I'm seeing btrfs consume ~220GB of space for the workload
> regardless of whether I use subvol snapshot or file clones
> (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> wondering if this is expected or whether there's something else
> going on. XFS amplification for 1000 iterations using reflink is
> only 1.4x, so 5.5x seems somewhat excessive to me.
> 
> On a similar note, the IO bandwidth consumed by btrfs is way out of
> proportion with the amount of user data being written. I'm seeing
> multiple GBs being written by btrfs on every iteration - easily
> exceeding 5GB of writes per cycle in the later iterations of the
> test. Given that only 40MB of user data is being written per cycle,
> there's a write amplification factor of well over 100x ocurring
> here. In comparison, XFS is writing roughly consistently at 80MB/s
> to disk over the course of the entire workload, largely because of
> journal traffic for the transactions run during COW and clone
> operations.  Is such a huge amount of of IO expected for btrfs in
> this situation?

<just gonna snip this part>

> FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
> performance stays largely consistent across all 1000 iterations at
> around 13-14k +/-2k IOPS. The reflink time also scales linearly with
> the number of extents in the source file and levels off at about
> 10-11s per cycle as the extent count in the source file levels off
> at ~850,000 extents. XFS completes the 1000 iterations of
> write/clone in about 4 hours, btrfs completels the same part of the
> workload in about 9 hours.

Just out of curiosity, do any of the patches in [1] improve those
numbers for xfs?  As you noted a long time ago, the transaction
reservations are kind of huge, so I fixed those and shook out a few
other warts while I was at it.

--D

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=reflink-speedups
> 
> Oh, I almost forget - FIEMAP performance. After the reflink test, I
> map all the extents in all the cloned files to a) count the extents
> and b) confirm that the difference between clones is correct (~10000
> extents not shared with the previous iteration). Pulling the extent
> maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> minutes for the whole set when run serialised. btrfs takes 90-100s
> per clone - after 8 hours it had only managed to map 380 files and
> was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> _half a million_ read IOs to map the extents of a single clone that
> only had a million extents in it. Is it expected that FIEMAP is so
> slow and IO intensive on cloned files?
> 
> As there are no performance anomolies or memory reclaim issues with
> XFS running this workload, I suspect the issues I note above are
> btrfs issues, not expected behaviour.  I'm not sure what the
> expected scalability of btrfs file clones and snapshots are though,
> so I'm interested to hear if these results are expected or not.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> JOBS=4
> IODEPTH=4
> IOCOUNT=$((10000 / $JOBS))
> FILESIZE=4g
> 
> cat >$fio_config <<EOF
> [global]
> name=${DST}.name
> directory=${DST}
> size=${FILESIZE}
> randrepeat=0
> bs=4k
> ioengine=libaio
> iodepth=${IODEPTH}
> iodepth_low=2
> direct=1
> end_fsync=1
> fallocate=none
> overwrite=1
> number_ios=${IOCOUNT}
> runtime=30s
> group_reporting=1
> disable_lat=1
> lat_percentiles=0
> clat_percentiles=0
> slat_percentiles=0
> disk_util=0
> 
> [j1]
> filename=testfile
> rw=randwrite
> 
> [j2]
> filename=testfile
> rw=randwrite
> 
> [j3]
> filename=testfile
> rw=randwrite
> 
> [j4]
> filename=testfile
> rw=randwrite
> EOF
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-02-02  2:14 ` Darrick J. Wong
@ 2021-02-02  6:02   ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2021-02-02  6:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, xfs

On Mon, Feb 01, 2021 at 06:14:21PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 22, 2021 at 09:20:51AM +1100, Dave Chinner wrote:
> > Hi btrfs-gurus,
> > 
> > I'm running a simple reflink/snapshot/COW scalability test at the
> > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > random direct IOs in a 4GB file; snapshot" and I want to check a
> > couple of things I'm seeing with btrfs. fio config file is appended
> > to the email.
> > 
> > Firstly, what is the expected "space amplification" of such a
> > workload over 1000 iterations on btrfs? This will write 40GB of user
> > data, and I'm seeing btrfs consume ~220GB of space for the workload
> > regardless of whether I use subvol snapshot or file clones
> > (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> > wondering if this is expected or whether there's something else
> > going on. XFS amplification for 1000 iterations using reflink is
> > only 1.4x, so 5.5x seems somewhat excessive to me.
> > 
> > On a similar note, the IO bandwidth consumed by btrfs is way out of
> > proportion with the amount of user data being written. I'm seeing
> > multiple GBs being written by btrfs on every iteration - easily
> > exceeding 5GB of writes per cycle in the later iterations of the
> > test. Given that only 40MB of user data is being written per cycle,
> > there's a write amplification factor of well over 100x ocurring
> > here. In comparison, XFS is writing roughly consistently at 80MB/s
> > to disk over the course of the entire workload, largely because of
> > journal traffic for the transactions run during COW and clone
> > operations.  Is such a huge amount of of IO expected for btrfs in
> > this situation?
> 
> <just gonna snip this part>
> 
> > FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
> > performance stays largely consistent across all 1000 iterations at
> > around 13-14k +/-2k IOPS. The reflink time also scales linearly with
> > the number of extents in the source file and levels off at about
> > 10-11s per cycle as the extent count in the source file levels off
> > at ~850,000 extents. XFS completes the 1000 iterations of
> > write/clone in about 4 hours, btrfs completels the same part of the
> > workload in about 9 hours.
> 
> Just out of curiosity, do any of the patches in [1] improve those
> numbers for xfs?  As you noted a long time ago, the transaction
> reservations are kind of huge, so I fixed those and shook out a few
> other warts while I was at it.

I'll give it a spin, but my initial reaction is "I don't think so".
The workload is does not have the concurrency necessary to be
sensitive to log reservation space running out...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Unexpected reflink/subvol snapshot behaviour
  2021-02-02  0:13       ` Dave Chinner
@ 2021-02-12  3:04         ` Zygo Blaxell
  0 siblings, 0 replies; 16+ messages in thread
From: Zygo Blaxell @ 2021-02-12  3:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Qu Wenruo, linux-btrfs

On Tue, Feb 02, 2021 at 11:13:34AM +1100, Dave Chinner wrote:
> On Fri, Jan 29, 2021 at 06:25:50PM -0500, Zygo Blaxell wrote:
> > On Mon, Jan 25, 2021 at 09:36:55AM +1100, Dave Chinner wrote:
> > > On Sat, Jan 23, 2021 at 04:42:33PM +0800, Qu Wenruo wrote:
> > > > 
> > > > 
> > > > On 2021/1/22 上午6:20, Dave Chinner wrote:
> > > > > Hi btrfs-gurus,
> > > > This means, if we have an file which has one 128K file extent written to
> > > > disk, and then write 4K, which will be COWed to another 4K extent, the
> > > > 128K extent is still kept as is, even the no longer referred 4K range is
> > > > still kept there, with extra 4K space usage.
> > > 
> > > That's not relevant to the workload I'm running. Once it reaches
> > > steady state, it's just doing 4kB overwrites of shared 4kB extents.
> > 
> > Actually it is relevant, because that's _not_ what your workload is doing.
> > 
> > Despite having 'prealloc=0' in fio_config, fio preallocates the testfile.
> 
> I'm using fallocate=none, so preallocation uses write(), not
> fallocate() to preallocation unwritten extents. I explicitly chose
> this so that there was no interactions with unwritten extents in the
> workload and it therefore is a pure overwrite test.
> 
> strace output of fio laying out the base file in iteration 0 before
> any overwrites are done:
> 
> 155715 unlink("/mnt/scr/testdir/testfile") = -1 ENOENT (No such file or directory)
> 155715 openat(AT_FDCWD, "/mnt/scr/testdir/testfile", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 5
> 155715 ftruncate(5, 4294967296)         = 0
> 155715 write(5, "\37\256\233\346J EH\303\365\303\205\2730\374\37\270\376\326u5\3036\17\327\337^/{\35T\t"..., 4096) = 4096
> 155715 write(5, "\341\7\3477j\36,\30\374`\f\303\331g\267\6\37\214\343\7u\v\316\4\203\361\354\332|=\370\23"..., 4096) = 4096
> 155715 write(5, "\250\241\264\10\27\375*\0275\224B\220H\333\361\5\206\322\355\307G\363L\27P\272\272\17r\272o\1"..., 4096) = 4096
> 155715 write(5, "\261P=\300\371\303fN\26*\257\217`\370f\4B\345\352|4\227v\35\250\\\374\34q\224b\24"..., 4096) = 4096
> 155715 write(5, "\222Z\340\254\335\37R,R\vS\210\213m\31\23jaa\353\207i[\16-,\267L\200wm\4"..., 4096) = 4096
> ....
> 155715 write(5, "z\306bv\3\322$>\317X\217\v\17\337\230\25\31k\n5\32\226\24\31c\315\24\21>x\204\23"..., 4096) = 4096
> 155715 write(5, "\215w\352\252FN\332b\361\316\226\231S\364\322\n\336Y\272\v\277\221\207\7;K\210\264Zl\243\r"..., 4096) = 4096
> 155715 write(5, "`\201Qe\331L$\5,0\372k\362\326\327\v\5Fi5\341\3045\f\300\250\252\203\347\35\371\v"..., 4096) = 4096
> 155715 fsync(5 <unfinished ...>
> ....
> 155715 <... fsync resumed> )            = 0
> 155715 fadvise64(5, 0, 4294967296, POSIX_FADV_DONTNEED <unfinished ...>
> ....
> 155715 <... fadvise64 resumed> )        = 0
> 155715 close(5)                         = 0
> ....
> 155715 stat("/mnt/scr/testdir/testfile", {st_mode=S_IFREG|0644, st_size=4294967296, ...}) = 0

Hmmm you are of course right, although all the writes get coalesced into
giant extents by delalloc anyway.  The end result looks very similar in
the extent tree, but it doesn't have prealloc bits set in the metadata,
and I somehow missed that detail.

> > By the time you look at these extents with FIEMAP, FIEMAP is stuck
> > potentially running tens of trillions of iterations trying to fill in
> > the "SHARED" bit for millions of extents.
> 
> Yup, that's pretty bad. Is there any plan to fix this?

Not that I know of.  FIEMAP is too limited to be useful with btrfs,
and it's pretty useless to work on FIEMAP performance before those other
issues are resolved because both sets of issues have the same cause.

Arguably, removing the other issues could also fix btrfs FIEMAP
(i.e. change btrfs to force logical and physical extents to have the
same offset and size, and implement a faster reverse lookup table) but
those changes could break several things that currently work well in
btrfs (e.g. adequate FIEMAP handling for compression seems trivially
impossible to implement without separate logical and physical extent
sizes, and snapshots would always incur full reflink costs up front
instead of being able to defer them).

On the other hand, since there's not a prealloc here, this test case is
not hitting the _really_ slow code at all...

> > IMHO, PREALLOC should be ignored for all datacow files on btrfs.
> > It can't do things people expect with a datacow file (in
> > particular the ENOSPC guarantee is only possible for the first
> > write), and it does a bunch of
> > expensive, counterintuitive stuff that people don't expect.
> 
> So what you are saying is that preallocated extents in btrfs are
> compromised from an architectural POV and are largely
> unfixable?

I'm saying fallocate _on datacow files_ in btrfs is broken (as opposed
to nodatacow files where we can just write directly on the allocated
blocks like most other filesystems do).  It just seems obvious to me that
fallocate on datacow could never work well enough to be of practical use.
Here's why:

fallocate makes promises like this one:

	After a successful call to posix_fallocate(), subsequent writes to
	bytes in the specified range are guaranteed not to fail because
	of lack of disk space.

while reflink says the following about fallocate:

	Because a copy-on-write operation requires the allocation
	of new storage, the fallocate(2) operation may unshare shared
	blocks to guarantee that subsequent writes will not fail because
	of lack of disk space.

This seems to say that if the reflink happens first, fallocate will
allocate or reserve duplicate space to implement the no-fail guarantee.

But what happens if the fallocate happens first, and then the reflink?
The doc doesn't say.  There's no guidance in either text about how long
the no-fail guarantee from fallocate lasts, or what events invalidate it,
or what other operations are obligated to maintain it.

One could naively assume that the no-fail guarantee lasts forever, that
reflink will make duplicate space reservations for CoW of fallocated
extents (or even make duplicate copies of the existing data), and that
the reflinks will also provide no-fail guarantees.  The doc doesn't make
any such promises.

Another interpretation is that reflink is allowed to remove the no-fail
guarantee from fallocated extents, and use shared physical storage and
possibly-failing copy-on-write allocations in the future.  The last
stated intent of the user was for shared physical storage of the extent,
so we could argue this doesn't violate the principle of least surprise.
The doc doesn't forbid this, and this is what actually happens on btrfs.

Every btrfs transaction starts by creating a reflink to every datacow
extent in the filesystem.  A snapshot is created in memory (recall
that snapshots are effectively deferred reflinks), btrfs updates the
snapshot rather than the on-disk data (as required for datacow), and then
swaps the snapshot for the original subvol after flushing, deleting the
on-disk subvol.  btrfs datacow files thus behave as if they always have
a new reflink, which removes the no-fail guarantee immediately after
fallocate() is done.

Without the no-fail guarantee, fallocate is (mostly) meaningless.
It would have been sane to say "lol no, we don't implement fallocate
on datacow files because the concepts are obviously incompatible,"
and stop there.  Back in 2008, someone didn't stop, and today we have
btrfs's broken fallocate emulation for datacow files.

The fallocate emulation is good enough to pass a simple unit test (the
one where we preallocate some empty space, fill up the filesystem, then
write once to the preallocated region, and don't get ENOSPC most of the
time) but it fails a lot:

	- it doesn't reserve metadata space for partially overwritten
	prealloc extents or data csums, so writes to fallocated extents
	that don't overwrite the entire extent at once can return ENOSPC
	at any time,

	- it triggers the expensive block-level sharing check for any
	partial overwrite of a prealloc extent,

	- it triggers a normal extent-level sharing check for any future
	write to the file as long as the file exists (it's a flag in
	the inode that cannot be turned off even if all preallocated
	extents are removed).

	- it forgets the prealloc bit in the extent when data is written
	to a block, so overwrites don't get the no-fail guarantee,

	- it doesn't provide space or metadata annotation for the no-fail
	guarantee on existing data blocks at all,

	- it doesn't implement unsharing or space reservation for
	reflinks or snapshots of fallocated extents,

	- it usually ends up wasting a lot of space in practice, due to
	btrfs's shared ref counting (one byte of reference holds 128MB
	of immutable data on the filesystem),

	- it permanently disables compression on files where fallocate
	is used (due to the inode flag that cannot be unset).

Possibly other problems too, this is just a list of some random issues
users complained about over the years.  Some of these apply to nodatacow
files on btrfs too, but fallocate on nodatacow files can work if there
are a) no reflinks or snapshots and b) all the blocks are filled with
non-zero data so there is no metadata expansion for zero-filling blocks.
i.e. don't use prealloc on btrfs nodatacow files either, just write
filler data to them if you really need the no-fail guarantee to work.

btrfs could fix all those issues:  persistently reserve metadata space
for data csums and partial overwrite extent items, persistently reserve
space for reflinks, persist the fallocate bit (writes guaranteed not
to fail) separately from the prealloc bit (space allocated but not
filled, reads as zero) so data overwrites are guaranteed not to fail,
track subvol-level fallocate usage so we can reserve space when making
snapshots, redesign compression, data csums, and reference sharing.

All of the fixes increase the runtime cost of fallocate, possibly to
insane levels, or they turn datacow files into nodatacow files and violate
btrfs data integrity rules, or they impose write overheads proportional
to the number of reflinks, or the preallocated space is subject to
arbitrary amounts of fragmentation, or they have surprising reserved
space requirements, or snapshot create and reflink grow a bunch of knobs
for controlling whether the snapshot/reflink inherits the shared space
guarantee from its parent, or async data writes have to be journaled,
or something even worse than any of these.

Of course any of these fixes will require incompat bits, because the
current on-disk format is useless for implementing fallocate.

Real users of fallocate tend to be performance-sensitive:  one critical
feature of most fallocate implementations is that writes have zero
overhead in the future because allocation is done in the present.
On btrfs datacow files, normal writes never have zero overhead, and
writes to fallocated extents can have _additional_ overhead compared to
normal writes because of the extra work btrfs does to emulate fallocate.
We can eliminate half of the proposable fallocate fixes because they
have even worse overheads than what we have now.

fallocate doesn't separate the no-fail guarantee from the (mostly
unstated) contiguous allocation expectation (i.e. most users assume
fallocate will try to allocate contiguous space, because most fallocate
implementations do that, but it's not a stated effect of the fallocate
system call).  If we separate these, btrfs could simply reserve space
for all fallocate overwrites in a pool subtracted from free data space
(the same way space for the current transaction is reserved in metadata
space), and snapshots and reflinks can just add to the pool size (or
fail if there's not enough free space for reserved fallocate blocks).
Such allocations would not be contiguous (which isn't guaranteed) but
writes would not fail with ENOSPC (which is guaranteed), so it would
be technically correct; however, the non-contiguous allocation behavior
would be so different from what users expect from other filesystems that
it's questionable whether we should bother.

The one (and possibly only) case where prealloc on datacow is useful
is when writing a file in random order--very carefully, with strictly
page-aligned writes, each page written exactly once.  This reduces
fragmentation, which would otherwise create a pile of metadata roughly 2%
of the size of the data (at 4K page size).  A torrent client could use
this if it was very careful to buffer up partial block writes--if it's
not careful, the resulting prealloc file ends up very badly fragmented,
and a lot of space is wasted.

systemd famously ran into this issue back in 2015.  At the time, systemd
used prealloc on datacow files for its journal, but prealloc on datacow
was useless with systemd's write pattern, so systemd now uses prealloc
on nodatacow files, and tries (incorrectly, it turns out) to flip them
back to datacow once systemd has closed them.

Databases and VM images (which frequently overwrite blocks) are strictly
worse with prealloc on datacow files than without prealloc or without
datacow.

> > PREALLOC is useful for nodatacow files and does implement expected
> > behavior, but it should only be used on those.
> 
> Yeah, that's not an option for general use filesystems
> that run applications that use fallocate() for preallocation and the
> user either doesn't know about it or cannot turn it off.

Hence my proposal that the filesystem silently ignore fallocate when the
file is datacow, as that is the least bad way to satisfy the various
conflicting requirements.  Deprecate the half-finished, stillborn,
insane fallocate-on-datacow feature and remove it in some future kernel.

The most correct way would be to reject posix_fallocate with an error
since btrfs doesn't implement it sanely on datacow files, but as you
point out, that would require LD_PRELOAD hacks on all the applications
that think fallocate is mandatory.

As things are now, I already use LD_PRELOAD hacks to prevent applications
from using prealloc, because prealloc on datacow such a bad idea at scale.

> > > > > Next, subvol snapshot and clone time appears to be scale with the
> > > > > number of snapshots/clones already present. The initial clone/subvol
> > > > > snapshot command take a few milliseconds. At 50 snapshots it take
> > > > > 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
> > > > > > 850 it seems to level off at about 30s a snapshot. There are
> > > > > outliers that take double this time (63s was the longest) and the
> > > > > variation between iterations can be quite substantial. Is this
> > > > > expected scalablity?
> > > > 
> > > > The snapshot will make the current subvolume to be fully committed
> > > > before really taking the snapshot.
> > > > 
> > > > Considering above metadata overhead, I believe most of the performance
> > > > penalty should come from the metadata writeback, not the snapshot
> > > > creation itself.
> > > > 
> > > > If you just create a big subvolume, sync the fs, and try to take as many
> > > > snapshot as you wish, the overhead should be pretty the same as
> > > > snapshotting an empty subvolume.
> > > 
> > > The fio workload runs fsync at the end of the overwrite, which means
> > > all the writes and the metadata needed to reference it *must* be on
> > > stable storage. 
> > 
> > That is not how btrfs fsync works, and your assertions that follow from
> > this misunderstanding are also wrong.
> 
> I suspect you misunderstand what I said.
> 
> "metadata on stable storage" for a journalling filesystem means
> "stable in the journal", not at it's final resting place. 
> I'll snip your description of the btrfs fsync journal because I know that
> btrfs does this and why it was implmented the way it was and not the
> way WAFL or ZFS solved the same "COW metadata is expensive
> for fsync()" problem...

On filesystems with mutable metadata, fsync can cheaply overwrite
metadata (possibly even without writing the journal, e.g. when new
files or metadata pages are created in previously unreferenced space),
mark the metadata pages clean in memory, and thus not have to write
them out again later with async writes.  On such filesystems, fsync()
does reduce future async metadata write latency because it can entirely
remove metadata updates from async queues (not just the data blocks).

My point was that btrfs doesn't ever do that.  btrfs uses transactions
and wandering trees instead of a journal for metadata.  There's no
(implemented) short cut for a metadata tree update, because it has
to insert items into existing btree pages, and that's read, update,
write for every page, all the way from the leaves to the roots, and the
metadata trees are _big_.

As far as I can tell, ZFS's ZIL/SLOG implementation works the same way.
Sync writes don't remove any async write workload from ZFS transaction
aggregation...and ZFS's metadata is even larger than btrfs.

Neither ZFS nor btrfs have a background process that reads these logs and
implements the metadata tree changes asynchronously after transaction
commit, so there's no fast path to dequeueing metadata changes from
kernel memory and thus no (positive) latency impact from fsync.
 
> > In the rare cases where fsync() happens to run at the same time as a
> > transaction commit (or maybe just before), the transaction commit and
> > the fsync() get synchronized by trying to touch the same locks, and
> > return at close to the same time.  In those cases, the snapshot only
> > has to write out a new subvol root and some free space map changes,
> > which takes 0.02s.
> 
> Ok, so there are internal transaction commit/metadata COW lock-step
> conflicts in btrfs? 

Exactly.  delalloc can avoid the locks to some degree, and there is some
attempt to spread out the write load over time in background threads that
don't block anything, but once those run out of memory, or a flushing
commit like snapshot create is triggered, btrfs will stop the world
to wait for a metadata tree update to finish.

> I'm guessing that they can hit anything that runs fsync() and at
> any time?

They can hit anything that is mutating the filesystem at any time.
All mutating functions try to start or join a transaction, and all can
be locked out during the critical section of a transaction commit.

'mkdir' could take 10 microseconds or 10 months under the right
conditions.

> i.e. non-deterministic long tail latencies can hit at any time?

Yes.  Currently it's not possible to run a latency-sensitive workload
on btrfs in the presence of continuous writers.

It will always take a long time to flush millions of 4K extent refs to
disk with 80 GB of metadata, but it seems to be easily possible to cap
the latencies at seconds per commit, instead of allowing them to grow
to multiple hours.

There used to be request throttling to prevent the latencies from getting
stupidly large, but it was removed more or less accidentally in 5.0, so
now the latencies are bounded only by disk space.

Some older parts of the btrfs code (like snapshot delete) never had
throttling implemented in any version.  These are the easiest ways to
get multi-minute commit latencies.

> I doubt that aggregating more changes in memory will improve
> determinism. Sure, it might delay the interference for some time,
> but then the machine will eventually be out of memory. That can be
> trigger by a user allocation and so the user will now complain about
> long application stalls due to kernel memory reclaim....

Yeah, that advice would have only helped your test be more deterministic,
to help confirm that this theory of why the latencies are occurring is
the correct one.  It won't solve any problem.

> > If you have 1000 snapshots and your writes have high metadata locality
> > (e.g. you are appending to a single log file in each snapshot) then
> > the write multipliers are very close to 1.0x.  If you have low metadata
> > locality, even one snapshot will be followed by a big write multiplication
> > burst.
> 
> Yup, so general use case with snapshots is low data and metadata
> write locality as most files tend to get written once and then not
> touched again. 

That's the high locality case.  Most changes will occur at the high end
of the subvol where the new files are created (plus a few pages in the
middle for their dirents), so the majority of the metadata pages in the
subvol are untouched and btrfs avoids having to make reflinks of them.
Most new metadata items will appear on a few pages.

Files that are updated at random throughout their logical space, like
a big VM image or database (or directories where files are updated
randomly, e.g. a cache directory where filenames are hashes, or a
build tree where source files persist a long time but derived files
are wiped out by periodic 'make clean') will have low locality due to
deletions (including overwrites) at random points within the subvol,
while overwrites and new data still appear at the end of the subvol.
The same number of metadata item changes will affect many more metadata
pages over a large logical distance.  These are the workloads that show
up as hot spots when we are looking at what our rotating snapshot servers
are doing.

> > > > No, this is mostly due to the exploding amount of metadata caused by the
> > > > near-worst case workload.
> > 
> > Every 2 orders of magnitude more metadata items increases the O(log(N))
> > costs of btrfs by one unit.  By 50 snapshots or reflinks you have hundreds
> > of millions of metadata items, it's 6x slower and not increasing very
> > much any more...not too far off what we'd expect.
> 
> Ok, so the problem is the exponential cost of maintaining all the
> cross-btree references, not the btree itself. 

I wouldn't say that the cost is exponential (explosive, sure, that's a
good ex- word for this).  The size growth is O(n*log(n)) for the number
of extent refs (shared and unshared reflinks are implemented the same
way on btrfs) and the curve follows that shape for the first 600 or so
snapshots (I didn't run all 1000).

CPU growth seems to follow the metadata size, provided you don't step
on the related prealloc or FIEMAP performance landmines which are
O(reflinks * snapshots * log(n)) for each individual physical extent.
Large numbers growing in two dimensions, but not exponentially.

Obviously the curve bends sharply at the point where metadata no longer
fits in RAM and spills out on disk.  With 80GB of metadata, the constant
terms in O()-notation are going to be huge, and the flood of random
metadata page IOs probably large enough to find nonlinearities in the
storage hardware as well (we've hit SLC and DRAM cache throughput limits
in SSDs with this kind of workload).

> So it really is an architectural issue and not something that can
> be fixed?

Well, it's hard to tell with so many trivial performance problems floating
around in btrfs, but the results in your test do seem to be dominated
by metadata size effects.

Someone could be working on an ultra-skinny reflink metadata format, or
mutable extent ref maps (which would be useful for dedupe use cases as
well), or on-disk deferred metadata ref updates, or a high-speed FIEMAP
accelerator cache just because Dave Chinner wants one.  That would be
a lot like bolting a whole other filesystem onto the side of btrfs.
It was done for fsync, so it's not impossible, but it doesn't seem likely.

For that matter, somebody could be teaching XFS how to do compression,
data csums, and self-healing from mirror devices, and then I could flip
a few servers over to test it... ;)

> > I'm familiar with this workload.  I've been running something similar to
> > your target workload since 2014.  We build NAS backup appliance boxes:
> > each has about 100 client subvols ranging in size from 1GB to 10TB,
> > thousands to millions of files each, 1-5% daily turnover.
> > 
> > Multiple snapshots per hour at this scale is a really ambitious target
> > for btrfs.  We can theoretically do somewhere between 15 and 180 snapshot
> > rotates per day before the machine starts falling behind on the deletes
> > and running out of space.  Snapshot create and delete on btrfs come
> > with giant unbounded latency spikes, so we don't run them all the time.
> > We'll create snapshots any time a client finishes an update, but we only
> > delete old snapshots to recover disk space during a 3-hour maintenance
> > window.
> >
> > While the snapshot rotates are happening, btrfs leaves CPU cores and
> > disks idle.  Current performance is far from the theoretical limits.
> 
> Which, IMO, is kinda sad because zero-cost snapshot-based
> workload/workflows is what btrfs was specifically intended to
> provide to users...

This is a frequent misconception among btrfs users.  The costs are
not zero.  They are _deferred_, and paid out over the lifetime of the
snapshot.

They are good for use cases where you don't modify enough of a subvol to
force a complete reflink copy before you start deleting the snapshot.
This avoids the cost of creating and later deleting a full reflink
copy.  You only pay for what you modified.  This is useful for origin
servers sending backups--the backup snapshot gets deleted after use,
and you have only a handful of snapshots lying around between backups
to calculate incrementals.

For other use cases, snapshot costs include the full cost of a reflink
copy spread out over later writes to shared subvols.  Once a snapshot
has been fully transformed into a reflink copy, deleting the snapshot has
roughly the same cost as rm -fr.  This is the case you were testing, and
btrfs's ability to defer snapshot costs gives no advantage for this case.

Of course, the savings gained by deferred reflink on btrfs could be less
than the entire snapshot lifecycle cost on other filesystems due to the
ratio of btrfs metadata size to other filesystems' metadata sizes.

> Yup, that's exactly how I have fio configured to behave. So the 80GB
> of metadata is for what you consider to be the "better" case.

It should be ~80GB of metadata in both cases.  The total could be 70GB
or 90GB due to random variation between runs.

The prealloc / write fill thing just makes FIEMAP slower (more CPU
iterations) and adds 4GB of data space.  prealloc doesn't change the
metadata size by more than a few KB, and FIEMAP doesn't change anything
on disk at all.

On our fileservers with multi-million-extent subvols, we get from 50 GB
to 180 GB of metadata from 20 TB to 50 TB of data, depending on average
file size and how well dedupe is doing.  Every TB of data is a GB of
data csums.  After that, the rest is roughly equal parts directory
entries, inline file data, and extent forward/backward reference
pairs, with the sizes varying depending on content of the filesystem.
Dedupe will reflink common extents hundreds or thousands of times each,
and then we have hundreds or thousands of snapshots (daily snapshots *
years of retention).

There can easily be 30-100 GB of reflinks in the metadata, so 80 GB is
not an atypical size for this kind of workload.

You haven't confirmed whether your ~220GB size was measured with single
or dup metadata, or whether you were using 'df' or 'btrfs fi df' to
measure it.  'df' counts all 1GB metadata chunks as used space, whether
they are fully occupied or not.  'dup' metadata literally doubles the
metadata size.  If you're using dup metadata and 'df' "used" space, that
number will include 44GB of data, and the remainder will be about 210%
of the actual metadata size.  I used 'btrfs fi df', which shows directly
how much pre-duplication metadata and data space is used, so I'm getting
a best-case estimate.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-02-12  3:05 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 22:20 Unexpected reflink/subvol snapshot behaviour Dave Chinner
2021-01-23  8:42 ` Qu Wenruo
2021-01-23  8:51   ` Qu Wenruo
2021-01-23 10:39   ` Roman Mamedov
2021-01-23 10:58     ` Qu Wenruo
2021-01-24 13:08   ` Filipe Manana
2021-01-24 22:36   ` Dave Chinner
2021-01-25  1:09     ` Qu Wenruo
2021-01-29 23:25     ` Zygo Blaxell
2021-02-02  0:13       ` Dave Chinner
2021-02-12  3:04         ` Zygo Blaxell
2021-01-24  0:19 ` Zygo Blaxell
2021-01-24 21:43   ` Dave Chinner
2021-01-30  1:03     ` Zygo Blaxell
2021-02-02  2:14 ` Darrick J. Wong
2021-02-02  6:02   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.