* XFS reflink vs ThinLVM @ 2020-01-13 10:22 Gionatan Danti 2020-01-13 11:10 ` Carlos Maiolino 2020-01-13 16:14 ` Chris Murphy 0 siblings, 2 replies; 20+ messages in thread From: Gionatan Danti @ 2020-01-13 10:22 UTC (permalink / raw) To: linux-xfs; +Cc: 'g.danti@assyoma.it' Hi all, as RHEL/CentOS 8 finally ships with XFS reflink enabled, I was thinking on how to put that very useful feature to good use. Doing that, I noticed how there is a significant overlap between XFS CoW (via reflink) and dm-thin CoW (via LVM thin volumes). I am fully aware that they are far from identical, both in use and scope: ThinLVM is used to create multiple volumes from a single pool, with volume-level atomic snapshot; on the other hand, XFS CoW works inside a single volume and with file-level atomic snapshot. Still, in at least one use case they are quite similar: single-volume storage of virtual machine files, with vdisk-level snapshot. So lets say I have a single big volume for storing virtual disk image file, and using XFS reflink to take atomic, per file snapshot via a simple "cp --reflink vdisk.img vdisk_snap.img". How do you feel about using reflink for such a purpose? Is the right tool for the job? Or do you think a "classic" approach with dmthin and lvm snapshot should be preferred? On top of my head, I can thin about the following pros and cons when using reflink vs thin lvm: PRO: - xfs reflink works at 4k granularity; - significantly simpler setup and fs expansion, especially when staked devices (ie: vdo) are employed. CONS: - xfs reflink works at 4k granularity, leading to added fragmentation (albeit mitigated by speculative preallocation?); - no filesystem-wide atomic snapshot (ie: various vdisk files are reflinked one-by-one, at small but different times). Side note: I am aware of the fact that a snapshot taken without guest quiescing is akin to a crashed guest, but lets ignore that for the moment. Am I missing something? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 10:22 XFS reflink vs ThinLVM Gionatan Danti @ 2020-01-13 11:10 ` Carlos Maiolino 2020-01-13 11:25 ` Gionatan Danti 2020-01-13 16:14 ` Chris Murphy 1 sibling, 1 reply; 20+ messages in thread From: Carlos Maiolino @ 2020-01-13 11:10 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs Hi Gionatan. On Mon, Jan 13, 2020 at 11:22:51AM +0100, Gionatan Danti wrote: > Hi all, > as RHEL/CentOS 8 finally ships with XFS reflink enabled, I was thinking on > how to put that very useful feature to good use. Doing that, I noticed how > there is a significant overlap between XFS CoW (via reflink) and dm-thin CoW > (via LVM thin volumes). > > I am fully aware that they are far from identical, both in use and scope: > ThinLVM is used to create multiple volumes from a single pool, with > volume-level atomic snapshot; on the other hand, XFS CoW works inside a > single volume and with file-level atomic snapshot. > > Still, in at least one use case they are quite similar: single-volume > storage of virtual machine files, with vdisk-level snapshot. So lets say I > have a single big volume for storing virtual disk image file, and using XFS > reflink to take atomic, per file snapshot via a simple "cp --reflink > vdisk.img vdisk_snap.img". > > How do you feel about using reflink for such a purpose? Is the right tool > for the job? Or do you think a "classic" approach with dmthin and lvm > snapshot should be preferred? On top of my head, I can thin about the > following pros and cons when using reflink vs thin lvm: > > PRO: > - xfs reflink works at 4k granularity; > - significantly simpler setup and fs expansion, especially when staked > devices (ie: vdo) are employed. > > CONS: > - xfs reflink works at 4k granularity, leading to added fragmentation > (albeit mitigated by speculative preallocation?); > - no filesystem-wide atomic snapshot (ie: various vdisk files are reflinked > one-by-one, at small but different times). > > Side note: I am aware of the fact that a snapshot taken without guest > quiescing is akin to a crashed guest, but lets ignore that for the moment. > First of all, I think there is no 'right' answer, but instead, use what best fit you and your environment. As you mentioned, there are PROs and CONS for each different solution. I use XFS reflink to CoW my Virtual Machines I use for testing. As I know many others do the same, and it works very well, but as you said. It is file-based disk images, opposed to volume-based disk images, used by DM and LVM.man. About your concern regarding fragmentation... The granularity is not really 4k, as it really depends on the extent sizes. Well, yes, the fundamental granularity is block size, but we basically never allocate a single block... Also, you can control it by using extent size hints, which will help reduce the fragmentation you are concerned about. Check 'extsize' and 'cowextsize' arguments for mkfs.xfs and xfs_io. Cheers -- Carlos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 11:10 ` Carlos Maiolino @ 2020-01-13 11:25 ` Gionatan Danti 2020-01-13 11:43 ` Carlos Maiolino 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-13 11:25 UTC (permalink / raw) To: linux-xfs; +Cc: 'g.danti@assyoma.it' On 13/01/20 12:10, Carlos Maiolino wrote: > First of all, I think there is no 'right' answer, but instead, use what best fit > you and your environment. As you mentioned, there are PROs and CONS for each > different solution. > > I use XFS reflink to CoW my Virtual Machines I use for testing. As I know many > others do the same, and it works very well, but as you said. It is file-based > disk images, opposed to volume-based disk images, used by DM and LVM.man. > > About your concern regarding fragmentation... The granularity is not really 4k, > as it really depends on the extent sizes. Well, yes, the fundamental granularity > is block size, but we basically never allocate a single block... > > Also, you can control it by using extent size hints, which will help reduce the > fragmentation you are concerned about. > Check 'extsize' and 'cowextsize' arguments for mkfs.xfs and xfs_io. Hi Carlos, thank you for pointing me to the "cowextsize" option. From what I can read, it default to 32 blocks x 4 KB = 128 KB, which is a very reasonable granularity for CoW space/fragmentation tradeoff. On the other hand, "extsize" seems to apply only to realtime filesystem section (which I don't plan to use), right? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 11:25 ` Gionatan Danti @ 2020-01-13 11:43 ` Carlos Maiolino 2020-01-13 12:21 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Carlos Maiolino @ 2020-01-13 11:43 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs Hi Gionatan. On Mon, Jan 13, 2020 at 12:25:26PM +0100, Gionatan Danti wrote: > On 13/01/20 12:10, Carlos Maiolino wrote: > > First of all, I think there is no 'right' answer, but instead, use what best fit > > you and your environment. As you mentioned, there are PROs and CONS for each > > different solution. > > > > I use XFS reflink to CoW my Virtual Machines I use for testing. As I know many > > others do the same, and it works very well, but as you said. It is file-based > > disk images, opposed to volume-based disk images, used by DM and LVM.man. > > > > About your concern regarding fragmentation... The granularity is not really 4k, > > as it really depends on the extent sizes. Well, yes, the fundamental granularity > > is block size, but we basically never allocate a single block... > > > > Also, you can control it by using extent size hints, which will help reduce the > > fragmentation you are concerned about. > > Check 'extsize' and 'cowextsize' arguments for mkfs.xfs and xfs_io. > > Hi Carlos, thank you for pointing me to the "cowextsize" option. From what I > can read, it default to 32 blocks x 4 KB = 128 KB, which is a very > reasonable granularity for CoW space/fragmentation tradeoff. > > On the other hand, "extsize" seems to apply only to realtime filesystem > section (which I don't plan to use), right? I should have mentioned it, my apologies. 'extsize' argument for mkfs.xfs will set the size of the blocks in the RT section. Although, the 'extsize' command in xfs_io, will set the extent size hints on any file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR). Notice you can use xfs_io extsize to set the extent size hint to a directory, and all files under the directory will inherit the same extent hint. Cheers. > > Thanks. > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 > -- Carlos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 11:43 ` Carlos Maiolino @ 2020-01-13 12:21 ` Gionatan Danti 2020-01-13 15:34 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-13 12:21 UTC (permalink / raw) To: linux-xfs; +Cc: 'g.danti@assyoma.it' On 13/01/20 12:43, Carlos Maiolino wrote: > I should have mentioned it, my apologies. > > 'extsize' argument for mkfs.xfs will set the size of the blocks in the RT > section. > > Although, the 'extsize' command in xfs_io, will set the extent size hints on any > file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR). > > Notice you can use xfs_io extsize to set the extent size hint to a directory, > and all files under the directory will inherit the same extent hint. My bad, I forgot about xfs_io. Thanks for the detailed explanation. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 12:21 ` Gionatan Danti @ 2020-01-13 15:34 ` Gionatan Danti 2020-01-13 16:53 ` Darrick J. Wong 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-13 15:34 UTC (permalink / raw) To: linux-xfs; +Cc: 'g.danti@assyoma.it' On 13/01/20 13:21, Gionatan Danti wrote: > On 13/01/20 12:43, Carlos Maiolino wrote: >> I should have mentioned it, my apologies. >> >> 'extsize' argument for mkfs.xfs will set the size of the blocks in the RT >> section. >> >> Although, the 'extsize' command in xfs_io, will set the extent size >> hints on any >> file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR). >> >> Notice you can use xfs_io extsize to set the extent size hint to a >> directory, >> and all files under the directory will inherit the same extent hint. > > My bad, I forgot about xfs_io. > Thanks for the detailed explanation. Well, I did some test with a reflinked file and I must say I am impressed on how well XFS handles small rewrites (for example 4K). From my understanding, by mapping at 4K granularity but allocating at 128K, it avoid most read/write amplification *and* keep low fragmentation. After "speculative_cow_prealloc_lifetime" it reclaim the allocated but unused space, bringing back any available free space to the filesystem. Is this understanding correct? I have a question: how can I see the allocated-but-unused cow extents? For example, giving the following files: [root@neutron xfs]# stat test.img copy.img File: test.img Size: 1073741824 Blocks: 2097400 IO Block: 4096 regular file Device: 810h/2064d Inode: 131 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: unconfined_u:object_r:unlabeled_t:s0 Access: 2020-01-13 15:40:50.280711297 +0100 Modify: 2020-01-13 16:21:55.564726283 +0100 Change: 2020-01-13 16:21:55.564726283 +0100 Birth: - File: copy.img Size: 1073741824 Blocks: 2097152 IO Block: 4096 regular file Device: 810h/2064d Inode: 132 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: unconfined_u:object_r:unlabeled_t:s0 Access: 2020-01-13 15:40:50.280711297 +0100 Modify: 2020-01-13 15:40:57.828552412 +0100 Change: 2020-01-13 15:41:48.190492279 +0100 Birth: - I can clearly see that test.img has an additional 124K allocated after a 4K rewrite. This matches my expectation: a 4K rewrite really allocates a 128K blocks, leading to 124K of temporarily "wasted" space. But both "filefrag -v" and "xfs_bmap -vep" show only the used space as seen by an userspace application (ie: 262144 blocks of 4096 bytes = 1073741824 bytes). How can I check the total allocated space as reported by stat? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 15:34 ` Gionatan Danti @ 2020-01-13 16:53 ` Darrick J. Wong 2020-01-13 17:00 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Darrick J. Wong @ 2020-01-13 16:53 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Mon, Jan 13, 2020 at 04:34:50PM +0100, Gionatan Danti wrote: > On 13/01/20 13:21, Gionatan Danti wrote: > > On 13/01/20 12:43, Carlos Maiolino wrote: > > > I should have mentioned it, my apologies. > > > > > > 'extsize' argument for mkfs.xfs will set the size of the blocks in the RT > > > section. mkfs.xfs -d extszinherit=NNN is what you want here. > > > > > > Although, the 'extsize' command in xfs_io, will set the extent size > > > hints on any > > > file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR). > > > > > > Notice you can use xfs_io extsize to set the extent size hint to a > > > directory, > > > and all files under the directory will inherit the same extent hint. > > > > My bad, I forgot about xfs_io. > > Thanks for the detailed explanation. > > Well, I did some test with a reflinked file and I must say I am impressed on > how well XFS handles small rewrites (for example 4K). > > From my understanding, by mapping at 4K granularity but allocating at 128K, > it avoid most read/write amplification *and* keep low fragmentation. After > "speculative_cow_prealloc_lifetime" it reclaim the allocated but unused > space, bringing back any available free space to the filesystem. Is this > understanding correct? Right. > I have a question: how can I see the allocated-but-unused cow extents? For > example, giving the following files: > > [root@neutron xfs]# stat test.img copy.img > File: test.img > Size: 1073741824 Blocks: 2097400 IO Block: 4096 regular file > Device: 810h/2064d Inode: 131 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Context: unconfined_u:object_r:unlabeled_t:s0 > Access: 2020-01-13 15:40:50.280711297 +0100 > Modify: 2020-01-13 16:21:55.564726283 +0100 > Change: 2020-01-13 16:21:55.564726283 +0100 > Birth: - > > File: copy.img > Size: 1073741824 Blocks: 2097152 IO Block: 4096 regular file > Device: 810h/2064d Inode: 132 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Context: unconfined_u:object_r:unlabeled_t:s0 > Access: 2020-01-13 15:40:50.280711297 +0100 > Modify: 2020-01-13 15:40:57.828552412 +0100 > Change: 2020-01-13 15:41:48.190492279 +0100 > Birth: - > > I can clearly see that test.img has an additional 124K allocated after a 4K > rewrite. This matches my expectation: a 4K rewrite really allocates a 128K > blocks, leading to 124K of temporarily "wasted" space. > > But both "filefrag -v" and "xfs_bmap -vep" show only the used space as seen > by an userspace application (ie: 262144 blocks of 4096 bytes = 1073741824 > bytes). xfs_bmap -c, but only if you have xfs debugging enabled. > How can I check the total allocated space as reported by stat? > Thanks. If you happen to have rmap enabled, you can use the xfs_io fsmap command to look for 'cow reservation' blocks, since that 124k is (according to ondisk metadata, anyway) owned by the refcount btree until it gets remapped into the file on writeback. --D > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 16:53 ` Darrick J. Wong @ 2020-01-13 17:00 ` Gionatan Danti 2020-01-13 18:09 ` Darrick J. Wong 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-13 17:00 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, 'g.danti@assyoma.it' On 13/01/20 17:53, Darrick J. Wong wrote: > mkfs.xfs -d extszinherit=NNN is what you want here. Hi Darrik, thank you, I missed that option. > Right. Ok > xfs_bmap -c, but only if you have xfs debugging enabled. [root@neutron xfs]# xfs_bmap -c test.img /usr/sbin/xfs_bmap: illegal option -- c Usage: xfs_bmap [-adelpvV] [-n nx] file... Maybe my xfs_bmap version is too old: > If you happen to have rmap enabled, you can use the xfs_io fsmap command > to look for 'cow reservation' blocks, since that 124k is (according to > ondisk metadata, anyway) owned by the refcount btree until it gets > remapped into the file on writeback. I see. By default, on RHEL at least, rmapbt is disabled. As a side note, do you suggest enabling it when creating a new fs? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 17:00 ` Gionatan Danti @ 2020-01-13 18:09 ` Darrick J. Wong 2020-01-14 8:45 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Darrick J. Wong @ 2020-01-13 18:09 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Mon, Jan 13, 2020 at 06:00:15PM +0100, Gionatan Danti wrote: > On 13/01/20 17:53, Darrick J. Wong wrote: > > mkfs.xfs -d extszinherit=NNN is what you want here. > > Hi Darrik, thank you, I missed that option. > > > Right. > > Ok > > > xfs_bmap -c, but only if you have xfs debugging enabled. > > [root@neutron xfs]# xfs_bmap -c test.img > /usr/sbin/xfs_bmap: illegal option -- c > Usage: xfs_bmap [-adelpvV] [-n nx] file... > > Maybe my xfs_bmap version is too old: Doh, sorry, thinko on my part. -c is exposed in the raw xfs_io command but not in the xfs_bmap wrapper. xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img > > If you happen to have rmap enabled, you can use the xfs_io fsmap command > > to look for 'cow reservation' blocks, since that 124k is (according to > > ondisk metadata, anyway) owned by the refcount btree until it gets > > remapped into the file on writeback. > > I see. By default, on RHEL at least, rmapbt is disabled. As a side note, do > you suggest enabling it when creating a new fs? If you are interested in online scrub, then I'd say yes because it's the secret sauce that gives online metadata checking most of its power. I confess, I haven't done a lot of performance analysis of rmap lately, the metadata ops overhead might still be in the ~10% range. The two issues preventing rmap from being turned on by default (at least in my head) are (1) scrub itself is still EXPERIMENTAL and (2) it's not 100% clear that online fsck is such a killer app that everyone will want it, since you always pay the performance overhead of enabling rmap regardless of whether you use xfs_scrub. (If your workload is easily restored from backup/Docker and you need all the performance you can squeeze then perhaps don't enable this.) Note that I've been running with rmap=1 and scrub=1 on all systems since 2016, and frankly I've noticed the system stumbling over broken writeback throttling much more than whatever the tracing tools attribute to rmap. --D > Thanks. > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 18:09 ` Darrick J. Wong @ 2020-01-14 8:45 ` Gionatan Danti 2020-01-15 11:37 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-14 8:45 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On 13/01/20 19:09, Darrick J. Wong wrote: > xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img Ok, good to know. Thanks. > If you are interested in online scrub, then I'd say yes because it's the > secret sauce that gives online metadata checking most of its power. I > confess, I haven't done a lot of performance analysis of rmap lately, > the metadata ops overhead might still be in the ~10% range. > > The two issues preventing rmap from being turned on by default (at least > in my head) are (1) scrub itself is still EXPERIMENTAL and (2) it's not > 100% clear that online fsck is such a killer app that everyone will want > it, since you always pay the performance overhead of enabling rmap > regardless of whether you use xfs_scrub. Well, I really think online scrub, when ready, will be a killer feature. So, for a "mere" 10% performance penalty, I would enable rbmap unless a concrete chance to expose some bug exists. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-14 8:45 ` Gionatan Danti @ 2020-01-15 11:37 ` Gionatan Danti 2020-01-15 16:39 ` Darrick J. Wong 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-15 11:37 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On 14/01/20 09:45, Gionatan Danti wrote: > On 13/01/20 19:09, Darrick J. Wong wrote: >> xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img > > Ok, good to know. Thanks. Hi all, I have an additional question about extszinherit/extsize. If I understand it correctly, by default it is 0: any non-EOF writes on a sparse file will allocate how much space it needs. If these writes are random and small enough (ie: 4k random writes), a subsequent sequential read of the same file will have much lower performance (because sequential IO are transformed in random accesses by the logical/physical block remapping). Setting a 128K extszinherit (for the entire filesystem) or extsize (for a file/dir) will markedly improve the situation, as much bigger contiguous LBA regions can be read for each IO (note: I know SSD and NVME disks are much less impacted by fragmentation, but I am mainly speaking about HDD here). So, my question: there is anything wrong and/or I should be aware when using a 128K extsize, so setting it the same as cowextsize? The only possible drawback I can think is a coarse granularity when allocating from the sparse file (ie: a 4k write will allocate the full 128k extent). Am I missing something? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-15 11:37 ` Gionatan Danti @ 2020-01-15 16:39 ` Darrick J. Wong 2020-01-15 17:45 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Darrick J. Wong @ 2020-01-15 16:39 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Wed, Jan 15, 2020 at 12:37:52PM +0100, Gionatan Danti wrote: > On 14/01/20 09:45, Gionatan Danti wrote: > > On 13/01/20 19:09, Darrick J. Wong wrote: > > > xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img > > > > Ok, good to know. Thanks. > > Hi all, I have an additional question about extszinherit/extsize. > > If I understand it correctly, by default it is 0: any non-EOF writes on a > sparse file will allocate how much space it needs. If these writes are > random and small enough (ie: 4k random writes), a subsequent sequential read > of the same file will have much lower performance (because sequential IO are > transformed in random accesses by the logical/physical block remapping). > > Setting a 128K extszinherit (for the entire filesystem) or extsize (for a > file/dir) will markedly improve the situation, as much bigger contiguous LBA > regions can be read for each IO (note: I know SSD and NVME disks are much > less impacted by fragmentation, but I am mainly speaking about HDD here). > > So, my question: there is anything wrong and/or I should be aware when using > a 128K extsize, so setting it the same as cowextsize? The only possible > drawback I can think is a coarse granularity when allocating from the sparse > file (ie: a 4k write will allocate the full 128k extent). > > Am I missing something? extszinherit > 0 disables delayed allocation, which means that (in your case above) if you wrote 1G to a file (using the pagecache) you'd get 8192x 128K calls to the allocator instead of making a single 1G allocation during writeback. If you have a lot of memory (or a high vmm dirty ratio) then you want delalloc over extsize. Most of the time you want delalloc, frankly. --D > Thanks. > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-15 16:39 ` Darrick J. Wong @ 2020-01-15 17:45 ` Gionatan Danti 2020-01-17 21:58 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-15 17:45 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, g.danti Il 15-01-2020 17:39 Darrick J. Wong ha scritto: > extszinherit > 0 disables delayed allocation, which means that (in your > case above) if you wrote 1G to a file (using the pagecache) you'd get > 8192x 128K calls to the allocator instead of making a single 1G > allocation during writeback. Thanks for the valuable information, I did not know that specific interaction between extsize and delalloc. > If you have a lot of memory (or a high vmm > dirty ratio) then you want delalloc over extsize. Most of the time you > want delalloc, frankly. Let me briefly describe the expected workload: thinly provisioned virtual image storage. The problem with "plain" sparse file (ie: without extsize hint) is that, after some time, the underlying vdisk file will be very fragmented: consecutive physical blocks will be assigned to very different logical blocks, leading to sub-par performance when reading back the whole file (eg: for backup purpose). I can easily simulate a worst-case scenario with fio, issuing random write to a pre-created sparse file. While the random writes complete very fast (because they are more-or-less sequentially written inside the sparse file), reading back that file will have very low performance: 10 MB/s vs 600+ MB/s for a preallocated file. Using a 128k extsize brings sequential read to ~100 MB/s (which is reasonable on that old hardware), and a 16M extsize is in the range of 500+ MB/s. Given that use case, do you suggest sticking with delalloc or setting an appropriate extsize? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-15 17:45 ` Gionatan Danti @ 2020-01-17 21:58 ` Gionatan Danti 2020-01-17 23:42 ` Darrick J. Wong 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-17 21:58 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, g.danti Il 15-01-2020 18:45 Gionatan Danti ha scritto: > Let me briefly describe the expected workload: thinly provisioned > virtual image storage. The problem with "plain" sparse file (ie: > without extsize hint) is that, after some time, the underlying vdisk > file will be very fragmented: consecutive physical blocks will be > assigned to very different logical blocks, leading to sub-par > performance when reading back the whole file (eg: for backup purpose). > > I can easily simulate a worst-case scenario with fio, issuing random > write to a pre-created sparse file. While the random writes complete > very fast (because they are more-or-less sequentially written inside > the sparse file), reading back that file will have very low > performance: 10 MB/s vs 600+ MB/s for a preallocated file. I would like to share some other observation/results, which I hope can be useful for other peoples. Further testing shows that "cp --reflink" an highly fragmented files is a relatively long operation, easily in the range of 30s or more, during which the guest virtual machine is basically denied any access to the underlying virtual disk file. While the number of fragments required to reach reflink time of 30+ seconds is very high, this would be a quite common case when using thinly provisioned virtual disk files. With sparse file, any write done at guest OS level has a very good chance to create its own fragment (ie: allocating a discontiguous chunk as seen by logical/physical block mapping), leading to very fragmented files. So, back to main topic: reflink is an invaluable tool, to be used *with* (rather than instead of) thin lvm: - thinlvm is the right tool for taking rolling volume snapshot; - reflink is extremely useful for "on-demand" snapshot of key files. Thank you all for the very detailed and useful information you provided. Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-17 21:58 ` Gionatan Danti @ 2020-01-17 23:42 ` Darrick J. Wong 2020-01-18 11:08 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Darrick J. Wong @ 2020-01-17 23:42 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Fri, Jan 17, 2020 at 10:58:15PM +0100, Gionatan Danti wrote: > Il 15-01-2020 18:45 Gionatan Danti ha scritto: > > Let me briefly describe the expected workload: thinly provisioned > > virtual image storage. The problem with "plain" sparse file (ie: > > without extsize hint) is that, after some time, the underlying vdisk > > file will be very fragmented: consecutive physical blocks will be > > assigned to very different logical blocks, leading to sub-par > > performance when reading back the whole file (eg: for backup purpose). > > > > I can easily simulate a worst-case scenario with fio, issuing random > > write to a pre-created sparse file. While the random writes complete > > very fast (because they are more-or-less sequentially written inside > > the sparse file), reading back that file will have very low > > performance: 10 MB/s vs 600+ MB/s for a preallocated file. > > I would like to share some other observation/results, which I hope can be > useful for other peoples. > > Further testing shows that "cp --reflink" an highly fragmented files is a > relatively long operation, easily in the range of 30s or more, during which > the guest virtual machine is basically denied any access to the underlying > virtual disk file. How many fragments, and how big of a sparse file? --D > While the number of fragments required to reach reflink time of 30+ seconds > is very high, this would be a quite common case when using thinly > provisioned virtual disk files. With sparse file, any write done at guest OS > level has a very good chance to create its own fragment (ie: allocating a > discontiguous chunk as seen by logical/physical block mapping), leading to > very fragmented files. > > So, back to main topic: reflink is an invaluable tool, to be used *with* > (rather than instead of) thin lvm: > - thinlvm is the right tool for taking rolling volume snapshot; > - reflink is extremely useful for "on-demand" snapshot of key files. > > Thank you all for the very detailed and useful information you provided. > Regards. > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-17 23:42 ` Darrick J. Wong @ 2020-01-18 11:08 ` Gionatan Danti 2020-01-18 23:06 ` Darrick J. Wong 0 siblings, 1 reply; 20+ messages in thread From: Gionatan Danti @ 2020-01-18 11:08 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, g.danti Il 18-01-2020 00:42 Darrick J. Wong ha scritto: > How many fragments, and how big of a sparse file? A just installed CentOS 8 guest using a 20 GB sparse file vdisk had about 2000 fragments. After running "fio --name=test --filename=test.img --rw=randwrite --size=4G" for about 30 mins, it ended with over 1M fragments/extents. At that point, reflinking that file took over 2 mins, and unlinking it about 4 mins. I understand fio randwrite pattern is a worst case scenario; still, I think the results are interesting and telling for "aged" virtual machines. As a side note, a just installed Win2019 guest backed with an 80 GB sparse file had about 18000 fragments. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-18 11:08 ` Gionatan Danti @ 2020-01-18 23:06 ` Darrick J. Wong 2020-01-19 8:45 ` Gionatan Danti 0 siblings, 1 reply; 20+ messages in thread From: Darrick J. Wong @ 2020-01-18 23:06 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Sat, Jan 18, 2020 at 12:08:48PM +0100, Gionatan Danti wrote: > Il 18-01-2020 00:42 Darrick J. Wong ha scritto: > > How many fragments, and how big of a sparse file? > > A just installed CentOS 8 guest using a 20 GB sparse file vdisk had about > 2000 fragments. > > After running "fio --name=test --filename=test.img --rw=randwrite --size=4G" 4GB / 1M extents == 4096, which is probably the fs blocksize :) I wonder, do you get different results if you set an extent size hint on the dir before running fio? I forgot(?) to mention that if you're mostly dealing with sparse VM images then you might as well set a extent size hint and forego delayed allocation because it won't help you much. --D > for about 30 mins, it ended with over 1M fragments/extents. At that point, > reflinking that file took over 2 mins, and unlinking it about 4 mins. > > I understand fio randwrite pattern is a worst case scenario; still, I think > the results are interesting and telling for "aged" virtual machines. > > As a side note, a just installed Win2019 guest backed with an 80 GB sparse > file had about 18000 fragments. > Thanks. > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-18 23:06 ` Darrick J. Wong @ 2020-01-19 8:45 ` Gionatan Danti 0 siblings, 0 replies; 20+ messages in thread From: Gionatan Danti @ 2020-01-19 8:45 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, g.danti Il 19-01-2020 00:06 Darrick J. Wong ha scritto: > 4GB / 1M extents == 4096, which is probably the fs blocksize :) Yes, it did the same observation: due to random allocation, the underlying vdisk had block-sized extents. > I wonder, do you get different results if you set an extent size hint > on the dir before running fio? Yes: setting extsize at 128K strongly reduces the amount of allocated extents (eg: 4M / 128K = 32K extents). A similar results can be obtained tapping in cowextsize, by cp --reflink the original file. Any subsequent 4K write inside the guest will cause a 128K CoW allocation (with default setting) on the backing file. However, while *much* better, it is my understanding that XFS reflink is a variable-length process: as any extents had to be scanned/reflinked, the reflink time is not constant. Meanwhile it is impossible to read/write from the reflinked file. Am I right? On the other side thinlvm snapshots, operating on block level, are a (more-or-less) constant-time operation, causing much less disruption in the normal IO flow of the guest volumes. I don't absolutely want to lessen reflink usefulnes; rather, it is an extremely useful feature which can be put to very good use. > I forgot(?) to mention that if you're mostly dealing with sparse VM > images then you might as well set a extent size hint and forego delayed > allocation because it won't help you much. This was my conclusion as well. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 10:22 XFS reflink vs ThinLVM Gionatan Danti 2020-01-13 11:10 ` Carlos Maiolino @ 2020-01-13 16:14 ` Chris Murphy 2020-01-13 16:25 ` Gionatan Danti 1 sibling, 1 reply; 20+ messages in thread From: Chris Murphy @ 2020-01-13 16:14 UTC (permalink / raw) To: xfs list; +Cc: Gionatan Danti On Mon, Jan 13, 2020 at 3:28 AM Gionatan Danti <g.danti@assyoma.it> wrote: > > Still, in at least one use case they are quite similar: single-volume > storage of virtual machine files, with vdisk-level snapshot. So lets say > I have a single big volume for storing virtual disk image file, and > using XFS reflink to take atomic, per file snapshot via a simple "cp > --reflink vdisk.img vdisk_snap.img". Is --reflink on XFS atomic? In particular for a VM file that's being used, that's possibly quite a lot of metadata on disk and in-flight in the host and in the guest. I ask because I'm not certain --reflink copies on Btrfs are atomic, I'll have to ask over there too. Whereas btrfs subvolume snapshots are considered atomic. -- Chris Murphy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: XFS reflink vs ThinLVM 2020-01-13 16:14 ` Chris Murphy @ 2020-01-13 16:25 ` Gionatan Danti 0 siblings, 0 replies; 20+ messages in thread From: Gionatan Danti @ 2020-01-13 16:25 UTC (permalink / raw) To: Chris Murphy, xfs list; +Cc: 'g.danti@assyoma.it' On 13/01/20 17:14, Chris Murphy wrote: > Is --reflink on XFS atomic? In particular for a VM file that's being > used, that's possibly quite a lot of metadata on disk and in-flight in > the host and in the guest. > > I ask because I'm not certain --reflink copies on Btrfs are atomic, > I'll have to ask over there too. Whereas btrfs subvolume snapshots are > considered atomic. Hi, I did that question some time ago and, based on what I read here [1], it *should* be atomic. Feel free to correct me, anyway. [1] https://www.spinics.net/lists/linux-xfs/msg15969.html -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2020-01-19 8:45 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-13 10:22 XFS reflink vs ThinLVM Gionatan Danti 2020-01-13 11:10 ` Carlos Maiolino 2020-01-13 11:25 ` Gionatan Danti 2020-01-13 11:43 ` Carlos Maiolino 2020-01-13 12:21 ` Gionatan Danti 2020-01-13 15:34 ` Gionatan Danti 2020-01-13 16:53 ` Darrick J. Wong 2020-01-13 17:00 ` Gionatan Danti 2020-01-13 18:09 ` Darrick J. Wong 2020-01-14 8:45 ` Gionatan Danti 2020-01-15 11:37 ` Gionatan Danti 2020-01-15 16:39 ` Darrick J. Wong 2020-01-15 17:45 ` Gionatan Danti 2020-01-17 21:58 ` Gionatan Danti 2020-01-17 23:42 ` Darrick J. Wong 2020-01-18 11:08 ` Gionatan Danti 2020-01-18 23:06 ` Darrick J. Wong 2020-01-19 8:45 ` Gionatan Danti 2020-01-13 16:14 ` Chris Murphy 2020-01-13 16:25 ` Gionatan Danti
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).