Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
* XFS reflink vs ThinLVM
@ 2020-01-13 10:22 Gionatan Danti
  2020-01-13 11:10 ` Carlos Maiolino
  2020-01-13 16:14 ` Chris Murphy
  0 siblings, 2 replies; 20+ messages in thread
From: Gionatan Danti @ 2020-01-13 10:22 UTC (permalink / raw)
  To: linux-xfs; +Cc: g.danti

Hi all,
as RHEL/CentOS 8 finally ships with XFS reflink enabled, I was thinking 
on how to put that very useful feature to good use. Doing that, I 
noticed how there is a significant overlap between XFS CoW (via reflink) 
and dm-thin CoW (via LVM thin volumes).

I am fully aware that they are far from identical, both in use and 
scope: ThinLVM is used to create multiple volumes from a single pool, 
with volume-level atomic snapshot; on the other hand, XFS CoW works 
inside a single volume and with file-level atomic snapshot.

Still, in at least one use case they are quite similar: single-volume 
storage of virtual machine files, with vdisk-level snapshot. So lets say 
I have a single big volume for storing virtual disk image file, and 
using XFS reflink to take atomic, per file snapshot via a simple "cp 
--reflink vdisk.img vdisk_snap.img".

How do you feel about using reflink for such a purpose? Is the right 
tool for the job? Or do you think a "classic" approach with dmthin and 
lvm snapshot should be preferred? On top of my head, I can thin about 
the following pros and cons when using reflink vs thin lvm:

PRO:
- xfs reflink works at 4k granularity;
- significantly simpler setup and fs expansion, especially when staked 
devices (ie: vdo) are employed.

CONS:
- xfs reflink works at 4k granularity, leading to added fragmentation 
(albeit mitigated by speculative preallocation?);
- no filesystem-wide atomic snapshot (ie: various vdisk files are 
reflinked one-by-one, at small but different times).

Side note: I am aware of the fact that a snapshot taken without guest 
quiescing is akin to a crashed guest, but lets ignore that for the moment.

Am I missing something?
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 10:22 XFS reflink vs ThinLVM Gionatan Danti
@ 2020-01-13 11:10 ` Carlos Maiolino
  2020-01-13 11:25   ` Gionatan Danti
  2020-01-13 16:14 ` Chris Murphy
  1 sibling, 1 reply; 20+ messages in thread
From: Carlos Maiolino @ 2020-01-13 11:10 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

Hi Gionatan.

On Mon, Jan 13, 2020 at 11:22:51AM +0100, Gionatan Danti wrote:
> Hi all,
> as RHEL/CentOS 8 finally ships with XFS reflink enabled, I was thinking on
> how to put that very useful feature to good use. Doing that, I noticed how
> there is a significant overlap between XFS CoW (via reflink) and dm-thin CoW
> (via LVM thin volumes).
> 
> I am fully aware that they are far from identical, both in use and scope:
> ThinLVM is used to create multiple volumes from a single pool, with
> volume-level atomic snapshot; on the other hand, XFS CoW works inside a
> single volume and with file-level atomic snapshot.
> 
> Still, in at least one use case they are quite similar: single-volume
> storage of virtual machine files, with vdisk-level snapshot. So lets say I
> have a single big volume for storing virtual disk image file, and using XFS
> reflink to take atomic, per file snapshot via a simple "cp --reflink
> vdisk.img vdisk_snap.img".
> 
> How do you feel about using reflink for such a purpose? Is the right tool
> for the job? Or do you think a "classic" approach with dmthin and lvm
> snapshot should be preferred? On top of my head, I can thin about the
> following pros and cons when using reflink vs thin lvm:
> 

> PRO:
> - xfs reflink works at 4k granularity;
> - significantly simpler setup and fs expansion, especially when staked
> devices (ie: vdo) are employed.
> 
> CONS:
> - xfs reflink works at 4k granularity, leading to added fragmentation
> (albeit mitigated by speculative preallocation?);
> - no filesystem-wide atomic snapshot (ie: various vdisk files are reflinked
> one-by-one, at small but different times).
> 
> Side note: I am aware of the fact that a snapshot taken without guest
> quiescing is akin to a crashed guest, but lets ignore that for the moment.
> 

First of all, I think there is no 'right' answer, but instead, use what best fit
you and your environment. As you mentioned, there are PROs and CONS for each
different solution.

I use XFS reflink to CoW my Virtual Machines I use for testing. As I know many
others do the same, and it works very well, but as you said. It is file-based
disk images, opposed to volume-based disk images, used by DM and LVM.man.

About your concern regarding fragmentation... The granularity is not really 4k,
as it really depends on the extent sizes. Well, yes, the fundamental granularity
is block size, but we basically never allocate a single block...

Also, you can control it by using extent size hints, which will help reduce the
fragmentation you are concerned about.
Check 'extsize' and 'cowextsize' arguments for mkfs.xfs and xfs_io.


Cheers

-- 
Carlos


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 11:10 ` Carlos Maiolino
@ 2020-01-13 11:25   ` Gionatan Danti
  2020-01-13 11:43     ` Carlos Maiolino
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-13 11:25 UTC (permalink / raw)
  To: linux-xfs; +Cc: g.danti

On 13/01/20 12:10, Carlos Maiolino wrote:
> First of all, I think there is no 'right' answer, but instead, use what best fit
> you and your environment. As you mentioned, there are PROs and CONS for each
> different solution.
> 
> I use XFS reflink to CoW my Virtual Machines I use for testing. As I know many
> others do the same, and it works very well, but as you said. It is file-based
> disk images, opposed to volume-based disk images, used by DM and LVM.man.
> 
> About your concern regarding fragmentation... The granularity is not really 4k,
> as it really depends on the extent sizes. Well, yes, the fundamental granularity
> is block size, but we basically never allocate a single block...
> 
> Also, you can control it by using extent size hints, which will help reduce the
> fragmentation you are concerned about.
> Check 'extsize' and 'cowextsize' arguments for mkfs.xfs and xfs_io.

Hi Carlos, thank you for pointing me to the "cowextsize" option. From 
what I can read, it default to 32 blocks x 4 KB = 128 KB, which is a 
very reasonable granularity for CoW space/fragmentation tradeoff.

On the other hand, "extsize" seems to apply only to realtime filesystem 
section (which I don't plan to use), right?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 11:25   ` Gionatan Danti
@ 2020-01-13 11:43     ` Carlos Maiolino
  2020-01-13 12:21       ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Carlos Maiolino @ 2020-01-13 11:43 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

Hi Gionatan.

On Mon, Jan 13, 2020 at 12:25:26PM +0100, Gionatan Danti wrote:
> On 13/01/20 12:10, Carlos Maiolino wrote:
> > First of all, I think there is no 'right' answer, but instead, use what best fit
> > you and your environment. As you mentioned, there are PROs and CONS for each
> > different solution.
> > 
> > I use XFS reflink to CoW my Virtual Machines I use for testing. As I know many
> > others do the same, and it works very well, but as you said. It is file-based
> > disk images, opposed to volume-based disk images, used by DM and LVM.man.
> > 
> > About your concern regarding fragmentation... The granularity is not really 4k,
> > as it really depends on the extent sizes. Well, yes, the fundamental granularity
> > is block size, but we basically never allocate a single block...
> > 
> > Also, you can control it by using extent size hints, which will help reduce the
> > fragmentation you are concerned about.
> > Check 'extsize' and 'cowextsize' arguments for mkfs.xfs and xfs_io.
> 
> Hi Carlos, thank you for pointing me to the "cowextsize" option. From what I
> can read, it default to 32 blocks x 4 KB = 128 KB, which is a very
> reasonable granularity for CoW space/fragmentation tradeoff.
> 
> On the other hand, "extsize" seems to apply only to realtime filesystem
> section (which I don't plan to use), right?

I should have mentioned it, my apologies.

'extsize' argument for mkfs.xfs will set the size of the blocks in the RT
section.

Although, the 'extsize' command in xfs_io, will set the extent size hints on any
file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR).

Notice you can use xfs_io extsize to set the extent size hint to a directory,
and all files under the directory will inherit the same extent hint.

Cheers.

> 
> Thanks.
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8
> 

-- 
Carlos


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 11:43     ` Carlos Maiolino
@ 2020-01-13 12:21       ` Gionatan Danti
  2020-01-13 15:34         ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-13 12:21 UTC (permalink / raw)
  To: linux-xfs; +Cc: g.danti

On 13/01/20 12:43, Carlos Maiolino wrote:
> I should have mentioned it, my apologies.
> 
> 'extsize' argument for mkfs.xfs will set the size of the blocks in the RT
> section.
> 
> Although, the 'extsize' command in xfs_io, will set the extent size hints on any
> file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR).
> 
> Notice you can use xfs_io extsize to set the extent size hint to a directory,
> and all files under the directory will inherit the same extent hint.

My bad, I forgot about xfs_io.
Thanks for the detailed explanation.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 12:21       ` Gionatan Danti
@ 2020-01-13 15:34         ` Gionatan Danti
  2020-01-13 16:53           ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-13 15:34 UTC (permalink / raw)
  To: linux-xfs; +Cc: g.danti

On 13/01/20 13:21, Gionatan Danti wrote:
> On 13/01/20 12:43, Carlos Maiolino wrote:
>> I should have mentioned it, my apologies.
>>
>> 'extsize' argument for mkfs.xfs will set the size of the blocks in the RT
>> section.
>>
>> Although, the 'extsize' command in xfs_io, will set the extent size 
>> hints on any
>> file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR).
>>
>> Notice you can use xfs_io extsize to set the extent size hint to a 
>> directory,
>> and all files under the directory will inherit the same extent hint.
> 
> My bad, I forgot about xfs_io.
> Thanks for the detailed explanation.

Well, I did some test with a reflinked file and I must say I am 
impressed on how well XFS handles small rewrites (for example 4K).

 From my understanding, by mapping at 4K granularity but allocating at 
128K, it avoid most read/write amplification *and* keep low 
fragmentation. After "speculative_cow_prealloc_lifetime" it reclaim the 
allocated but unused space, bringing back any available free space to 
the filesystem. Is this understanding correct?

I have a question: how can I see the allocated-but-unused cow extents? 
For example, giving the following files:

[root@neutron xfs]# stat test.img copy.img
   File: test.img
   Size: 1073741824      Blocks: 2097400    IO Block: 4096   regular file
Device: 810h/2064d      Inode: 131         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:unlabeled_t:s0
Access: 2020-01-13 15:40:50.280711297 +0100
Modify: 2020-01-13 16:21:55.564726283 +0100
Change: 2020-01-13 16:21:55.564726283 +0100
  Birth: -

   File: copy.img
   Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular file
Device: 810h/2064d      Inode: 132         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:unlabeled_t:s0
Access: 2020-01-13 15:40:50.280711297 +0100
Modify: 2020-01-13 15:40:57.828552412 +0100
Change: 2020-01-13 15:41:48.190492279 +0100
  Birth: -

I can clearly see that test.img has an additional 124K allocated after a 
4K rewrite. This matches my expectation: a 4K rewrite really allocates a 
128K blocks, leading to 124K of temporarily "wasted" space.

But both "filefrag -v" and "xfs_bmap -vep" show only the used space as 
seen by an userspace application (ie: 262144 blocks of 4096 bytes = 
1073741824 bytes).

How can I check the total allocated space as reported by stat?
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 10:22 XFS reflink vs ThinLVM Gionatan Danti
  2020-01-13 11:10 ` Carlos Maiolino
@ 2020-01-13 16:14 ` Chris Murphy
  2020-01-13 16:25   ` Gionatan Danti
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2020-01-13 16:14 UTC (permalink / raw)
  To: xfs list; +Cc: Gionatan Danti

On Mon, Jan 13, 2020 at 3:28 AM Gionatan Danti <g.danti@assyoma.it> wrote:
>
> Still, in at least one use case they are quite similar: single-volume
> storage of virtual machine files, with vdisk-level snapshot. So lets say
> I have a single big volume for storing virtual disk image file, and
> using XFS reflink to take atomic, per file snapshot via a simple "cp
> --reflink vdisk.img vdisk_snap.img".

Is --reflink on XFS atomic? In particular for a VM file that's being
used, that's possibly quite a lot of metadata on disk and in-flight in
the host and in the guest.

I ask because I'm not certain --reflink copies on Btrfs are atomic,
I'll have to ask over there too. Whereas btrfs subvolume snapshots are
considered atomic.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 16:14 ` Chris Murphy
@ 2020-01-13 16:25   ` Gionatan Danti
  0 siblings, 0 replies; 20+ messages in thread
From: Gionatan Danti @ 2020-01-13 16:25 UTC (permalink / raw)
  To: Chris Murphy, xfs list; +Cc: g.danti

On 13/01/20 17:14, Chris Murphy wrote:
> Is --reflink on XFS atomic? In particular for a VM file that's being
> used, that's possibly quite a lot of metadata on disk and in-flight in
> the host and in the guest.
> 
> I ask because I'm not certain --reflink copies on Btrfs are atomic,
> I'll have to ask over there too. Whereas btrfs subvolume snapshots are
> considered atomic.

Hi, I did that question some time ago and, based on what I read here 
[1], it *should* be atomic.

Feel free to correct me, anyway.

[1] https://www.spinics.net/lists/linux-xfs/msg15969.html

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 15:34         ` Gionatan Danti
@ 2020-01-13 16:53           ` Darrick J. Wong
  2020-01-13 17:00             ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2020-01-13 16:53 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Mon, Jan 13, 2020 at 04:34:50PM +0100, Gionatan Danti wrote:
> On 13/01/20 13:21, Gionatan Danti wrote:
> > On 13/01/20 12:43, Carlos Maiolino wrote:
> > > I should have mentioned it, my apologies.
> > > 
> > > 'extsize' argument for mkfs.xfs will set the size of the blocks in the RT
> > > section.

mkfs.xfs -d extszinherit=NNN is what you want here.

> > > 
> > > Although, the 'extsize' command in xfs_io, will set the extent size
> > > hints on any
> > > file of any xfs filesystem (or filesystem supporting FS_IOC_FSSETXATTR).
> > > 
> > > Notice you can use xfs_io extsize to set the extent size hint to a
> > > directory,
> > > and all files under the directory will inherit the same extent hint.
> > 
> > My bad, I forgot about xfs_io.
> > Thanks for the detailed explanation.
> 
> Well, I did some test with a reflinked file and I must say I am impressed on
> how well XFS handles small rewrites (for example 4K).
> 
> From my understanding, by mapping at 4K granularity but allocating at 128K,
> it avoid most read/write amplification *and* keep low fragmentation. After
> "speculative_cow_prealloc_lifetime" it reclaim the allocated but unused
> space, bringing back any available free space to the filesystem. Is this
> understanding correct?

Right.

> I have a question: how can I see the allocated-but-unused cow extents? For
> example, giving the following files:
> 
> [root@neutron xfs]# stat test.img copy.img
>   File: test.img
>   Size: 1073741824      Blocks: 2097400    IO Block: 4096   regular file
> Device: 810h/2064d      Inode: 131         Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: unconfined_u:object_r:unlabeled_t:s0
> Access: 2020-01-13 15:40:50.280711297 +0100
> Modify: 2020-01-13 16:21:55.564726283 +0100
> Change: 2020-01-13 16:21:55.564726283 +0100
>  Birth: -
> 
>   File: copy.img
>   Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular file
> Device: 810h/2064d      Inode: 132         Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: unconfined_u:object_r:unlabeled_t:s0
> Access: 2020-01-13 15:40:50.280711297 +0100
> Modify: 2020-01-13 15:40:57.828552412 +0100
> Change: 2020-01-13 15:41:48.190492279 +0100
>  Birth: -
> 
> I can clearly see that test.img has an additional 124K allocated after a 4K
> rewrite. This matches my expectation: a 4K rewrite really allocates a 128K
> blocks, leading to 124K of temporarily "wasted" space.
> 
> But both "filefrag -v" and "xfs_bmap -vep" show only the used space as seen
> by an userspace application (ie: 262144 blocks of 4096 bytes = 1073741824
> bytes).

xfs_bmap -c, but only if you have xfs debugging enabled.

> How can I check the total allocated space as reported by stat?
> Thanks.

If you happen to have rmap enabled, you can use the xfs_io fsmap command
to look for 'cow reservation' blocks, since that 124k is (according to
ondisk metadata, anyway) owned by the refcount btree until it gets
remapped into the file on writeback.

--D

> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 16:53           ` Darrick J. Wong
@ 2020-01-13 17:00             ` Gionatan Danti
  2020-01-13 18:09               ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-13 17:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, g.danti

On 13/01/20 17:53, Darrick J. Wong wrote:
> mkfs.xfs -d extszinherit=NNN is what you want here.

Hi Darrik, thank you, I missed that option.

> Right.

Ok

> xfs_bmap -c, but only if you have xfs debugging enabled.

[root@neutron xfs]# xfs_bmap -c test.img
/usr/sbin/xfs_bmap: illegal option -- c
Usage: xfs_bmap [-adelpvV] [-n nx] file...

Maybe my xfs_bmap version is too old:

> If you happen to have rmap enabled, you can use the xfs_io fsmap command
> to look for 'cow reservation' blocks, since that 124k is (according to
> ondisk metadata, anyway) owned by the refcount btree until it gets
> remapped into the file on writeback.

I see. By default, on RHEL at least, rmapbt is disabled. As a side note, 
do you suggest enabling it when creating a new fs?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 17:00             ` Gionatan Danti
@ 2020-01-13 18:09               ` Darrick J. Wong
  2020-01-14  8:45                 ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2020-01-13 18:09 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Mon, Jan 13, 2020 at 06:00:15PM +0100, Gionatan Danti wrote:
> On 13/01/20 17:53, Darrick J. Wong wrote:
> > mkfs.xfs -d extszinherit=NNN is what you want here.
> 
> Hi Darrik, thank you, I missed that option.
> 
> > Right.
> 
> Ok
> 
> > xfs_bmap -c, but only if you have xfs debugging enabled.
> 
> [root@neutron xfs]# xfs_bmap -c test.img
> /usr/sbin/xfs_bmap: illegal option -- c
> Usage: xfs_bmap [-adelpvV] [-n nx] file...
> 
> Maybe my xfs_bmap version is too old:

Doh, sorry, thinko on my part.  -c is exposed in the raw xfs_io command
but not in the xfs_bmap wrapper.

xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img

> > If you happen to have rmap enabled, you can use the xfs_io fsmap command
> > to look for 'cow reservation' blocks, since that 124k is (according to
> > ondisk metadata, anyway) owned by the refcount btree until it gets
> > remapped into the file on writeback.
> 
> I see. By default, on RHEL at least, rmapbt is disabled. As a side note, do
> you suggest enabling it when creating a new fs?

If you are interested in online scrub, then I'd say yes because it's the
secret sauce that gives online metadata checking most of its power.  I
confess, I haven't done a lot of performance analysis of rmap lately,
the metadata ops overhead might still be in the ~10% range.

The two issues preventing rmap from being turned on by default (at least
in my head) are (1) scrub itself is still EXPERIMENTAL and (2) it's not
100% clear that online fsck is such a killer app that everyone will want
it, since you always pay the performance overhead of enabling rmap
regardless of whether you use xfs_scrub.

(If your workload is easily restored from backup/Docker and you need all
the performance you can squeeze then perhaps don't enable this.)

Note that I've been running with rmap=1 and scrub=1 on all systems since
2016, and frankly I've noticed the system stumbling over broken
writeback throttling much more than whatever the tracing tools attribute
to rmap.

--D

> Thanks.
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-13 18:09               ` Darrick J. Wong
@ 2020-01-14  8:45                 ` Gionatan Danti
  2020-01-15 11:37                   ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-14  8:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 13/01/20 19:09, Darrick J. Wong wrote:
> xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img

Ok, good to know. Thanks.

> If you are interested in online scrub, then I'd say yes because it's the
> secret sauce that gives online metadata checking most of its power.  I
> confess, I haven't done a lot of performance analysis of rmap lately,
> the metadata ops overhead might still be in the ~10% range.
> 
> The two issues preventing rmap from being turned on by default (at least
> in my head) are (1) scrub itself is still EXPERIMENTAL and (2) it's not
> 100% clear that online fsck is such a killer app that everyone will want
> it, since you always pay the performance overhead of enabling rmap
> regardless of whether you use xfs_scrub.

Well, I really think online scrub, when ready, will be a killer feature. 
So, for a "mere" 10% performance penalty, I would enable rbmap unless a 
concrete chance to expose some bug exists.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-14  8:45                 ` Gionatan Danti
@ 2020-01-15 11:37                   ` Gionatan Danti
  2020-01-15 16:39                     ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-15 11:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 14/01/20 09:45, Gionatan Danti wrote:
> On 13/01/20 19:09, Darrick J. Wong wrote:
>> xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img
> 
> Ok, good to know. Thanks.

Hi all, I have an additional question about extszinherit/extsize.

If I understand it correctly, by default it is 0: any non-EOF writes on 
a sparse file will allocate how much space it needs. If these writes are 
random and small enough (ie: 4k random writes), a subsequent sequential 
read of the same file will have much lower performance (because 
sequential IO are transformed in random accesses by the logical/physical 
block remapping).

Setting a 128K extszinherit (for the entire filesystem) or extsize (for 
a file/dir) will markedly improve the situation, as much bigger 
contiguous LBA regions can be read for each IO (note: I know SSD and 
NVME disks are much less impacted by fragmentation, but I am mainly 
speaking about HDD here).

So, my question: there is anything wrong and/or I should be aware when 
using a 128K extsize, so setting it the same as cowextsize? The only 
possible drawback I can think is a coarse granularity when allocating 
from the sparse file (ie: a 4k write will allocate the full 128k extent).

Am I missing something?
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-15 11:37                   ` Gionatan Danti
@ 2020-01-15 16:39                     ` Darrick J. Wong
  2020-01-15 17:45                       ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2020-01-15 16:39 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Wed, Jan 15, 2020 at 12:37:52PM +0100, Gionatan Danti wrote:
> On 14/01/20 09:45, Gionatan Danti wrote:
> > On 13/01/20 19:09, Darrick J. Wong wrote:
> > > xfs_io -c 'bmap -c -e -l -p -v <whatever>' test.img
> > 
> > Ok, good to know. Thanks.
> 
> Hi all, I have an additional question about extszinherit/extsize.
> 
> If I understand it correctly, by default it is 0: any non-EOF writes on a
> sparse file will allocate how much space it needs. If these writes are
> random and small enough (ie: 4k random writes), a subsequent sequential read
> of the same file will have much lower performance (because sequential IO are
> transformed in random accesses by the logical/physical block remapping).
> 
> Setting a 128K extszinherit (for the entire filesystem) or extsize (for a
> file/dir) will markedly improve the situation, as much bigger contiguous LBA
> regions can be read for each IO (note: I know SSD and NVME disks are much
> less impacted by fragmentation, but I am mainly speaking about HDD here).
> 
> So, my question: there is anything wrong and/or I should be aware when using
> a 128K extsize, so setting it the same as cowextsize? The only possible
> drawback I can think is a coarse granularity when allocating from the sparse
> file (ie: a 4k write will allocate the full 128k extent).
> 
> Am I missing something?

extszinherit > 0 disables delayed allocation, which means that (in your
case above) if you wrote 1G to a file (using the pagecache) you'd get
8192x 128K calls to the allocator instead of making a single 1G
allocation during writeback.  If you have a lot of memory (or a high vmm
dirty ratio) then you want delalloc over extsize.  Most of the time you
want delalloc, frankly.

--D

> Thanks.
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-15 16:39                     ` Darrick J. Wong
@ 2020-01-15 17:45                       ` Gionatan Danti
  2020-01-17 21:58                         ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-15 17:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, g.danti

Il 15-01-2020 17:39 Darrick J. Wong ha scritto:
> extszinherit > 0 disables delayed allocation, which means that (in your
> case above) if you wrote 1G to a file (using the pagecache) you'd get
> 8192x 128K calls to the allocator instead of making a single 1G
> allocation during writeback.

Thanks for the valuable information, I did not know that specific 
interaction between extsize and delalloc.

> If you have a lot of memory (or a high vmm
> dirty ratio) then you want delalloc over extsize.  Most of the time you
> want delalloc, frankly.

Let me briefly describe the expected workload: thinly provisioned 
virtual image storage. The problem with "plain" sparse file (ie: without 
extsize hint) is that, after some time, the underlying vdisk file will 
be very fragmented: consecutive physical blocks will be assigned to very 
different logical blocks, leading to sub-par performance when reading 
back the whole file (eg: for backup purpose).

I can easily simulate a worst-case scenario with fio, issuing random 
write to a pre-created sparse file. While the random writes complete 
very fast (because they are more-or-less sequentially written inside the 
sparse file), reading back that file will have very low performance: 10 
MB/s vs 600+ MB/s for a preallocated file.

Using a 128k extsize brings sequential read to ~100 MB/s (which is 
reasonable on that old hardware), and a 16M extsize is in the range of 
500+ MB/s.

Given that use case, do you suggest sticking with delalloc or setting an 
appropriate extsize?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-15 17:45                       ` Gionatan Danti
@ 2020-01-17 21:58                         ` Gionatan Danti
  2020-01-17 23:42                           ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-17 21:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, g.danti

Il 15-01-2020 18:45 Gionatan Danti ha scritto:
> Let me briefly describe the expected workload: thinly provisioned
> virtual image storage. The problem with "plain" sparse file (ie:
> without extsize hint) is that, after some time, the underlying vdisk
> file will be very fragmented: consecutive physical blocks will be
> assigned to very different logical blocks, leading to sub-par
> performance when reading back the whole file (eg: for backup purpose).
> 
> I can easily simulate a worst-case scenario with fio, issuing random
> write to a pre-created sparse file. While the random writes complete
> very fast (because they are more-or-less sequentially written inside
> the sparse file), reading back that file will have very low
> performance: 10 MB/s vs 600+ MB/s for a preallocated file.

I would like to share some other observation/results, which I hope can 
be useful for other peoples.

Further testing shows that "cp --reflink" an highly fragmented files is 
a relatively long operation, easily in the range of 30s or more, during 
which the guest virtual machine is basically denied any access to the 
underlying virtual disk file.

While the number of fragments required to reach reflink time of 30+ 
seconds is very high, this would be a quite common case when using 
thinly provisioned virtual disk files. With sparse file, any write done 
at guest OS level has a very good chance to create its own fragment (ie: 
allocating a discontiguous chunk as seen by logical/physical block 
mapping), leading to very fragmented files.

So, back to main topic: reflink is an invaluable tool, to be used *with* 
(rather than instead of) thin lvm:
- thinlvm is the right tool for taking rolling volume snapshot;
- reflink is extremely useful for "on-demand" snapshot of key files.

Thank you all for the very detailed and useful information you provided.
Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-17 21:58                         ` Gionatan Danti
@ 2020-01-17 23:42                           ` Darrick J. Wong
  2020-01-18 11:08                             ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2020-01-17 23:42 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Fri, Jan 17, 2020 at 10:58:15PM +0100, Gionatan Danti wrote:
> Il 15-01-2020 18:45 Gionatan Danti ha scritto:
> > Let me briefly describe the expected workload: thinly provisioned
> > virtual image storage. The problem with "plain" sparse file (ie:
> > without extsize hint) is that, after some time, the underlying vdisk
> > file will be very fragmented: consecutive physical blocks will be
> > assigned to very different logical blocks, leading to sub-par
> > performance when reading back the whole file (eg: for backup purpose).
> > 
> > I can easily simulate a worst-case scenario with fio, issuing random
> > write to a pre-created sparse file. While the random writes complete
> > very fast (because they are more-or-less sequentially written inside
> > the sparse file), reading back that file will have very low
> > performance: 10 MB/s vs 600+ MB/s for a preallocated file.
> 
> I would like to share some other observation/results, which I hope can be
> useful for other peoples.
> 
> Further testing shows that "cp --reflink" an highly fragmented files is a
> relatively long operation, easily in the range of 30s or more, during which
> the guest virtual machine is basically denied any access to the underlying
> virtual disk file.

How many fragments, and how big of a sparse file?

--D

> While the number of fragments required to reach reflink time of 30+ seconds
> is very high, this would be a quite common case when using thinly
> provisioned virtual disk files. With sparse file, any write done at guest OS
> level has a very good chance to create its own fragment (ie: allocating a
> discontiguous chunk as seen by logical/physical block mapping), leading to
> very fragmented files.
> 
> So, back to main topic: reflink is an invaluable tool, to be used *with*
> (rather than instead of) thin lvm:
> - thinlvm is the right tool for taking rolling volume snapshot;
> - reflink is extremely useful for "on-demand" snapshot of key files.
> 
> Thank you all for the very detailed and useful information you provided.
> Regards.
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-17 23:42                           ` Darrick J. Wong
@ 2020-01-18 11:08                             ` Gionatan Danti
  2020-01-18 23:06                               ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Gionatan Danti @ 2020-01-18 11:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, g.danti

Il 18-01-2020 00:42 Darrick J. Wong ha scritto:
> How many fragments, and how big of a sparse file?

A just installed CentOS 8 guest using a 20 GB sparse file vdisk had 
about 2000 fragments.

After running "fio --name=test --filename=test.img --rw=randwrite 
--size=4G" for about 30 mins, it ended with over 1M fragments/extents. 
At that point, reflinking that file took over 2 mins, and unlinking it 
about 4 mins.

I understand fio randwrite pattern is a worst case scenario; still, I 
think the results are interesting and telling for "aged" virtual 
machines.

As a side note, a just installed Win2019 guest backed with an 80 GB 
sparse file had about 18000 fragments.
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-18 11:08                             ` Gionatan Danti
@ 2020-01-18 23:06                               ` Darrick J. Wong
  2020-01-19  8:45                                 ` Gionatan Danti
  0 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2020-01-18 23:06 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Sat, Jan 18, 2020 at 12:08:48PM +0100, Gionatan Danti wrote:
> Il 18-01-2020 00:42 Darrick J. Wong ha scritto:
> > How many fragments, and how big of a sparse file?
> 
> A just installed CentOS 8 guest using a 20 GB sparse file vdisk had about
> 2000 fragments.
> 
> After running "fio --name=test --filename=test.img --rw=randwrite --size=4G"

4GB / 1M extents == 4096, which is probably the fs blocksize :)

I wonder, do you get different results if you set an extent size hint
on the dir before running fio?

I forgot(?) to mention that if you're mostly dealing with sparse VM
images then you might as well set a extent size hint and forego delayed
allocation because it won't help you much.

--D

> for about 30 mins, it ended with over 1M fragments/extents. At that point,
> reflinking that file took over 2 mins, and unlinking it about 4 mins.
> 
> I understand fio randwrite pattern is a worst case scenario; still, I think
> the results are interesting and telling for "aged" virtual machines.
> 
> As a side note, a just installed Win2019 guest backed with an 80 GB sparse
> file had about 18000 fragments.
> Thanks.
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS reflink vs ThinLVM
  2020-01-18 23:06                               ` Darrick J. Wong
@ 2020-01-19  8:45                                 ` Gionatan Danti
  0 siblings, 0 replies; 20+ messages in thread
From: Gionatan Danti @ 2020-01-19  8:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, g.danti

Il 19-01-2020 00:06 Darrick J. Wong ha scritto:
> 4GB / 1M extents == 4096, which is probably the fs blocksize :)

Yes, it did the same observation: due to random allocation, the 
underlying vdisk had block-sized extents.

> I wonder, do you get different results if you set an extent size hint
> on the dir before running fio?

Yes: setting extsize at 128K strongly reduces the amount of allocated 
extents (eg: 4M / 128K = 32K extents). A similar results can be obtained 
tapping in cowextsize, by cp --reflink the original file. Any subsequent 
4K write inside the guest will cause a 128K CoW allocation (with default 
setting) on the backing file.

However, while *much* better, it is my understanding that XFS reflink is 
a variable-length process: as any extents had to be scanned/reflinked, 
the reflink time is not constant. Meanwhile it is impossible to 
read/write from the reflinked file. Am I right?

On the other side thinlvm snapshots, operating on block level, are a 
(more-or-less) constant-time operation, causing much less disruption in 
the normal IO flow of the guest volumes.

I don't absolutely want to lessen reflink usefulnes; rather, it is an 
extremely useful feature which can be put to very good use.

> I forgot(?) to mention that if you're mostly dealing with sparse VM
> images then you might as well set a extent size hint and forego delayed
> allocation because it won't help you much.

This was my conclusion as well.
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13 10:22 XFS reflink vs ThinLVM Gionatan Danti
2020-01-13 11:10 ` Carlos Maiolino
2020-01-13 11:25   ` Gionatan Danti
2020-01-13 11:43     ` Carlos Maiolino
2020-01-13 12:21       ` Gionatan Danti
2020-01-13 15:34         ` Gionatan Danti
2020-01-13 16:53           ` Darrick J. Wong
2020-01-13 17:00             ` Gionatan Danti
2020-01-13 18:09               ` Darrick J. Wong
2020-01-14  8:45                 ` Gionatan Danti
2020-01-15 11:37                   ` Gionatan Danti
2020-01-15 16:39                     ` Darrick J. Wong
2020-01-15 17:45                       ` Gionatan Danti
2020-01-17 21:58                         ` Gionatan Danti
2020-01-17 23:42                           ` Darrick J. Wong
2020-01-18 11:08                             ` Gionatan Danti
2020-01-18 23:06                               ` Darrick J. Wong
2020-01-19  8:45                                 ` Gionatan Danti
2020-01-13 16:14 ` Chris Murphy
2020-01-13 16:25   ` Gionatan Danti

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git