All of lore.kernel.org
 help / color / mirror / Atom feed
* Large files, nodatacow and fragmentation
@ 2014-08-11 18:36 G. Richard Bellamy
  2014-08-11 19:14 ` Roman Mamedov
  0 siblings, 1 reply; 20+ messages in thread
From: G. Richard Bellamy @ 2014-08-11 18:36 UTC (permalink / raw)
  To: linux-btrfs

I've been playing with btrfs as a backing store for my KVM images.

I've used 'chattr +C' on the directory where those images are stored.

You can see my recipe below [1]. I've read the gotchas found here [2]

I'm having continuing performance issues inside the Guest VM that is
created inside the btrfs subvolume, using a qcow2 format. I'm having a
hard time determining whether the issues are related to KVM or btrfs,
or if this is even a reasonable topic of discussion.

I've seen the comments on this list saying that if I want a COW
filesystem with sparse files, that I'd be better off with ZFS. I'd
like to use an in-tree COW filesystem, but if it's just not gonna
happen yet with btrfs, I guess that's just the way it is.

That being said, how would I determine what the root issue is?
Specifically, the qcow2 file in question seems to have increasing
fragmentation, even with the No_COW attr.

[1]
$ mkfs.btrfs -m raid10 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd
$ mount /dev/sda /mnt
$ cd /mnt
$ btrfs create subvolume __data
$ btrfs create subvolume __data/libvirt
$ cd /
$ umount /mnt
$ mount /dev/sda /var/lib/libvirt
$ chattr +C /var/lib/libvirt/images
$ cp /run/media/rbellamy/433acf1d-a1a4-4596-a6a7-005e643b24e0/libvirt/images/atlas.qcow2
/var/lib/libvirt/images/
$ filefrag /var/lib/libvirt/images/atlas.qcow2
/var/lib/libvirt/images/atlas.qcow2: 0 extents found
[START UP THE VM - DO SOME THINGS]
$ filefrag /var/lib/libvirt/images/atlas.qcow2
/var/lib/libvirt/images/atlas.qcow2: 12236 extents found
[START UP THE VM - DO SOME THINGS]
$ filefrag /var/lib/libvirt/images/atlas.qcow2
/var/lib/libvirt/images/atlas.qcow2: 34988 extents found

[2]
https://btrfs.wiki.kernel.org/index.php/Gotchas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-11 18:36 Large files, nodatacow and fragmentation G. Richard Bellamy
@ 2014-08-11 19:14 ` Roman Mamedov
  2014-08-11 21:37   ` G. Richard Bellamy
  2014-08-11 23:31   ` Chris Murphy
  0 siblings, 2 replies; 20+ messages in thread
From: Roman Mamedov @ 2014-08-11 19:14 UTC (permalink / raw)
  To: G. Richard Bellamy; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2003 bytes --]

On Mon, 11 Aug 2014 11:36:46 -0700
"G. Richard Bellamy" <rbellamy@pteradigm.com> wrote:

> I've been playing with btrfs as a backing store for my KVM images.
> 
> I've used 'chattr +C' on the directory where those images are stored.
> 
> You can see my recipe below [1]. I've read the gotchas found here [2]
> 
> I'm having continuing performance issues inside the Guest VM that is
> created inside the btrfs subvolume, using a qcow2 format. I'm having a
> hard time determining whether the issues are related to KVM or btrfs,
> or if this is even a reasonable topic of discussion.
> 
> I've seen the comments on this list saying that if I want a COW
> filesystem with sparse files, that I'd be better off with ZFS. I'd
> like to use an in-tree COW filesystem, but if it's just not gonna
> happen yet with btrfs, I guess that's just the way it is.
> 
> That being said, how would I determine what the root issue is?
> Specifically, the qcow2 file in question seems to have increasing
> fragmentation, even with the No_COW attr.

First of all, why do you require a COW filesystem in the first place... if all
you do is just use it in a NoCOW mode?

Second, why qcow2? It can also have internal fragmentation which is unlikely to
do anything good for performance.

Try RAW format images; to reduce the space requirements, with the latest
Qemu/KVM you can pass-through TRIM command from inside the VM guest (at least
in the IDE controller mode) so that the backing filesystem will unmap areas
that are no longer in use inside the VM, in effect "re-sparsifying" the image.
This is VERY nifty. But yeah this can cause some fragmentation even with NoCOW.

In my personal use case NoCOW is only utilized partly, because all subvolumes
with running VMs are being snapshotted about every 30 minutes, and those
snapshots are kept for two weeks. The performance is passable; at least when
using KVM's "cache=writeback" mode (or less safer ones).

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-11 19:14 ` Roman Mamedov
@ 2014-08-11 21:37   ` G. Richard Bellamy
  2014-08-11 23:31   ` Chris Murphy
  1 sibling, 0 replies; 20+ messages in thread
From: G. Richard Bellamy @ 2014-08-11 21:37 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Mon, Aug 11, 2014 at 12:14 PM, Roman Mamedov <rm@romanrm.net> wrote:
>
> First of all, why do you require a COW filesystem in the first place... if all
> you do is just use it in a NoCOW mode?
>
> Second, why qcow2? It can also have internal fragmentation which is unlikely to
> do anything good for performance.

Both great questions. I'm experimenting with btrfs, and the various
permutations of
btrfs with KVM.

So, why btrfs vs lvm or ext4:
1. Because nocow isn't all I'm doing with that filesystem.
2. I like the way btrfs subvolumes work, vs lvm. I can have the nocow
files in one
subvolume, and still get great snapshot performance out of others.
3. I get the performance of a raid10 without the lvm management overhead. Online
rebalancing. Easy online resizing.
4. And frankly, I just kinda want to make it work.

>
> Try RAW format images; to reduce the space requirements, with the latest
> Qemu/KVM you can pass-through TRIM command from inside the VM guest (at least
> in the IDE controller mode) so that the backing filesystem will unmap areas
> that are no longer in use inside the VM, in effect "re-sparsifying" the image.
> This is VERY nifty. But yeah this can cause some fragmentation even with NoCOW.
>
> In my personal use case NoCOW is only utilized partly, because all subvolumes
> with running VMs are being snapshotted about every 30 minutes, and those
> snapshots are kept for two weeks. The performance is passable; at least when
> using KVM's "cache=writeback" mode (or less safer ones).

I've done my reading of qcow2 vs raw and that indicated that while
there is better
performance using raw, it's not significant enough to bypass the
ability to take a qemu
snapshot. I've not done the analysis myself, so I could be reading things wrong.

There's a great thread, "Are nocow files snapshot-aware?" [1]. My take
from that reading
is that doing a btrfs snapshot of a nocow file seems like it's
reasonable on a semi-regular
basis, but DON'T do it every 30 seconds.

Also that whole thread is predicated on the idea that your nocow files
are themselves
managed by a process/system that can read and write to them
atomically, thus I decided
against using the raw format.

>
> --
> With respect,
> Roman

Thanks Roman. But really we haven't addressed my original question,
which is - how would
I determine the root cause of the fragmentation in this nocow file on
top of a btrfs subvolume?

[1] http://www.spinics.net/lists/linux-btrfs/msg31341.html

Kind Regards,
Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-11 19:14 ` Roman Mamedov
  2014-08-11 21:37   ` G. Richard Bellamy
@ 2014-08-11 23:31   ` Chris Murphy
  2014-08-14  3:57     ` G. Richard Bellamy
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2014-08-11 23:31 UTC (permalink / raw)
  To: linux-btrfs


On Aug 11, 2014, at 1:14 PM, Roman Mamedov <rm@romanrm.net> wrote:


> 
> Second, why qcow2? It can also have internal fragmentation which is unlikely to
> do anything good for performance.

It really depends on what version of libvirt and qemu-image you've got. I did some testing during Fedora 20 prior to release, and the best results for my configuration (a laptop with an HDD at that time): btrfs host, +C qcow2, btrfs guest, both 16KB leaf size, and the drive pointing to the qcow2 file with cache policy set to unsafe. And even when obliterating the VM while writing data, I never lost the guest Btrfs file system. Not that I recommend it, the cache policy is unsafe after all. I did lose some data but it was limited to commit time. We're not talking huge differences, the metric I was using was installing Fedora 20 based on installer log start/stop time for doing the unattended portion of the install. It also matters somewhat to pre-allocate metadata when creating the qcow2 file.

I also tested XFS on XFS, ext4 on ext4, also in qcow2. And also on raw images. And also on LV's. I'd think the LV would have been faster since it completely eliminates one of the file systems (there is no host fs).

Anyway, what I determined was the only way to know is to actually test your workload, or a good approximation of it, with various configurations.

And another test is LVM thinp LV's once libvirt has support for using them (which may already have happened, I haven't revisted this since Oct 2013 testing), because those snapshots should be as usable as Btrfs snapshots, unlike conventional LVM snapshots which are slow and need explicit preallocation.


Chris Murphy


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-11 23:31   ` Chris Murphy
@ 2014-08-14  3:57     ` G. Richard Bellamy
  2014-08-14  4:23       ` Chris Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: G. Richard Bellamy @ 2014-08-14  3:57 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Mon, Aug 11, 2014 at 11:36 AM, G. Richard Bellamy
<rbellamy@pteradigm.com> wrote:
> That being said, how would I determine what the root issue is?
> Specifically, the qcow2 file in question seems to have increasing
> fragmentation, even with the No_COW attr.
>
> [1]
> $ mkfs.btrfs -m raid10 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd
> $ mount /dev/sda /mnt
> $ cd /mnt
> $ btrfs create subvolume __data
> $ btrfs create subvolume __data/libvirt
> $ cd /
> $ umount /mnt
> $ mount /dev/sda /var/lib/libvirt
> $ chattr +C /var/lib/libvirt/images
> $ cp /run/media/rbellamy/433acf1d-a1a4-4596-a6a7-005e643b24e0/libvirt/images/atlas.qcow2
> /var/lib/libvirt/images/
> $ filefrag /var/lib/libvirt/images/atlas.qcow2
> /var/lib/libvirt/images/atlas.qcow2: 0 extents found
> [START UP THE VM - DO SOME THINGS]
> $ filefrag /var/lib/libvirt/images/atlas.qcow2
> /var/lib/libvirt/images/atlas.qcow2: 12236 extents found
> [START UP THE VM - DO SOME THINGS]
> $ filefrag /var/lib/libvirt/images/atlas.qcow2
> /var/lib/libvirt/images/atlas.qcow2: 34988 extents found

I appreciate the information to date.

$ filefrag /var/lib/libvirt/images/atlas.qcow2
/var/lib/libvirt/images/atlas.qcow2: 39738 extents found

That's after a few reboots and limited access to the guest vm OS.

So, I think my question still stands: how can I determine empirically
what is causing the fragmentation? Or maybe a better question is, what
would be reasonable fragmentation of a nocow file in btrfs? For that
matter, what about a regular cow file in btrfs? I'm willing to read
technical documentation or source if that is where I need to go to
become better educated, so any pointers at all will be very much
appreciated.

-rb


On Mon, Aug 11, 2014 at 4:31 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Aug 11, 2014, at 1:14 PM, Roman Mamedov <rm@romanrm.net> wrote:
>
>
>>
>> Second, why qcow2? It can also have internal fragmentation which is unlikely to
>> do anything good for performance.
>
> It really depends on what version of libvirt and qemu-image you've got. I did some testing during Fedora 20 prior to release, and the best results for my configuration (a laptop with an HDD at that time): btrfs host, +C qcow2, btrfs guest, both 16KB leaf size, and the drive pointing to the qcow2 file with cache policy set to unsafe. And even when obliterating the VM while writing data, I never lost the guest Btrfs file system. Not that I recommend it, the cache policy is unsafe after all. I did lose some data but it was limited to commit time. We're not talking huge differences, the metric I was using was installing Fedora 20 based on installer log start/stop time for doing the unattended portion of the install. It also matters somewhat to pre-allocate metadata when creating the qcow2 file.
>
> I also tested XFS on XFS, ext4 on ext4, also in qcow2. And also on raw images. And also on LV's. I'd think the LV would have been faster since it completely eliminates one of the file systems (there is no host fs).
>
> Anyway, what I determined was the only way to know is to actually test your workload, or a good approximation of it, with various configurations.
>
> And another test is LVM thinp LV's once libvirt has support for using them (which may already have happened, I haven't revisted this since Oct 2013 testing), because those snapshots should be as usable as Btrfs snapshots, unlike conventional LVM snapshots which are slow and need explicit preallocation.
>
>
> Chris Murphy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-14  3:57     ` G. Richard Bellamy
@ 2014-08-14  4:23       ` Chris Murphy
  2014-08-14 14:30         ` G. Richard Bellamy
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2014-08-14  4:23 UTC (permalink / raw)
  To: G. Richard Bellamy; +Cc: linux-btrfs


On Aug 13, 2014, at 9:57 PM, "G. Richard Bellamy" <rbellamy@pteradigm.com> wrote:

> On Mon, Aug 11, 2014 at 11:36 AM, G. Richard Bellamy
> <rbellamy@pteradigm.com> wrote:
>> That being said, how would I determine what the root issue is?
>> Specifically, the qcow2 file in question seems to have increasing
>> fragmentation, even with the No_COW attr.
>> 
>> [1]
>> $ mkfs.btrfs -m raid10 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd
>> $ mount /dev/sda /mnt
>> $ cd /mnt
>> $ btrfs create subvolume __data
>> $ btrfs create subvolume __data/libvirt
>> $ cd /
>> $ umount /mnt
>> $ mount /dev/sda /var/lib/libvirt
>> $ chattr +C /var/lib/libvirt/images
>> $ cp /run/media/rbellamy/433acf1d-a1a4-4596-a6a7-005e643b24e0/libvirt/images/atlas.qcow2
>> /var/lib/libvirt/images/
>> $ filefrag /var/lib/libvirt/images/atlas.qcow2
>> /var/lib/libvirt/images/atlas.qcow2: 0 extents found

Sorta weird, it ought to have at least 1. Or maybe it wasn't sync'd yet from the copy.

>> [START UP THE VM - DO SOME THINGS]
>> $ filefrag /var/lib/libvirt/images/atlas.qcow2
>> /var/lib/libvirt/images/atlas.qcow2: 12236 extents found
>> [START UP THE VM - DO SOME THINGS]
>> $ filefrag /var/lib/libvirt/images/atlas.qcow2
>> /var/lib/libvirt/images/atlas.qcow2: 34988 extents found
> 
> I appreciate the information to date.

lsattr /var/lib/libvirt/images/atlas.qcow2

Is the xattr actually in place on that file?
> 
> So, I think my question still stands: how can I determine empirically
> what is causing the fragmentation?

It will fragment somewhat but I can't say that I've seen this much fragmentation with xattr C applied to qcow2. What's the workload? How was the qcow2 created? I recommend -o preallocation=metadata,compat=1.1,lazy_refcounts=on when creating it. My workloads were rather simplistic: OS installs and reinstalls. What's the filesystem being used in the guest that's using the qcow2 as backing?

It might be that your workload is best suited for a preallocated raw file that inherits +C, or even possibly an LV.


Chris Murphy


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-14  4:23       ` Chris Murphy
@ 2014-08-14 14:30         ` G. Richard Bellamy
  2014-08-14 15:05           ` Austin S Hemmelgarn
  2014-08-14 18:40           ` Chris Murphy
  0 siblings, 2 replies; 20+ messages in thread
From: G. Richard Bellamy @ 2014-08-14 14:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Wed, Aug 13, 2014 at 9:23 PM, Chris Murphy <lists@colorremedies.com> wrote:
> lsattr /var/lib/libvirt/images/atlas.qcow2
>
> Is the xattr actually in place on that file?

2014-08-14 07:07:36
$ filefrag /var/lib/libvirt/images/atlas.qcow2
/var/lib/libvirt/images/atlas.qcow2: 46378 extents found
2014-08-14 07:08:34
$ lsattr /var/lib/libvirt/images/atlas.qcow2
---------------C /var/lib/libvirt/images/atlas.qcow2

So, yeah, the attribute is set.

>
> It will fragment somewhat but I can't say that I've seen this much fragmentation with xattr C applied to qcow2. What's the workload? How was the qcow2 created? I recommend -o preallocation=metadata,compat=1.1,lazy_refcounts=on when creating it. My workloads were rather simplistic: OS installs and reinstalls. What's the filesystem being used in the guest that's using the qcow2 as backing?

When I created the file, I definitely preallocated the metadata, but
did not set compat or lazy_refcounts. However, isn't that more a
function of how qemu + KVM managed the image, rather than how btrfs?
This is a p2v target, if that matters. Workload has been minimal since
virtualizing because I have yet to get usable performance with this
configuration. The filesystem in the guest is Win7 NTFS. I have seen
massive thrashing of the underlying volume during VSS operations in
the guest, if that signifies.

>
> It might be that your workload is best suited for a preallocated raw file that inherits +C, or even possibly an LV.

I'm close to that decision. As I mentioned, I much prefer the btrfs
subvolume story over lvm, so moving to raw is probably more desirable
than that... however, then I run into my lack of understanding of the
difference between qcow2 and raw with respect to recoverability, e.g.
does raw have the same ACID characteristics as a qcow2 image, or is
atomicity a completely separate concern from the format? The ability
for the owning process to recover from corruption or inconsistency is
a key factor in deciding whether or not to turn COW off in btrfs - if
your overlying system is capable of such recovery, like a database
engine or (presumably) virtualization layer, then COW isn't a
necessary function from the underlying system.

So, just since I started this reply, you can see the difference in
fragmentation:
2014-08-14 07:25:04
$ filefrag /var/lib/libvirt/images/atlas.qcow2
/var/lib/libvirt/images/atlas.qcow2: 46461 extents found

That's 17 minutes, an OS without interaction (I wasn't doing anything
with it, but it may have been doing its own work like updates, etc.),
and I see an fragmentation increase of 83 extents, and a raid10 volume
that was beating itself up (I could hear the drives chattering away as
they worked).

-rb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-14 14:30         ` G. Richard Bellamy
@ 2014-08-14 15:05           ` Austin S Hemmelgarn
  2014-08-14 18:15             ` G. Richard Bellamy
  2014-08-14 18:40           ` Chris Murphy
  1 sibling, 1 reply; 20+ messages in thread
From: Austin S Hemmelgarn @ 2014-08-14 15:05 UTC (permalink / raw)
  To: G. Richard Bellamy, Chris Murphy; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4412 bytes --]

On 2014-08-14 10:30, G. Richard Bellamy wrote:
> On Wed, Aug 13, 2014 at 9:23 PM, Chris Murphy <lists@colorremedies.com> wrote:
>> lsattr /var/lib/libvirt/images/atlas.qcow2
>>
>> Is the xattr actually in place on that file?
> 
> 2014-08-14 07:07:36
> $ filefrag /var/lib/libvirt/images/atlas.qcow2
> /var/lib/libvirt/images/atlas.qcow2: 46378 extents found
> 2014-08-14 07:08:34
> $ lsattr /var/lib/libvirt/images/atlas.qcow2
> ---------------C /var/lib/libvirt/images/atlas.qcow2
> 
> So, yeah, the attribute is set.
> 
>>
>> It will fragment somewhat but I can't say that I've seen this much fragmentation with xattr C applied to qcow2. What's the workload? How was the qcow2 created? I recommend -o preallocation=metadata,compat=1.1,lazy_refcounts=on when creating it. My workloads were rather simplistic: OS installs and reinstalls. What's the filesystem being used in the guest that's using the qcow2 as backing?
> 
> When I created the file, I definitely preallocated the metadata, but
> did not set compat or lazy_refcounts. However, isn't that more a
> function of how qemu + KVM managed the image, rather than how btrfs?
> This is a p2v target, if that matters. Workload has been minimal since
> virtualizing because I have yet to get usable performance with this
> configuration. The filesystem in the guest is Win7 NTFS. I have seen
> massive thrashing of the underlying volume during VSS operations in
> the guest, if that signifies.
> 
>>
>> It might be that your workload is best suited for a preallocated raw file that inherits +C, or even possibly an LV.
> 
> I'm close to that decision. As I mentioned, I much prefer the btrfs
> subvolume story over lvm, so moving to raw is probably more desirable
> than that... however, then I run into my lack of understanding of the
> difference between qcow2 and raw with respect to recoverability, e.g.
> does raw have the same ACID characteristics as a qcow2 image, or is
> atomicity a completely separate concern from the format? The ability
> for the owning process to recover from corruption or inconsistency is
> a key factor in deciding whether or not to turn COW off in btrfs - if
> your overlying system is capable of such recovery, like a database
> engine or (presumably) virtualization layer, then COW isn't a
> necessary function from the underlying system.
> 
> So, just since I started this reply, you can see the difference in
> fragmentation:
> 2014-08-14 07:25:04
> $ filefrag /var/lib/libvirt/images/atlas.qcow2
> /var/lib/libvirt/images/atlas.qcow2: 46461 extents found
> 
> That's 17 minutes, an OS without interaction (I wasn't doing anything
> with it, but it may have been doing its own work like updates, etc.),
> and I see an fragmentation increase of 83 extents, and a raid10 volume
> that was beating itself up (I could hear the drives chattering away as
> they worked).
The fact that it is Windows using NTFS is probably part of the problem.
 Here's some things you can do to decrease it's background disk
utilization (these also improve performance on real hardware):
1. Disable system restore points.  These aren't really necessary if you
are running in a VM and can take snapshots from the host OS.
2. Disable the indexing service.  This does a lot of background disk IO,
and most people don't need the high speed search functionality.
3. Turn off Windows Features that you don't need.  This won't help disk
utilization much, but can greatly improve overall system performance.
4. Disable the paging file.  Windows does a lot of unnecessary
background paging, which can cause lots of unneeded disk IO.  Be careful
doing this however, as it may cause problems for memory hungry applications.
5. See if you can disable boot time services you don't need.  Bluetooth,
SmartCard, and Adaptive Screen Brightness are all things you probably
don't need in a VM environment.

Of these, 1, 2, and 4 will probably help the most.  The other thing is
that NTFS is a journaling file system, and putting a journaled file
system image on a COW backing store will always cause some degree of
thrashing, because the same few hundred MB of the disk get rewritten
over and over again, and the only way to work around that on BTRFS is to
make the file NOCOW, AND preallocate the entire file in one operation
(use the fallocate command from util-linux to do this).



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-14 15:05           ` Austin S Hemmelgarn
@ 2014-08-14 18:15             ` G. Richard Bellamy
  0 siblings, 0 replies; 20+ messages in thread
From: G. Richard Bellamy @ 2014-08-14 18:15 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Chris Murphy, linux-btrfs

On Thu, Aug 14, 2014 at 8:05 AM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> The fact that it is Windows using NTFS is probably part of the problem.
>  Here's some things you can do to decrease it's background disk
> utilization (these also improve performance on real hardware):
> 1. Disable system restore points.  These aren't really necessary if you
> are running in a VM and can take snapshots from the host OS.

Great point - doing that.

> 2. Disable the indexing service.  This does a lot of background disk IO,
> and most people don't need the high speed search functionality.

Being a developer, while I usually don't need this functionality, when
I /do/ need it, I need it /right now/.

> 3. Turn off Windows Features that you don't need.  This won't help disk
> utilization much, but can greatly improve overall system performance.

Normal for me, so it's already done. Well, except for #1 above. ;)

> 4. Disable the paging file.  Windows does a lot of unnecessary
> background paging, which can cause lots of unneeded disk IO.  Be careful
> doing this however, as it may cause problems for memory hungry applications.

The VM is getting 16G, so I see no reason to keep a large paging file,
so good call her as well. However, I have had some disk-related BSODs
which produced some MEMORY.DMP files I was planning to investigate if
all external troubleshooting paths were exhausted, I wouldn't have
been able to capture those if I had a minimal page file. Also, I would
need to once again increase the page file if I wanted to create a
NMI-based crash dump.

> 5. See if you can disable boot time services you don't need.  Bluetooth,
> SmartCard, and Adaptive Screen Brightness are all things you probably
> don't need in a VM environment.

Done as a matter of course.

>
> Of these, 1, 2, and 4 will probably help the most.  The other thing is
> that NTFS is a journaling file system, and putting a journaled file
> system image on a COW backing store will always cause some degree of
> thrashing, because the same few hundred MB of the disk get rewritten
> over and over again, and the only way to work around that on BTRFS is to
> make the file NOCOW, AND preallocate the entire file in one operation
> (use the fallocate command from util-linux to do this).

I've been working through this for long enough, I've done both QEMU
preallocate and fallocate. The current image is large enough, that I
just set the nocow attr on the target directory and copied the
preallocated qcow2 image from a usb drive... thus it wasn't
fallocate'd. I'm again hitting my ignorance threshold, so am unclear
whether fallocate will behave the same as a copy from usb/ext4 to my
btrfs subvolume.

-rb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-14 14:30         ` G. Richard Bellamy
  2014-08-14 15:05           ` Austin S Hemmelgarn
@ 2014-08-14 18:40           ` Chris Murphy
  2014-08-14 23:16             ` G. Richard Bellamy
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2014-08-14 18:40 UTC (permalink / raw)
  To: linux-btrfs


On Aug 14, 2014, at 8:30 AM, G. Richard Bellamy <rbellamy@pteradigm.com> wrote:

> This is a p2v target, if that matters. Workload has been minimal since
> virtualizing because I have yet to get usable performance with this
> configuration. The filesystem in the guest is Win7 NTFS. I have seen
> massive thrashing of the underlying volume during VSS operations in
> the guest, if that signifies.

Does the VM have enough memory? Is this swap activity? You can't have a VM depend on swap.


> 
>> 
>> It might be that your workload is best suited for a preallocated raw file that inherits +C, or even possibly an LV.
> 
> I'm close to that decision. As I mentioned, I much prefer the btrfs
> subvolume story over lvm, so moving to raw is probably more desirable
> than that... however, then I run into my lack of understanding of the
> difference between qcow2 and raw with respect to recoverability, e.g.
> does raw have the same ACID characteristics as a qcow2 image, or is
> atomicity a completely separate concern from the format?

NTFS isn't atomic unless you're using transactional NTFS. If your use case requires transactional NTFS I'd think you need to give it an LV, or even better a physical drive or partition of its own, only because adding layers in between transactional NTFS and the physical media seems to me like asking for increasingly non-deterministic results. For sure it increases the test matrix, and if atomic writes is the goal, you have to be willing to sabotage it when testing to know if you're going to get the outcome you expect. Unless someone else has done this exact same setup…

I can't say whether, or to what degree, the layers make this pathological. But transactional NTFS + libvirt bus + libvirt cache policy + qcow2 + Btrfs just seems like a lot of layers. And then the drive itself has a couple: its own write cache should be disabled using hdparm for your described use case, and then also knowing whether it honors write barrier at all or sufficiently.

It sounds like your guest VM might be swapping or indexing, i.e. it's doing random writes, not overwrites, and the guest VM caching policy is causing them to be flushed to disk quickly rather than allowing them to be cached so that the host filesystem can combine those multiple writes into larger sequential writes. That's how you can get many fragments even with xattr +C, and chances are you'd get fragmentation with any file system in that case, if the behavior in the VM is many new random writes rather than overwrites. Transactional NTFS could do that also.

So I'd find out what these writes you're getting are all about, and if you can do something to stop them. If you can't, then you need to look at whether qcow2 on XFS  fragments as badly, and if it doesn't then maybe you've found a bug (some interaction between libvirt, qemu and Btfs?) because I'd expect +C on Btrfs to perform approximately the same as ext4 or XFS in terms of fragmentation. 

So what about Btrfs subvolumes do you prefer for this use case? There might be other ways to mitigate that feature loss when going to XFS; the raid10 can be done with either md/mdadm or LVM; and there may be a fit for bcache here because you actually would get these random writes committed to stable media much faster in that case, and a lot of work has been done to make this more reliable than battery backed write caches on hardware raid.

Also take my results with a grain of salt because I was using libvirt's unsafe caching policy.


> The ability
> for the owning process to recover from corruption or inconsistency is
> a key factor in deciding whether or not to turn COW off in btrfs - if
> your overlying system is capable of such recovery, like a database
> engine or (presumably) virtualization layer, then COW isn't a
> necessary function from the underlying system.

By using +C you've already turned off COW for that file in Btrfs, and you've turned off checksumming. While you still have Btfs snapshots, you also have qcow2 snapshots still.

Anyway, I'd make no assumptions about a particular setup actually recovering consistently and lacking corruption without testing it. Test the VM with "virsh destroy" which is an ungraceful shutdown of the VM. Start it back up and see if your database, or whatever, recovers. A more aggressive test is to kill the VM itself with SIGKILL which is going to clobber anything its cached that hasn't yet been submitted to host controlled storage. And even more aggressive would be a sysrq+b. And finally the power cable - which you can bypass if you're on a UPS.


Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-14 18:40           ` Chris Murphy
@ 2014-08-14 23:16             ` G. Richard Bellamy
  2014-08-15  1:05               ` Chris Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: G. Richard Bellamy @ 2014-08-14 23:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Thu, Aug 14, 2014 at 11:40 AM, Chris Murphy <lists@colorremedies.com> wrote:
> and there may be a fit for bcache here because you actually would get these random writes committed to stable media much faster in that case, and a lot of work has been done to make this more reliable than battery backed write caches on hardware raid.

umph... heard of bcache, but never looked at it or considered it as an
option in this scenario. After reading the doco and some of the design
documents, it's looking like bcache and md/mdadm or LVM could do the
trick.

The gotchas state clearly that btrfs on top of bcache is not recommended.

However, can bcache be put 'in front' of a btrfs raid10 volume? I
think not, since btrfs volumes are not presented as individual block
devices, instead you've got several block devices (e.g. /dev/sda and
/dev/sdb are in a btrfs raid1, and can be seen individually by the
OS)... however I wish it could, since bcache "...turns random writes
into sequential writes", which solve entirely the problem which
prompts the nocow option in btrfs.

Much to think on here.

-rb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-14 23:16             ` G. Richard Bellamy
@ 2014-08-15  1:05               ` Chris Murphy
  2014-09-02 18:31                 ` G. Richard Bellamy
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2014-08-15  1:05 UTC (permalink / raw)
  To: linux-btrfs


On Aug 14, 2014, at 5:16 PM, G. Richard Bellamy <rbellamy@pteradigm.com> wrote:

> On Thu, Aug 14, 2014 at 11:40 AM, Chris Murphy <lists@colorremedies.com> wrote:
>> and there may be a fit for bcache here because you actually would get these random writes committed to stable media much faster in that case, and a lot of work has been done to make this more reliable than battery backed write caches on hardware raid.
> 
> umph... heard of bcache, but never looked at it or considered it as an
> option in this scenario. After reading the doco and some of the design
> documents, it's looking like bcache and md/mdadm or LVM could do the
> trick.

They are all separate things. I haven't worked with the LVM caching (which uses dm-cache as the backend, similar to how it uses md code on the backend for all of its RAID level support), there could be some advantages there if you have to use LVM anyway, but the design goal of bcache sounds more suited for your workload. And it's got


> 
> The gotchas state clearly that btrfs on top of bcache is not recommended.

Yeah I'm not sure if the suggested changes from 3.12 btrfs + bcache problems went through. Eventually they should work together. But I'd use bcache with XFS or ext4 when not in the mood for bleeding something.

> 
> However, can bcache be put 'in front' of a btrfs raid10 volume?

More correctly you will mkfs.btrfs on the bcache devices, which are logical devices made from one or more backing devices, and a cache device.


# make-bcache -w 2k -b 512k -C /dev/sdc -B /dev/sd[defg]
# lsblk
sdc               8:32   0    8G  0 disk 
├─bcache0       252:0    0    8G  0 disk 
├─bcache1       252:1    0    8G  0 disk 
├─bcache2       252:2    0    8G  0 disk 
└─bcache3       252:3    0    8G  0 disk 
sdd               8:48   0    8G  0 disk 
└─bcache0       252:0    0    8G  0 disk 
sde               8:64   0    8G  0 disk 
└─bcache1       252:1    0    8G  0 disk 
sdf               8:80   0    8G  0 disk 
└─bcache2       252:2    0    8G  0 disk 
sdg               8:96   0    8G  0 disk 
└─bcache3       252:3    0    8G  0 disk 
# mkfs.btrfs -draid10 -mraid10 /dev/bcache[0123]
# mount /dev/bcache0 /mnt
# btrfs fi df /mnt
Data, RAID10: total=2.00GiB, used=27.91MiB
System, RAID10: total=64.00MiB, used=16.00KiB
Metadata, RAID10: total=256.00MiB, used=160.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
[* Yes I cheated and did a balance first so the df output looks cleaner.]


> I
> think not, since btrfs volumes are not presented as individual block
> devices, instead you've got several block devices (e.g. /dev/sda and
> /dev/sdb are in a btrfs raid1, and can be seen individually by the
> OS)... however I wish it could, since bcache "...turns random writes
> into sequential writes", which solve entirely the problem which
> prompts the nocow option in btrfs.

Yeah but you've got something perturbing this in the VM guest, and probably also libvirt caching isn't ideal for that workload either. Now it may be safe, but at the expense of being chatty. I'm not yet convinced you avoid this problem with XFS, in which case you're in a better position to safely use bcache.


Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-08-15  1:05               ` Chris Murphy
@ 2014-09-02 18:31                 ` G. Richard Bellamy
  2014-09-02 19:17                   ` Chris Murphy
  2014-09-02 19:17                   ` Austin S Hemmelgarn
  0 siblings, 2 replies; 20+ messages in thread
From: G. Richard Bellamy @ 2014-09-02 18:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

I thought I'd follow-up and give everyone an update, in case anyone
had further interest.

I've rebuilt the RAID10 volume in question with a Samsung 840 Pro for
bcache front device.

It's 5x600GB SAS 15k RPM drives RAID10, with the 512MB SSD bcache.

2014-09-02 11:23:16
root@eanna i /var/lib/libvirt/images # lsblk
NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda         8:0    0 558.9G  0 disk
└─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
sdb         8:16   0 558.9G  0 disk
└─bcache2 254:2    0 558.9G  0 disk
sdc         8:32   0 558.9G  0 disk
└─bcache1 254:1    0 558.9G  0 disk
sdd         8:48   0 558.9G  0 disk
└─bcache0 254:0    0 558.9G  0 disk
sde         8:64   0 558.9G  0 disk
└─bcache4 254:4    0 558.9G  0 disk
sdf         8:80   0   1.8T  0 disk
└─sdf1      8:81   0   1.8T  0 part
sdg         8:96   0   477G  0 disk /var/lib/btrfs/system
sdh         8:112  0   477G  0 disk
sdi         8:128  0   477G  0 disk
├─bcache0 254:0    0 558.9G  0 disk
├─bcache1 254:1    0 558.9G  0 disk
├─bcache2 254:2    0 558.9G  0 disk
├─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
└─bcache4 254:4    0 558.9G  0 disk
sr0        11:0    1  1024M  0 rom

I further split the system and data drives of the VM Win7 guest. It's
very interesting to see the huge level of fragmentation I'm seeing,
even with the help of ordered writes offered by bcache - in other
words while bcache seems to be offering me stability and better
behavior to the guest, the underlying the filesystem is still seeing a
level of fragmentation that has me scratching my head.

That being said, I don't know what would be normal fragmentation of a
VM Win7 guest system drive, so could be I'm just operating in my zone
of ignorance again.

2014-09-01 14:41:19
root@eanna i /var/lib/libvirt/images # filefrag atlas-*
atlas-data.qcow2: 7 extents found
atlas-system.qcow2: 154 extents found
2014-09-01 18:12:27
root@eanna i /var/lib/libvirt/images # filefrag atlas-*
atlas-data.qcow2: 564 extents found
atlas-system.qcow2: 28171 extents found
2014-09-02 08:22:00
root@eanna i /var/lib/libvirt/images # filefrag atlas-*
atlas-data.qcow2: 564 extents found
atlas-system.qcow2: 35281 extents found
2014-09-02 08:44:43
root@eanna i /var/lib/libvirt/images # filefrag atlas-*
atlas-data.qcow2: 564 extents found
atlas-system.qcow2: 37203 extents found
2014-09-02 10:14:32
root@eanna i /var/lib/libvirt/images # filefrag atlas-*
atlas-data.qcow2: 564 extents found
atlas-system.qcow2: 40903 extents found

On Thu, Aug 14, 2014 at 6:05 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Aug 14, 2014, at 5:16 PM, G. Richard Bellamy <rbellamy@pteradigm.com> wrote:
>
>> On Thu, Aug 14, 2014 at 11:40 AM, Chris Murphy <lists@colorremedies.com> wrote:
>>> and there may be a fit for bcache here because you actually would get these random writes committed to stable media much faster in that case, and a lot of work has been done to make this more reliable than battery backed write caches on hardware raid.
>>
>> umph... heard of bcache, but never looked at it or considered it as an
>> option in this scenario. After reading the doco and some of the design
>> documents, it's looking like bcache and md/mdadm or LVM could do the
>> trick.
>
> They are all separate things. I haven't worked with the LVM caching (which uses dm-cache as the backend, similar to how it uses md code on the backend for all of its RAID level support), there could be some advantages there if you have to use LVM anyway, but the design goal of bcache sounds more suited for your workload. And it's got
>
>
>>
>> The gotchas state clearly that btrfs on top of bcache is not recommended.
>
> Yeah I'm not sure if the suggested changes from 3.12 btrfs + bcache problems went through. Eventually they should work together. But I'd use bcache with XFS or ext4 when not in the mood for bleeding something.
>
>>
>> However, can bcache be put 'in front' of a btrfs raid10 volume?
>
> More correctly you will mkfs.btrfs on the bcache devices, which are logical devices made from one or more backing devices, and a cache device.
>
>
> # make-bcache -w 2k -b 512k -C /dev/sdc -B /dev/sd[defg]
> # lsblk
> sdc               8:32   0    8G  0 disk
> ├─bcache0       252:0    0    8G  0 disk
> ├─bcache1       252:1    0    8G  0 disk
> ├─bcache2       252:2    0    8G  0 disk
> └─bcache3       252:3    0    8G  0 disk
> sdd               8:48   0    8G  0 disk
> └─bcache0       252:0    0    8G  0 disk
> sde               8:64   0    8G  0 disk
> └─bcache1       252:1    0    8G  0 disk
> sdf               8:80   0    8G  0 disk
> └─bcache2       252:2    0    8G  0 disk
> sdg               8:96   0    8G  0 disk
> └─bcache3       252:3    0    8G  0 disk
> # mkfs.btrfs -draid10 -mraid10 /dev/bcache[0123]
> # mount /dev/bcache0 /mnt
> # btrfs fi df /mnt
> Data, RAID10: total=2.00GiB, used=27.91MiB
> System, RAID10: total=64.00MiB, used=16.00KiB
> Metadata, RAID10: total=256.00MiB, used=160.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> [* Yes I cheated and did a balance first so the df output looks cleaner.]
>
>
>> I
>> think not, since btrfs volumes are not presented as individual block
>> devices, instead you've got several block devices (e.g. /dev/sda and
>> /dev/sdb are in a btrfs raid1, and can be seen individually by the
>> OS)... however I wish it could, since bcache "...turns random writes
>> into sequential writes", which solve entirely the problem which
>> prompts the nocow option in btrfs.
>
> Yeah but you've got something perturbing this in the VM guest, and probably also libvirt caching isn't ideal for that workload either. Now it may be safe, but at the expense of being chatty. I'm not yet convinced you avoid this problem with XFS, in which case you're in a better position to safely use bcache.
>
>
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-09-02 18:31                 ` G. Richard Bellamy
@ 2014-09-02 19:17                   ` Chris Murphy
  2014-09-02 19:17                   ` Austin S Hemmelgarn
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Murphy @ 2014-09-02 19:17 UTC (permalink / raw)
  To: linux-btrfs


On Sep 2, 2014, at 12:31 PM, G. Richard Bellamy <rbellamy@pteradigm.com> wrote:

> I thought I'd follow-up and give everyone an update, in case anyone
> had further interest.
> 
> I've rebuilt the RAID10 volume in question with a Samsung 840 Pro for
> bcache front device.
> 
> It's 5x600GB SAS 15k RPM drives RAID10, with the 512MB SSD bcache.
> 
> 2014-09-02 11:23:16
> root@eanna i /var/lib/libvirt/images # lsblk
> NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda         8:0    0 558.9G  0 disk
> └─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
> sdb         8:16   0 558.9G  0 disk
> └─bcache2 254:2    0 558.9G  0 disk
> sdc         8:32   0 558.9G  0 disk
> └─bcache1 254:1    0 558.9G  0 disk
> sdd         8:48   0 558.9G  0 disk
> └─bcache0 254:0    0 558.9G  0 disk
> sde         8:64   0 558.9G  0 disk
> └─bcache4 254:4    0 558.9G  0 disk
> sdf         8:80   0   1.8T  0 disk
> └─sdf1      8:81   0   1.8T  0 part
> sdg         8:96   0   477G  0 disk /var/lib/btrfs/system
> sdh         8:112  0   477G  0 disk
> sdi         8:128  0   477G  0 disk
> ├─bcache0 254:0    0 558.9G  0 disk
> ├─bcache1 254:1    0 558.9G  0 disk
> ├─bcache2 254:2    0 558.9G  0 disk
> ├─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
> └─bcache4 254:4    0 558.9G  0 disk
> sr0        11:0    1  1024M  0 rom
> 
> I further split the system and data drives of the VM Win7 guest. It's
> very interesting to see the huge level of fragmentation I'm seeing,
> even with the help of ordered writes offered by bcache - in other
> words while bcache seems to be offering me stability and better
> behavior to the guest, the underlying the filesystem is still seeing a
> level of fragmentation that has me scratching my head.
> 
> That being said, I don't know what would be normal fragmentation of a
> VM Win7 guest system drive, so could be I'm just operating in my zone
> of ignorance again.
> 
> 2014-09-01 14:41:19
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 7 extents found
> atlas-system.qcow2: 154 extents found
> 2014-09-01 18:12:27
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 28171 extents found
> 2014-09-02 08:22:00
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 35281 extents found
> 2014-09-02 08:44:43
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 37203 extents found
> 2014-09-02 10:14:32
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 40903 extents found

Hmm interesting. What is happening to atlas-data.qcow2 this whole time? It goes from 7 extents to 564 within 3.5 hours and stays there, implying either no writes, or only overwrites are happening, not new writes (writes to previously unwritten LBA's as far as the VM guest is concerned). The file atlas-system.qcow2 meanwhile has a huge spike of fragments in the first 3.5 hours as it's being populated by some activity, and then it looks like it tapers off quite a bit, either indicate less writes, or more overwrites, but still with quite a bit of new writes.

Most of my experience with qcow2 on btrfs with +C xattr has been a lot of new writes, and then mostly overwrites. The pattern I see there is a lot of initial fragmentation, and then much less which makes sense in my case because a bulk of subsequent writes are overwrites. But I also noticed despite the fragmentation, it didn't seem to negatively impact performance.


Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-09-02 18:31                 ` G. Richard Bellamy
  2014-09-02 19:17                   ` Chris Murphy
@ 2014-09-02 19:17                   ` Austin S Hemmelgarn
  2014-09-02 23:30                     ` G. Richard Bellamy
  1 sibling, 1 reply; 20+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-02 19:17 UTC (permalink / raw)
  To: G. Richard Bellamy, Chris Murphy; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3295 bytes --]

On 2014-09-02 14:31, G. Richard Bellamy wrote:
> I thought I'd follow-up and give everyone an update, in case anyone
> had further interest.
> 
> I've rebuilt the RAID10 volume in question with a Samsung 840 Pro for
> bcache front device.
> 
> It's 5x600GB SAS 15k RPM drives RAID10, with the 512MB SSD bcache.
> 
> 2014-09-02 11:23:16
> root@eanna i /var/lib/libvirt/images # lsblk
> NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda         8:0    0 558.9G  0 disk
> └─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
> sdb         8:16   0 558.9G  0 disk
> └─bcache2 254:2    0 558.9G  0 disk
> sdc         8:32   0 558.9G  0 disk
> └─bcache1 254:1    0 558.9G  0 disk
> sdd         8:48   0 558.9G  0 disk
> └─bcache0 254:0    0 558.9G  0 disk
> sde         8:64   0 558.9G  0 disk
> └─bcache4 254:4    0 558.9G  0 disk
> sdf         8:80   0   1.8T  0 disk
> └─sdf1      8:81   0   1.8T  0 part
> sdg         8:96   0   477G  0 disk /var/lib/btrfs/system
> sdh         8:112  0   477G  0 disk
> sdi         8:128  0   477G  0 disk
> ├─bcache0 254:0    0 558.9G  0 disk
> ├─bcache1 254:1    0 558.9G  0 disk
> ├─bcache2 254:2    0 558.9G  0 disk
> ├─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
> └─bcache4 254:4    0 558.9G  0 disk
> sr0        11:0    1  1024M  0 rom
> 
> I further split the system and data drives of the VM Win7 guest. It's
> very interesting to see the huge level of fragmentation I'm seeing,
> even with the help of ordered writes offered by bcache - in other
> words while bcache seems to be offering me stability and better
> behavior to the guest, the underlying the filesystem is still seeing a
> level of fragmentation that has me scratching my head.
> 
> That being said, I don't know what would be normal fragmentation of a
> VM Win7 guest system drive, so could be I'm just operating in my zone
> of ignorance again.
> 
> 2014-09-01 14:41:19
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 7 extents found
> atlas-system.qcow2: 154 extents found
> 2014-09-01 18:12:27
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 28171 extents found
> 2014-09-02 08:22:00
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 35281 extents found
> 2014-09-02 08:44:43
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 37203 extents found
> 2014-09-02 10:14:32
> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
> atlas-data.qcow2: 564 extents found
> atlas-system.qcow2: 40903 extents found
> 
This may sound odd, but are you exposing the disk to the Win7 guest as a
non-rotational device? Win7 and higher tend to have different write
behavior when they think they are on an SSD (or something else where
seek latency is effectively 0).  Most VMM's (at least, most that I've
seen) will use fallocate to punch holes for ranges that get TRIM'ed in
the guest, so if windows is sending TRIM commands, that may also be part
of the issue.  Also, you might try reducing the amount of logging in the
guest.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-09-02 19:17                   ` Austin S Hemmelgarn
@ 2014-09-02 23:30                     ` G. Richard Bellamy
  2014-09-03  6:01                       ` Chris Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: G. Richard Bellamy @ 2014-09-02 23:30 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Chris Murphy, linux-btrfs

Thanks @chris & @austin. You both bring up interesting questions and points.

@chris: atlas-data.qcow2 isn't running any software or logging at this
time, I isolated my D:\ drive on that file via clonezilla and
virt-resize.
Microsoft DiskPart version 6.1.7601
Copyright (C) 1999-2008 Microsoft Corporation.
On computer: ATLAS

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online          350 GB      0 B
  Disk 1    Online          300 GB      0 B

DISKPART> list vol

  Volume ###  Ltr  Label        Fs     Type        Size     Status     Info
  ----------  ---  -----------  -----  ----------  -------  ---------  --------
  Volume 0     E                       CD-ROM          0 B  No Media
  Volume 1     F   CDROM        CDFS   DVD-ROM       70 MB  Healthy
  Volume 2         System Rese  NTFS   Partition    100 MB  Healthy    System
  Volume 3     C   System       NTFS   Partition    349 GB  Healthy    Boot
  Volume 4     D   Data         NTFS   Partition    299 GB  Healthy

Volume 2 & 3 == atlas-system.qcow2
Volume 4 == atlas-data.qcow2

...and the current fragmentation:
2014-09-02 16:27:45
root@eanna i /var/lib/libvirt/images # filefrag atlas-*
atlas-data.qcow2: 564 extents found
atlas-system.qcow2: 47412 extents found

@austin, the Windows 7 guest sees both disks as spinning rust.

On Tue, Sep 2, 2014 at 12:17 PM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2014-09-02 14:31, G. Richard Bellamy wrote:
>> I thought I'd follow-up and give everyone an update, in case anyone
>> had further interest.
>>
>> I've rebuilt the RAID10 volume in question with a Samsung 840 Pro for
>> bcache front device.
>>
>> It's 5x600GB SAS 15k RPM drives RAID10, with the 512MB SSD bcache.
>>
>> 2014-09-02 11:23:16
>> root@eanna i /var/lib/libvirt/images # lsblk
>> NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>> sda         8:0    0 558.9G  0 disk
>> └─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
>> sdb         8:16   0 558.9G  0 disk
>> └─bcache2 254:2    0 558.9G  0 disk
>> sdc         8:32   0 558.9G  0 disk
>> └─bcache1 254:1    0 558.9G  0 disk
>> sdd         8:48   0 558.9G  0 disk
>> └─bcache0 254:0    0 558.9G  0 disk
>> sde         8:64   0 558.9G  0 disk
>> └─bcache4 254:4    0 558.9G  0 disk
>> sdf         8:80   0   1.8T  0 disk
>> └─sdf1      8:81   0   1.8T  0 part
>> sdg         8:96   0   477G  0 disk /var/lib/btrfs/system
>> sdh         8:112  0   477G  0 disk
>> sdi         8:128  0   477G  0 disk
>> ├─bcache0 254:0    0 558.9G  0 disk
>> ├─bcache1 254:1    0 558.9G  0 disk
>> ├─bcache2 254:2    0 558.9G  0 disk
>> ├─bcache3 254:3    0 558.9G  0 disk /var/lib/btrfs/data
>> └─bcache4 254:4    0 558.9G  0 disk
>> sr0        11:0    1  1024M  0 rom
>>
>> I further split the system and data drives of the VM Win7 guest. It's
>> very interesting to see the huge level of fragmentation I'm seeing,
>> even with the help of ordered writes offered by bcache - in other
>> words while bcache seems to be offering me stability and better
>> behavior to the guest, the underlying the filesystem is still seeing a
>> level of fragmentation that has me scratching my head.
>>
>> That being said, I don't know what would be normal fragmentation of a
>> VM Win7 guest system drive, so could be I'm just operating in my zone
>> of ignorance again.
>>
>> 2014-09-01 14:41:19
>> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
>> atlas-data.qcow2: 7 extents found
>> atlas-system.qcow2: 154 extents found
>> 2014-09-01 18:12:27
>> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
>> atlas-data.qcow2: 564 extents found
>> atlas-system.qcow2: 28171 extents found
>> 2014-09-02 08:22:00
>> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
>> atlas-data.qcow2: 564 extents found
>> atlas-system.qcow2: 35281 extents found
>> 2014-09-02 08:44:43
>> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
>> atlas-data.qcow2: 564 extents found
>> atlas-system.qcow2: 37203 extents found
>> 2014-09-02 10:14:32
>> root@eanna i /var/lib/libvirt/images # filefrag atlas-*
>> atlas-data.qcow2: 564 extents found
>> atlas-system.qcow2: 40903 extents found
>>
> This may sound odd, but are you exposing the disk to the Win7 guest as a
> non-rotational device? Win7 and higher tend to have different write
> behavior when they think they are on an SSD (or something else where
> seek latency is effectively 0).  Most VMM's (at least, most that I've
> seen) will use fallocate to punch holes for ranges that get TRIM'ed in
> the guest, so if windows is sending TRIM commands, that may also be part
> of the issue.  Also, you might try reducing the amount of logging in the
> guest.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-09-02 23:30                     ` G. Richard Bellamy
@ 2014-09-03  6:01                       ` Chris Murphy
  2014-09-03  6:26                         ` Chris Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2014-09-03  6:01 UTC (permalink / raw)
  To: G. Richard Bellamy; +Cc: Austin S Hemmelgarn, linux-btrfs

I created two pools, one xfs one btrfs, default formatting and mount options. I then created a qcow2 file on each using virt-manager, also using default options. And default caching (whatever that is, I think it's writethrough but don't hold me to it).

I then installed Windows 7 (not concurrently) to each qcow2 file. Immediately after install, at the BIOS screen for the 1st reboot from install, the fragment count is:

btrfs:	1576
xfs:	665

After letting each of them proceed through their subsequent reboots, and hang out for ~ 1 hour while mscorsvw.exe and svchost.exe do a bunch of stuff busying up the drive and sucking CPU, I get the following counts once they went idle:

btrfs	8042
xfs	8369


As a note, there was about 20-30 minutes of idle time at which point for xfs the extent count was 2564, and within 5 minutes of svchost.exe going all chaotic the fragment count was 8369. At which point it stayed there for another 10 or so minutes, and even through a couple reboots it stayed at that point. So there are Windows 7 processes that can cause a lot of fragmentation of qcow2 files in a very short period of time. I have no idea what they're doing, but a lot of this is probably just the nature of qcow2 files.

Also, 'qemu-img create' has a relatively new option 'nocow'. If -o nocow=on then qemu-img sets xattr +C on the qcow2 file at the time of its creation. This is a btrfs only option, and is described in the man page.

Does anyone know if filefrag's physical_offset value has any correlation to disk sectors at all? Or is it just another kind of logical value? I ask because the vast majority of the extents have sequential ordering based on the reported physical_offset. It's not like the extents are being created haphazardly all over the drive. So I'm not convinced that this kind of fragmentation is really a big problem.


Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-09-03  6:01                       ` Chris Murphy
@ 2014-09-03  6:26                         ` Chris Murphy
  2014-09-03 15:45                           ` G. Richard Bellamy
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2014-09-03  6:26 UTC (permalink / raw)
  To: linux-btrfs


On Sep 3, 2014, at 12:01 AM, Chris Murphy <lists@colorremedies.com> wrote:

> I created two pools, one xfs one btrfs, default formatting and mount options. I then created a qcow2 file on each using virt-manager, also using default options. And default caching (whatever that is, I think it's writethrough but don't hold me to it).

On the btrfs qcow2, xattr C was set.


Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-09-03  6:26                         ` Chris Murphy
@ 2014-09-03 15:45                           ` G. Richard Bellamy
  2014-09-03 18:53                             ` Clemens Eisserer
  0 siblings, 1 reply; 20+ messages in thread
From: G. Richard Bellamy @ 2014-09-03 15:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

It is interesting that for me the number of extents before and after
bcache are essentially the same.

The lesson here for me there is that the fragmentation of a btrfs
nodatacow file is not mitigated by bcache. There seems to be nothing I
can do to prevent that fragmentation, and may in fact be expected
behavior.

I cannot prove that adding the SSD bcache front-end improved
performance of the guest VM, though subjectively it seems to have had
a positive effect.

There is something systemically pathological with the VM in question,
but that's a different mailing list. :)

-rb



On Tue, Sep 2, 2014 at 11:26 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Sep 3, 2014, at 12:01 AM, Chris Murphy <lists@colorremedies.com> wrote:
>
>> I created two pools, one xfs one btrfs, default formatting and mount options. I then created a qcow2 file on each using virt-manager, also using default options. And default caching (whatever that is, I think it's writethrough but don't hold me to it).
>
> On the btrfs qcow2, xattr C was set.
>
>
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Large files, nodatacow and fragmentation
  2014-09-03 15:45                           ` G. Richard Bellamy
@ 2014-09-03 18:53                             ` Clemens Eisserer
  0 siblings, 0 replies; 20+ messages in thread
From: Clemens Eisserer @ 2014-09-03 18:53 UTC (permalink / raw)
  To: G. Richard Bellamy, linux-btrfs

Hi Richard,

> It is interesting that for me the number of extents before and after
> bcache are essentially the same.
>
> The lesson here for me there is that the fragmentation of a btrfs
> nodatacow file is not mitigated by bcache. There seems to be nothing I
> can do to prevent that fragmentation, and may in fact be expected
> behavior.

This is to be expected - bcache behaves like a single, transparent
block device - so for btrfs it doesn't matter whether you run on a
"real" device or a bcache one.
The performance increase is expected, however ;)

Best regards, Clemens

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-09-03 18:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-11 18:36 Large files, nodatacow and fragmentation G. Richard Bellamy
2014-08-11 19:14 ` Roman Mamedov
2014-08-11 21:37   ` G. Richard Bellamy
2014-08-11 23:31   ` Chris Murphy
2014-08-14  3:57     ` G. Richard Bellamy
2014-08-14  4:23       ` Chris Murphy
2014-08-14 14:30         ` G. Richard Bellamy
2014-08-14 15:05           ` Austin S Hemmelgarn
2014-08-14 18:15             ` G. Richard Bellamy
2014-08-14 18:40           ` Chris Murphy
2014-08-14 23:16             ` G. Richard Bellamy
2014-08-15  1:05               ` Chris Murphy
2014-09-02 18:31                 ` G. Richard Bellamy
2014-09-02 19:17                   ` Chris Murphy
2014-09-02 19:17                   ` Austin S Hemmelgarn
2014-09-02 23:30                     ` G. Richard Bellamy
2014-09-03  6:01                       ` Chris Murphy
2014-09-03  6:26                         ` Chris Murphy
2014-09-03 15:45                           ` G. Richard Bellamy
2014-09-03 18:53                             ` Clemens Eisserer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.