All of lore.kernel.org
 help / color / mirror / Atom feed
* qemu-kvm VM died during partial raid1 problems of btrfs
@ 2017-09-12  8:02 Marat Khalili
  2017-09-12  8:25 ` Timofey Titovets
  2017-09-13 13:23 ` Chris Murphy
  0 siblings, 2 replies; 32+ messages in thread
From: Marat Khalili @ 2017-09-12  8:02 UTC (permalink / raw)
  To: Linux fs Btrfs

Thanks to the help from the list I've successfully replaced part of 
btrfs raid1 filesystem. However, while I waited for best opinions on the 
course of actions, the root filesystem of one the qemu-kvm VMs went 
read-only, and this root was of course based in a qcow2 file on the 
problematic btrfs (the root filesystem of the VM itself is ext4, not 
btrfs). It is very well possible that it is a coincidence or something 
inducted by heavier than usual IO load, but it is hard for me to ignore 
the possibility that somehow the hardware error was propagated to VM. Is 
it possible?

No other processes on the machine developed any problems, but:
(1) it is very well possible that problematic sector belonged to this 
qcow2 file;
(2) it is a Kernel VM after all, and it might bypass normal IO paths of 
userspace processes;
(3) it is possible that it uses O_DIRECT or something, and btrfs raid1 
does not fully protect this kind of access.
Does this make any sense?

I could not login to the VM normally to see logs, and made big mistake 
of rebooting it. Now all I see in its logs is big hole, since, well, it 
went read-only :( I'll try to find out if (1) above is true after I 
finish migrating data from HDD and remove the it. I wonder where else 
can I look?

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12  8:02 qemu-kvm VM died during partial raid1 problems of btrfs Marat Khalili
@ 2017-09-12  8:25 ` Timofey Titovets
  2017-09-12  8:42   ` Marat Khalili
  2017-09-13 13:23 ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Timofey Titovets @ 2017-09-12  8:25 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Linux fs Btrfs

2017-09-12 11:02 GMT+03:00 Marat Khalili <mkh@rqc.ru>:
> Thanks to the help from the list I've successfully replaced part of btrfs
> raid1 filesystem. However, while I waited for best opinions on the course of
> actions, the root filesystem of one the qemu-kvm VMs went read-only, and
> this root was of course based in a qcow2 file on the problematic btrfs (the
> root filesystem of the VM itself is ext4, not btrfs). It is very well
> possible that it is a coincidence or something inducted by heavier than
> usual IO load, but it is hard for me to ignore the possibility that somehow
> the hardware error was propagated to VM. Is it possible?
>
> No other processes on the machine developed any problems, but:
> (1) it is very well possible that problematic sector belonged to this qcow2
> file;
> (2) it is a Kernel VM after all, and it might bypass normal IO paths of
> userspace processes;
> (3) it is possible that it uses O_DIRECT or something, and btrfs raid1 does
> not fully protect this kind of access.
> Does this make any sense?
>
> I could not login to the VM normally to see logs, and made big mistake of
> rebooting it. Now all I see in its logs is big hole, since, well, it went
> read-only :( I'll try to find out if (1) above is true after I finish
> migrating data from HDD and remove the it. I wonder where else can I look?
>
> --
>
> With Best Regards,
> Marat Khalili

AFAIK, if while read BTRFS get Read Error in RAID1, application will
also see that error and if application can't handle it -> you got a
problems

So Btrfs RAID1 ONLY protect data, not application (qemu in your case).

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12  8:25 ` Timofey Titovets
@ 2017-09-12  8:42   ` Marat Khalili
  2017-09-12  9:21     ` Timofey Titovets
  2017-09-12 10:01     ` Duncan
  0 siblings, 2 replies; 32+ messages in thread
From: Marat Khalili @ 2017-09-12  8:42 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: Linux fs Btrfs

On 12/09/17 11:25, Timofey Titovets wrote:
> AFAIK, if while read BTRFS get Read Error in RAID1, application will
> also see that error and if application can't handle it -> you got a
> problems
>
> So Btrfs RAID1 ONLY protect data, not application (qemu in your case).
That's news to me! Why doesn't it try another copy and when does it 
correct the error then? Any idea on how to work it around at least for 
qemu? (Assemble the array from within the VM?)

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12  8:42   ` Marat Khalili
@ 2017-09-12  9:21     ` Timofey Titovets
  2017-09-12  9:29       ` Marat Khalili
  2017-09-12 10:01     ` Duncan
  1 sibling, 1 reply; 32+ messages in thread
From: Timofey Titovets @ 2017-09-12  9:21 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Linux fs Btrfs

2017-09-12 11:42 GMT+03:00 Marat Khalili <mkh@rqc.ru>:
> On 12/09/17 11:25, Timofey Titovets wrote:
>>
>> AFAIK, if while read BTRFS get Read Error in RAID1, application will
>> also see that error and if application can't handle it -> you got a
>> problems
>>
>> So Btrfs RAID1 ONLY protect data, not application (qemu in your case).
>
> That's news to me! Why doesn't it try another copy and when does it correct
> the error then? Any idea on how to work it around at least for qemu?
> (Assemble the array from within the VM?)
>
>
> --
>
> With Best Regards,
> Marat Khalili

Can't reproduce that on latest kernel: 4.13.1
For reproduce i use 2 usb flash disk in btrfs raid1 + fio test for generate load
While test i pull down one flash drive, some time ago (year?) this
produce fio error, at now test continues without problems.

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12  9:21     ` Timofey Titovets
@ 2017-09-12  9:29       ` Marat Khalili
  2017-09-12  9:35         ` Timofey Titovets
  0 siblings, 1 reply; 32+ messages in thread
From: Marat Khalili @ 2017-09-12  9:29 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: Linux fs Btrfs

On 12/09/17 12:21, Timofey Titovets wrote:
> Can't reproduce that on latest kernel: 4.13.1
Great! Thank you very much for the test. Do you know if it's fixed in 
4.10? (or what particular version does?)

--

With Best Regards,
Marat Khalili


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12  9:29       ` Marat Khalili
@ 2017-09-12  9:35         ` Timofey Titovets
  0 siblings, 0 replies; 32+ messages in thread
From: Timofey Titovets @ 2017-09-12  9:35 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Linux fs Btrfs

2017-09-12 12:29 GMT+03:00 Marat Khalili <mkh@rqc.ru>:
> On 12/09/17 12:21, Timofey Titovets wrote:
>>
>> Can't reproduce that on latest kernel: 4.13.1
>
> Great! Thank you very much for the test. Do you know if it's fixed in 4.10?
> (or what particular version does?)
> --
>
> With Best Regards,
> Marat Khalili
>

Nope, i reading all list message for at least 3 years and i can't
remember merget patches that can fix that, may be this can be related
to latest BIO API rework and changes, but i'm unsure.
Sry.

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12  8:42   ` Marat Khalili
  2017-09-12  9:21     ` Timofey Titovets
@ 2017-09-12 10:01     ` Duncan
  2017-09-12 10:32       ` Adam Borowski
  1 sibling, 1 reply; 32+ messages in thread
From: Duncan @ 2017-09-12 10:01 UTC (permalink / raw)
  To: linux-btrfs

Marat Khalili posted on Tue, 12 Sep 2017 11:42:52 +0300 as excerpted:

> On 12/09/17 11:25, Timofey Titovets wrote:
>> AFAIK, if while read BTRFS get Read Error in RAID1, application will
>> also see that error and if application can't handle it -> you got a
>> problems
>>
>> So Btrfs RAID1 ONLY protect data, not application (qemu in your case).

> That's news to me! Why doesn't it try another copy and when does it
> correct the error then?

AFAIK that's wrong -- the only time the app should see the error on btrfs 
raid1 is if the second copy is also bad (and if it's good, the bad copy 
is automatically rewritten... elsewhere of course, due to cow)... or if 
the problem with btrfs is bad enough it sends the entire filesystem read-
only, which I don't believe happened in your case (it was the ext4 on the 
VM that went ro).

So you should be able to rest easy on that, at least. =:^)

> Any idea on how to work it around at least for
> qemu? (Assemble the array from within the VM?)

BTW, I am most definitely /not/ a VM expert, and won't pretend to 
understand the details or be able to explain further, but IIRC from what 
I've read on-list, qcow2 isn't the best alternative for hosting VMs on 
top of btrfs.  Something about it being cow-based as well, which means cow
(qcow2)-on-cow(btrfs), which tends to lead to /extreme/ fragmentation, 
leading to low performance.  I'd guess that due to the additional stress, 
it may also trigger race conditions and/or deadlocks that wouldn't 
ordinarily trigger.

I don't know enough about it to know what the alternatives to qcow2 are, 
but something that not itself cow when it's on cow-based btrfs, would 
presumably be a better alternative.

Sorry I can't do better on that, but this should at least give you enough 
information to look for more, if no one reposts the details here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 10:01     ` Duncan
@ 2017-09-12 10:32       ` Adam Borowski
  2017-09-12 10:39         ` Marat Khalili
  2017-09-12 11:09         ` Roman Mamedov
  0 siblings, 2 replies; 32+ messages in thread
From: Adam Borowski @ 2017-09-12 10:32 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Sep 12, 2017 at 10:01:07AM +0000, Duncan wrote:
> BTW, I am most definitely /not/ a VM expert, and won't pretend to 
> understand the details or be able to explain further, but IIRC from what 
> I've read on-list, qcow2 isn't the best alternative for hosting VMs on 
> top of btrfs.  Something about it being cow-based as well, which means cow
> (qcow2)-on-cow(btrfs), which tends to lead to /extreme/ fragmentation, 
> leading to low performance.
> 
> I don't know enough about it to know what the alternatives to qcow2 are, 
> but something that not itself cow when it's on cow-based btrfs, would 
> presumably be a better alternative.

Just use raw -- btrfs already has every feature that qcow2 has, and does it
better.  This doesn't mean btrfs is the best choice for hosting VM files,
just that raw-over-btrfs is strictly better than qcow2-over-btrfs.

And like qcow2, with raw over btrfs you have the choice between a fully
pre-written nocow file and a sparse file.  For the latter, you want discard
in the guest (not supported over ide and virtio, supported over scsi and
virtio-scsi), and you get the full list of btrfs goodies like snapshots or
dedup.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 10:32       ` Adam Borowski
@ 2017-09-12 10:39         ` Marat Khalili
  2017-09-12 11:01           ` Timofey Titovets
  2017-09-12 11:09         ` Roman Mamedov
  1 sibling, 1 reply; 32+ messages in thread
From: Marat Khalili @ 2017-09-12 10:39 UTC (permalink / raw)
  To: Adam Borowski, Duncan; +Cc: linux-btrfs

On 12/09/17 13:01, Duncan wrote:
> AFAIK that's wrong -- the only time the app should see the error on btrfs
> raid1 is if the second copy is also bad
So thought I, but...

> IIRC from what I've read on-list, qcow2 isn't the best alternative for hosting VMs on
> top of btrfs.
Yeah, I've seen discussions about it here too, but in my case VMs write 
very little (mostly logs and distro updates), so I decided it can live 
as it is for a while. But I'm looking for better solutions as long as 
they are not too complicated.


On 12/09/17 13:32, Adam Borowski wrote:
> Just use raw -- btrfs already has every feature that qcow2 has, and does it
> better.  This doesn't mean btrfs is the best choice for hosting VM files,
> just that raw-over-btrfs is strictly better than qcow2-over-btrfs.
Thanks for advice, I wasn't sure I won't lose features, and was too lazy 
to investigate/ask. Now it looks simple.

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 10:39         ` Marat Khalili
@ 2017-09-12 11:01           ` Timofey Titovets
  2017-09-12 11:12             ` Adam Borowski
  0 siblings, 1 reply; 32+ messages in thread
From: Timofey Titovets @ 2017-09-12 11:01 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Adam Borowski, Duncan, linux-btrfs

2017-09-12 13:39 GMT+03:00 Marat Khalili <mkh@rqc.ru>:
> On 12/09/17 13:01, Duncan wrote:
>>
>> AFAIK that's wrong -- the only time the app should see the error on btrfs
>> raid1 is if the second copy is also bad
>
> So thought I, but...
>
>> IIRC from what I've read on-list, qcow2 isn't the best alternative for
>> hosting VMs on
>> top of btrfs.
>
> Yeah, I've seen discussions about it here too, but in my case VMs write very
> little (mostly logs and distro updates), so I decided it can live as it is
> for a while. But I'm looking for better solutions as long as they are not
> too complicated.
>
>
> On 12/09/17 13:32, Adam Borowski wrote:
>>
>> Just use raw -- btrfs already has every feature that qcow2 has, and does
>> it
>> better.  This doesn't mean btrfs is the best choice for hosting VM files,
>> just that raw-over-btrfs is strictly better than qcow2-over-btrfs.
>
> Thanks for advice, I wasn't sure I won't lose features, and was too lazy to
> investigate/ask. Now it looks simple.
>
> --
>
> With Best Regards,
> Marat Khalili
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The main problem with Raw over Btrfs is that (IIRC) no one support
btrfs features.

 - Patches for libvirt not merged and obsolete
 - Patches for Proxmox also not merged
 - Other VM hypervisor like Virtualbox, VMware just ignore btrfs features.

So with raw you will have a problems like: no snapshot support

But yes, raw over btrfs the best performance wise solution.

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 10:32       ` Adam Borowski
  2017-09-12 10:39         ` Marat Khalili
@ 2017-09-12 11:09         ` Roman Mamedov
  1 sibling, 0 replies; 32+ messages in thread
From: Roman Mamedov @ 2017-09-12 11:09 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-btrfs

On Tue, 12 Sep 2017 12:32:14 +0200
Adam Borowski <kilobyte@angband.pl> wrote:

> discard in the guest (not supported over ide and virtio, supported over scsi
> and virtio-scsi)

IDE does support discard in QEMU, I use that all the time.

It got broken briefly in QEMU 2.1 [1], but then fixed again.

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757927


-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 11:01           ` Timofey Titovets
@ 2017-09-12 11:12             ` Adam Borowski
  2017-09-12 11:17               ` Timofey Titovets
  2017-09-12 11:26               ` Marat Khalili
  0 siblings, 2 replies; 32+ messages in thread
From: Adam Borowski @ 2017-09-12 11:12 UTC (permalink / raw)
  To: Marat Khalili, Duncan, linux-btrfs

On Tue, Sep 12, 2017 at 02:01:53PM +0300, Timofey Titovets wrote:
> > On 12/09/17 13:32, Adam Borowski wrote:
> >> Just use raw -- btrfs already has every feature that qcow2 has, and
> >> does it better.  This doesn't mean btrfs is the best choice for hosting
> >> VM files, just that raw-over-btrfs is strictly better than
> >> qcow2-over-btrfs.
> >
> > Thanks for advice, I wasn't sure I won't lose features, and was too lazy to
> > investigate/ask. Now it looks simple.
> 
> The main problem with Raw over Btrfs is that (IIRC) no one support
> btrfs features.
> 
>  - Patches for libvirt not merged and obsolete
>  - Patches for Proxmox also not merged
>  - Other VM hypervisor like Virtualbox, VMware just ignore btrfs features.
> 
> So with raw you will have a problems like: no snapshot support

Why would you need support in the hypervisor if cp --reflink=always is
enough?  Likewise, I wouldn't expect hypervisors to implement support for
every dedup tool -- it'd be a layering violation[1].  It's not emacs or
systemd, you really can use an external tool instead of adding a lawnmower
to the kitchen sink.


Meow!

[1] Yeah, talking about layering violations in btrfs context is a bit weird,
but it's better to at least try.
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 11:12             ` Adam Borowski
@ 2017-09-12 11:17               ` Timofey Titovets
  2017-09-12 11:26               ` Marat Khalili
  1 sibling, 0 replies; 32+ messages in thread
From: Timofey Titovets @ 2017-09-12 11:17 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Marat Khalili, Duncan, linux-btrfs

2017-09-12 14:12 GMT+03:00 Adam Borowski <kilobyte@angband.pl>:
> On Tue, Sep 12, 2017 at 02:01:53PM +0300, Timofey Titovets wrote:
>> > On 12/09/17 13:32, Adam Borowski wrote:
>> >> Just use raw -- btrfs already has every feature that qcow2 has, and
>> >> does it better.  This doesn't mean btrfs is the best choice for hosting
>> >> VM files, just that raw-over-btrfs is strictly better than
>> >> qcow2-over-btrfs.
>> >
>> > Thanks for advice, I wasn't sure I won't lose features, and was too lazy to
>> > investigate/ask. Now it looks simple.
>>
>> The main problem with Raw over Btrfs is that (IIRC) no one support
>> btrfs features.
>>
>>  - Patches for libvirt not merged and obsolete
>>  - Patches for Proxmox also not merged
>>  - Other VM hypervisor like Virtualbox, VMware just ignore btrfs features.
>>
>> So with raw you will have a problems like: no snapshot support
>
> Why would you need support in the hypervisor if cp --reflink=always is
> enough?  Likewise, I wouldn't expect hypervisors to implement support for
> every dedup tool -- it'd be a layering violation[1].  It's not emacs or
> systemd, you really can use an external tool instead of adding a lawnmower
> to the kitchen sink.
>
>
> Meow!
>
> [1] Yeah, talking about layering violations in btrfs context is a bit weird,
> but it's better to at least try.

In that case why Hypervisors add support for LVM snapshots, ZFS, RBD
Snapshot & etc?

User can do that by hand, so it's useless, nope? (rhetorical question)

This is not about layering violation with about teaming and
integration between tools.

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 11:12             ` Adam Borowski
  2017-09-12 11:17               ` Timofey Titovets
@ 2017-09-12 11:26               ` Marat Khalili
  2017-09-12 17:21                 ` Adam Borowski
  1 sibling, 1 reply; 32+ messages in thread
From: Marat Khalili @ 2017-09-12 11:26 UTC (permalink / raw)
  To: Adam Borowski, Duncan; +Cc: linux-btrfs

On 12/09/17 14:12, Adam Borowski wrote:
> Why would you need support in the hypervisor if cp --reflink=always is
> enough?
+1 :)

But I've already found one problem: I use rsync snapshots for backups, 
and although rsync does have --sparse argument, apparently it conflicts 
with --inplace. You cannot have all nice things :(

I think I'll simply try to minimize size of VM root partitions and won't 
think too much about gig or two extra zeroes in backup, at least until 
some autopunchholes mount option arrives.

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 11:26               ` Marat Khalili
@ 2017-09-12 17:21                 ` Adam Borowski
  2017-09-12 17:36                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Adam Borowski @ 2017-09-12 17:21 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Duncan, linux-btrfs

On Tue, Sep 12, 2017 at 02:26:39PM +0300, Marat Khalili wrote:
> On 12/09/17 14:12, Adam Borowski wrote:
> > Why would you need support in the hypervisor if cp --reflink=always is
> > enough?
> +1 :)
> 
> But I've already found one problem: I use rsync snapshots for backups, and
> although rsync does have --sparse argument, apparently it conflicts with
> --inplace. You cannot have all nice things :(

There's fallocate -d, but that for some reason touches mtime which makes
rsync go again.  This can be handled manually but is still not nice.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 17:21                 ` Adam Borowski
@ 2017-09-12 17:36                   ` Austin S. Hemmelgarn
  2017-09-12 18:43                     ` Adam Borowski
  0 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-12 17:36 UTC (permalink / raw)
  To: Adam Borowski, Marat Khalili; +Cc: Duncan, linux-btrfs

On 2017-09-12 13:21, Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 02:26:39PM +0300, Marat Khalili wrote:
>> On 12/09/17 14:12, Adam Borowski wrote:
>>> Why would you need support in the hypervisor if cp --reflink=always is
>>> enough?
>> +1 :)
>>
>> But I've already found one problem: I use rsync snapshots for backups, and
>> although rsync does have --sparse argument, apparently it conflicts with
>> --inplace. You cannot have all nice things :(
(Replying here to the above, as I can't seem to find the original in my 
e-mail client to reply to)

--inplace and --sparse are inherently at odds with each other.  The only 
way that they could work together is if rsync was taught about the 
FALLOCATE_PUNCH_HOLES ioctl, and that isn't likely to ever happen 
because it's Linux specific (at least, it's functionally Linux 
specific).  Without that ioctl, the only way to create a sparse file is 
to seek over areas that are supposed to be empty when writing the file 
out initially, but you can't do that with an existing file because you 
then have old data where you're supposed to have zeroes.
> 
> There's fallocate -d, but that for some reason touches mtime which makes
> rsync go again.  This can be handled manually but is still not nice.It touches mtime because it updates the block allocations, which in turn 
touch ctime, which on most (possibly all, not sure though) POSIX systems 
implies an mtime update.  It's essentially the same as truncate updating 
the mtime when you extend the file, the only difference is that the 
FALLOCATE_PUNCH_HOLES ioctl doesn't change the file size.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 17:36                   ` Austin S. Hemmelgarn
@ 2017-09-12 18:43                     ` Adam Borowski
  2017-09-12 18:47                       ` Christoph Hellwig
  2017-09-12 19:11                       ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 32+ messages in thread
From: Adam Borowski @ 2017-09-12 18:43 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Marat Khalili, Duncan, linux-btrfs

On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 13:21, Adam Borowski wrote:
> > There's fallocate -d, but that for some reason touches mtime which makes
> > rsync go again.  This can be handled manually but is still not nice.

> It touches mtime because it updates the block allocations, which in turn
> touch ctime, which on most (possibly all, not sure though) POSIX systems
> implies an mtime update.  It's essentially the same as truncate updating the
> mtime when you extend the file, the only difference is that the
> FALLOCATE_PUNCH_HOLES ioctl doesn't change the file size.

Yeah, the underlying ioctl does modify the file, it's merely fallocate -d
calling it on regions that are already zero.  The ioctl doesn't know that,
so fallocate would have to restore the mtime by itself.

There's also another problem: such a check + ioctl are racey.  Unlike defrag
or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could
suddenly become in use).  Fixing this would need kernel support, either as
FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate.

For now, though, I wonder -- should we send fine folks at util-linux a patch
to make fallocate -d restore mtime, either always or on an option?


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 18:43                     ` Adam Borowski
@ 2017-09-12 18:47                       ` Christoph Hellwig
  2017-09-12 19:12                         ` Austin S. Hemmelgarn
  2017-09-12 19:11                       ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2017-09-12 18:47 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Austin S. Hemmelgarn, Marat Khalili, Duncan, linux-btrfs

On Tue, Sep 12, 2017 at 08:43:59PM +0200, Adam Borowski wrote:
> For now, though, I wonder -- should we send fine folks at util-linux a patch
> to make fallocate -d restore mtime, either always or on an option?

Don't do that.  Please just add a new ioctl or fallocate command
that punches a hole if the range is zeroed, similar to what dedup
does.  It can probably even reuse a few helpers.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 18:43                     ` Adam Borowski
  2017-09-12 18:47                       ` Christoph Hellwig
@ 2017-09-12 19:11                       ` Austin S. Hemmelgarn
  2017-09-12 20:00                         ` Adam Borowski
  1 sibling, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-12 19:11 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Marat Khalili, Duncan, linux-btrfs

On 2017-09-12 14:43, Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 13:21, Adam Borowski wrote:
>>> There's fallocate -d, but that for some reason touches mtime which makes
>>> rsync go again.  This can be handled manually but is still not nice.
> 
>> It touches mtime because it updates the block allocations, which in turn
>> touch ctime, which on most (possibly all, not sure though) POSIX systems
>> implies an mtime update.  It's essentially the same as truncate updating the
>> mtime when you extend the file, the only difference is that the
>> FALLOCATE_PUNCH_HOLES ioctl doesn't change the file size.
> 
> Yeah, the underlying ioctl does modify the file, it's merely fallocate -d
> calling it on regions that are already zero.  The ioctl doesn't know that,
> so fallocate would have to restore the mtime by itself.
> 
> There's also another problem: such a check + ioctl are racey.  Unlike defrag
> or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could
> suddenly become in use).  Fixing this would need kernel support, either as
> FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate.
A new fallocate mode would be more likely.  Adding special code to the 
EXTENT_SAME ioctl and then requiring implementation on filesystems that 
don't otherwise support it is not likely to get anywhere.  A new 
fallocate mode though would be easy, especially considering that a naive 
implementation is easy (block further requests to that range, complete 
all outstanding ones, check the range, punch the hole if possible, and 
then reopen requests for the range).

That said, I'm not 100% certain if it's necessary.  Intentionally 
calling fallocate on a file in use is not something most people are 
going to do normally anyway, since there is already a TOCTOU race in the 
fallocate -d implementation as things are right now.
> 
> For now, though, I wonder -- should we send fine folks at util-linux a patch
> to make fallocate -d restore mtime, either always or on an option?
It would need to be an option, because it also suffers from a TOCTOU 
race (other things might have changed the mtime while you were punching 
holes), and it breaks from existing behavior.  I think such an option 
would be useful, but not universally (for example, I don't care if the 
mtime on my VM images changes, as it typically matches the current date 
and time since the VM's are running constantly other than when doing 
maintenance like punching holes in the images).

You're the one with particular interest though, so I guess it's 
ultimately up to you how you choose to implement things in the patch ;)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 18:47                       ` Christoph Hellwig
@ 2017-09-12 19:12                         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-12 19:12 UTC (permalink / raw)
  To: Christoph Hellwig, Adam Borowski; +Cc: Marat Khalili, Duncan, linux-btrfs

On 2017-09-12 14:47, Christoph Hellwig wrote:
> On Tue, Sep 12, 2017 at 08:43:59PM +0200, Adam Borowski wrote:
>> For now, though, I wonder -- should we send fine folks at util-linux a patch
>> to make fallocate -d restore mtime, either always or on an option?
> 
> Don't do that.  Please just add a new ioctl or fallocate command
> that punches a hole if the range is zeroed, similar to what dedup
> does.  It can probably even reuse a few helpers.
> 
Agreed, that would be far preferred.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 19:11                       ` Austin S. Hemmelgarn
@ 2017-09-12 20:00                         ` Adam Borowski
  2017-09-12 20:12                           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Adam Borowski @ 2017-09-12 20:00 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Marat Khalili, Duncan, linux-btrfs

On Tue, Sep 12, 2017 at 03:11:52PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 14:43, Adam Borowski wrote:
> > On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-09-12 13:21, Adam Borowski wrote:
> > > > There's fallocate -d, but that for some reason touches mtime which makes
> > > > rsync go again.  This can be handled manually but is still not nice.
> > 
> > Yeah, the underlying ioctl does modify the file, it's merely fallocate -d
> > calling it on regions that are already zero.  The ioctl doesn't know that,
> > so fallocate would have to restore the mtime by itself.
> > 
> > There's also another problem: such a check + ioctl are racey.  Unlike defrag
> > or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could
> > suddenly become in use).  Fixing this would need kernel support, either as
> > FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate.
> A new fallocate mode would be more likely.  Adding special code to the
> EXTENT_SAME ioctl and then requiring implementation on filesystems that
> don't otherwise support it is not likely to get anywhere.  A new fallocate
> mode though would be easy, especially considering that a naive
> implementation is easy

Sounds like a good idea.  If we go this way, there's a question about
interface: there's choice between:
A) check if the whole range is zero, if even a single bit is one, abort
B) dig many holes, with a given granulation (perhaps left to the
   filesystem's choice)
or even both.  The former is more consistent with FILE_EXTENT_SAME, the
latter can be smarter (like, digging a 4k hole is bad for fragmentation but
replacing a whole extent, no matter how small, is always a win).

> That said, I'm not 100% certain if it's necessary.  Intentionally calling
> fallocate on a file in use is not something most people are going to do
> normally anyway, since there is already a TOCTOU race in the fallocate -d
> implementation as things are right now.

_Current_ fallocate -d suffers from races, the whole gain from doing this
kernel-side would be eliminating those races.  Use cases about the same as
FILE_EXTENT_SAME: you don't need to stop the world.  Heck, as I mentioned
before, it conceptually _is_ FILE_EXTENT_SAME with /dev/null, other than
your (good) point about non-btrfs non-xfs.

> > For now, though, I wonder -- should we send fine folks at util-linux a patch
> > to make fallocate -d restore mtime, either always or on an option?
> It would need to be an option, because it also suffers from a TOCTOU race
> (other things might have changed the mtime while you were punching holes),
> and it breaks from existing behavior.  I think such an option would be
> useful, but not universally (for example, I don't care if the mtime on my VM
> images changes, as it typically matches the current date and time since the
> VM's are running constantly other than when doing maintenance like punching
> holes in the images).

Noted.  Both Marat's and my use cases, though, involve VMs that are off most
of the time, and at least for me, turned on only to test something. 
Touching mtime makes rsync run again, and it's freaking _slow_: worse than
40 minutes for a 40GB VM (source:SSD target:deduped HDD).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 20:00                         ` Adam Borowski
@ 2017-09-12 20:12                           ` Austin S. Hemmelgarn
  2017-09-12 21:13                             ` Adam Borowski
  0 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-12 20:12 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Marat Khalili, Duncan, linux-btrfs

On 2017-09-12 16:00, Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 03:11:52PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 14:43, Adam Borowski wrote:
>>> On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote:
>>>> On 2017-09-12 13:21, Adam Borowski wrote:
>>>>> There's fallocate -d, but that for some reason touches mtime which makes
>>>>> rsync go again.  This can be handled manually but is still not nice.
>>>
>>> Yeah, the underlying ioctl does modify the file, it's merely fallocate -d
>>> calling it on regions that are already zero.  The ioctl doesn't know that,
>>> so fallocate would have to restore the mtime by itself.
>>>
>>> There's also another problem: such a check + ioctl are racey.  Unlike defrag
>>> or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could
>>> suddenly become in use).  Fixing this would need kernel support, either as
>>> FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate.
>> A new fallocate mode would be more likely.  Adding special code to the
>> EXTENT_SAME ioctl and then requiring implementation on filesystems that
>> don't otherwise support it is not likely to get anywhere.  A new fallocate
>> mode though would be easy, especially considering that a naive
>> implementation is easy
> 
> Sounds like a good idea.  If we go this way, there's a question about
> interface: there's choice between:
> A) check if the whole range is zero, if even a single bit is one, abort
> B) dig many holes, with a given granulation (perhaps left to the
>     filesystem's choice)
> or even both.  The former is more consistent with FILE_EXTENT_SAME, the
> latter can be smarter (like, digging a 4k hole is bad for fragmentation but
> replacing a whole extent, no matter how small, is always a win).
The first.  It's more flexible, and the logic required for the second 
option is policy, which should not be in the kernel.  Matching the 
EXTENT_SAME semantics would probably also make the implementation 
significantly easier, and with some minor work might give a trivial 
implementation for any FS that already supports that ioctl.
> 
>> That said, I'm not 100% certain if it's necessary.  Intentionally calling
>> fallocate on a file in use is not something most people are going to do
>> normally anyway, since there is already a TOCTOU race in the fallocate -d
>> implementation as things are right now.
> 
> _Current_ fallocate -d suffers from races, the whole gain from doing this
> kernel-side would be eliminating those races.  Use cases about the same as
> FILE_EXTENT_SAME: you don't need to stop the world.  Heck, as I mentioned
> before, it conceptually _is_ FILE_EXTENT_SAME with /dev/null, other than
> your (good) point about non-btrfs non-xfs.
I meant we shouldn't worry about a race involving the mtime check given 
that there's an existing race inherent in the ioctl already.
> 
>>> For now, though, I wonder -- should we send fine folks at util-linux a patch
>>> to make fallocate -d restore mtime, either always or on an option?
>> It would need to be an option, because it also suffers from a TOCTOU race
>> (other things might have changed the mtime while you were punching holes),
>> and it breaks from existing behavior.  I think such an option would be
>> useful, but not universally (for example, I don't care if the mtime on my VM
>> images changes, as it typically matches the current date and time since the
>> VM's are running constantly other than when doing maintenance like punching
>> holes in the images).
> 
> Noted.  Both Marat's and my use cases, though, involve VMs that are off most
> of the time, and at least for me, turned on only to test something.
> Touching mtime makes rsync run again, and it's freaking _slow_: worse than
> 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if 
you're going direct to a hard drive.  I get better performance than that 
on my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s 
there, but it's for archival storage so I don't really care).  I'm 
actually curious what the exact rsync command you are using is (you can 
obviously redact paths as you see fit), as the only way I can think of 
that it should be that slow is if you're using both --checksum (but if 
you're using this, you can tell rsync to skip the mtime check, and that 
issue goes away) and --inplace, _and_ your HDD is slow to begin with.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 20:12                           ` Austin S. Hemmelgarn
@ 2017-09-12 21:13                             ` Adam Borowski
  2017-09-13  0:52                               ` Timofey Titovets
                                                 ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Adam Borowski @ 2017-09-12 21:13 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Marat Khalili, Duncan, linux-btrfs

On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 16:00, Adam Borowski wrote:
> > Noted.  Both Marat's and my use cases, though, involve VMs that are off most
> > of the time, and at least for me, turned on only to test something.
> > Touching mtime makes rsync run again, and it's freaking _slow_: worse than
> > 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
> you're going direct to a hard drive.  I get better performance than that on
> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
> but it's for archival storage so I don't really care).  I'm actually curious
> what the exact rsync command you are using is (you can obviously redact
> paths as you see fit), as the only way I can think of that it should be that
> slow is if you're using both --checksum (but if you're using this, you can
> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
> _and_ your HDD is slow to begin with.

rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
with nothing notable on SMART, in a Qnap 253a, kernel 4.9.

Both source and target are btrfs, but here switching to send|receive
wouldn't give much as this particular guest is Win10 Insider Edition --
a thingy that shows what the folks from Redmond have cooked up, with roughly
weekly updates to the tune of ~10GB writes 10GB deletions (if they do
incremental transfers, installation still rewrites everything system).

Lemme look a bit more, rsync performance is indeed really abysmal compared
to what it should be.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 21:13                             ` Adam Borowski
@ 2017-09-13  0:52                               ` Timofey Titovets
  2017-09-13 12:55                                 ` Austin S. Hemmelgarn
  2017-09-13 12:21                               ` Austin S. Hemmelgarn
  2017-09-13 14:47                               ` Martin Raiber
  2 siblings, 1 reply; 32+ messages in thread
From: Timofey Titovets @ 2017-09-13  0:52 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Austin S. Hemmelgarn, Marat Khalili, Duncan, linux-btrfs

2017-09-13 0:13 GMT+03:00 Adam Borowski <kilobyte@angband.pl>:
> On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 16:00, Adam Borowski wrote:
>> > Noted.  Both Marat's and my use cases, though, involve VMs that are off most
>> > of the time, and at least for me, turned on only to test something.
>> > Touching mtime makes rsync run again, and it's freaking _slow_: worse than
>> > 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
>> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
>> you're going direct to a hard drive.  I get better performance than that on
>> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
>> but it's for archival storage so I don't really care).  I'm actually curious
>> what the exact rsync command you are using is (you can obviously redact
>> paths as you see fit), as the only way I can think of that it should be that
>> slow is if you're using both --checksum (but if you're using this, you can
>> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
>> _and_ your HDD is slow to begin with.
>
> rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
> The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
>
> Both source and target are btrfs, but here switching to send|receive
> wouldn't give much as this particular guest is Win10 Insider Edition --
> a thingy that shows what the folks from Redmond have cooked up, with roughly
> weekly updates to the tune of ~10GB writes 10GB deletions (if they do
> incremental transfers, installation still rewrites everything system).
>
> Lemme look a bit more, rsync performance is indeed really abysmal compared
> to what it should be.
>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
> ⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
> ⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
> ⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

No, no, no, no...
No new ioctl, no change in fallocate.
Fisrt: VM can do punch hole, if you use qemu -> qemu know how to do it.
Windows Guest also know how to do it.

Different Hypervisor? -> google -> Make issue to support, all
Linux/Windows/Mac OS support holes in files.

No new code, no new strange stuff to fix not broken things.

You want replace zeroes? EXTENT_SAME can do that.

truncate -s 4M test_hole
dd if=/dev/zero of=./test_zero bs=4M

duperemove -vhrd ./test_hole ./test_zero

~ du -hs test_*
0       test_hole
0       test_zero

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 21:13                             ` Adam Borowski
  2017-09-13  0:52                               ` Timofey Titovets
@ 2017-09-13 12:21                               ` Austin S. Hemmelgarn
  2017-09-18 11:53                                 ` Adam Borowski
  2017-09-13 14:47                               ` Martin Raiber
  2 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-13 12:21 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Marat Khalili, linux-btrfs

On 2017-09-12 17:13, Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 16:00, Adam Borowski wrote:
>>> Noted.  Both Marat's and my use cases, though, involve VMs that are off most
>>> of the time, and at least for me, turned on only to test something.
>>> Touching mtime makes rsync run again, and it's freaking _slow_: worse than
>>> 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
>> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
>> you're going direct to a hard drive.  I get better performance than that on
>> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
>> but it's for archival storage so I don't really care).  I'm actually curious
>> what the exact rsync command you are using is (you can obviously redact
>> paths as you see fit), as the only way I can think of that it should be that
>> slow is if you're using both --checksum (but if you're using this, you can
>> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
>> _and_ your HDD is slow to begin with.
> 
> rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
> The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
compress=zlib is probably your biggest culprit.  As odd as this sounds, 
I'd suggest switching that to lzo (seriously, the performance difference 
is ludicrous), and then setting up a cron job (or systemd timer) to run 
defrag over things to switch to zlib.  As a general point of comparison, 
we do archival backups to a file server running BTRFS where I work, and 
the archiving process runs about four to ten times faster if we take 
this approach (LZO for initial compression, then recompress using defrag 
once the initial transfer is done) than just using zlib directly.

`--inplace` is probably not helping (especially if most of the file 
changed, on BTRFS, it actually is marginally more efficient to just 
write out a whole new file and then replace the old one with a rename if 
you're rewriting most of the file), but is probably not as much of an 
issue as compress=zlib.
> 
> Both source and target are btrfs, but here switching to send|receive
> wouldn't give much as this particular guest is Win10 Insider Edition --
> a thingy that shows what the folks from Redmond have cooked up, with roughly
> weekly updates to the tune of ~10GB writes 10GB deletions (if they do
> incremental transfers, installation still rewrites everything system). >
> Lemme look a bit more, rsync performance is indeed really abysmal compared
> to what it should be.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-13  0:52                               ` Timofey Titovets
@ 2017-09-13 12:55                                 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-13 12:55 UTC (permalink / raw)
  To: Timofey Titovets, Adam Borowski; +Cc: Marat Khalili, Duncan, linux-btrfs

On 2017-09-12 20:52, Timofey Titovets wrote:
> No, no, no, no...
> No new ioctl, no change in fallocate.
> Fisrt: VM can do punch hole, if you use qemu -> qemu know how to do it.
> Windows Guest also know how to do it.
> 
> Different Hypervisor? -> google -> Make issue to support, all
> Linux/Windows/Mac OS support holes in files.
Not everybody who uses sparse files is using virtual machines.
> 
> No new code, no new strange stuff to fix not broken things.
Um, the fallocate PUNCH_HOLE mode _is_ broken.  There's a race condition 
that can trivially cause data loss.
> 
> You want replace zeroes? EXTENT_SAME can do that.
But only on a small number of filesystems, and it requires extra work 
that shouldn't be necessary.
> 
> truncate -s 4M test_hole
> dd if=/dev/zero of=./test_zero bs=4M
> 
> duperemove -vhrd ./test_hole ./test_zero
And performance for this approach is absolute shit compared to fallocate -d.

Actual numbers, using a 4G test file (which is still small for what 
you're talking about) and a 4M hole file:
fallocate -d:		0.19 user, 0.85 system, 1.26 real
duperemove -vhrd:	0.75 user, 137.70 system, 144.80 real

So, for a 4G file, it took duperemove (and the EXTENT_SAME ioctl) 114.92 
times as long to achieve the same net effect.  From a practical 
perspective, this isn't viable for regular usage just because of how 
long it takes.  Most of that overhead is that the EXTENT_SAME ioctl does 
a byte-by-byte comparison of the ranges to make sure they match, but 
that isn't strictly necessary to avoid this race condition.  All that's 
actually needed is determining if there is outstanding I/O on that 
region, and if so, some special handling prior to freezing the region is 
needed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12  8:02 qemu-kvm VM died during partial raid1 problems of btrfs Marat Khalili
  2017-09-12  8:25 ` Timofey Titovets
@ 2017-09-13 13:23 ` Chris Murphy
  2017-09-13 14:15   ` Marat Khalili
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2017-09-13 13:23 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Linux fs Btrfs

On Tue, Sep 12, 2017 at 10:02 AM, Marat Khalili <mkh@rqc.ru> wrote:

> (3) it is possible that it uses O_DIRECT or something, and btrfs raid1 does
> not fully protect this kind of access.

Right, known problem. To use o_direct implies also using nodatacow (or
at least nodatasum), e.g. xattr +C is set, done by qemu-img -o
nocow=on
https://www.spinics.net/lists/linux-btrfs/msg68244.html


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-13 13:23 ` Chris Murphy
@ 2017-09-13 14:15   ` Marat Khalili
  2017-09-13 17:52     ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Marat Khalili @ 2017-09-13 14:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux fs Btrfs

On 13/09/17 16:23, Chris Murphy wrote:
> Right, known problem. To use o_direct implies also using nodatacow (or
> at least nodatasum), e.g. xattr +C is set, done by qemu-img -o
> nocow=on
> https://www.spinics.net/lists/linux-btrfs/msg68244.html
Can you please elaborate? I don't have exactly the same problem as 
described by the link, but I'm still worried that particularly qemu can 
be less resilient to partial raid1 failures even on newer kernels, due 
to missing checksums for instance. (BTW I didn't find any xattrs on my 
VM images, nor do I plan to set any.)

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-12 21:13                             ` Adam Borowski
  2017-09-13  0:52                               ` Timofey Titovets
  2017-09-13 12:21                               ` Austin S. Hemmelgarn
@ 2017-09-13 14:47                               ` Martin Raiber
  2017-09-13 15:25                                 ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 32+ messages in thread
From: Martin Raiber @ 2017-09-13 14:47 UTC (permalink / raw)
  To: Adam Borowski, Austin S. Hemmelgarn; +Cc: Marat Khalili, Duncan, linux-btrfs

Hi,

On 12.09.2017 23:13 Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 16:00, Adam Borowski wrote:
>>> Noted.  Both Marat's and my use cases, though, involve VMs that are off most
>>> of the time, and at least for me, turned on only to test something.
>>> Touching mtime makes rsync run again, and it's freaking _slow_: worse than
>>> 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
>> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
>> you're going direct to a hard drive.  I get better performance than that on
>> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
>> but it's for archival storage so I don't really care).  I'm actually curious
>> what the exact rsync command you are using is (you can obviously redact
>> paths as you see fit), as the only way I can think of that it should be that
>> slow is if you're using both --checksum (but if you're using this, you can
>> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
>> _and_ your HDD is slow to begin with.
> rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
> The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
>
> Both source and target are btrfs, but here switching to send|receive
> wouldn't give much as this particular guest is Win10 Insider Edition --
> a thingy that shows what the folks from Redmond have cooked up, with roughly
> weekly updates to the tune of ~10GB writes 10GB deletions (if they do
> incremental transfers, installation still rewrites everything system).
>
> Lemme look a bit more, rsync performance is indeed really abysmal compared
> to what it should be.

self promo, but consider using UrBackup (OSS software, too) instead? For
Windows VMs I would install the client in the VM. It excludes unnessary
stuff like e.g. page files or the shadow storage area from the image
backups, as well and has a mode to store image backups as raw btrfs files.
Linux VMs I'd backup as files either from the hypervisor or from in VM.
If you want to backup big btrfs image files it can do that too, and
faster than rsync plus it can do incremental backups with sparse files.

Regards,
Martin Raiber


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-13 14:47                               ` Martin Raiber
@ 2017-09-13 15:25                                 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-13 15:25 UTC (permalink / raw)
  To: Martin Raiber, Adam Borowski; +Cc: Marat Khalili, Duncan, linux-btrfs

On 2017-09-13 10:47, Martin Raiber wrote:
> Hi,
> 
> On 12.09.2017 23:13 Adam Borowski wrote:
>> On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
>>> On 2017-09-12 16:00, Adam Borowski wrote:
>>>> Noted.  Both Marat's and my use cases, though, involve VMs that are off most
>>>> of the time, and at least for me, turned on only to test something.
>>>> Touching mtime makes rsync run again, and it's freaking _slow_: worse than
>>>> 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
>>> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
>>> you're going direct to a hard drive.  I get better performance than that on
>>> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
>>> but it's for archival storage so I don't really care).  I'm actually curious
>>> what the exact rsync command you are using is (you can obviously redact
>>> paths as you see fit), as the only way I can think of that it should be that
>>> slow is if you're using both --checksum (but if you're using this, you can
>>> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
>>> _and_ your HDD is slow to begin with.
>> rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
>> The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
>> with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
>>
>> Both source and target are btrfs, but here switching to send|receive
>> wouldn't give much as this particular guest is Win10 Insider Edition --
>> a thingy that shows what the folks from Redmond have cooked up, with roughly
>> weekly updates to the tune of ~10GB writes 10GB deletions (if they do
>> incremental transfers, installation still rewrites everything system).
>>
>> Lemme look a bit more, rsync performance is indeed really abysmal compared
>> to what it should be.
> 
> self promo, but consider using UrBackup (OSS software, too) instead? For
> Windows VMs I would install the client in the VM. It excludes unnessary
> stuff like e.g. page files or the shadow storage area from the image
> backups, as well and has a mode to store image backups as raw btrfs files.
> Linux VMs I'd backup as files either from the hypervisor or from in VM.
> If you want to backup big btrfs image files it can do that too, and
> faster than rsync plus it can do incremental backups with sparse files.
Even without UrBackup (I'll need to look into that actually, we're 
looking for new backup software where I work since MS has been debating 
removing File History, and the custom scripts my predecessor wrote are 
showing their 20+ year age at this point), it's usually better to just 
run the backup from inside the VM if at all possible.  You end up saving 
space, and don't waste time backing up stuff you don't need.

In this particular use case, it would also save other system resources, 
since you only need to back up the VM if something has changed, and by 
definition nothing could have changed in the VM (at least, nothing could 
have legitimately changed) if it's not running.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-13 14:15   ` Marat Khalili
@ 2017-09-13 17:52     ` Goffredo Baroncelli
  0 siblings, 0 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2017-09-13 17:52 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Chris Murphy, Linux fs Btrfs, Josef Bacik

On 09/13/2017 04:15 PM, Marat Khalili wrote:
> On 13/09/17 16:23, Chris Murphy wrote:
>> Right, known problem. To use o_direct implies also using nodatacow (or
>> at least nodatasum), e.g. xattr +C is set, done by qemu-img -o
>> nocow=on
>> https://www.spinics.net/lists/linux-btrfs/msg68244.html
> Can you please elaborate? I don't have exactly the same problem as described by the link, but I'm still worried that particularly qemu can be less resilient to partial raid1 failures even on newer kernels, due to missing checksums for instance. (BTW I didn't find any xattrs on my VM images, nor do I plan to set any.)

>From what Josef Bacik wrote, I understood that it is not only related to RAID1. I tried to ask further clarifications without success :(

It seems that simply using O_DIRECT could allow checksums mismatch. My understand is that to avoid to copy the data between buffer, the checksum computation is subject to data race: i.e. it is possible that the kernel computes the checksum *and* the user space program change the data. 
This lead to an io error during a subsequent read.

To avoid that BTRFS should copy in a temporary buffer the data, and then compute the checksum. But this is what the common sense suggest that O_DIRECT should avoid.

If I understood correctly (which is a BIG if), i think that O_DIRECT should be unsupported (i.e. return -EINVAL) if the file is "not marked" as "nodatacsum"

I looked to what ZFSOL does: it seems that it doesn't support O_DIRECT [1] for the same reason (see the comments ' ryao commented on Jul 23, 2015' for further details).

Anyway I suggest to read what the open(2) man page says about O_DIRECT: it seems that O_DIRECT has to be used carefully when doing fork; the man page concludes:
[...]
In  summary,  O_DIRECT  is  a  potentially powerful tool that should be used with caution.  It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
[...]

[1] https://github.com/zfsonlinux/zfs/issues/224


> 
> -- 
> 
> With Best Regards,
> Marat Khalili
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: qemu-kvm VM died during partial raid1 problems of btrfs
  2017-09-13 12:21                               ` Austin S. Hemmelgarn
@ 2017-09-18 11:53                                 ` Adam Borowski
  0 siblings, 0 replies; 32+ messages in thread
From: Adam Borowski @ 2017-09-18 11:53 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Marat Khalili, linux-btrfs

On Wed, Sep 13, 2017 at 08:21:01AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 17:13, Adam Borowski wrote:
> > On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-09-12 16:00, Adam Borowski wrote:
> > > > Noted.  Both Marat's and my use cases, though, involve VMs that are off most
> > > > of the time, and at least for me, turned on only to test something.
> > > > Touching mtime makes rsync run again, and it's freaking _slow_: worse than
> > > > 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
> > > 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
> > > you're going direct to a hard drive.  I get better performance than that on
> > > my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
> > > but it's for archival storage so I don't really care).  I'm actually curious
> > > what the exact rsync command you are using is (you can obviously redact
> > > paths as you see fit), as the only way I can think of that it should be that
> > > slow is if you're using both --checksum (but if you're using this, you can
> > > tell rsync to skip the mtime check, and that issue goes away) and --inplace,
> > > _and_ your HDD is slow to begin with.
> >
> > rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
> > The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> > with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
> compress=zlib is probably your biggest culprit.  As odd as this sounds, I'd
> suggest switching that to lzo (seriously, the performance difference is
> ludicrous), and then setting up a cron job (or systemd timer) to run defrag
> over things to switch to zlib.  As a general point of comparison, we do
> archival backups to a file server running BTRFS where I work, and the
> archiving process runs about four to ten times faster if we take this
> approach (LZO for initial compression, then recompress using defrag once the
> initial transfer is done) than just using zlib directly.

Turns out that lzo is actually the slowest, but only by a bit.

I tried a different disk, in the same Qnap; also an old disk but 7200 rpm
rather than 5400.  Mostly empty, only a handful subvolumes, not much
reflinking.  I made three separate copies, fallocated -d, upgraded Windows
inside the VM, then:

[/mnt/btr1/qemu]$ for x in none lzo zlib;do time rsync -axX --delete --inplace --numeric-ids win10.img mordor:/SOME/DIR/$x/win10.img;done

real    31m37.459s
user    27m21.587s
sys     2m16.210s

real    33m28.258s
user    27m19.745s
sys     2m17.642s

real    32m57.058s
user    27m24.297s
sys     2m17.640s

Note the "user" values.  So rsync does something bad on the source side.

Despite fragmentation, reads on the source are not a problem:

[/mnt/btr1/qemu]$ time cat <win10.img >/dev/null

real	1m28.815s
user	0m0.061s
sys	0m48.094s
[/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img 
win10.img: 63682 extents found
[/mnt/btr1/qemu]$ btrfs fi def win10.img
[/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img 
win10.img: 18015 extents found
[/mnt/btr1/qemu]$ time cat <win10.img >/dev/null

real	1m17.879s
user	0m0.076s
sys	0m37.757s

> `--inplace` is probably not helping (especially if most of the file changed,
> on BTRFS, it actually is marginally more efficient to just write out a whole
> new file and then replace the old one with a rename if you're rewriting most
> of the file), but is probably not as much of an issue as compress=zlib.

Yeah, scp + dedupe would run faster.  For deduplication, instead of
duperemove it'd be better to call file_extent_same on the first 128K, then
the second, ... -- without even hashing the blocks beforehand.

Not that this particular VM takes enough backup space to make spending too
much time worthwhile, but it's a good test case for performance issues like
this.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-09-18 11:53 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-12  8:02 qemu-kvm VM died during partial raid1 problems of btrfs Marat Khalili
2017-09-12  8:25 ` Timofey Titovets
2017-09-12  8:42   ` Marat Khalili
2017-09-12  9:21     ` Timofey Titovets
2017-09-12  9:29       ` Marat Khalili
2017-09-12  9:35         ` Timofey Titovets
2017-09-12 10:01     ` Duncan
2017-09-12 10:32       ` Adam Borowski
2017-09-12 10:39         ` Marat Khalili
2017-09-12 11:01           ` Timofey Titovets
2017-09-12 11:12             ` Adam Borowski
2017-09-12 11:17               ` Timofey Titovets
2017-09-12 11:26               ` Marat Khalili
2017-09-12 17:21                 ` Adam Borowski
2017-09-12 17:36                   ` Austin S. Hemmelgarn
2017-09-12 18:43                     ` Adam Borowski
2017-09-12 18:47                       ` Christoph Hellwig
2017-09-12 19:12                         ` Austin S. Hemmelgarn
2017-09-12 19:11                       ` Austin S. Hemmelgarn
2017-09-12 20:00                         ` Adam Borowski
2017-09-12 20:12                           ` Austin S. Hemmelgarn
2017-09-12 21:13                             ` Adam Borowski
2017-09-13  0:52                               ` Timofey Titovets
2017-09-13 12:55                                 ` Austin S. Hemmelgarn
2017-09-13 12:21                               ` Austin S. Hemmelgarn
2017-09-18 11:53                                 ` Adam Borowski
2017-09-13 14:47                               ` Martin Raiber
2017-09-13 15:25                                 ` Austin S. Hemmelgarn
2017-09-12 11:09         ` Roman Mamedov
2017-09-13 13:23 ` Chris Murphy
2017-09-13 14:15   ` Marat Khalili
2017-09-13 17:52     ` Goffredo Baroncelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.