linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* dm-integrity + mdadm + btrfs = no journal?
@ 2019-01-29 23:15 Hans van Kranenburg
  2019-01-30  1:02 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Hans van Kranenburg @ 2019-01-29 23:15 UTC (permalink / raw)
  To: linux-btrfs

Hi,

Thought experiment time...

I have an HP z820 workstation here (with ECC memory, yay!) and 4x250G
10k SAS disks (and some spare disks). It's donated hardware, and I'm
going to use it to replace the current server in the office of a
non-profit organization (so it's not work stuff this time).

The machine is going to run Debian/Xen and a few virtual machines
(current one also does, but the hardware is now really starting to fall
apart).

I have been thinking a bit how to (re)organize disk storage in this
scenario.

1. Let's use btrfs everywhere. \:D/
2. For running Xen virtual machines, I prefer block devices on LVM. No
image files, no btrfs-on-btrfs etc...
3. Oh, and there's also 1 MS Windows VM that will be in the mix.

Obviously I can't start using multi-device btrfs in each and every
virtual machine (a big pile of horror when one disk dies or starts
misbehaving).

So, what I was thinking of is:

* Use dm-integrity on partitions on the individual disks
* Use mdadm RAID10 on top (which is then able to repair bitrot)
* Use LVM on top
* Etc...

For all of the filesystems, I would be doing backups to a remote
location outside of the building with send/receive.

The Windows VM will be an image file on a btrfs filesystem in the Xen
dom0. It's idle most of the time, and I think cow+autodefrag can easily
handle it. I'd like to be able to take snapshots of it which can be sent
to a remote location.

Now, to finally throw in the big question: If I use btrfs everywhere,
can I run dm-integrity without a journal?

As far as I can reason about.. I could. As long as there's no 'nocow'
happening, the only thing that needs to happen correctly is superblock
writes, right?

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg
@ 2019-01-30  1:02 ` Chris Murphy
  2019-01-30  8:42 ` Roman Mamedov
  2019-01-30 12:58 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2019-01-30  1:02 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs

On Tue, Jan 29, 2019 at 4:15 PM Hans van Kranenburg
<Hans.van.Kranenburg@mendix.com> wrote:
>
> Hi,
>
> Thought experiment time...
>
> I have an HP z820 workstation here (with ECC memory, yay!) and 4x250G
> 10k SAS disks (and some spare disks). It's donated hardware, and I'm
> going to use it to replace the current server in the office of a
> non-profit organization (so it's not work stuff this time).
>
> The machine is going to run Debian/Xen and a few virtual machines
> (current one also does, but the hardware is now really starting to fall
> apart).
>
> I have been thinking a bit how to (re)organize disk storage in this
> scenario.
>
> 1. Let's use btrfs everywhere. \:D/
> 2. For running Xen virtual machines, I prefer block devices on LVM. No
> image files, no btrfs-on-btrfs etc...
> 3. Oh, and there's also 1 MS Windows VM that will be in the mix.
>
> Obviously I can't start using multi-device btrfs in each and every
> virtual machine (a big pile of horror when one disk dies or starts
> misbehaving).
>
> So, what I was thinking of is:
>
> * Use dm-integrity on partitions on the individual disks
> * Use mdadm RAID10 on top (which is then able to repair bitrot)
> * Use LVM on top
> * Etc...
>
> For all of the filesystems, I would be doing backups to a remote
> location outside of the building with send/receive.
>
> The Windows VM will be an image file on a btrfs filesystem in the Xen
> dom0. It's idle most of the time, and I think cow+autodefrag can easily
> handle it. I'd like to be able to take snapshots of it which can be sent
> to a remote location.

I'd consider thinp LV's for VM's. They are way more efficient for
snapshots than thickp (conventional) LVM snapshots. There is no
command to compute, send/receive only the LVM extents that are changed
though.  And this includes for NTFS. In effect, you can shrink any LV
without literally shrinking it, you just need to execute fstrim on the
mounted volume (you can use discard mount option from inside each VM;
or enable fstrim.timer), and this will cause unused LVM logical
extents to be returned to the thin pool, which can then be used by any
other LV that draws from that pool.

It's been a couple years since I tested NTFS in a Raw file on Btrfs
but at that time it was just pathological and I gave up. It was so
slow. Btrfs on Raw image on Btrfs was way faster. You could also
consider XFS on LVM or plain partition, with a qcow2 file as backing.
Snapshots are supported by creating a new image that points to another
as a backing file. And you can easily backup these snapshots as they
are discrete files.


> Now, to finally throw in the big question: If I use btrfs everywhere,
> can I run dm-integrity without a journal?

The documentation says if you run without a journal, dm-integrity is
no longer crash safe, i.e. it's no longer atomic operation. That to me
is the whole point of dm-integrity so I wouldn't do it even if I'm
using Btrfs on top.


>
> As far as I can reason about.. I could. As long as there's no 'nocow'
> happening, the only thing that needs to happen correctly is superblock
> writes, right?

Metadata is always cow. There is a nodatacow mount option but that
doesn't affect metadata. But in either case the whole point of having
dm-integrity in place is to have a Linux block device layer that tells
you if there has been some kind of storage stack corruption; to make
silent corruption visible. And that's not as effective if it's
possible to get such corruption in the course of a crash or power
failure of some kind. It might be useful to ask on @linux-integrity
list.

http://vger.kernel.org/vger-lists.html#linux-integrity

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg
  2019-01-30  1:02 ` Chris Murphy
@ 2019-01-30  8:42 ` Roman Mamedov
  2019-01-30 12:58 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 9+ messages in thread
From: Roman Mamedov @ 2019-01-30  8:42 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs

On Tue, 29 Jan 2019 23:15:18 +0000
Hans van Kranenburg <Hans.van.Kranenburg@mendix.com> wrote:

> So, what I was thinking of is:
> 
> * Use dm-integrity on partitions on the individual disks
> * Use mdadm RAID10 on top (which is then able to repair bitrot)
> * Use LVM on top
> * Etc...

You never explicitly say what's the whole idea, what are you protecting
against. By mentions of bitrot and of dm-integrity, you seem to think that
when hardware is "starting to fall apart" the disks will eventually start
returning wrong/corrupt data.

Thing is, they do not. What you will get on disks going bad is uncorrectable
read errors (UNC), not a silent corruption. The latter is still possible, but
more likely to be caused by the SATA controller issues (or its
driver/firmware), not disks. And those are hardly related to whether it's
"new" or "old".

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg
  2019-01-30  1:02 ` Chris Murphy
  2019-01-30  8:42 ` Roman Mamedov
@ 2019-01-30 12:58 ` Austin S. Hemmelgarn
  2019-01-30 15:26   ` Christoph Anton Mitterer
  2 siblings, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2019-01-30 12:58 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 2019-01-29 18:15, Hans van Kranenburg wrote:
> Hi,
> 
> Thought experiment time...
> 
> I have an HP z820 workstation here (with ECC memory, yay!) and 4x250G
> 10k SAS disks (and some spare disks). It's donated hardware, and I'm
> going to use it to replace the current server in the office of a
> non-profit organization (so it's not work stuff this time).
> 
> The machine is going to run Debian/Xen and a few virtual machines
> (current one also does, but the hardware is now really starting to fall
> apart).
> 
> I have been thinking a bit how to (re)organize disk storage in this
> scenario.
> 
> 1. Let's use btrfs everywhere. \:D/
> 2. For running Xen virtual machines, I prefer block devices on LVM. No
> image files, no btrfs-on-btrfs etc...
> 3. Oh, and there's also 1 MS Windows VM that will be in the mix.
> 
> Obviously I can't start using multi-device btrfs in each and every
> virtual machine (a big pile of horror when one disk dies or starts
> misbehaving).
> 
> So, what I was thinking of is:
> 
> * Use dm-integrity on partitions on the individual disks
> * Use mdadm RAID10 on top (which is then able to repair bitrot)
> * Use LVM on top
> * Etc...
> 
> For all of the filesystems, I would be doing backups to a remote
> location outside of the building with send/receive.
> 
> The Windows VM will be an image file on a btrfs filesystem in the Xen
> dom0. It's idle most of the time, and I think cow+autodefrag can easily
> handle it. I'd like to be able to take snapshots of it which can be sent
> to a remote location.
I would suggest against this.  NTFS is a pathologically bad case even 
when using it from inside Linux and leaving it almost completely idle. 
When used from Windows, it has horrible performance and trashes 
performance of _all_ other VM images on the same disk.

Also, just in general, I've only seen at best mediocre results from 
using BTRFS for VM image storage when using Xen.  I'm not sure exactly 
why, but I think it has something to do with how the Xen block backend 
access the filesystem.
> 
> Now, to finally throw in the big question: If I use btrfs everywhere,
> can I run dm-integrity without a journal?
> 
> As far as I can reason about.. I could. As long as there's no 'nocow'
> happening, the only thing that needs to happen correctly is superblock
> writes, right?
Running dm-integrity without a journal is roughly equivalent to using 
the nobarrier mount option (the journal is used to provide the same 
guarantees that barriers do).  IOW, don't do this unless you are willing 
to lose the whole volume.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-30 12:58 ` Austin S. Hemmelgarn
@ 2019-01-30 15:26   ` Christoph Anton Mitterer
  2019-01-30 16:00     ` Austin S. Hemmelgarn
  2019-01-30 16:38     ` Hans van Kranenburg
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Anton Mitterer @ 2019-01-30 15:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Hans van Kranenburg, linux-btrfs

On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote:
> Running dm-integrity without a journal is roughly equivalent to
> using 
> the nobarrier mount option (the journal is used to provide the same 
> guarantees that barriers do).  IOW, don't do this unless you are
> willing 
> to lose the whole volume.

That sounds a bit strange to me.

My understanding was that the idea of being able to disable the journal
of dm-integrity was just to avoid any double work, if equivalent
guarantees are already given by higher levels.

If btrfs is by itself already safe (by using barriers), then I'd have
expected that not transaction is committed, unless it got through all
lower layers... so either everything works well on the dm-integrity
base (and thus no journal is needed)... or it fails there... but then
btrfs would already safe by it's own means (barriers + CoW)?


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-30 15:26   ` Christoph Anton Mitterer
@ 2019-01-30 16:00     ` Austin S. Hemmelgarn
  2019-01-30 16:31       ` Christoph Anton Mitterer
  2019-01-30 16:38     ` Hans van Kranenburg
  1 sibling, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2019-01-30 16:00 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Hans van Kranenburg, linux-btrfs

On 2019-01-30 10:26, Christoph Anton Mitterer wrote:
> On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote:
>> Running dm-integrity without a journal is roughly equivalent to
>> using
>> the nobarrier mount option (the journal is used to provide the same
>> guarantees that barriers do).  IOW, don't do this unless you are
>> willing
>> to lose the whole volume.
> 
> That sounds a bit strange to me.
Probably because I forgot to qualify the statement properly and should 
have worded it differently.  It should read:

Running dm-integrity on a device which doesn't support barriers without 
a journal is risky, because the journal can help mitigate the issues 
arising from the lack of barrier support.  So, make sure your storage 
devices support barriers properly first.
> 
> My understanding was that the idea of being able to disable the journal
> of dm-integrity was just to avoid any double work, if equivalent
> guarantees are already given by higher levels.
Except they aren't completely if the storage device doesn't support 
barriers properly.
> 
> If btrfs is by itself already safe (by using barriers), then I'd have
> expected that not transaction is committed, unless it got through all
> lower layers... so either everything works well on the dm-integrity
> base (and thus no journal is needed)... or it fails there... but then
> btrfs would already safe by it's own means (barriers + CoW)?
BTRFS is _mostly_ safe.  The problem is that there are still devices out 
there that don't have proper barrier support.  Without barriers, the 
superblocks can hit the disk before the most recent transactions do, and 
in that case you're kind of screwed.  dm-integrity's journaling can help 
protect against this to a limited degree (it doesn't completely solve 
the issue, but it's better than nothing), but at the cost of higher 
overhead from duplicated work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-30 16:00     ` Austin S. Hemmelgarn
@ 2019-01-30 16:31       ` Christoph Anton Mitterer
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Anton Mitterer @ 2019-01-30 16:31 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Hans van Kranenburg, linux-btrfs

On Wed, 2019-01-30 at 11:00 -0500, Austin S. Hemmelgarn wrote:
> Running dm-integrity on a device which doesn't support barriers
> without 
> a journal is risky, because the journal can help mitigate the issues 
> arising from the lack of barrier support.

Does it? Isn't it then suffering from the same problems that any IO
suffers (e.g. also from btrfs itself, with no further block layer
beneath), when there are no barriers supported?


>   So, make sure your storage 
> devices support barriers properly first.
Is there any proper way to do so? And if,... shouldn't the kernel then
do this automatically?


> > If btrfs is by itself already safe (by using barriers), then I'd
> > have
> > expected that not transaction is committed, unless it got through
> > all
> > lower layers... so either everything works well on the dm-integrity
> > base (and thus no journal is needed)... or it fails there... but
> > then
> > btrfs would already safe by it's own means (barriers + CoW)?
> BTRFS is _mostly_ safe.  The problem is that there are still devices
> out 
> there that don't have proper barrier support.  Without barriers, the 
> superblocks can hit the disk before the most recent transactions do,
> and 
> in that case you're kind of screwed.  dm-integrity's journaling can
> help 
> protect against this to a limited degree (it doesn't completely
> solve 
> the issue, but it's better than nothing), but at the cost of higher 
> overhead from duplicated work.

Okay. But then the general official btrfs advise should probably be:
[If the device supports barriers - and if not one is
anyway at risk]... journaling at the lower dm-integrity level can be
safely disabled, right?

I would expect that there's no difference as to wheter nodatasum is
used or not... cause even if one has a journal below which can be
recovered, btrfs would not be able to make any use of this, or would
it?


If all this is truly the case (i.e. double checked by senior
developers), then it should go into perhaps even both, btrfs and
cryptsetup manpages.


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-30 15:26   ` Christoph Anton Mitterer
  2019-01-30 16:00     ` Austin S. Hemmelgarn
@ 2019-01-30 16:38     ` Hans van Kranenburg
  2019-01-30 16:56       ` Hans van Kranenburg
  1 sibling, 1 reply; 9+ messages in thread
From: Hans van Kranenburg @ 2019-01-30 16:38 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Austin S. Hemmelgarn, linux-btrfs

On 1/30/19 4:26 PM, Christoph Anton Mitterer wrote:
> On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote:
>> Running dm-integrity without a journal is roughly equivalent to
>> using 
>> the nobarrier mount option (the journal is used to provide the same 
>> guarantees that barriers do).  IOW, don't do this unless you are
>> willing 
>> to lose the whole volume.
> 
> That sounds a bit strange to me.
> 
> My understanding was that the idea of being able to disable the journal
> of dm-integrity was just to avoid any double work, if equivalent
> guarantees are already given by higher levels.
> 
> If btrfs is by itself already safe (by using barriers), then I'd have
> expected that not transaction is committed, unless it got through all
> lower layers... so either everything works well on the dm-integrity
> base (and thus no journal is needed)... or it fails there... but then
> btrfs would already safe by it's own means (barriers + CoW)?

This. Exactly this.

The reason that this journal of dm-integrity has to be used is because
data and the checksum of that data gets written in two different places.
The result of using it is that you'll always read back data with
matching checksums, either the previous data, or the new data.

https://arxiv.org/pdf/1807.00309.pdf
See Section 4.4 "Recovery on Write Failure".

"A device must provide atomic updating of both data and metadata.  A
situation in which one part is written to media while another part
failed must not occur."

Now, the great thing here is that btrfs does not overwrite disk data in
place. It writes out new data, metadata and then the superblock. So,
e.g. on power loss, I don't care about whatever happened to writes that
are not visible because the superblock was never written? Btrfs will not
read these disk sectors back, because it's unused space.

Also, it's not a write hole like in RAID56, because when "pulling the
plug" between writing out data and metadata, the checksums of older
existing data sectors are not corrupted, only new writes that were in
flight... I think... But the the pdf is still mentioning (also in 4.4)
"Furthermore, metadata sectors are packed with tags for multiple
sectors; thus, a write failure must not cause an integrity validation
failure for other sectors". From the design, I can however not see how
this could happen.

I asked on dm-devel list a while ago about this, but the mailing list
post never got any reply.

Hans

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dm-integrity + mdadm + btrfs = no journal?
  2019-01-30 16:38     ` Hans van Kranenburg
@ 2019-01-30 16:56       ` Hans van Kranenburg
  0 siblings, 0 replies; 9+ messages in thread
From: Hans van Kranenburg @ 2019-01-30 16:56 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Austin S. Hemmelgarn, linux-btrfs

On 1/30/19 5:38 PM, Hans van Kranenburg wrote:
> On 1/30/19 4:26 PM, Christoph Anton Mitterer wrote:
>> On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote:
>>> Running dm-integrity without a journal is roughly equivalent to
>>> using 
>>> the nobarrier mount option (the journal is used to provide the same 
>>> guarantees that barriers do).  IOW, don't do this unless you are
>>> willing 
>>> to lose the whole volume.
>>
>> That sounds a bit strange to me.
>>
>> My understanding was that the idea of being able to disable the journal
>> of dm-integrity was just to avoid any double work, if equivalent
>> guarantees are already given by higher levels.
>>
>> If btrfs is by itself already safe (by using barriers), then I'd have
>> expected that not transaction is committed, unless it got through all
>> lower layers... so either everything works well on the dm-integrity
>> base (and thus no journal is needed)... or it fails there... but then
>> btrfs would already safe by it's own means (barriers + CoW)?
> 
> This. Exactly this.
> 
> The reason that this journal of dm-integrity has to be used is because
> data and the checksum of that data gets written in two different places.
> The result of using it is that you'll always read back data with
> matching checksums, either the previous data, or the new data.
> 
> https://arxiv.org/pdf/1807.00309.pdf
> See Section 4.4 "Recovery on Write Failure".
> 
> "A device must provide atomic updating of both data and metadata.  A
> situation in which one part is written to media while another part
> failed must not occur."
> 
> Now, the great thing here is that btrfs does not overwrite disk data in
> place. It writes out new data, metadata and then the superblock. So,
> e.g. on power loss, I don't care about whatever happened to writes that
> are not visible because the superblock was never written? Btrfs will not
> read these disk sectors back, because it's unused space.

So, to reiterate from first post, this means that I cannot use nocow or
directio", because it goes around the cow safety net.

Also, there is still a risk, which is of course writing the superblocks.
If all copies of superblock on a single device are written, and all of
them lack the updated checksum, then I'll lose the fs, and will have to
either repair that manually, or restore from send/receive backups of a
few minutes ago.

> Also, it's not a write hole like in RAID56, because when "pulling the
> plug" between writing out data and metadata, the checksums of older
> existing data sectors are not corrupted, only new writes that were in
> flight... I think... But the the pdf is still mentioning (also in 4.4)
> "Furthermore, metadata sectors are packed with tags for multiple
> sectors; thus, a write failure must not cause an integrity validation
> failure for other sectors". From the design, I can however not see how
> this could happen.
> 
> I asked on dm-devel list a while ago about this, but the mailing list
> post never got any reply.

Hans

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-30 16:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg
2019-01-30  1:02 ` Chris Murphy
2019-01-30  8:42 ` Roman Mamedov
2019-01-30 12:58 ` Austin S. Hemmelgarn
2019-01-30 15:26   ` Christoph Anton Mitterer
2019-01-30 16:00     ` Austin S. Hemmelgarn
2019-01-30 16:31       ` Christoph Anton Mitterer
2019-01-30 16:38     ` Hans van Kranenburg
2019-01-30 16:56       ` Hans van Kranenburg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).