* dm-integrity + mdadm + btrfs = no journal? @ 2019-01-29 23:15 Hans van Kranenburg 2019-01-30 1:02 ` Chris Murphy ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Hans van Kranenburg @ 2019-01-29 23:15 UTC (permalink / raw) To: linux-btrfs Hi, Thought experiment time... I have an HP z820 workstation here (with ECC memory, yay!) and 4x250G 10k SAS disks (and some spare disks). It's donated hardware, and I'm going to use it to replace the current server in the office of a non-profit organization (so it's not work stuff this time). The machine is going to run Debian/Xen and a few virtual machines (current one also does, but the hardware is now really starting to fall apart). I have been thinking a bit how to (re)organize disk storage in this scenario. 1. Let's use btrfs everywhere. \:D/ 2. For running Xen virtual machines, I prefer block devices on LVM. No image files, no btrfs-on-btrfs etc... 3. Oh, and there's also 1 MS Windows VM that will be in the mix. Obviously I can't start using multi-device btrfs in each and every virtual machine (a big pile of horror when one disk dies or starts misbehaving). So, what I was thinking of is: * Use dm-integrity on partitions on the individual disks * Use mdadm RAID10 on top (which is then able to repair bitrot) * Use LVM on top * Etc... For all of the filesystems, I would be doing backups to a remote location outside of the building with send/receive. The Windows VM will be an image file on a btrfs filesystem in the Xen dom0. It's idle most of the time, and I think cow+autodefrag can easily handle it. I'd like to be able to take snapshots of it which can be sent to a remote location. Now, to finally throw in the big question: If I use btrfs everywhere, can I run dm-integrity without a journal? As far as I can reason about.. I could. As long as there's no 'nocow' happening, the only thing that needs to happen correctly is superblock writes, right? -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg @ 2019-01-30 1:02 ` Chris Murphy 2019-01-30 8:42 ` Roman Mamedov 2019-01-30 12:58 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 9+ messages in thread From: Chris Murphy @ 2019-01-30 1:02 UTC (permalink / raw) To: Hans van Kranenburg; +Cc: linux-btrfs On Tue, Jan 29, 2019 at 4:15 PM Hans van Kranenburg <Hans.van.Kranenburg@mendix.com> wrote: > > Hi, > > Thought experiment time... > > I have an HP z820 workstation here (with ECC memory, yay!) and 4x250G > 10k SAS disks (and some spare disks). It's donated hardware, and I'm > going to use it to replace the current server in the office of a > non-profit organization (so it's not work stuff this time). > > The machine is going to run Debian/Xen and a few virtual machines > (current one also does, but the hardware is now really starting to fall > apart). > > I have been thinking a bit how to (re)organize disk storage in this > scenario. > > 1. Let's use btrfs everywhere. \:D/ > 2. For running Xen virtual machines, I prefer block devices on LVM. No > image files, no btrfs-on-btrfs etc... > 3. Oh, and there's also 1 MS Windows VM that will be in the mix. > > Obviously I can't start using multi-device btrfs in each and every > virtual machine (a big pile of horror when one disk dies or starts > misbehaving). > > So, what I was thinking of is: > > * Use dm-integrity on partitions on the individual disks > * Use mdadm RAID10 on top (which is then able to repair bitrot) > * Use LVM on top > * Etc... > > For all of the filesystems, I would be doing backups to a remote > location outside of the building with send/receive. > > The Windows VM will be an image file on a btrfs filesystem in the Xen > dom0. It's idle most of the time, and I think cow+autodefrag can easily > handle it. I'd like to be able to take snapshots of it which can be sent > to a remote location. I'd consider thinp LV's for VM's. They are way more efficient for snapshots than thickp (conventional) LVM snapshots. There is no command to compute, send/receive only the LVM extents that are changed though. And this includes for NTFS. In effect, you can shrink any LV without literally shrinking it, you just need to execute fstrim on the mounted volume (you can use discard mount option from inside each VM; or enable fstrim.timer), and this will cause unused LVM logical extents to be returned to the thin pool, which can then be used by any other LV that draws from that pool. It's been a couple years since I tested NTFS in a Raw file on Btrfs but at that time it was just pathological and I gave up. It was so slow. Btrfs on Raw image on Btrfs was way faster. You could also consider XFS on LVM or plain partition, with a qcow2 file as backing. Snapshots are supported by creating a new image that points to another as a backing file. And you can easily backup these snapshots as they are discrete files. > Now, to finally throw in the big question: If I use btrfs everywhere, > can I run dm-integrity without a journal? The documentation says if you run without a journal, dm-integrity is no longer crash safe, i.e. it's no longer atomic operation. That to me is the whole point of dm-integrity so I wouldn't do it even if I'm using Btrfs on top. > > As far as I can reason about.. I could. As long as there's no 'nocow' > happening, the only thing that needs to happen correctly is superblock > writes, right? Metadata is always cow. There is a nodatacow mount option but that doesn't affect metadata. But in either case the whole point of having dm-integrity in place is to have a Linux block device layer that tells you if there has been some kind of storage stack corruption; to make silent corruption visible. And that's not as effective if it's possible to get such corruption in the course of a crash or power failure of some kind. It might be useful to ask on @linux-integrity list. http://vger.kernel.org/vger-lists.html#linux-integrity -- Chris Murphy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg 2019-01-30 1:02 ` Chris Murphy @ 2019-01-30 8:42 ` Roman Mamedov 2019-01-30 12:58 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 9+ messages in thread From: Roman Mamedov @ 2019-01-30 8:42 UTC (permalink / raw) To: Hans van Kranenburg; +Cc: linux-btrfs On Tue, 29 Jan 2019 23:15:18 +0000 Hans van Kranenburg <Hans.van.Kranenburg@mendix.com> wrote: > So, what I was thinking of is: > > * Use dm-integrity on partitions on the individual disks > * Use mdadm RAID10 on top (which is then able to repair bitrot) > * Use LVM on top > * Etc... You never explicitly say what's the whole idea, what are you protecting against. By mentions of bitrot and of dm-integrity, you seem to think that when hardware is "starting to fall apart" the disks will eventually start returning wrong/corrupt data. Thing is, they do not. What you will get on disks going bad is uncorrectable read errors (UNC), not a silent corruption. The latter is still possible, but more likely to be caused by the SATA controller issues (or its driver/firmware), not disks. And those are hardly related to whether it's "new" or "old". -- With respect, Roman ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg 2019-01-30 1:02 ` Chris Murphy 2019-01-30 8:42 ` Roman Mamedov @ 2019-01-30 12:58 ` Austin S. Hemmelgarn 2019-01-30 15:26 ` Christoph Anton Mitterer 2 siblings, 1 reply; 9+ messages in thread From: Austin S. Hemmelgarn @ 2019-01-30 12:58 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs On 2019-01-29 18:15, Hans van Kranenburg wrote: > Hi, > > Thought experiment time... > > I have an HP z820 workstation here (with ECC memory, yay!) and 4x250G > 10k SAS disks (and some spare disks). It's donated hardware, and I'm > going to use it to replace the current server in the office of a > non-profit organization (so it's not work stuff this time). > > The machine is going to run Debian/Xen and a few virtual machines > (current one also does, but the hardware is now really starting to fall > apart). > > I have been thinking a bit how to (re)organize disk storage in this > scenario. > > 1. Let's use btrfs everywhere. \:D/ > 2. For running Xen virtual machines, I prefer block devices on LVM. No > image files, no btrfs-on-btrfs etc... > 3. Oh, and there's also 1 MS Windows VM that will be in the mix. > > Obviously I can't start using multi-device btrfs in each and every > virtual machine (a big pile of horror when one disk dies or starts > misbehaving). > > So, what I was thinking of is: > > * Use dm-integrity on partitions on the individual disks > * Use mdadm RAID10 on top (which is then able to repair bitrot) > * Use LVM on top > * Etc... > > For all of the filesystems, I would be doing backups to a remote > location outside of the building with send/receive. > > The Windows VM will be an image file on a btrfs filesystem in the Xen > dom0. It's idle most of the time, and I think cow+autodefrag can easily > handle it. I'd like to be able to take snapshots of it which can be sent > to a remote location. I would suggest against this. NTFS is a pathologically bad case even when using it from inside Linux and leaving it almost completely idle. When used from Windows, it has horrible performance and trashes performance of _all_ other VM images on the same disk. Also, just in general, I've only seen at best mediocre results from using BTRFS for VM image storage when using Xen. I'm not sure exactly why, but I think it has something to do with how the Xen block backend access the filesystem. > > Now, to finally throw in the big question: If I use btrfs everywhere, > can I run dm-integrity without a journal? > > As far as I can reason about.. I could. As long as there's no 'nocow' > happening, the only thing that needs to happen correctly is superblock > writes, right? Running dm-integrity without a journal is roughly equivalent to using the nobarrier mount option (the journal is used to provide the same guarantees that barriers do). IOW, don't do this unless you are willing to lose the whole volume. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-30 12:58 ` Austin S. Hemmelgarn @ 2019-01-30 15:26 ` Christoph Anton Mitterer 2019-01-30 16:00 ` Austin S. Hemmelgarn 2019-01-30 16:38 ` Hans van Kranenburg 0 siblings, 2 replies; 9+ messages in thread From: Christoph Anton Mitterer @ 2019-01-30 15:26 UTC (permalink / raw) To: Austin S. Hemmelgarn, Hans van Kranenburg, linux-btrfs On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote: > Running dm-integrity without a journal is roughly equivalent to > using > the nobarrier mount option (the journal is used to provide the same > guarantees that barriers do). IOW, don't do this unless you are > willing > to lose the whole volume. That sounds a bit strange to me. My understanding was that the idea of being able to disable the journal of dm-integrity was just to avoid any double work, if equivalent guarantees are already given by higher levels. If btrfs is by itself already safe (by using barriers), then I'd have expected that not transaction is committed, unless it got through all lower layers... so either everything works well on the dm-integrity base (and thus no journal is needed)... or it fails there... but then btrfs would already safe by it's own means (barriers + CoW)? Cheers, Chris. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-30 15:26 ` Christoph Anton Mitterer @ 2019-01-30 16:00 ` Austin S. Hemmelgarn 2019-01-30 16:31 ` Christoph Anton Mitterer 2019-01-30 16:38 ` Hans van Kranenburg 1 sibling, 1 reply; 9+ messages in thread From: Austin S. Hemmelgarn @ 2019-01-30 16:00 UTC (permalink / raw) To: Christoph Anton Mitterer, Hans van Kranenburg, linux-btrfs On 2019-01-30 10:26, Christoph Anton Mitterer wrote: > On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote: >> Running dm-integrity without a journal is roughly equivalent to >> using >> the nobarrier mount option (the journal is used to provide the same >> guarantees that barriers do). IOW, don't do this unless you are >> willing >> to lose the whole volume. > > That sounds a bit strange to me. Probably because I forgot to qualify the statement properly and should have worded it differently. It should read: Running dm-integrity on a device which doesn't support barriers without a journal is risky, because the journal can help mitigate the issues arising from the lack of barrier support. So, make sure your storage devices support barriers properly first. > > My understanding was that the idea of being able to disable the journal > of dm-integrity was just to avoid any double work, if equivalent > guarantees are already given by higher levels. Except they aren't completely if the storage device doesn't support barriers properly. > > If btrfs is by itself already safe (by using barriers), then I'd have > expected that not transaction is committed, unless it got through all > lower layers... so either everything works well on the dm-integrity > base (and thus no journal is needed)... or it fails there... but then > btrfs would already safe by it's own means (barriers + CoW)? BTRFS is _mostly_ safe. The problem is that there are still devices out there that don't have proper barrier support. Without barriers, the superblocks can hit the disk before the most recent transactions do, and in that case you're kind of screwed. dm-integrity's journaling can help protect against this to a limited degree (it doesn't completely solve the issue, but it's better than nothing), but at the cost of higher overhead from duplicated work. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-30 16:00 ` Austin S. Hemmelgarn @ 2019-01-30 16:31 ` Christoph Anton Mitterer 0 siblings, 0 replies; 9+ messages in thread From: Christoph Anton Mitterer @ 2019-01-30 16:31 UTC (permalink / raw) To: Austin S. Hemmelgarn, Hans van Kranenburg, linux-btrfs On Wed, 2019-01-30 at 11:00 -0500, Austin S. Hemmelgarn wrote: > Running dm-integrity on a device which doesn't support barriers > without > a journal is risky, because the journal can help mitigate the issues > arising from the lack of barrier support. Does it? Isn't it then suffering from the same problems that any IO suffers (e.g. also from btrfs itself, with no further block layer beneath), when there are no barriers supported? > So, make sure your storage > devices support barriers properly first. Is there any proper way to do so? And if,... shouldn't the kernel then do this automatically? > > If btrfs is by itself already safe (by using barriers), then I'd > > have > > expected that not transaction is committed, unless it got through > > all > > lower layers... so either everything works well on the dm-integrity > > base (and thus no journal is needed)... or it fails there... but > > then > > btrfs would already safe by it's own means (barriers + CoW)? > BTRFS is _mostly_ safe. The problem is that there are still devices > out > there that don't have proper barrier support. Without barriers, the > superblocks can hit the disk before the most recent transactions do, > and > in that case you're kind of screwed. dm-integrity's journaling can > help > protect against this to a limited degree (it doesn't completely > solve > the issue, but it's better than nothing), but at the cost of higher > overhead from duplicated work. Okay. But then the general official btrfs advise should probably be: [If the device supports barriers - and if not one is anyway at risk]... journaling at the lower dm-integrity level can be safely disabled, right? I would expect that there's no difference as to wheter nodatasum is used or not... cause even if one has a journal below which can be recovered, btrfs would not be able to make any use of this, or would it? If all this is truly the case (i.e. double checked by senior developers), then it should go into perhaps even both, btrfs and cryptsetup manpages. Cheers, Chris. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-30 15:26 ` Christoph Anton Mitterer 2019-01-30 16:00 ` Austin S. Hemmelgarn @ 2019-01-30 16:38 ` Hans van Kranenburg 2019-01-30 16:56 ` Hans van Kranenburg 1 sibling, 1 reply; 9+ messages in thread From: Hans van Kranenburg @ 2019-01-30 16:38 UTC (permalink / raw) To: Christoph Anton Mitterer, Austin S. Hemmelgarn, linux-btrfs On 1/30/19 4:26 PM, Christoph Anton Mitterer wrote: > On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote: >> Running dm-integrity without a journal is roughly equivalent to >> using >> the nobarrier mount option (the journal is used to provide the same >> guarantees that barriers do). IOW, don't do this unless you are >> willing >> to lose the whole volume. > > That sounds a bit strange to me. > > My understanding was that the idea of being able to disable the journal > of dm-integrity was just to avoid any double work, if equivalent > guarantees are already given by higher levels. > > If btrfs is by itself already safe (by using barriers), then I'd have > expected that not transaction is committed, unless it got through all > lower layers... so either everything works well on the dm-integrity > base (and thus no journal is needed)... or it fails there... but then > btrfs would already safe by it's own means (barriers + CoW)? This. Exactly this. The reason that this journal of dm-integrity has to be used is because data and the checksum of that data gets written in two different places. The result of using it is that you'll always read back data with matching checksums, either the previous data, or the new data. https://arxiv.org/pdf/1807.00309.pdf See Section 4.4 "Recovery on Write Failure". "A device must provide atomic updating of both data and metadata. A situation in which one part is written to media while another part failed must not occur." Now, the great thing here is that btrfs does not overwrite disk data in place. It writes out new data, metadata and then the superblock. So, e.g. on power loss, I don't care about whatever happened to writes that are not visible because the superblock was never written? Btrfs will not read these disk sectors back, because it's unused space. Also, it's not a write hole like in RAID56, because when "pulling the plug" between writing out data and metadata, the checksums of older existing data sectors are not corrupted, only new writes that were in flight... I think... But the the pdf is still mentioning (also in 4.4) "Furthermore, metadata sectors are packed with tags for multiple sectors; thus, a write failure must not cause an integrity validation failure for other sectors". From the design, I can however not see how this could happen. I asked on dm-devel list a while ago about this, but the mailing list post never got any reply. Hans ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: dm-integrity + mdadm + btrfs = no journal? 2019-01-30 16:38 ` Hans van Kranenburg @ 2019-01-30 16:56 ` Hans van Kranenburg 0 siblings, 0 replies; 9+ messages in thread From: Hans van Kranenburg @ 2019-01-30 16:56 UTC (permalink / raw) To: Christoph Anton Mitterer, Austin S. Hemmelgarn, linux-btrfs On 1/30/19 5:38 PM, Hans van Kranenburg wrote: > On 1/30/19 4:26 PM, Christoph Anton Mitterer wrote: >> On Wed, 2019-01-30 at 07:58 -0500, Austin S. Hemmelgarn wrote: >>> Running dm-integrity without a journal is roughly equivalent to >>> using >>> the nobarrier mount option (the journal is used to provide the same >>> guarantees that barriers do). IOW, don't do this unless you are >>> willing >>> to lose the whole volume. >> >> That sounds a bit strange to me. >> >> My understanding was that the idea of being able to disable the journal >> of dm-integrity was just to avoid any double work, if equivalent >> guarantees are already given by higher levels. >> >> If btrfs is by itself already safe (by using barriers), then I'd have >> expected that not transaction is committed, unless it got through all >> lower layers... so either everything works well on the dm-integrity >> base (and thus no journal is needed)... or it fails there... but then >> btrfs would already safe by it's own means (barriers + CoW)? > > This. Exactly this. > > The reason that this journal of dm-integrity has to be used is because > data and the checksum of that data gets written in two different places. > The result of using it is that you'll always read back data with > matching checksums, either the previous data, or the new data. > > https://arxiv.org/pdf/1807.00309.pdf > See Section 4.4 "Recovery on Write Failure". > > "A device must provide atomic updating of both data and metadata. A > situation in which one part is written to media while another part > failed must not occur." > > Now, the great thing here is that btrfs does not overwrite disk data in > place. It writes out new data, metadata and then the superblock. So, > e.g. on power loss, I don't care about whatever happened to writes that > are not visible because the superblock was never written? Btrfs will not > read these disk sectors back, because it's unused space. So, to reiterate from first post, this means that I cannot use nocow or directio", because it goes around the cow safety net. Also, there is still a risk, which is of course writing the superblocks. If all copies of superblock on a single device are written, and all of them lack the updated checksum, then I'll lose the fs, and will have to either repair that manually, or restore from send/receive backups of a few minutes ago. > Also, it's not a write hole like in RAID56, because when "pulling the > plug" between writing out data and metadata, the checksums of older > existing data sectors are not corrupted, only new writes that were in > flight... I think... But the the pdf is still mentioning (also in 4.4) > "Furthermore, metadata sectors are packed with tags for multiple > sectors; thus, a write failure must not cause an integrity validation > failure for other sectors". From the design, I can however not see how > this could happen. > > I asked on dm-devel list a while ago about this, but the mailing list > post never got any reply. Hans ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-01-30 16:56 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-29 23:15 dm-integrity + mdadm + btrfs = no journal? Hans van Kranenburg 2019-01-30 1:02 ` Chris Murphy 2019-01-30 8:42 ` Roman Mamedov 2019-01-30 12:58 ` Austin S. Hemmelgarn 2019-01-30 15:26 ` Christoph Anton Mitterer 2019-01-30 16:00 ` Austin S. Hemmelgarn 2019-01-30 16:31 ` Christoph Anton Mitterer 2019-01-30 16:38 ` Hans van Kranenburg 2019-01-30 16:56 ` Hans van Kranenburg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).