From: Demi Marie Obenour <demi@invisiblethingslab.com>
To: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Cc: LVM general discussion and development <linux-lvm@redhat.com>
Subject: Re: [linux-lvm] LVM performance vs direct dm-thin
Date: Tue, 1 Feb 2022 21:09:52 -0500 [thread overview]
Message-ID: <Yfnn+BcjuukqjfVO@itl-email> (raw)
In-Reply-To: <849ab633-ec3d-a0a5-38bf-72b87bbba2c5@gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 8168 bytes --]
On Sun, Jan 30, 2022 at 06:43:13PM +0100, Zdenek Kabelac wrote:
> Dne 30. 01. 22 v 17:45 Demi Marie Obenour napsal(a):
> > On Sun, Jan 30, 2022 at 11:52:52AM +0100, Zdenek Kabelac wrote:
> > > Dne 30. 01. 22 v 1:32 Demi Marie Obenour napsal(a):
> > > > On Sat, Jan 29, 2022 at 10:32:52PM +0100, Zdenek Kabelac wrote:
> > > > > Dne 29. 01. 22 v 21:34 Demi Marie Obenour napsal(a):
> > > > > > How much slower are operations on an LVM2 thin pool compared to manually
> > > > > > managing a dm-thin target via ioctls? I am mostly concerned about
> > > > > > volume snapshot, creation, and destruction. Data integrity is very
> > > > > > important, so taking shortcuts that risk data loss is out of the
> > > > > > question. However, the application may have some additional information
> > > > > > that LVM2 does not have. For instance, it may know that the volume that
> > > > > > it is snapshotting is not in use, or that a certain volume it is
> > > > > > creating will never be used after power-off.
> > > > > >
> > > >
> > > > > So brave developers may always write their own management tools for their
> > > > > constrained environment requirements that will by significantly faster in
> > > > > terms of how many thins you could create per minute (btw you will need to
> > > > > also consider dropping usage of udev on such system)
> > > >
> > > > What kind of constraints are you referring to? Is it possible and safe
> > > > to have udev running, but told to ignore the thins in question?
> > >
> > > Lvm2 is oriented more towards managing set of different disks,
> > > where user is adding/removing/replacing them. So it's more about
> > > recoverability, good support for manual repair (ascii metadata),
> > > tracking history of changes, backward compatibility, support
> > > of conversion to different volume types (i.e. caching of thins, pvmove...)
> > > Support for no/udev & no/systemd, clusters and nearly every linux distro
> > > available... So there is a lot - and this all adds quite complexity.
> >
> > I am certain it does, and that makes a lot of sense. Thanks for the
> > hard work! Those features are all useful for Qubes OS, too — just not
> > in the VM startup/shutdown path.
> >
> > > So once you scratch all this - and you say you only care about single disc
> > > then you are able to use more efficient metadata formats which you could
> > > even keep permanently in memory during the lifetime - this all adds great
> > > performance.
> > >
> > > But it all depends how you could constrain your environment.
> > >
> > > It's worth to mention there is lvm2 support for 'external' 'thin volume'
> > > creators - so lvm2 only maintains 'thin-pool' data & metadata LV - but thin
> > > volume creation, activation, deactivation of thins is left to external tool.
> > > This has been used by docker for a while - later on they switched to
> > > overlayFs I believe..
> >
> > That indeeds sounds like a good choice for Qubes OS. It would allow the
> > data and metadata LVs to be any volume type that lvm2 supports, and
> > managed using all of lvm2’s features. So one could still put the
> > metadata on a RAID-10 volume while everything else is RAID-6, or set up
> > a dm-cache volume to store the data (please correct me if I am wrong).
> > Qubes OS has already moved to using a separate thin pool for virtual
> > machines, as it prevents dom0 (privileged management VM) from being run
> > out of disk space (by accident or malice). That means that the thin
> > pool use for guests is managed only by Qubes OS, and so the standard
> > lvm2 tools do not need to touch it.
> >
> > Is this a setup that you would recommend, and would be comfortable using
> > in production? As far as metadata is concerned, Qubes OS has its own
> > XML file containing metadata about all qubes, which should suffice for
> > this purpose. To prevent races during updates and ensure automatic
> > crash recovery, is it sufficient to store metadata for both new and old
> > transaction IDs, and pick the correct one based on the device-mapper
> > status line? I have seen lvm2 get in an inconsistent state (transaction
> > ID off by one) that required manual repair before, which is quite
> > unnerving for a desktop OS.
>
> My biased advice would be to stay with lvm2. There is lot of work, many
> things are not well documented and getting everything running correctly will
> take a lot of effort (Docker in fact did not managed to do it well and was
> incapable to provide any recoverability)
What did Docker do wrong? Would it be possible for a future version of
lvm2 to be able to automatically recover from off-by-one thin pool
transaction IDs?
> > One feature that would be nice is to be able to import an
> > externally-provided mapping of thin pool device numbers to LV names, so
> > that lvm2 could provide a (read-only, and not guaranteed fresh) view of
> > system state for reporting purposes.
>
> Once you will have evidence it's the lvm2 causing major issue - you could
> consider whether it's worth to step into a separate project.
Agreed.
> > > > > It's worth to mention - the more bullet-proof you will want to make your
> > > > > project - the more closer to the extra processing made by lvm2 you will get.
> > > >
> > > > Why is this? How does lvm2 compare to stratis, for example?
> > >
> > > Stratis is yet another volume manager written in Rust combined with XFS for
> > > easier user experience. That's all I'd probably say about it...
> >
> > That’s fine. I guess my question is why making lvm2 bullet-proof needs
> > so much overhead.
>
> It's difficult - if you would be distributing lvm2 with exact kernel version
> & udev & systemd with a single linux distro - it reduces huge set of
> troubles...
Qubes OS comes close to this in practice. systemd and udev versions are
known and fixed, and Qubes OS ships its own kernels.
> > > > > However before you will step into these waters - you should probably
> > > > > evaluate whether thin-pool actually meet your needs if you have that high
> > > > > expectation for number of supported volumes - so you will not end up with
> > > > > hyper fast snapshot creation while the actual usage then is not meeting your
> > > > > needs...
> > > >
> > > > What needs are you thinking of specifically? Qubes OS needs block
> > > > devices, so filesystem-backed storage would require the use of loop
> > > > devices unless I use ZFS zvols. Do you have any specific
> > > > recommendations?
> > >
> > > As long as you live in the world without crashes, buggy kernels, apps and
> > > failing hard drives everything looks very simple.
> >
> > Would you mind explaining further? LVM2 RAID and cache volumes should
> > provide most of the benefits that Qubes OS desires, unless I am missing
> > something.
>
> I'm not familiar with QubesOS - but in many cases in real-life world we
> can't push to our users latest&greatest - so we need to live with bugs and
> add workarounds...
Qubes OS is more than capable of shipping fixes for kernel bugs. Is
that what you are referring to?
> > > Since you mentioned ZFS - you might want focus on using 'ZFS-only' solution.
> > > Combining ZFS or Btrfs with lvm2 is always going to be a painful way as
> > > those filesystems have their own volume management.
> >
> > Absolutely! That said, I do wonder what your thoughts on using loop
> > devices for VM storage are. I know they are slower than thin volumes,
> > but they are also much easier to manage, since they are just ordinary
> > disk files. Any filesystem with reflink can provide the needed
> > copy-on-write support.
>
> Chain filesystem->block_layer->filesystem->block_layer is something you most
> likely do not want to use for any well performing solution...
> But it's ok for testing...
How much of this is due to the slow loop driver? How much of it could
be mitigated if btrfs supported an equivalent of zvols?
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 201 bytes --]
_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
next prev parent reply other threads:[~2022-02-02 2:10 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-29 20:34 [linux-lvm] LVM performance vs direct dm-thin Demi Marie Obenour
2022-01-29 21:32 ` Zdenek Kabelac
2022-01-30 0:32 ` Demi Marie Obenour
2022-01-30 10:52 ` Zdenek Kabelac
2022-01-30 16:45 ` Demi Marie Obenour
2022-01-30 17:43 ` Zdenek Kabelac
2022-01-30 20:27 ` Gionatan Danti
2022-01-30 21:17 ` Demi Marie Obenour
2022-01-31 7:52 ` Gionatan Danti
2022-02-02 2:09 ` Demi Marie Obenour [this message]
2022-02-02 10:04 ` Zdenek Kabelac
2022-02-03 0:23 ` Demi Marie Obenour
2022-02-03 12:04 ` Zdenek Kabelac
2022-02-03 12:04 ` Zdenek Kabelac
2022-01-30 21:39 ` Stuart D. Gathman
2022-01-30 22:14 ` Demi Marie Obenour
2022-01-31 21:29 ` Marian Csontos
2022-02-03 4:48 ` Demi Marie Obenour
2022-02-03 12:28 ` Zdenek Kabelac
2022-02-04 0:01 ` Demi Marie Obenour
2022-02-04 10:16 ` Zdenek Kabelac
2022-01-31 7:47 ` Gionatan Danti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yfnn+BcjuukqjfVO@itl-email \
--to=demi@invisiblethingslab.com \
--cc=linux-lvm@redhat.com \
--cc=zdenek.kabelac@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).