On Sun, Jan 30, 2022 at 06:43:13PM +0100, Zdenek Kabelac wrote: > Dne 30. 01. 22 v 17:45 Demi Marie Obenour napsal(a): > > On Sun, Jan 30, 2022 at 11:52:52AM +0100, Zdenek Kabelac wrote: > > > Dne 30. 01. 22 v 1:32 Demi Marie Obenour napsal(a): > > > > On Sat, Jan 29, 2022 at 10:32:52PM +0100, Zdenek Kabelac wrote: > > > > > Dne 29. 01. 22 v 21:34 Demi Marie Obenour napsal(a): > > > > > > How much slower are operations on an LVM2 thin pool compared to manually > > > > > > managing a dm-thin target via ioctls? I am mostly concerned about > > > > > > volume snapshot, creation, and destruction. Data integrity is very > > > > > > important, so taking shortcuts that risk data loss is out of the > > > > > > question. However, the application may have some additional information > > > > > > that LVM2 does not have. For instance, it may know that the volume that > > > > > > it is snapshotting is not in use, or that a certain volume it is > > > > > > creating will never be used after power-off. > > > > > > > > > > > > > > > So brave developers may always write their own management tools for their > > > > > constrained environment requirements that will by significantly faster in > > > > > terms of how many thins you could create per minute (btw you will need to > > > > > also consider dropping usage of udev on such system) > > > > > > > > What kind of constraints are you referring to? Is it possible and safe > > > > to have udev running, but told to ignore the thins in question? > > > > > > Lvm2 is oriented more towards managing set of different disks, > > > where user is adding/removing/replacing them. So it's more about > > > recoverability, good support for manual repair (ascii metadata), > > > tracking history of changes, backward compatibility, support > > > of conversion to different volume types (i.e. caching of thins, pvmove...) > > > Support for no/udev & no/systemd, clusters and nearly every linux distro > > > available... So there is a lot - and this all adds quite complexity. > > > > I am certain it does, and that makes a lot of sense. Thanks for the > > hard work! Those features are all useful for Qubes OS, too — just not > > in the VM startup/shutdown path. > > > > > So once you scratch all this - and you say you only care about single disc > > > then you are able to use more efficient metadata formats which you could > > > even keep permanently in memory during the lifetime - this all adds great > > > performance. > > > > > > But it all depends how you could constrain your environment. > > > > > > It's worth to mention there is lvm2 support for 'external' 'thin volume' > > > creators - so lvm2 only maintains 'thin-pool' data & metadata LV - but thin > > > volume creation, activation, deactivation of thins is left to external tool. > > > This has been used by docker for a while - later on they switched to > > > overlayFs I believe.. > > > > That indeeds sounds like a good choice for Qubes OS. It would allow the > > data and metadata LVs to be any volume type that lvm2 supports, and > > managed using all of lvm2’s features. So one could still put the > > metadata on a RAID-10 volume while everything else is RAID-6, or set up > > a dm-cache volume to store the data (please correct me if I am wrong). > > Qubes OS has already moved to using a separate thin pool for virtual > > machines, as it prevents dom0 (privileged management VM) from being run > > out of disk space (by accident or malice). That means that the thin > > pool use for guests is managed only by Qubes OS, and so the standard > > lvm2 tools do not need to touch it. > > > > Is this a setup that you would recommend, and would be comfortable using > > in production? As far as metadata is concerned, Qubes OS has its own > > XML file containing metadata about all qubes, which should suffice for > > this purpose. To prevent races during updates and ensure automatic > > crash recovery, is it sufficient to store metadata for both new and old > > transaction IDs, and pick the correct one based on the device-mapper > > status line? I have seen lvm2 get in an inconsistent state (transaction > > ID off by one) that required manual repair before, which is quite > > unnerving for a desktop OS. > > My biased advice would be to stay with lvm2. There is lot of work, many > things are not well documented and getting everything running correctly will > take a lot of effort (Docker in fact did not managed to do it well and was > incapable to provide any recoverability) What did Docker do wrong? Would it be possible for a future version of lvm2 to be able to automatically recover from off-by-one thin pool transaction IDs? > > One feature that would be nice is to be able to import an > > externally-provided mapping of thin pool device numbers to LV names, so > > that lvm2 could provide a (read-only, and not guaranteed fresh) view of > > system state for reporting purposes. > > Once you will have evidence it's the lvm2 causing major issue - you could > consider whether it's worth to step into a separate project. Agreed. > > > > > It's worth to mention - the more bullet-proof you will want to make your > > > > > project - the more closer to the extra processing made by lvm2 you will get. > > > > > > > > Why is this? How does lvm2 compare to stratis, for example? > > > > > > Stratis is yet another volume manager written in Rust combined with XFS for > > > easier user experience. That's all I'd probably say about it... > > > > That’s fine. I guess my question is why making lvm2 bullet-proof needs > > so much overhead. > > It's difficult - if you would be distributing lvm2 with exact kernel version > & udev & systemd with a single linux distro - it reduces huge set of > troubles... Qubes OS comes close to this in practice. systemd and udev versions are known and fixed, and Qubes OS ships its own kernels. > > > > > However before you will step into these waters - you should probably > > > > > evaluate whether thin-pool actually meet your needs if you have that high > > > > > expectation for number of supported volumes - so you will not end up with > > > > > hyper fast snapshot creation while the actual usage then is not meeting your > > > > > needs... > > > > > > > > What needs are you thinking of specifically? Qubes OS needs block > > > > devices, so filesystem-backed storage would require the use of loop > > > > devices unless I use ZFS zvols. Do you have any specific > > > > recommendations? > > > > > > As long as you live in the world without crashes, buggy kernels, apps and > > > failing hard drives everything looks very simple. > > > > Would you mind explaining further? LVM2 RAID and cache volumes should > > provide most of the benefits that Qubes OS desires, unless I am missing > > something. > > I'm not familiar with QubesOS - but in many cases in real-life world we > can't push to our users latest&greatest - so we need to live with bugs and > add workarounds... Qubes OS is more than capable of shipping fixes for kernel bugs. Is that what you are referring to? > > > Since you mentioned ZFS - you might want focus on using 'ZFS-only' solution. > > > Combining ZFS or Btrfs with lvm2 is always going to be a painful way as > > > those filesystems have their own volume management. > > > > Absolutely! That said, I do wonder what your thoughts on using loop > > devices for VM storage are. I know they are slower than thin volumes, > > but they are also much easier to manage, since they are just ordinary > > disk files. Any filesystem with reflink can provide the needed > > copy-on-write support. > > Chain filesystem->block_layer->filesystem->block_layer is something you most > likely do not want to use for any well performing solution... > But it's ok for testing... How much of this is due to the slow loop driver? How much of it could be mitigated if btrfs supported an equivalent of zvols? -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab