linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot
@ 2024-01-12 18:19 lists.linux.dev
  2024-01-17 11:08 ` Zdenek Kabelac
  0 siblings, 1 reply; 7+ messages in thread
From: lists.linux.dev @ 2024-01-12 18:19 UTC (permalink / raw)
  To: linux-lvm

Hi,

at first, a happy new year to everyone.

I'm currently considering to use dm-cache with a ramdisk/volatile PV for a small project and noticed some usability issues that make using it less appealing.


Currently this means:
1. Adding a cache to a VG will cause the entire VG to depend on the cache. If one of the cache drives fails or is missing it cannot be accessed and even worse if this was the VG containing the root filesystem it also causes the entire system to fail to boot. Even though we may already know that we don't have any dataloss but just degraded access times.
2. Requires manual scripting to activate the VG and handle potentially missing/failing cache PVs
3. LVM doesn't have a way to clearly indicate that the physical volume is volatile and that dataloss on it is expected. Maybe even including the PV header itself. Or alternatively a way to indicate "if something is wrong with the cache, just forget about it (if possible)".
4. Just recreating the 'pvcreate --zero --pvmetadatacopies 0 --norestorefile --uuid' appears to be enough to get a write-through cache and thereby also the associated volume working again. Therefore it doesn't look like LVM cares about the cache data being lost, but only about the PV itself. Therefore failing to activate the VG appears to be a bit too convservative and probably the error handling here could be improved (see above).
6. Also as there is currently no place within the LVM metadata to label a PV/VG/LV as "volatile" it is also not clear both to LVM as well as admins looking at output of tools like lvdisplay that a specific LV is volatile. Therefore there will also be no safeguards and warnings against actions that would cause dataloss (like adding a ramdisk to a raid0, or even just adding a write-back instead of a write-through cache).


Therefore I'd like to ask if it would be possible to make two small improvements:
1. Add a "volatile" flag to PVs, LVs, and VGs to allow to clearly indicate that they are non-persistent and that dataloss is expected.
2. And one of:
 a. Change error handling and automatic recovery from missing PVs if the LV or VG has the volatile flag. Like e.g. automatically `--uncache`-ing the volume and mount it without the cache that is missing its PV. This is even more important for boot volumes, where such a configuration would prevent the system from booting at all.
 b. Alternatively, add native support for ramdisks. This mainly would require extending the VG metadata with an 'is-RAMdisk' flag that causes the lookup for the PV to be skipped and instead a new ramdisk being allocated while the VG is being activated (we know its size from the VG metadata, as we know how much we allocate/use). This could also help with unit tests and CI/CD usages (where currently the PV is manually created with brd before activating/creating the VG). Including our own test/lib/aux.sh, test/shell/devicesfile-misc.sh, test/shell/devicesfile-refresh.sh, test/shell/devicesfile-serial.sh.
 c. Same as 2a, but instead of automatically uncaching the volume, add a flag to the VG metadata that allows LVM to use the hints file to find the PV and automatically re-initialize it regardless of its header. Maybe together with an additional configuration option to demand the block device being zeroed (I.E. to avoid reading the entire block device, the first 4 sectors) to safeguard against accidental data-loss that we normally get by looking for the correct PV header.
 d. Same as 2b, but limited to caches only. Considering how caching is currently implemented adding ramdisks with an limitation to caches may cause unecessary additional work and be less useful compared to adding them as a new additional kind of PV. Also it wouldn't help the additional usecase with unit tests and CI/CD pipelines. Additionally it would also simplify "playing with" and learning about LVM.
 e. Add an option to have lvconvert enable caching but WITHOUT saving it within the VGs metadata. Causing LVM to forget about the case. I.E. next time the system boots LVM would mount the VG normally without the cache. For write-through caches this should always be safe and for write-back it only causes dataloss when the system crashes without flushing it.

My personal favourite is 2b, followed by 2e.
2b basically realizes my entire usecase within LVM natively and 2e at least avoids the need to automating the LVM recovery just to be able to reboot the system and allow me to write a systemd service to add the cache at runtime.

Best regards

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot
  2024-01-12 18:19 add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot lists.linux.dev
@ 2024-01-17 11:08 ` Zdenek Kabelac
  2024-01-17 22:00   ` Gionatan Danti
  0 siblings, 1 reply; 7+ messages in thread
From: Zdenek Kabelac @ 2024-01-17 11:08 UTC (permalink / raw)
  To: lists.linux.dev, linux-lvm

Dne 12. 01. 24 v 19:19 lists.linux.dev@frank.fyi napsal(a):
> Hi,
> 
> at first, a happy new year to everyone.
> 
> I'm currently considering to use dm-cache with a ramdisk/volatile PV for a small project and noticed some usability issues that make using it less appealing.
> 
> 
> Currently this means:
> 1. Adding a cache to a VG will cause the entire VG to depend on the cache. If one of the cache drives fails or is missing it cannot be accessed and even worse if this was the VG containing the root filesystem it also causes the entire system to fail to boot. Even though we may already know that we don't have any dataloss but just degraded access times.
> 2. Requires manual scripting to activate the VG and handle potentially missing/failing cache PVs
> 3. LVM doesn't have a way to clearly indicate that the physical volume is volatile and that dataloss on it is expected. Maybe even including the PV header itself. Or alternatively a way to indicate "if something is wrong with the cache, just forget about it (if possible)".
> 4. Just recreating the 'pvcreate --zero --pvmetadatacopies 0 --norestorefile --uuid' appears to be enough to get a write-through cache and thereby also the associated volume working again. Therefore it doesn't look like LVM cares about the cache data being lost, but only about the PV itself. Therefore failing to activate the VG appears to be a bit too convservative and probably the error handling here could be improved (see above).
> 6. Also as there is currently no place within the LVM metadata to label a PV/VG/LV as "volatile" it is also not clear both to LVM as well as admins looking at output of tools like lvdisplay that a specific LV is volatile. Therefore there will also be no safeguards and warnings against actions that would cause dataloss (like adding a ramdisk to a raid0, or even just adding a write-back instead of a write-through cache).
> 
> 
> Therefore I'd like to ask if it would be possible to make two small improvements:
> 1. Add a "volatile" flag to PVs, LVs, and VGs to allow to clearly indicate that they are non-persistent and that dataloss is expected.
> 2. And one of:
>   a. Change error handling and automatic recovery from missing PVs if the LV or VG has the volatile flag. Like e.g. automatically `--uncache`-ing the volume and mount it without the cache that is missing its PV. This is even more important for boot volumes, where such a configuration would prevent the system from booting at all.
>   b. Alternatively, add native support for ramdisks. This mainly would require extending the VG metadata with an 'is-RAMdisk' flag that causes the lookup for the PV to be skipped and instead a new ramdisk being allocated while the VG is being activated (we know its size from the VG metadata, as we know how much we allocate/use). This could also help with unit tests and CI/CD usages (where currently the PV is manually created with brd before activating/creating the VG). Including our own test/lib/aux.sh, test/shell/devicesfile-misc.sh, test/shell/devicesfile-refresh.sh, test/shell/devicesfile-serial.sh.
>   c. Same as 2a, but instead of automatically uncaching the volume, add a flag to the VG metadata that allows LVM to use the hints file to find the PV and automatically re-initialize it regardless of its header. Maybe together with an additional configuration option to demand the block device being zeroed (I.E. to avoid reading the entire block device, the first 4 sectors) to safeguard against accidental data-loss that we normally get by looking for the correct PV header.
>   d. Same as 2b, but limited to caches only. Considering how caching is currently implemented adding ramdisks with an limitation to caches may cause unecessary additional work and be less useful compared to adding them as a new additional kind of PV. Also it wouldn't help the additional usecase with unit tests and CI/CD pipelines. Additionally it would also simplify "playing with" and learning about LVM.
>   e. Add an option to have lvconvert enable caching but WITHOUT saving it within the VGs metadata. Causing LVM to forget about the case. I.E. next time the system boots LVM would mount the VG normally without the cache. For write-through caches this should always be safe and for write-back it only causes dataloss when the system crashes without flushing it.
> 
> My personal favourite is 2b, followed by 2e.
> 2b basically realizes my entire usecase within LVM natively and 2e at least avoids the need to automating the LVM recovery just to be able to reboot the system and allow me to write a systemd service to add the cache at runtime.



Hi

We do have several such things in our TODO plans - but it's actually way more 
complicated then you might think.  It's also not completely true that even 
'writethrough' cache cannot have dirty-blocks (aka - only present in cache and 
origin had failed writes).

Another important note here is -  the dm-cache target is not intended to be a 
'bigger page-cache' - it has different purpose and different usage.

So using 'ramdisk' for dm-cache is kind of pointless when the same RAM can be 
likely more effectively used by system's page cache logic.

To extend the dirty cached pages - there is  'dm-writecache' target to be used 
that in some way extents amount page cache with the size of your fast NVMe/SSD 
device -  but it's not accelerating 'reads' from hotspots.

Lvm should cope (eventually with --force option) with removal of missing 
devices holding cached blocks - however there can be still some dead spots.
But ATM we are not seeing it as some major trouble.  Hotspot cache is simply 
not supposed to be randomly removed from your systems - as it it's not easy to 
rebuild.

But it might be possible to more easily automate bootup process in case a PV 
with cache is missing (something like for  'raidLVs' with missing legs).

Regards

Zdenek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot
  2024-01-17 11:08 ` Zdenek Kabelac
@ 2024-01-17 22:00   ` Gionatan Danti
  2024-01-18 15:40     ` Zdenek Kabelac
  0 siblings, 1 reply; 7+ messages in thread
From: Gionatan Danti @ 2024-01-17 22:00 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: lists.linux.dev, linux-lvm

Il 2024-01-17 12:08 Zdenek Kabelac ha scritto:
> It's also not completely
> true that even 'writethrough' cache cannot have dirty-blocks (aka -
> only present in cache and origin had failed writes).

Hi, really? From dm-cache docs:

"If writethrough is selected then a write to a cached block will not
complete until it has hit both the origin and cache devices.  Clean
blocks should remain clean."

So I would not expect to see dirty blocks on write-through cache, unless 
the origin device is unable to write at all - which means that removing 
the cache device would be no worse that not having it at all in the 
first place.

What am I missing?

> But ATM we are not seeing it as some major trouble.  Hotspot cache is
> simply not supposed to be randomly removed from your systems - as it
> it's not easy to rebuild.

As a write-through cache should not contain dirty data, using a single 
SSD for caching should be OK. I think that if such expendable (and 
write-through) SSD fails, one should be able to boot without issues.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot
  2024-01-17 22:00   ` Gionatan Danti
@ 2024-01-18 15:40     ` Zdenek Kabelac
  2024-01-18 19:50       ` Gionatan Danti
  2024-01-20 20:29       ` lists.linux.dev
  0 siblings, 2 replies; 7+ messages in thread
From: Zdenek Kabelac @ 2024-01-18 15:40 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: lists.linux.dev, linux-lvm

Dne 17. 01. 24 v 23:00 Gionatan Danti napsal(a):
> Il 2024-01-17 12:08 Zdenek Kabelac ha scritto:
>> It's also not completely
>> true that even 'writethrough' cache cannot have dirty-blocks (aka -
>> only present in cache and origin had failed writes).
> 
> Hi, really? From dm-cache docs:
> 
> "If writethrough is selected then a write to a cached block will not
> complete until it has hit both the origin and cache devices.  Clean
> blocks should remain clean."
> 
> So I would not expect to see dirty blocks on write-through cache, unless the 
> origin device is unable to write at all - which means that removing the cache 
> device would be no worse that not having it at all in the first place.
> 
> What am I missing?
> 

Cache can contain blocks that are still being 'synchronized' to the cache 
origin. So while the 'writing' process doesn't get ACK for writes - the cache
may have valid blocks that are 'dirty' in terms of being synchronized to 
origin device.

And while this is usually not a problem when system works properly,
it's getting into weird 'state machine' model when i.e. origin device has 
errors - which might be even 'transient' with all the variety of storage types 
and raid arrays with integrity and self-healing and so on...

So while it's usually not a problem for a laptop with 2 disks, the world is 
more complex...


>> But ATM we are not seeing it as some major trouble.  Hotspot cache is
>> simply not supposed to be randomly removed from your systems - as it
>> it's not easy to rebuild.
> 
> As a write-through cache should not contain dirty data, using a single SSD for 
> caching should be OK. I think that if such expendable (and write-through) SSD 
> fails, one should be able to boot without issues.

This is mostly true - yet the lvm2 should be 'available' in boot ramdisk and 
the booting process should be possibly able to recognize problem and call some 
sort of 'lvconvert --repair' and proceed with boot.

As mentioned - there could be seen some similarity with raid with failed leg 
- so some sort of 'degraded' activation might be also an option here.
But it further needs some lvm2 metadata update to maintain the 'state' of 
metadata - so if there is again some 'reboot' and PV with cache appears back - 
it will not interfere with the system (aka providing some historical cached 
blocks,  so just like mirrored leg needs some care...)

Regards

Zdenek


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot
  2024-01-18 15:40     ` Zdenek Kabelac
@ 2024-01-18 19:50       ` Gionatan Danti
  2024-01-20 20:29       ` lists.linux.dev
  1 sibling, 0 replies; 7+ messages in thread
From: Gionatan Danti @ 2024-01-18 19:50 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: lists.linux.dev, linux-lvm

Il 2024-01-18 16:40 Zdenek Kabelac ha scritto:
> But it further needs some lvm2 metadata update to maintain the 'state'
> of metadata - so if there is again some 'reboot' and PV with cache
> appears back - it will not interfere with the system (aka providing
> some historical cached blocks,  so just like mirrored leg needs some
> care...)

Hi Zdenek,
yes, this is a valid point: if the cache device reappears and starts 
providing old blocks, then data corruption can happen.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot
  2024-01-18 15:40     ` Zdenek Kabelac
  2024-01-18 19:50       ` Gionatan Danti
@ 2024-01-20 20:29       ` lists.linux.dev
       [not found]         ` <1279983342.522347.1705803666448@mail.yahoo.com>
  1 sibling, 1 reply; 7+ messages in thread
From: lists.linux.dev @ 2024-01-20 20:29 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: linux-lvm

On Thu, Jan 18, 2024 at 04:40:47PM +0100, Zdenek Kabelac wrote:
> Cache can contain blocks that are still being 'synchronized' to the cache
> origin. So while the 'writing' process doesn't get ACK for writes - the
> cache
> may have valid blocks that are 'dirty' in terms of being synchronized to
> origin device.
> 
> And while this is usually not a problem when system works properly,
> it's getting into weird 'state machine' model when i.e. origin device has
> errors - which might be even 'transient' with all the variety of storage
> types and raid arrays with integrity and self-healing and so on...
> 
> So while it's usually not a problem for a laptop with 2 disks, the world is
> more complex...

Ehm, but wouldn't anything other than discarding that block from the cache and using whatever is on the backing storage introduce unpredictable errors?
As like you already said it was never ACKed, so the software that tried to write it never expected it to be written.
Why exactly are we allowed to use the data from the write-through cache to modify the data on the backing storage in such cases?
I.E. Why can we safely consider it as valid data?

> metadata - so if there is again some 'reboot' and PV with cache appears back
> - it will not interfere with the system (aka providing some historical
> cached blocks,  so just like mirrored leg needs some care...)

Same here, why do we have to consider these blocks at all and can't discard them? We know when a drive re-appears, so we could just not use it without validation, or in the case the volatile flag I suggested would be used, just wipe it and start over...

After all I don't know anyone that designs their storage systems with the assumption that the write-through cache has to be redundant.
Even more, I know enough people in data center environments that reuse their "failing but still kinda good" SSDs and NVMEs for write-through caches using the assumption that them failing at most impacts read performance but not data security.

Is there some common missconception at play? Or what exaclty am I missing here?

Sincerely,
Klaus Frank

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot
       [not found]         ` <1279983342.522347.1705803666448@mail.yahoo.com>
@ 2024-01-22 10:58           ` Zdenek Kabelac
  0 siblings, 0 replies; 7+ messages in thread
From: Zdenek Kabelac @ 2024-01-22 10:58 UTC (permalink / raw)
  To: matthew patton, lists.linux.dev; +Cc: linux-lvm

Dne 21. 01. 24 v 3:21 matthew patton napsal(a):
>> As like you already said it was never ACKed, so the software that tried to write it never expected it to be written.
> 
> we don't care about the user program and what it thinks got written or not. 
> That's way higher up the stack.
> 
> Any write-thru cache has NO business writing new data to cache first, it must 
> hit the source media first. Once that is done it can be ACK'd. The ONLY other 
> part of the "transaction" is an update to the cache management block-mapping 
> to invalidate the block so as to prevent stale reads.
> 
> THEN  IF there is a case to be made for re-caching the new data (we know it 
> was a block under active management), that is a SECOND OP that can also be 
> made asynchronous. Write-thru should ALWAYS perform and behave like cache 
> device doesn't exist at all.


Hi

Anyone can surely write a caching policy following rules above, however 
current DM cache is working differently with cached 'blocks'.

Method above would require to drop/demote whole cached block out of the cache 
first. Then update the content on the origin device, and promote the whole 
such updated block back to cache. i.e. user writes sector 512b  and the cached 
block with 512KiB would need to be recached...

So here I could wish a good luck with performance of such engine, the current 
DM cache engine is using parallel writes - thus there can be a moment where 
the cache has simply the more recent and valid data.

The problem here will happen when origin would have faulty sectors - so DM 
target takes this risk - it should not have any impact on properly written 
software that is using transactional mechanisms properly.

So if there is a space for much slower caching that will never ever have any 
dirty pages - someone can bravely step-in and write a new caching policy for 
such engine.

Regards

Zdenek


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-01-22 10:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-12 18:19 add volatile flag to PV/LVs (for cache) to avoid degraded state on reboot lists.linux.dev
2024-01-17 11:08 ` Zdenek Kabelac
2024-01-17 22:00   ` Gionatan Danti
2024-01-18 15:40     ` Zdenek Kabelac
2024-01-18 19:50       ` Gionatan Danti
2024-01-20 20:29       ` lists.linux.dev
     [not found]         ` <1279983342.522347.1705803666448@mail.yahoo.com>
2024-01-22 10:58           ` Zdenek Kabelac

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).