All of lore.kernel.org
 help / color / mirror / Atom feed
* [linux-lvm] Reserve space for specific thin logical volumes
@ 2017-09-08 10:35 Gionatan Danti
  2017-09-08 11:06 ` Xen
  2017-09-09 22:04 ` Gionatan Danti
  0 siblings, 2 replies; 91+ messages in thread
From: Gionatan Danti @ 2017-09-08 10:35 UTC (permalink / raw)
  To: linux-lvm

Hi list,
as by the subject: is it possible to reserve space for specific thin 
logical volumes?

This can be useful to "protect" critical volumes from having their space 
"eaten" by other, potentially misconfigured, thin volumes.

Another, somewhat more convoluted, use case is to prevent snapshot 
creation when thin pool space is too low, causing the pool to fill up 
completely (with all the associated dramas for the other thin volumes).

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-08 10:35 [linux-lvm] Reserve space for specific thin logical volumes Gionatan Danti
@ 2017-09-08 11:06 ` Xen
  2017-09-09 22:04 ` Gionatan Danti
  1 sibling, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-08 11:06 UTC (permalink / raw)
  To: linux-lvm

Gionatan Danti schreef op 08-09-2017 12:35:
> Hi list,
> as by the subject: is it possible to reserve space for specific thin
> logical volumes?
> 
> This can be useful to "protect" critical volumes from having their
> space "eaten" by other, potentially misconfigured, thin volumes.
> 
> Another, somewhat more convoluted, use case is to prevent snapshot
> creation when thin pool space is too low, causing the pool to fill up
> completely (with all the associated dramas for the other thin
> volumes).

For my 'ideals' thin space reservation (which would be like allocation 
in advance) would definitely be a welcome thing.

You can also think of it in terms of a default pre-allocation setting. 
I.e. every volume keeps a bit of space over-allocated while only doing 
so if there is actually room in the thin volume (some kind of lazy 
allocation?).

Of course not trying to steal your question here and I do not know if 
any such thing is possible but it might be and I wouldn't mind hearing 
the answer as well.

No offense intended. Regards.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-08 10:35 [linux-lvm] Reserve space for specific thin logical volumes Gionatan Danti
  2017-09-08 11:06 ` Xen
@ 2017-09-09 22:04 ` Gionatan Danti
  2017-09-11 10:35   ` Zdenek Kabelac
  1 sibling, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-09 22:04 UTC (permalink / raw)
  To: linux-lvm

Il 08-09-2017 12:35 Gionatan Danti ha scritto:
> Hi list,
> as by the subject: is it possible to reserve space for specific thin
> logical volumes?
> 
> This can be useful to "protect" critical volumes from having their
> space "eaten" by other, potentially misconfigured, thin volumes.
> 
> Another, somewhat more convoluted, use case is to prevent snapshot
> creation when thin pool space is too low, causing the pool to fill up
> completely (with all the associated dramas for the other thin
> volumes).
> 
> Thanks.

Hi all,
anyone with some informations?

Any comment would be very appreciated :)
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-09 22:04 ` Gionatan Danti
@ 2017-09-11 10:35   ` Zdenek Kabelac
  2017-09-11 10:55     ` Xen
  2017-09-11 21:59     ` Gionatan Danti
  0 siblings, 2 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-11 10:35 UTC (permalink / raw)
  To: LVM general discussion and development, Gionatan Danti

Dne 10.9.2017 v 00:04 Gionatan Danti napsal(a):
> Il 08-09-2017 12:35 Gionatan Danti ha scritto:
>> Hi list,
>> as by the subject: is it possible to reserve space for specific thin
>> logical volumes?
>>
>> This can be useful to "protect" critical volumes from having their
>> space "eaten" by other, potentially misconfigured, thin volumes.
>>
>> Another, somewhat more convoluted, use case is to prevent snapshot
>> creation when thin pool space is too low, causing the pool to fill up
>> completely (with all the associated dramas for the other thin
>> volumes).
>>
>> Thanks.
> 
> Hi all,
> anyone with some informations?
> 
> Any comment would be very appreciated :)
> Thanks.


Hi


Not sure for which information are you looking for ??

Having 'reserved' space for thinLV - means - you have to add more space
to this thin-pool -  there is not much point in keeping space in VG,
which could be only used for extension of particular LV ??

What we do have thought is 'shard' "_pmspare' extra space for metadata 
recovery, but there is nothing  like that for data space (and not even planned).

There is support for so-called- fully-provisioned thinLVs withing thin-pool 
in-plan, but that probably doesn't suit your needs.


The first question here is - why do you want to use thin-provisioning ?

As thin-provisioning is about 'promising the space you can deliver later when 
needed'  - it's not about hidden magic to make the space out-of-nowhere.
The idea of planning to operate thin-pool on 100% fullness boundary is simply 
not going to work well - it's  not been designed for that use-case - so if 
that's been your plan - you will need to seek for other solution.
(Unless you seek for those 100% provisioned devices)

Regards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 10:35   ` Zdenek Kabelac
@ 2017-09-11 10:55     ` Xen
  2017-09-11 11:20       ` Zdenek Kabelac
  2017-09-11 21:59     ` Gionatan Danti
  1 sibling, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-11 10:55 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 11-09-2017 12:35:

> As thin-provisioning is about 'promising the space you can deliver
> later when needed'  - it's not about hidden magic to make the space
> out-of-nowhere.
> The idea of planning to operate thin-pool on 100% fullness boundary is
> simply not going to work well - it's  not been designed for that
> use-case

I am going to rear my head again and say that a great many people would 
probably want a thin-provisioning that does exactly that ;-).

I mean you have it designed for auto-extension but there are also many 
people that do not want to auto-extend and just share available 
resources more flexibly.

For those people safety around 100% fullness boundary becomes more 
important.

I don't really think there is another solution for that.

I don't think BTRFS is really a good solution for that.

So what alternatives are there, Zdenek? LVM is really the only thing 
that feels "good" to us.

Are there structural design inhibitions that would really prevent this 
thing from ever arising?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 10:55     ` Xen
@ 2017-09-11 11:20       ` Zdenek Kabelac
  2017-09-11 12:06         ` Xen
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-11 11:20 UTC (permalink / raw)
  To: linux-lvm

Dne 11.9.2017 v 12:55 Xen napsal(a):
> Zdenek Kabelac schreef op 11-09-2017 12:35:
> 
>> As thin-provisioning is about 'promising the space you can deliver
>> later when needed'� - it's not about hidden magic to make the space
>> out-of-nowhere.
>> The idea of planning to operate thin-pool on 100% fullness boundary is
>> simply not going to work well - it's� not been designed for that
>> use-case
> 
> I am going to rear my head again and say that a great many people would 
> probably want a thin-provisioning that does exactly that ;-).
> 

Wondering from where they could get this idea...
We always communicate clearly - do not plan to use 100% full unresizable 
thin-pool as a part of regular work-flow - it's always critical situation 
often even leading to system's reboot and full check of all volumes.

> I mean you have it designed for auto-extension but there are also many people 
> that do not want to auto-extend and just share available resources more flexibly.
> 
> For those people safety around 100% fullness boundary becomes more important.
> 
> I don't really think there is another solution for that.
> 
> I don't think BTRFS is really a good solution for that.
> 
> So what alternatives are there, Zdenek? LVM is really the only thing that 
> feels "good" to us.
> 

Thin-pool  needs to be ACTIVELY monitored and proactively either added more PV 
free space to the VG or eliminating  unneeded 'existing' provisioned blocks 
(fstrim,  dropping snapshots, removal of unneeded thinLVs.... -  whatever 
comes on your mind to make a more free space in thin-pool  - lvm2 fully 
supports now to call 'smart' scripts directly out of dmeventd for such action.


It's illusion to hope anyone will be able to operate lvm2 thin-pool at 100% 
fullness reliable - there should be always enough room to give 'scripts'
reaction time to gain some more space in-time  - so thin-pool can serve free 
chunks for provisioning - that's been design - to deliver blocks when needed,
not to brake system

> Are there structural design inhibitions that would really prevent this thing 
> from ever arising?

Yes, performance and resources consumption.... :)

And there is fundamental difference between full 'block device' sharing
space with other device - compared with single full filesystem - you can't 
compare these 2 things at all.....


Regards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 11:20       ` Zdenek Kabelac
@ 2017-09-11 12:06         ` Xen
  2017-09-11 12:45           ` Xen
  2017-09-11 13:11           ` Zdenek Kabelac
  0 siblings, 2 replies; 91+ messages in thread
From: Xen @ 2017-09-11 12:06 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 11-09-2017 13:20:

> Wondering from where they could get this idea...
> We always communicate clearly - do not plan to use 100% full
> unresizable thin-pool as a part of regular work-flow

No one really PLANS for that.

They probably plan for some 80% usage or less.

But they *do* use thin provisioning for over-provisioning.

So the issue is runaway processes.

Typically the issue won't be "as planned" behaviour.

I still intend to write more better monitoring support for myself if I 
ever get the chance to code again.

> - it's always
> critical situation often even leading to system's reboot and full
> check of all volumes.

I know that but the issue is to prevent the critical situation (if the 
design should allow for that).

TWO levels of failure:

- Filesystem level failure
- Block layer level failure

File system level failure can also not be critical because of using 
non-critical volume because LVM might fail even though filesystem does 
not fail or applications.

Block level layer failure is much more serious, and can prevent system 
from recovering when it otherwise could.

> Thin-pool  needs to be ACTIVELY monitored

But monitoring is labour intensive task unless monitoring systems are in 
place with email reporting and so on.

Do those systems exist? Do we have them available?

I know I wrote one the other day and it is still working so I am not so 
much in a problem right now.

But in general it is still a poor solution for me (because I didn't 
develop it further and it is just a Bash script using some log reading 
functionality using the older version of LVM's reporting feature into 
the syslog (systemd-journald).

> and proactively either added
> more PV free space to the VG

That is not possible in the use case described. Not all systems have 
instantly more space available, or even able to expand, and may still 
want to use LVM thin provisioning because of the flexibility it 
provides.

> or eliminating  unneeded 'existing'
> provisioned blocks (fstrim

Yes that is very good to do that, but also needs setup.

> ,  dropping snapshots

Might also be good in more fully-fledged system.

> , removal of unneeded
> thinLVs....

Only manual intervention this one... and last resort only to prevent 
crash so not really useful in general situation?

> -  whatever comes on your mind to make a more free space
> in thin-pool

I guess but that is lot of manual intervention. We like to also be safe 
in case we're sleeping ;-).

> - lvm2 fully supports now to call 'smart' scripts
> directly out of dmeventd for such action.

Yes that is very good, thank you for that. I am still on older LVM 
making use of existing logging feature, which also works for me for now.

> It's illusion to hope anyone will be able to operate lvm2 thin-pool at
> 100% fullness reliable

That's not what we want.

100% is not the goal. Is exceptional situation to begin with.

> - there should be always enough room to give
> 'scripts' reaction time

Sure but some level of "room reservation" is only to buy time -- or 
really perhaps to make sure main system volume doesn't crash when data 
volume fills up by accident.

But system volumes already have reserved space filesystem level.

But do they also have this space reserved in actuality? I doubt it. Not 
on the LVM level.

So it is only to mirror that filesystem feature.

Now you could do something on the filesystem level to ensure that those 
blocks are already allocated on LVM level, that would be good too.

> to gain some more space in-time

Yes email monitoring would be most important I think for most people.

> - so thin-pool can
> serve free chunks for provisioning - that's been design

Aye but does design have to be complete failure when condition runs out?

I am just asking whether or not there is a clear design limitation that 
would ever prevent safety in operation when 100% full (by accident).

You said before that there was design limitation, that concurrent 
process cannot know whether the last block has been allocated.

> - to deliver
> blocks when needed,
> not to brake system

But it's exceptional situation to begin with.

>> Are there structural design inhibitions that would really prevent this 
>> thing from ever arising?
> 
> Yes, performance and resources consumption.... :)

Right, that was my question I guess.

So you said before it was a concurrent thread issue.

Concurrent allocation issue using search algorithm to find empty blocks.

> And there is fundamental difference between full 'block device' sharing
> space with other device - compared with single full filesystem - you
> can't compare these 2 things at all.....

You mean BTRFS being full filesystem.

I still think theoretically solution would be easy if you wanted it.

I mean I have been programmer for many years too ;-).

But it seems to me desire is not there.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 12:06         ` Xen
@ 2017-09-11 12:45           ` Xen
  2017-09-11 13:11           ` Zdenek Kabelac
  1 sibling, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-11 12:45 UTC (permalink / raw)
  To: linux-lvm

Xen schreef op 11-09-2017 14:06:

> But system volumes already have reserved space filesystem level.
> 
> But do they also have this space reserved in actuality? I doubt it.
> Not on the LVM level.
> 
> So it is only to mirror that filesystem feature.
> 
> Now you could do something on the filesystem level to ensure that
> those blocks are already allocated on LVM level, that would be good
> too.

This made no sense, sorry.

No system should really run main system volume on LVM thin (or at least 
there no great need for it) so the typical failure case would be:

- data volume fills up
- entire system crashes

THAT is the only problem LVM has today.

It's not just that the thin pool is going to be unreliable

But that it also causes a kernel panic in due time. Usually within 10-20 
seconds.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 12:06         ` Xen
  2017-09-11 12:45           ` Xen
@ 2017-09-11 13:11           ` Zdenek Kabelac
  2017-09-11 13:46             ` Xen
                               ` (3 more replies)
  1 sibling, 4 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-11 13:11 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 11.9.2017 v 14:06 Xen napsal(a):
> Zdenek Kabelac schreef op 11-09-2017 13:20:
> 
>> Wondering from where they could get this idea...
>> We always communicate clearly - do not plan to use 100% full
>> unresizable thin-pool as a part of regular work-flow
> 
> No one really PLANS for that.
> 
> They probably plan for some 80% usage or less.


Thin-provisioning is - about 'postponing'  available space to be delivered in 
time - let's have an example:

You order some work which cost $100.
You have just $30, but you know, you will have $90 next week -
so the work can start....

But it seems some users know it will cost $100, but they still think the work 
could be done with $10 and it's will 'just' work the same....

Sorry it won't....


> But they *do* use thin provisioning for over-provisioning.

Noone is blaming anyone for over-provisioning - but using over-provising 
without the plan of adding this space in case the space is really needed - 
that's the main issue and problem here.

thin-provisiong is giving you extra TIME - not the SPACE :)

> 
> File system level failure can also not be critical because of using 
> non-critical volume because LVM might fail even though filesystem does not 
> fail or applications.

So my Laptop machine has 32G RAM - so you can have 60% of dirty-pages
those may raise pretty major 'provisioning' storm....

> 
> Block level layer failure is much more serious, and can prevent system from 
> recovering when it otherwise could.

Yep - the idea is - when thin-pool gets full - it will stop working,
but you can't rely on 'usable' system when this happens....

Of course - it differs on case by case - if you run your /rootvolume
out of such overfilled thin-pool - you have much bigger set of problems
compared with user which has just some mount  data volume - so
the rest of system is sitting on some 'fully provisioned' volume....

But we are talking about generic case here no on some individual sub-cases
where some limitation might give you the chance to rescue better...

> 
> That is not possible in the use case described. Not all systems have instantly 
> more space available, or even able to expand, and may still want to use LVM 
> thin provisioning because of the flexibility it provides.

Again - it's admin's  gambling here - if he let the system overprovisiong
and doesn't have 'backup' plan -  you can't blame here  lvm2.....

> 
> Only manual intervention this one... and last resort only to prevent crash so 
> not really useful in general situation?

Let's simplify it for the case:

You have  1G thin-pool
You use 10G of thinLV on top of 1G thin-pool

And you ask for 'sane' behavior ??

You would have to probably write your whole own linux kernel - to continue 
work reasonable well when 'write-failure' starts to appear.

It's completely out-of-hand of  dm/lvm2.....

The most 'sane' is to stop and reboot and fix missing space....

Any idea of having 'reserved' space for 'prioritized' applications and other 
crazy ideas leads to nowhere.

Actually there is very good link to read about:

https://lwn.net/Articles/104185/

Hopefully this will bring your mind further ;)

>> - lvm2 fully supports now to call 'smart' scripts
>> directly out of dmeventd for such action.
> 
> Yes that is very good, thank you for that. I am still on older LVM making use 
> of existing logging feature, which also works for me for now.

Well yeah - it's not useless to discuses solution for  old releases of lvm2...

Lvm2 should be compilable and usable on older distros as well - so upgrade and 
do not torture yourself with older lvm2....



> 
>> It's illusion to hope anyone will be able to operate lvm2 thin-pool at
>> 100% fullness reliable
> 
> That's not what we want.
> 
> 100% is not the goal. Is exceptional situation to begin with.

And we believe it's fine to solve exceptional case  by reboot.
Since the effort you would need to put into solve all kernel corner case is 
absurdly high compared with the fact 'it's exception' for normally used and 
configured and monitored thin-pool....

So don't expect lvm2 team will be solving this - there are more prio work....


>> - there should be always enough room to give
>> 'scripts' reaction time
> 
> Sure but some level of "room reservation" is only to buy time -- or really 
> perhaps to make sure main system volume doesn't crash when data volume fills  > up by accident.

If the system volume IS that important - don't use it with over-provisiong!

The answer is that simple.

You can user different thin-pool for your system LV where you can maintain
snapshot without over-provisioning.

It's way more practical solution the trying to fix  OOM problem :)

>> to gain some more space in-time
> 
> Yes email monitoring would be most important I think for most people.
Put mail messaging into  plugin script then.
Or use any monitoring software for messages in syslog - this worked pretty 
well 20 years back - and hopefully still works well :)

>> serve free chunks for provisioning - that's been design
> 
> Aye but does design have to be complete failure when condition runs out?

YES

> I am just asking whether or not there is a clear design limitation that would 
> ever prevent safety in operation when 100% full (by accident).

Don't user over-provisioning in case you don't want to see failure.

It's the same as you should not overcommit your RAM in case you do not want to 
see OOM....


> I still think theoretically solution would be easy if you wanted it.

My best advice - please you should try to write it - so you would see more in 
depth how yours 'theoretical solution'  meets with reality....


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 13:11           ` Zdenek Kabelac
@ 2017-09-11 13:46             ` Xen
  2017-09-12 11:46               ` Zdenek Kabelac
  2017-09-11 14:00             ` Xen
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-11 13:46 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

Zdenek Kabelac schreef op 11-09-2017 15:11:

> Thin-provisioning is - about 'postponing'  available space to be
> delivered in time

That is just one use case.

Many more people probably use it for other use case.

Which is fixed storage space and thin provisioning of available storage.

> You order some work which cost $100.
> You have just $30, but you know, you will have $90 next week -
> so the work can start....

I know the typical use case that you advocate yes.

> But it seems some users know it will cost $100, but they still think
> the work could be done with $10 and it's will 'just' work the same....

No that's not what people want.

People want efficient usage of data without BTRFS, that's all.

>> File system level failure can also not be critical because of using 
>> non-critical volume because LVM might fail even though filesystem does 
>> not fail or applications.
> 
> So my Laptop machine has 32G RAM - so you can have 60% of dirty-pages
> those may raise pretty major 'provisioning' storm....

Yes but still system does not need to crash, right.

>> Block level layer failure is much more serious, and can prevent system 
>> from recovering when it otherwise could.
> 
> Yep - the idea is - when thin-pool gets full - it will stop working,
> but you can't rely on 'usable' system when this happens....
> 
> Of course - it differs on case by case - if you run your /rootvolume
> out of such overfilled thin-pool - you have much bigger set of problems
> compared with user which has just some mount  data volume - so
> the rest of system is sitting on some 'fully provisioned' volume....

Yes.

> But we are talking about generic case here no on some individual 
> sub-cases
> where some limitation might give you the chance to rescue better...

But no one in his right mind currently runs /rootvolume out of thin pool 
and in pretty much all cases probably it is only used for data or for 
example of hosting virtual hosts/containers/virtualized 
environments/guests.

So Data use for thin volume is pretty much intended/common/standard use 
case.

Now maybe amount of people that will be able to have running system 
after data volumes overprovision/fill up/crash is limited.

However, from both a theoretical and practical standpoint being able to 
just shut down whatever services use those data volumes -- which is only 
possible if base system is still running -- makes for far easier 
recovery than anything else, because how are you going to boot system 
reliably without using any of those data volumes? You need rescue mode 
etc.

So I would say it is the general use case where LVM thin is used for 
data, or otherwise it is the "special" use case used by 90% of people...


In any case it wouldn't hurt anyone who didn't fall into that "special 
use case" scenario, it would benefit everyone.

Unless you are speaking perhaps about unmitigatable performance 
considerations.

Then it becomes indeed a tradeoff but you are the better judge of that.

> Again - it's admin's  gambling here - if he let the system 
> overprovisiong
> and doesn't have 'backup' plan -  you can't blame here  lvm2.....

He might have system backups.

He might be able to recover his system if his system is still allowed to 
be logged into.

That should be enough backup plan for most people who do not have 
expandable storage.

So maybe this is not main use case for LVM2, but it is still common use 
case that people keep asking about. So there is a demand for this.

Normal data volumes filling up is pretty much same situation.

Same user will not have backup plan in case volumes fill up.

Thin provisioning does not make that worse, normally.

That's where we start out from.

Thin provisioning with overprosisioning and expandable storage does 
improve that thing for those people, that want to have larger 
filesystems to cater to growth.

But people using slightly larger filesystems only for data space sharing 
between volumes...

Are trying to get a bit more flexibility (for example for moving data 
from partition to partition).

So for example I have 50GB VPS with Thin for data volumes.

If I want to reorganize my data across volumes I only have to ensure 
enough space in thin pool, or move in smaller parts so there is enough 
space for that.

Then I run fstrim and then everything is alright again.

This is benefit of me for thin pool.

It just makes moving data around a bit (a lot) easier.

So I first check thin space and then do operation.

So the only time when I near the "full" mark is when I do these 
operations.

My system is not data intensive (with just 50GB) and does not run quick 
risk of filling up -- but it could happen.

So that's all.

Regards.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 13:11           ` Zdenek Kabelac
  2017-09-11 13:46             ` Xen
@ 2017-09-11 14:00             ` Xen
  2017-09-11 17:34               ` Zdenek Kabelac
  2017-09-11 15:31             ` Eric Ren
  2017-09-11 16:55             ` David Teigland
  3 siblings, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-11 14:00 UTC (permalink / raw)
  To: linux-lvm

Just responding to second part of your email.

>> Only manual intervention this one... and last resort only to prevent 
>> crash so not really useful in general situation?
> 
> Let's simplify it for the case:
> 
> You have  1G thin-pool
> You use 10G of thinLV on top of 1G thin-pool
> 
> And you ask for 'sane' behavior ??

Why not? Really.

> Any idea of having 'reserved' space for 'prioritized' applications and
> other crazy ideas leads to nowhere.

It already exists in Linux filesystems since long time (root user).

> Actually there is very good link to read about:
> 
> https://lwn.net/Articles/104185/

That was cute.

But we're not asking aeroplane to keep flying.

We are asking aeroplane to not take down fuelling plane hovering nearby 
too.

> Well yeah - it's not useless to discuses solution for  old releases of 
> lvm2...

This is insignifant difference. There is no point in here for you to 
argue this.

> Lvm2 should be compilable and usable on older distros as well - so
> upgrade and do not torture yourself with older lvm2....

I'm not doing anything with my system right now so... upgrading LVM2 
would be more torture.


> And we believe it's fine to solve exceptional case  by reboot.

Well it's hard to disagree with that but for me it might take weeks 
before I discover the system is offline.

Otherwise most services would probably continue.

So now I need to install remote monitoring that checks the system is 
still up and running etc.

If all solutions require more and more and more and more monitoring, 
that's not good.


> Since the effort you would need to put into solve all kernel corner
> case is absurdly high compared with the fact 'it's exception' for
> normally used and configured and monitored thin-pool....

Well I take you on your word, it is just not my impression it would be 
*that* hard but this depends on design I guess and you are the arbiter 
on that I guess.

> So don't expect lvm2 team will be solving this - there are more prio 
> work....

Sure, whatever.

Safety is never prio right ;-).

But anyway.

>> Sure but some level of "room reservation" is only to buy time -- or 
>> really perhaps to make sure main system volume doesn't crash when data 
>> volume fills  > up by accident.
> 
> If the system volume IS that important - don't use it with 
> over-provisiong!

System-volume is not overprovisioned.

Just something else running in the system....

That will crash the ENTIRE SYSTEM when it fills up.

Even if it was not used by ANY APPLICATION WHATSOEVER!!!

> The answer is that simple.

But you misunderstand that I was talking about a system volume that was 
not a thin volume.

> You can user different thin-pool for your system LV where you can 
> maintain
> snapshot without over-provisioning.

My system LV is not even ON a thin pool.

> It's way more practical solution the trying to fix  OOM problem :)

Aye but in that case no one can tell you to ensure you have 
auto-expandable memory ;-) ;-) ;-) :p :p :p.

>> Yes email monitoring would be most important I think for most people.
> Put mail messaging into  plugin script then.
> Or use any monitoring software for messages in syslog - this worked
> pretty well 20 years back - and hopefully still works well :)

Yeah I guess but I do not have all this knowledge myself about all these 
different kinds of softwares and how they work, I hoped that thin LVM 
would work for me without excessive need for knowledge of many different 
kinds.

>> Aye but does design have to be complete failure when condition runs 
>> out?
> 
> YES

:(.

>> I am just asking whether or not there is a clear design limitation 
>> that would ever prevent safety in operation when 100% full (by 
>> accident).
> 
> Don't user over-provisioning in case you don't want to see failure.

That's no answer to that question.

> It's the same as you should not overcommit your RAM in case you do not
> want to see OOM....

But with RAM I'm sure you can typically see how much you have and can 
thus take account of that, filesystem will report wrong figure ;-).

>> I still think theoretically solution would be easy if you wanted it.
> 
> My best advice - please you should try to write it - so you would see
> more in depth how yours 'theoretical solution'  meets with reality....

Well more useful to ask people who know.


-- 
Highly Evolved Beings do not consider it “profitable” if they benefit at 
the expense of another.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 13:11           ` Zdenek Kabelac
  2017-09-11 13:46             ` Xen
  2017-09-11 14:00             ` Xen
@ 2017-09-11 15:31             ` Eric Ren
  2017-09-11 15:52               ` Zdenek Kabelac
  2017-09-11 17:41               ` David Teigland
  2017-09-11 16:55             ` David Teigland
  3 siblings, 2 replies; 91+ messages in thread
From: Eric Ren @ 2017-09-11 15:31 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: Xen, LVM general discussion and development

Hi Zdenek,

On 09/11/2017 09:11 PM, Zdenek Kabelac wrote:

[..snip...]

> So don't expect lvm2 team will be solving this - there are more prio 
> work....

Sorry for interrupting your discussion. But, I just cannot help to ask:

It's not the first time I see "there are more prio work". So I'm 
wondering: can upstream
consider to have these high priority works available on homepage [1] or 
trello tool [1]?

I really hope upstream can do so. Thus,

1. Users can expect what changes will likely happen for lvm.

2. It helps developer reach agreements on what problems/features should 
be on high
priority and avoid overlap efforts.

I know all core developer are working for RedHat. But, I guess you 
experts will also be happy
to see any real contributions from other engineers. For me, some big 
issues in LVM I can see by now:

- lvmetad slows down activation much if there are a lot of PVs on system 
(say 256 PVs, it takes >10s to pvscan
in my testing).
- pvmove is slow. I know it's not fault of LVM. The time is almost spent 
in DM (the IO dispatch/copy).
- snapshot cannot be used in cluster environment. There is a usecase: 
user has a central backup system
   running on a node. They want to make snapshot and backup some LUNs 
attached to other nodes, on this
   backup system node.

If our upstream have a place to put and discuss what the prio works are, 
I think it will encourage me to do
more contributions - because I'm not 100% sure if it's a real issue and 
if it's a work that upstream hopes
to see, every engineer wants their work to be accepted by upstream :)   
I can try to go forward to do meaningful
work (research, testing...) as far as I can, if you experts can confirm 
that "that's a real problem. Go ahead!".

[1] https://sourceware.org/lvm2/
[2] https://trello.com/

Regards,
Eric

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 15:31             ` Eric Ren
@ 2017-09-11 15:52               ` Zdenek Kabelac
  2017-09-11 21:35                 ` Eric Ren
  2017-09-11 17:41               ` David Teigland
  1 sibling, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-11 15:52 UTC (permalink / raw)
  To: Eric Ren; +Cc: LVM general discussion and development

Dne 11.9.2017 v 17:31 Eric Ren napsal(a):
> Hi Zdenek,
> 
> On 09/11/2017 09:11 PM, Zdenek Kabelac wrote:
> 
> [..snip...]
> 
>> So don't expect lvm2 team will be solving this - there are more prio work....
> 
> Sorry for interrupting your discussion. But, I just cannot help to ask:
> 
> It's not the first time I see "there are more prio work". So I'm wondering: 
> can upstream
> consider to have these high priority works available on homepage [1] or trello 
> tool [1]?
> 
> I really hope upstream can do so. Thus,
> 
> 1. Users can expect what changes will likely happen for lvm.
> 
> 2. It helps developer reach agreements on what problems/features should be on 
> high
> priority and avoid overlap efforts.
> 

lvm2 is using  upstream community BZ located here:

https://bugzilla.redhat.com/enter_bug.cgi?product=LVM%20and%20device-mapper

You can check RHBZ easily for all lvm2 bZ
(mixes  RHEL/Fedora/Upstream)

We usually want to have upstream BZ being linked with Community BZ,
but sometimes it's driven through other channel - not ideal - but still easily 
search-able.

> - lvmetad slows down activation much if there are a lot of PVs on system (say 
> 256 PVs, it takes >10s to pvscan
> in my testing).

It's should be opposite case - unless something regressed recently...
Easiest is to write out  lvm2 test suite some test.

And eventually bisect which commit broke it...

> - pvmove is slow. I know it's not fault of LVM. The time is almost spent in DM 
> (the IO dispatch/copy).

Yeah - this is more or less design issue inside kernel - there are
some workarounds - but since primary motivation was not to overload
system - it's been left a sleep a bit - since focus gained  'raid' target
and these pvmove fixes are working with old dm mirror target...
(i.e. try to use bigger  region_size for mirror in lvm.conf  (over 512K)
and evaluate performance  - there is something wrong - but core mirror 
developer is busy with raid features ATM....


> - snapshot cannot be used in cluster environment. There is a usecase: user has 
> a central backup system

Well, snapshot CANNOT work in cluster.
What you can do is to split snapshot and attach it different volume,
but exclusive assess is simply required - there is no synchronization of 
changes like with cmirrord for old mirror....

> 
> If our upstream have a place to put and discuss what the prio works are, I 
> think it will encourage me to do
> more contributions - because I'm not 100% sure if it's a real issue and if 

You are always welcome to open Community BZ  (instead of trello/github/.... )
Provide justification, present patches.

Of course I cannot hide :) RH has some sort of influence which bugs are more 
important then the others...

> it's a work that upstream hopes
> to see, every engineer wants their work to be accepted by upstream :) I can 
> try to go forward to do meaningful
> work (research, testing...) as far as I can, if you experts can confirm that 
> "that's a real problem. Go ahead!".

We do our best....

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 13:11           ` Zdenek Kabelac
                               ` (2 preceding siblings ...)
  2017-09-11 15:31             ` Eric Ren
@ 2017-09-11 16:55             ` David Teigland
  2017-09-11 17:43               ` Zdenek Kabelac
  3 siblings, 1 reply; 91+ messages in thread
From: David Teigland @ 2017-09-11 16:55 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: list, linux-lvm

On Mon, Sep 11, 2017 at 03:11:06PM +0200, Zdenek Kabelac wrote:
> > Aye but does design have to be complete failure when condition runs out?
> 
> YES

I am not satisfied with the way thin pools fail when space is exhausted,
and we aim to do better.  Our goal should be that the system behaves at
least no worse than a file system reaching 100% usage on a normal LV.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 14:00             ` Xen
@ 2017-09-11 17:34               ` Zdenek Kabelac
  0 siblings, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-11 17:34 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 11.9.2017 v 16:00 Xen napsal(a):
> Just responding to second part of your email.
> 
>>> Only manual intervention this one... and last resort only to prevent crash 
>>> so not really useful in general situation?
>>
>> Let's simplify it for the case:
>>
>> You have  1G thin-pool
>> You use 10G of thinLV on top of 1G thin-pool
>>
>> And you ask for 'sane' behavior ??
> 
> Why not? Really.

Because all filesystems put on top of thinLV  do believe all blocks on the 
device actually exist....

>> Any idea of having 'reserved' space for 'prioritized' applications and
>> other crazy ideas leads to nowhere.
> 
> It already exists in Linux filesystems since long time (root user).

Did I say you can't compare filesystem problem with block level problem ?
If not ;) let's repeat - being out of space in a single filesystem
is completely different fairy-tail with out of space thin-pool.

> 
>> Actually there is very good link to read about:
>>
>> https://lwn.net/Articles/104185/
> 
> That was cute.
> 
> But we're not asking aeroplane to keep flying.
IMHO you just don't yet see the parallelism....


>> And we believe it's fine to solve exceptional case  by reboot.
> 
> Well it's hard to disagree with that but for me it might take weeks before I 
> discover the system is offline.

IMHO it's problem of proper monitoring.

Still the same song here - you should actively trying to avoid car-collision, 
since trying to resurrect often seriously injured or even dead passenger from 
a demolished car is usually very complex job with unpredictable result...

We do put number of 'car-protection' safety mechanism - so the newer tools,
newer kernel the better -  but still when you hit the wall in top-speed
you can't expect you just 'walk-out' easily... and it's way cheaper to solve 
the problem in way you will NOT crash at all..

> 
> Otherwise most services would probably continue.
> 
> So now I need to install remote monitoring that checks the system is still up 
> and running etc.

Of course you do.

thin-pool needs attention/care :)

> If all solutions require more and more and more and more monitoring, that's 
> not good.

It's the best we can provide....


>> So don't expect lvm2 team will be solving this - there are more prio work....
> 
> Sure, whatever.
> 
> Safety is never prio right ;-).

We are safe enough (IMHO) to NOT loose committed data,
We cannot guarantee stable system though - it's too complex.
lvm2/dm can't be fixing extX/btrfs/XFS and other kernel related issues...
Bold men can step in - and fix it....


>> If the system volume IS that important - don't use it with over-provisiong!
> 
> System-volume is not overprovisioned.

If you have  enough blocks in thin-pool to cover all needed block for all 
thinLV attached to it - you are not overprovisioning.


> Just something else running in the system....


Use different pools ;)
(i.e. 10G system + 3 snaps needs  40G of data size & appropriate metadata size 
to be safe from overprovisioning)

> That will crash the ENTIRE SYSTEM when it fills up.
> 
> Even if it was not used by ANY APPLICATION WHATSOEVER!!!

Full thin-pool on recent kernel is certainly NOT randomly crashing entire 
system :)

If you think it's that case - provide full trace of crashed kernel and open BZ 
- just be sure you use upstream Linux...

> My system LV is not even ON a thin pool.

Again - if you reproduce on kernel 4.13 - open BZ and provide reproducer.
If you use older kernel - take a recent one and reproduce.

If you can't reproduce - problem has been already fixed.
It's then for your kernel provider to either back-port fix
or give you fixed newer kernel - nothing really for lvm2...


> It's way more practical solution the trying to fix  OOM problem :)
> 
> Aye but in that case no one can tell you to ensure you have auto-expandable 
> memory ;-) ;-) ;-) :p :p :p.

I'd probably recommend reading some books about how is memory mapped on a 
block device and what are all the constrains and related problems..

>>> Yes email monitoring would be most important I think for most people.
>> Put mail messaging into  plugin script then.
>> Or use any monitoring software for messages in syslog - this worked
>> pretty well 20 years back - and hopefully still works well :)
> 
> Yeah I guess but I do not have all this knowledge myself about all these 
> different kinds of softwares and how they work, I hoped that thin LVM would 
> work for me without excessive need for knowledge of many different kinds.

We do provide some 'generic' script - unfortunately - every use-case is 
basically pretty different set of rules and constrains.

So the best we have is 'auto-extension'
We used to trying to umount - but this has possibly added more problems then 
it has actually solved...

>>> I am just asking whether or not there is a clear design limitation that 
>>> would ever prevent safety in operation when 100% full (by accident).
>>
>> Don't user over-provisioning in case you don't want to see failure.
> 
> That's no answer to that question.

There is a lot of technical complexity behind it.....

I'd say the main part is -  'fs'  would need to be able to know understand
it's living on provisioned device (something we actually do not want to,
as you can change 'state' in runtime - so 'fs' should be aware & unaware
at the same time ;) -   checking with every request that thin-provisioning
is in the place would impact performance, doing in mount-time make it
also bad.

Then you need to deal with fact, that writes to filesystem are 'process' 
aware, while writes to block-device are some anonymous page writes for your 
page cache.
Have I said the level of problems for a single filesystem is totally different 
story yet ?

So in a simple statement  - thin-p has it's limits - if you are unhappy with 
them, then you probably need to look for some other solution - or starting
sending patches and improve things around...

> 
>> It's the same as you should not overcommit your RAM in case you do not
>> want to see OOM....
> 
> But with RAM I'm sure you can typically see how much you have and can thus 
> take account of that, filesystem will report wrong figure ;-).

Unfortunately you cannot....

Number of your free RAM is very fictional number ;) and you run in much bigger 
problems if you start overcommiting memory in kernel....

You can't compare your user-space failing malloc and OOM crashing Firefox....

Block device runs in-kernel - and as root...
There are no reserves, all you know is you need to write block XY,
you have no idea what is the block about..
(That's where ZFS/Btrfs was supposed to excel - they KNOW.... :)

Regard

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 15:31             ` Eric Ren
  2017-09-11 15:52               ` Zdenek Kabelac
@ 2017-09-11 17:41               ` David Teigland
  2017-09-11 21:08                 ` Eric Ren
  1 sibling, 1 reply; 91+ messages in thread
From: David Teigland @ 2017-09-11 17:41 UTC (permalink / raw)
  To: Eric Ren; +Cc: linux-lvm

On Mon, Sep 11, 2017 at 11:31:16PM +0800, Eric Ren wrote:
> It's not the first time I see "there are more prio work". So I'm wondering:
> can upstream
> consider to have these high priority works available on homepage [1] or
> trello tool [1]?

Hi Eric, this is a good question.  The lvm project has done a poor job at
this sort of thing.  A new homepage has been in the works for a long time,
but I think stalled in the review/feedback stage.  It should be unblocked
soon.  I agree we should figure something out for communicating about
ongoing or future work (I don't think bugzilla is the answer.)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 16:55             ` David Teigland
@ 2017-09-11 17:43               ` Zdenek Kabelac
  0 siblings, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-11 17:43 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: list

Dne 11.9.2017 v 18:55 David Teigland napsal(a):
> On Mon, Sep 11, 2017 at 03:11:06PM +0200, Zdenek Kabelac wrote:
>>> Aye but does design have to be complete failure when condition runs out?
>>
>> YES
> 
> I am not satisfied with the way thin pools fail when space is exhausted,
> and we aim to do better.  Our goal should be that the system behaves at
> least no worse than a file system reaching 100% usage on a normal LV.

We can reach this goal anytime soon - unless we fix all those filesystem....

And there is other metrics - you can make it way more 'safe' for exhausted
space at the prices of massively slowing down a serializing all writes...

I doubt we would find many users that would easily accept massive slowdown of 
their system just because thin-pool can run out of space....

Global anonymous page-cache is really a hard thing for resolving...

But when you start to limit your usage of thin-pool with some constrains,
you can get much better behaving system.

i.e.     using 'ext4' for mounted  'data' LV should be relatively safe...

And again if you see actual kernel crash OOPS - this is of course a real
kernel bug for fixing...

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 17:41               ` David Teigland
@ 2017-09-11 21:08                 ` Eric Ren
  0 siblings, 0 replies; 91+ messages in thread
From: Eric Ren @ 2017-09-11 21:08 UTC (permalink / raw)
  To: David Teigland; +Cc: Zdenek Kabelac, linux-lvm

Hi David and Zdenek,

On 09/12/2017 01:41 AM, David Teigland wrote:

[...snip...]
> Hi Eric, this is a good question.  The lvm project has done a poor job at
> this sort of thing.  A new homepage has been in the works for a long time,
> but I think stalled in the review/feedback stage.  It should be unblocked
> soon.  I agree we should figure something out for communicating about
> ongoing or future work (I don't think bugzilla is the answer.)
Good news! Thanks very much for you both giving such kind replies :)

Regards,
Eric

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 15:52               ` Zdenek Kabelac
@ 2017-09-11 21:35                 ` Eric Ren
  0 siblings, 0 replies; 91+ messages in thread
From: Eric Ren @ 2017-09-11 21:35 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

Hi Zdenek,

> lvm2 is using  upstream community BZ located here:
>
> https://bugzilla.redhat.com/enter_bug.cgi?product=LVM%20and%20device-mapper 
>
>
> You can check RHBZ easily for all lvm2 bZ
> (mixes  RHEL/Fedora/Upstream)
>
> We usually want to have upstream BZ being linked with Community BZ,
> but sometimes it's driven through other channel - not ideal - but 
> still easily search-able.

Yes, it's a place where problems are discussed. Thanks for your reminder :)

> [...snip...]
> It's should be opposite case - unless something regressed recently...
> Easiest is to write out  lvm2 test suite some test.
>
> And eventually bisect which commit broke it...

Good to know! I will find time to test different versions on both 
openSUSE and Fedora.

>
>> - pvmove is slow. I know it's not fault of LVM. The time is almost 
>> spent in DM (the IO dispatch/copy).
>
> Yeah - this is more or less design issue inside kernel - there are
> some workarounds - but since primary motivation was not to overload
> system - it's been left a sleep a bit - since focus gained  'raid' target

Aha, it's a good reason. Ideally, it would be good for pvmove having 
some option to control
the IO rate. I know it's not easy...

> and these pvmove fixes are working with old dm mirror target...
> (i.e. try to use bigger  region_size for mirror in lvm.conf  (over 512K)
> and evaluate performance  - there is something wrong - but core mirror 
> developer is busy with raid features ATM....

Thanks for the suggestion.

>
>
>> - snapshot cannot be used in cluster environment. There is a usecase: 
>> user has a central backup system
>
> Well, snapshot CANNOT work in cluster.
> What you can do is to split snapshot and attach it different volume,
> but exclusive assess is simply required - there is no synchronization 
> of changes like with cmirrord for old mirror....

Got it! Advanced features like snapshot/thinp/dmcache, have their own 
metadata. The payment for having those metadata
changes cluster-aware is painful.

> We do our best....

Like you guys have been always doing, thanks!

Regards,
Eric

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 10:35   ` Zdenek Kabelac
  2017-09-11 10:55     ` Xen
@ 2017-09-11 21:59     ` Gionatan Danti
  2017-09-12 11:01       ` Zdenek Kabelac
  1 sibling, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-11 21:59 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

Il 11-09-2017 12:35 Zdenek Kabelac ha scritto:
> The first question here is - why do you want to use thin-provisioning ?

Because classic LVM snapshot behavior (slow write speed and linear 
performance decrease as snapshot count increases) make them useful for 
nightly backups only.

On the other side, the very fast CoW thinp's behavior mean very usable 
and frequent snapshots (which are very useful to recover from user 
errors).

> As thin-provisioning is about 'promising the space you can deliver
> later when needed'  - it's not about hidden magic to make the space
> out-of-nowhere.

I fully agree. In fact, I was asking about how to reserve space to 
*protect* critical thin volumes from "liberal" resource use by less 
important volumes. Fully-allocated thin volumes sound very interesting - 
even if I think this is a performance optimization rather than a "safety 
measure".

> The idea of planning to operate thin-pool on 100% fullness boundary is
> simply not going to work well - it's  not been designed for that
> use-case - so if that's been your plan - you will need to seek for
> other solution.
> (Unless you seek for those 100% provisioned devices)

I do *not* want to run at 100% data usage. Actually, I want to avoid it 
entirely by setting a reserved space which cannot be used for things as 
snapshot. In other words, I would very like to see a snapshot to fail 
rather than its volume becoming unavailable *and* corrupted.

Let me de-tour by using ZFS as an example (don't bash me for doing 
that!)

In ZFS words, there are object called ZVOLs - ZFS volumes/block devices, 
which can either be "fully-preallocated" or "sparse".

By default, they are "fully-preallocated": their entire nominal space is 
reseved and subtracted from the ZPOOL total capacity. Please note that 
this does *not* means that space is really allocated on the ZPOOL, 
rather that nominal space is accounted against other ZFS dataset/volumes 
when creating new object. A filesystem sitting on top of such a ZVOL 
will never run out of space; rather, if the remaining capacity is not 
enough to guaranteed this constrain, new volume/snapshot creating is 
forbidden.

Example:
# 1 GB ZPOOL
[root@blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  
ALTROOT
tank  1008M   456K  1008M         -     0%     0%  1.00x  ONLINE  -

# Creating a 600 MB ZVOL (note the different USED vs REFER values)
[root@blackhole ~]# zfs create -V 600M tank/vol1
[root@blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        621M   259M    96K  /tank
tank/vol1   621M   880M    56K  -

# Snapshot creating - please see that, as REFER is very low (I did write 
nothig on the volume), snapshot creating is allowed
[root@blackhole ~]# zfs snapshot tank/vol1@snap1
[root@blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              621M   259M    96K  /tank
tank/vol1         621M   880M    56K  -
tank/vol1@snap1     0B      -    56K  -

# Let write something to the volume (note how REFER is higher than free, 
unreserved space)
[root@blackhole ~]# zfs destroy tank/vol1@snap1
[root@blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M 
count=500 oflag=direct
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
[root@blackhole ~]# zfs list -t all
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        622M   258M    96K  /tank
tank/vol1   621M   378M   501M  -

# Snapshot creation now FAILS!
[root@blackhole ~]# zfs snapshot tank/vol1@snap1
cannot create snapshot 'tank/vol1@snap1': out of space
[root@blackhole ~]# zfs list -t all
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        622M   258M    96K  /tank
tank/vol1   621M   378M   501M  -

The above surely is safe behavior: when free, unused space is too low to 
guarantee the reserved space, snapshot creation is disallowed.

On the other side, using the "-s" option you can create a "sparse" ZVOL 
- a volume which nominal space is *not* accounted/subtracted from the 
total ZPOOL capacity. Such a volume have similar warnings that thin 
volumes. From the man page:

'Though not recommended, a "sparse volume" (also known as "thin 
provisioning") can be created by specifying the -s option to the zfs 
create -V command, or by changing the reservation after the volume has 
been created.  A "sparse volume" is a volume where the reservation is 
less then the volume size.  Consequently, writes to a sparse volume can 
fail with ENOSPC when the pool is low on space.  For a sparse volume, 
changes to volsize are not reflected in the reservation.'

The only real difference vs a fully preallocated volume is the property 
carrying the reserved space expectation. I can even switch at run-time 
between a fully preallocated vs sparse volume by simply changing the 
right property. Indeed, a very important thing to understand is that 
this property can be set to *any value* between 0 ("none") and max 
volume (nominal) size.

On a 600M fully preallocated volumes:
[root@blackhole ~]# zfs get refreservation tank/vol1
NAME       PROPERTY        VALUE      SOURCE
tank/vol1  refreservation  621M       local

On a 600M sparse volume:
[root@blackhole ~]# zfs get refreservation tank/vol1
NAME       PROPERTY        VALUE      SOURCE
tank/vol1  refreservation  none       local

Now, a sparse (refreservation=none) volume *can* be snapshotted even if 
very little free space if available in the ZPOOL:

# The very same command that previously failed, now completes 
successfully
[root@blackhole ~]# zfs snapshot tank/vol1@snap1
[root@blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              502M   378M    96K  /tank
tank/vol1         501M   378M   501M  -
tank/vol1@snap1     0B      -   501M  -

# Using a non-zero, but lower-than-nominal threshold 
(refreservation=100M) allows the snapshot to be taken:
[root@blackhole ~]# zfs set refreservation=100M tank/vol1
[root@blackhole ~]# zfs snapshot tank/vol1@snap1
[root@blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              602M   278M    96K  /tank
tank/vol1         601M   378M   501M  -
tank/vol1@snap1     0B      -   501M  -

# If free space drops under the lower-but-not-zero reservation 
(refreservation=100M), snapshot again fails:
[root@blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M 
count=300 oflag=direct
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 4.85282 s, 64.8 MB/s
[root@blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              804M  76.3M    96K  /tank
tank/vol1         802M  76.3M   501M  -
tank/vol1@snap1   301M      -   501M  -
[root@blackhole ~]# zfs snapshot tank/vol1@snap2
cannot create snapshot 'tank/vol1@snap2': out of space

OK - now back to the original question: why reserved space can be 
useful? Consider the following two scenarios:

A) You want to efficiently use snapshots and *never* encounter 
unexpected full ZPOOL. Your main constrain it to use at most <50% of 
available space for your "critical" ZVOL. With such a setup, any 
"excessive" snapshot/volume creation will surely fail, but the main ZVOL 
will be unaffected;

B) You want to somewhat overprovision (taking account worst-case 
snapshot behavior), but with *large* operating margin. In this case, you 
can create a sparse volume with lower (but non-zero) reservation. Any 
snapshot/volume creation done when this margin is crossed will fail. You 
surely need to clean-up some space (eg: delete older snapshot), but you 
avoid the runaway effect of new snapshot being continuously created, 
consuming additional space.

Now leave ZWORLD, and back to thinp: it would be *really* cool to 
provide the same sort of functionality. Sure, you had to track space 
usage both@pool and a volume level - but the safety increase would be 
massive. There is an big difference between a corrupted main volume and 
a failed snapshot: while the latter can be resolved without too much 
concert, the former (volume corruption) really is a scary thing.

Don't misunderstand me, Zdenek: I *REALLY* appreciate you core 
developers from the outstanding work on LVM. This is especially true in 
the light of BTRFS's problems, and with stratis (which is heavily based 
on thinp) becoming the new next thing. I even more appreciate that you 
are on the mailing list, replying to your users.

Thin volumes are really cool (and fast!), but they can fail deadly. A 
fail-safe approach (ie: no new snapshot allowed) is much more desirable.

Thanks.

> 
> Regards
> 
> 
> Zdenek

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 21:59     ` Gionatan Danti
@ 2017-09-12 11:01       ` Zdenek Kabelac
  2017-09-12 11:34         ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 11:01 UTC (permalink / raw)
  To: LVM general discussion and development, Gionatan Danti

Dne 11.9.2017 v 23:59 Gionatan Danti napsal(a):
> Il 11-09-2017 12:35 Zdenek Kabelac ha scritto:
>> The first question here is - why do you want to use thin-provisioning ?
> 
> Because classic LVM snapshot behavior (slow write speed and linear performance 
> decrease as snapshot count increases) make them useful for nightly backups only.
> 
> On the other side, the very fast CoW thinp's behavior mean very usable and 
> frequent snapshots (which are very useful to recover from user errors).
> 

There is very good reason why thinLV is fast - when you work with thinLV -
you work only with data-set for single thin LV.

So you write to thinLV and either you modify existing exclusively owned chunk
or you duplicate and provision new one.   Single thinLV does not care about
other thin volume - this is very important to think about and it's important 
for reasonable performance and memory and cpu resources usage.

>> As thin-provisioning is about 'promising the space you can deliver
>> later when needed'� - it's not about hidden magic to make the space
>> out-of-nowhere.
> 
> I fully agree. In fact, I was asking about how to reserve space to *protect* 
> critical thin volumes from "liberal" resource use by less important volumes. 

I think you need to think 'wider'.

You do not need to use a single thin-pool - you can have numerous thin-pools,
and for each one you can maintain separate thresholds (for now in your own
scripting - but doable with today's  lvm2)

Why would you want to place 'critical' volume into the same pool
as some non-critical one ??

It's simply way easier to have critical volumes in different thin-pool
where you might not even use over-provisioning.


> I do *not* want to run at 100% data usage. Actually, I want to avoid it 
> entirely by setting a reserved space which cannot be used for things as 
> snapshot. In other words, I would very like to see a snapshot to fail rather 
> than its volume becoming unavailable *and* corrupted.

Seems to me - everyone here looks for a solution where thin-pool is used till 
the very last chunk in thin-pool is allocated - then some magical AI step in,
decides smartly which  'other already allocated chunk' can be trashed
(possibly the one with minimal impact  :)) - and whole think will continue
run in full speed ;)

Sad/bad news here - it's not going to work this way....

> In ZFS words, there are object called ZVOLs - ZFS volumes/block devices, which 
> can either be "fully-preallocated" or "sparse".
>
> By default, they are "fully-preallocated": their entire nominal space is 
> reseved and subtracted from the ZPOOL total capacity. Please note that this 

Fully-preallocated - sounds like thin-pool without overprovisioning to me...


> # Snapshot creating - please see that, as REFER is very low (I did write 
> nothig on the volume), snapshot creating is allowed

lvm2 also DOES protect you from creation of new thin-pool when the fullness
is about lvm.conf defined threshold - so nothing really new here...


> [root@blackhole ~]# zfs destroy tank/vol1@snap1
> [root@blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M count=500 
> oflag=direct
> 500+0 records in
> 500+0 records out
> 524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
> [root@blackhole ~]# zfs list -t all
> NAME������� USED� AVAIL� REFER� MOUNTPOINT
> tank������� 622M�� 258M��� 96K� /tank
> tank/vol1�� 621M�� 378M�� 501M� -
> 
> # Snapshot creation now FAILS!

ZFS is filesystem.

So let's repeat again :) amount of problems inside a single filesystem is not 
comparable with block-device layer - it's entirely different world of problems.

You can't really expect filesystem 'smartness' on block-layer.

That's the reason why we can see all those developers boldly stepping into the 
'dark waters' of  mixed filesystem & block layers.

lvm2/dm trusts in different concept - it's possibly less efficient,
but possibly way more secure - where you have different layers,
and each layer could be replaced and is maintained separately.

> The above surely is safe behavior: when free, unused space is too low to 
> guarantee the reserved space, snapshot creation is disallowed.


ATM thin-pool cannot somehow auto-magically 'drop'  snapshots on its own.

And that's the reason why we have those monitoring features provided with 
dmeventd.   Where you monitor  occupancy of thin-pool and when the
fullness goes above defined threshold  - some 'action' needs to happen.

It's really up-to admin to decide if it's more important to make some
free space for existing user writing his 10th copy of 16GB movie :) or erase
some snapshot with some important company work ;)

Just don't expect it will be some magical AI built-in into thin-pool to do 
such decision :)

User already has ALL the power to do this work - the main condition here is - 
this happens much earlier then your thin-pool gets exhausted!

It's really pointless trying to solve this issue after you are already 
out-of-space...

> Now leave ZWORLD, and back to thinp: it would be *really* cool to provide the 
> same sort of functionality. Sure, you had to track space usage both at pool 
> and a volume level - but the safety increase would be massive. There is an big 
> difference between a corrupted main volume and a failed snapshot: while the 
> latter can be resolved without too much concert, the former (volume 
> corruption) really is a scary thing.

AFAIK current kernel (4.13) with thinp & ext4 used with remount-ro on error 
and lvm2 is safe to use in case of emergency - so surely you can lose some 
uncommited data but after reboot and some extra free space made in thin-pool 
you should have consistent filesystem without any damage after fsck.

There are not known simple bugs in this case - like system crashing on dm 
related OOPS (like Xen seems to suggest... - we need to see his bug report...)

However - when thin-pool gets full - the reboot and filesystem check is 
basically mandatory  -  there is no support  (and no plan to start support 
randomly dropping allocated chunks from other thin-volumes to make space for 
your running one)

> Thin volumes are really cool (and fast!), but they can fail deadly. A 

I'd like to still see what you think is  'deadly'

And also I'd like to be explained what better thin-pool can do in terms
of block device layer.

As said in past - if you would modify filesystem to start to reallocate its 
metadata and data to provisioned space - so FS would be AWARE which blocks
are provisioned or uniquely owned... and start working with 'provisioned' 
volume differently  - that would be a very different story - it essentially 
means you would need to write quite new filesystem, since  extX not xfs is not 
really perfect match....

So all I'm saying here is - 'thin-pool' on block layer is doing 'mostly' its 
best to avoid losing user's committed!  data - but of course  if 'admin' has 
failed to fulfill his promise  and add more space to overprovisioned 
thin-pool, something not-nice will happen to the system -  and there is no way 
thin-pool on its own may resolve it  - it should have been resolved much much 
sooner with monitoring via dmeventd   - that's the place you should focus on 
implementing smart way how to protect you system going ballistic....


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 11:01       ` Zdenek Kabelac
@ 2017-09-12 11:34         ` Gionatan Danti
  2017-09-12 12:03           ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 11:34 UTC (permalink / raw)
  To: Zdenek Kabelac, LVM general discussion and development

On 12/09/2017 13:01, Zdenek Kabelac wrote:
> There is very good reason why thinLV is fast - when you work with thinLV -
> you work only with data-set for single thin LV.
> 
> So you write to thinLV and either you modify existing exclusively owned 
> chunk
> or you duplicate and provision new one.   Single thinLV does not care about
> other thin volume - this is very important to think about and it's 
> important for reasonable performance and memory and cpu resources usage.

Sure, I grasp that.

> I think you need to think 'wider'.
> 
> You do not need to use a single thin-pool - you can have numerous 
> thin-pools,
> and for each one you can maintain separate thresholds (for now in your own
> scripting - but doable with today's  lvm2)
> 
> Why would you want to place 'critical' volume into the same pool
> as some non-critical one ??
> 
> It's simply way easier to have critical volumes in different thin-pool
> where you might not even use over-provisioning.

I need to take a step back: my main use for thinp is virtual machine 
backing store. Due to some limitation in libvirt and virt-manager, which 
basically do not recognize thin pools, I can not use multiple thin pools 
or volumes.

Rather, I had to use a single, big thin volumes with XFS on top.

> Seems to me - everyone here looks for a solution where thin-pool is used 
> till the very last chunk in thin-pool is allocated - then some magical 
> AI step in,
> decides smartly which  'other already allocated chunk' can be trashed
> (possibly the one with minimal impact  :)) - and whole think will continue
> run in full speed ;)
> 
> Sad/bad news here - it's not going to work this way....

No, I absolutely *do not want* thinp to automatically dallocate/trash 
some provisioned blocks. Rather, I all for something as "if free space 
is lower than 30%, disable new snapshot *creation*"

> lvm2 also DOES protect you from creation of new thin-pool when the fullness
> is about lvm.conf defined threshold - so nothing really new here...

Maybe I am missing something: this threshold is about new thin pools or 
new snapshots within a single pool? I was really speaking about the latter.

>> [root@blackhole ~]# zfs destroy tank/vol1@snap1
>> [root@blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M 
>> count=500 oflag=direct
>> 500+0 records in
>> 500+0 records out
>> 524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
>> [root@blackhole ~]# zfs list -t all
>> NAME        USED  AVAIL  REFER  MOUNTPOINT
>> tank        622M   258M    96K  /tank
>> tank/vol1   621M   378M   501M  -
>>
>> # Snapshot creation now FAILS!
> 
> ZFS is filesystem.
> 
> So let's repeat again :) amount of problems inside a single filesystem 
> is not comparable with block-device layer - it's entirely different 
> world of problems.
> 
> You can't really expect filesystem 'smartness' on block-layer.
> 
> That's the reason why we can see all those developers boldly stepping 
> into the 'dark waters' of  mixed filesystem & block layers.

In the examples above, I did not use any ZFS filesystem layer. I used 
ZFS as volume manager, with the intent to place an XFS filesystem on top 
of ZVOL block volumes.

The ZFS man page clearly warns about ENOSP with sparse volume. My point 
is that, by cleaver using of the refreservation property, I can engineer 
a setup where snapshot are generally allowed, unless free space is under 
a certain threshold. In this case, the are not allowed (but newer 
automatically deleted!).

> lvm2/dm trusts in different concept - it's possibly less efficient,
> but possibly way more secure - where you have different layers,
> and each layer could be replaced and is maintained separately.

And I really trust layer separation - it is for this very reason I am a 
big fan of thinp, but its fail behavior somewhat scares me.

> ATM thin-pool cannot somehow auto-magically 'drop'  snapshots on its own.

Let me repeat: I do *not* want thinp to automatically drop anything. I 
simply what it to disallow new snapshot/volume creation when unallocated 
space is too low

> And that's the reason why we have those monitoring features provided 
> with dmeventd.   Where you monitor  occupancy of thin-pool and when the
> fullness goes above defined threshold  - some 'action' needs to happen.

And I really thank you for that - this is a big step forward.
> AFAIK current kernel (4.13) with thinp & ext4 used with remount-ro on 
> error and lvm2 is safe to use in case of emergency - so surely you can 
> lose some uncommited data but after reboot and some extra free space 
> made in thin-pool you should have consistent filesystem without any 
> damage after fsck.
> 
> There are not known simple bugs in this case - like system crashing on 
> dm related OOPS (like Xen seems to suggest... - we need to see his bug 
> report...)
> 
> However - when thin-pool gets full - the reboot and filesystem check is 
> basically mandatory  -  there is no support  (and no plan to start 
> support randomly dropping allocated chunks from other thin-volumes to 
> make space for your running one)
> 
> 
> I'd like to still see what you think is  'deadly'

Committed (fsynced) writes are safe, and this is very good. However, 
*many* application do not properly issue fsync(); this is a fact of life.

I absolutely *do not expect* thinp to automatically cope well with this 
applications - I full understand & agree that application *must* issue 
proper fsyncs.

However, recognizing that real world is quite different from my ideals, 
I want to exclude how many problems are possible: for this reason, I 
really want to prevent full thin pools even in the face of failed 
monitoring (or somnolent sysadmins).

In the past, I testified that XFS take its relatively long time to 
recognize that a thin volume is unavailable - and many async writes can 
be lost in the process. Ext4 + data=journaled did a better job, but a) 
it is not the default filesystem in RH anymore and b) data=journaled is 
not the default option and has its share of problems.

Complex systems need to be monitored - true. And I do that; in fact, I 
have *two* monitor system in place (Zabbix and custom shell based one). 
However, being bitten from a failed Zabbix Agent in the past, I learn a 
good lesson: to design system where some types of problems can not 
simply happen.

So, if in the face of a near-full pool, thinp refuse me to create a new 
filesystem, I would be happy :)

> And also I'd like to be explained what better thin-pool can do in terms
> of block device layer.

Thinp is doing a great job, and nobody wants to deny that.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 13:46             ` Xen
@ 2017-09-12 11:46               ` Zdenek Kabelac
  2017-09-12 12:37                 ` Xen
  2017-09-12 17:00                 ` Gionatan Danti
  0 siblings, 2 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 11:46 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 11.9.2017 v 15:46 Xen napsal(a):
> Zdenek Kabelac schreef op 11-09-2017 15:11:
> 
>> Thin-provisioning is - about 'postponing'� available space to be
>> delivered in time
> 
> That is just one use case.
> 
> Many more people probably use it for other use case.
> 
> Which is fixed storage space and thin provisioning of available storage.
> 
>> You order some work which cost $100.
>> You have just $30, but you know, you will have $90 next week -
>> so the work can start....
> 
> I know the typical use case that you advocate yes.
> 
>> But it seems some users know it will cost $100, but they still think
>> the work could be done with $10 and it's will 'just' work the same....
> 
> No that's not what people want.
> 
> People want efficient usage of data without BTRFS, that's all.


What's wrong with BTRFS....

Either you want  fs & block layer tied together - that the btrfs/zfs approach

or you want

layered approach with separate 'fs' and block layer  (dm approach)

If you are advocating here to start mixing 'dm' with 'fs' layer, just
because you do not want to use 'btrfs' you'll probably not gain main traction 
here...

> 
>>> File system level failure can also not be critical because of using 
>>> non-critical volume because LVM might fail even though filesystem does not 
>>> fail or applications.
>>
>> So my Laptop machine has 32G RAM - so you can have 60% of dirty-pages
>> those may raise pretty major 'provisioning' storm....
> 
> Yes but still system does not need to crash, right.

We  need to see EXACTLY which kind of crash do you mean.

If you are using some older kernel - then please upgrade first and
provide proper BZ case with reproducer.

BTW you can imagine an out-of-space thin-pool with thin volume and filesystem 
as a FS, where some writes ends with 'write-error'.


If you think there is OS system which keeps running uninterrupted, while 
number of writes ends with 'error'  - show them :)  - maybe we should stop 
working on Linux and switch to that (supposedly much better) different OS....


>> But we are talking about generic case here no on some individual sub-cases
>> where some limitation might give you the chance to rescue better...
> 
> But no one in his right mind currently runs /rootvolume out of thin pool and 
> in pretty much all cases probably it is only used for data or for example of 
> hosting virtual hosts/containers/virtualized environments/guests.

You can have different pools and you can use rootfs  with thins to easily test 
i.e. system upgrades....

> So Data use for thin volume is pretty much intended/common/standard use case.
> 
> Now maybe amount of people that will be able to have running system after data 
> volumes overprovision/fill up/crash is limited.

Most thin-pool users are AWARE how to properly use it ;)  lvm2 tries to 
minimize (data-lost) impact for misused thin-pools - but we can't spend too 
much effort there....

So what is important:
'commited' data (i.e. transaction database) are never lost
fsck after reboot should work.

If any of these 2 condition do not work - that's serious bug.

But if you advocate for continuing system use of out-of-space thin-pool - that 
I'd probably recommend start sending patches...  as an lvm2 developer I'm not 
seeing this as best time investment but anyway...


> However, from both a theoretical and practical standpoint being able to just 
> shut down whatever services use those data volumes -- which is only possible 

Are you aware there is just one single page cache shared for all devices
in your system ?


> if base system is still running -- makes for far easier recovery than anything 
> else, because how are you going to boot system reliably without using any of 
> those data volumes? You need rescue mode etc.

Again do you have use-case where you see a crash of data mounted volume
on overfilled thin-pool ?

On my system - I could easily umount such volume after all 'write' requests
are timeouted (eventually use thin-pool with --errorwhenfull y   for instant 
error reaction.

So please can you stop repeating overfilled thin-pool with thin LV data volume 
kills/crashes machine - unless you open BZ and prove otherwise -  you will 
surely get 'fs' corruption  but nothing like crashing OS can be observed on my 
boxes....

We are here really interested in upstream issues - not about missing bug fixes 
  backports into every distribution  and its every released version....


> He might be able to recover his system if his system is still allowed to be 
> logged into.

There is no problem with that as long as  /rootfs has consistently working fs!

Regards

Zdene

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 11:34         ` Gionatan Danti
@ 2017-09-12 12:03           ` Zdenek Kabelac
  2017-09-12 12:47             ` Xen
  2017-09-12 16:57             ` Gionatan Danti
  0 siblings, 2 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 12:03 UTC (permalink / raw)
  To: LVM general discussion and development, Gionatan Danti

Dne 12.9.2017 v 13:34 Gionatan Danti napsal(a):
> On 12/09/2017 13:01, Zdenek Kabelac wrote:
>> There is very good reason why thinLV is fast - when you work with thinLV -
>> you work only with data-set for single thin LV.
>>
>>
>> Sad/bad news here - it's not going to work this way....
> 
> No, I absolutely *do not want* thinp to automatically dallocate/trash some 
> provisioned blocks. Rather, I all for something as "if free space is lower 
> than 30%, disable new snapshot *creation*"
> 



# lvs -a
   LV              VG Attr       LSize  Pool Origin Data%  Meta%  Move Log 
Cpy%Sync Convert
   [lvol0_pmspare] vg ewi-------  2,00m 

   lvol1           vg Vwi-a-tz-- 20,00m pool        40,00 

   pool            vg twi-aotz-- 10,00m             80,00  1,95 

   [pool_tdata]    vg Twi-ao---- 10,00m 

   [pool_tmeta]    vg ewi-ao----  2,00m 

[root@linux export]# lvcreate -V10 vg/pool
   Using default stripesize 64,00 KiB.
   Reducing requested stripe size 64,00 KiB to maximum, physical extent size 
32,00 KiB.
   Cannot create new thin volume, free space in thin pool vg/pool reached 
threshold.

# lvcreate -s vg/lvol1
   Using default stripesize 64,00 KiB.
   Reducing requested stripe size 64,00 KiB to maximum, physical extent size 
32,00 KiB.
   Cannot create new thin volume, free space in thin pool vg/pool reached 
threshold.

# grep thin_pool_autoextend_threshold /etc/lvm/lvm.conf
	# Configuration option activation/thin_pool_autoextend_threshold.
	# thin_pool_autoextend_threshold = 70
	thin_pool_autoextend_threshold = 70

So as you can see - lvm2 clearly prohibits you to create a new thinLV
when you are above defined threshold.


To keep things single for a user - we have a single threshold value.


So what else is missing ?


>> lvm2 also DOES protect you from creation of new thin-pool when the fullness
>> is about lvm.conf defined threshold - so nothing really new here...
> 
> Maybe I am missing something: this threshold is about new thin pools or new 
> snapshots within a single pool? I was really speaking about the latter.

Yes - threshold applies to 'extension' as well as to creation of new thinLV.
(and snapshot is just a new thinLV)

> Let me repeat: I do *not* want thinp to automatically drop anything. I simply 
> what it to disallow new snapshot/volume creation when unallocated space is too 
> low

as said - already implemented....

  Committed (fsynced) writes are safe, and this is very good. However, *many*
> application do not properly issue fsync(); this is a fact of life.
> 
> I absolutely *do not expect* thinp to automatically cope well with this 
> applications - I full understand & agree that application *must* issue proper 
> fsyncs.
> 

Unfortunatelly lvm2 nor dm  can be responsible for whole kernel logic and
all user-land apps...


Yes - anonymous pages cache is somewhat Achilles' heel - but it's not a 
problem of thin-pool - all other 'provisioning' systems has some troubles....

So we really cannot fix it here.

You would need to prove that different strategy is better and fix linux kernel 
for this.

Until this moment - you need use well written user-land apps :) properly 
syncing written data - or not use thin-provisioning (and others).

You can also minimize amount of 'dirty' pages to avoid loosing too much data
in case you hit full thin-pool unexpectedly.....

You can sync every second to minimize amount of dirty pages....

Lots of things....  all of them will in some other the other impact system 
performance....


> In the past, I testified that XFS take its relatively long time to recognize 
> that a thin volume is unavailable - and many async writes can be lost in the 
> process. Ext4 + data=journaled did a better job, but a) it is not the default 
> filesystem in RH anymore and b) data=journaled is not the default option and 
> has its share of problems.

journaled is very 'secure' - but also very slow....

So depends what you aim for.

But this really cannot be solved on DM side...

> So, if in the face of a near-full pool, thinp refuse me to create a new 
> filesystem, I would be happy :)

So you are already happy right  :) ?
Your wish is upstream already for quite some time ;)

Regards

Zdenbek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 11:46               ` Zdenek Kabelac
@ 2017-09-12 12:37                 ` Xen
  2017-09-12 14:37                   ` Zdenek Kabelac
  2017-09-12 17:00                 ` Gionatan Danti
  1 sibling, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-12 12:37 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 12-09-2017 13:46:

> What's wrong with BTRFS....

I don't think you are a fan of it yourself.

> Either you want  fs & block layer tied together - that the btrfs/zfs 
> approach

Gionatan's responses used only Block layer mechanics.

> or you want
> 
> layered approach with separate 'fs' and block layer  (dm approach)

Of course that's what I want or I wouldn't be here.

> If you are advocating here to start mixing 'dm' with 'fs' layer, just
> because you do not want to use 'btrfs' you'll probably not gain main
> traction here...



You know Zdenek, it often appears to me your job here is to dissuade 
people from having any wishes or wanting anything new.

But if you look a little bit further, you will see that there is a lot 
more possible within the space that you define, than you think in a 
black & white vision.

"There are more things in Heaven and Earth, Horatio, than is dreamt of 
in your philosophy" ;-).

I am pretty sure many of the impossibilities you cite spring from a 
misunderstanding of what people want, you think they want something 
extreme, but it is often much more modest than that.

Although personally I would not mind communication between layers in 
which providing layer (DM) communicates some stuff to using layer (FS) 
but 90% of the time that is not even needed to implement what people 
would like.

Also we see ext4 being optimized around 4MB block sizes right? To create 
better allocation.

So that's example of "interoperation" without mixing layers.

I think Gionatan has demonstrated that pure block layer functionality, 
is possible to have more advanced protection ability that does not need 
any knowledge about filesystems.






> We  need to see EXACTLY which kind of crash do you mean.
> 
> If you are using some older kernel - then please upgrade first and
> provide proper BZ case with reproducer.

Yes apologies here, I responded to this thing earlier (perhaps a year 
ago) and the systems I was testing on was 4.4 kernel. So I cannot 
currently confirm and probably is already solved (could be right).

Back then the crash was kernel messages on TTY and then after some 20-30 
seconds total freeze. After I copied too much data to (test) thin pool.

Probably irrelevant now if already fixed.

> BTW you can imagine an out-of-space thin-pool with thin volume and
> filesystem as a FS, where some writes ends with 'write-error'.
> 
> 
> If you think there is OS system which keeps running uninterrupted,
> while number of writes ends with 'error'  - show them :)  - maybe we
> should stop working on Linux and switch to that (supposedly much
> better) different OS....

I don't see why you seem to think that devices cannot be logically 
separated from each other in terms of their error behaviour.

If I had a system crashing because I wrote to some USB device that was 
malfunctioning, that would not be a good thing either.

I have said repeatedly that the thin volumes are data volumes. Entire 
system should not come crashing down.

I am sorry if I was basing myself on older kernels in those messages, 
but my experience dates from a year ago ;-).

Linux kernel has had more issues with USB for example that are 
unacceptable, and even Linus Torvalds himself complained about it. 
Queues filling up because of pending writes to USB device and entire 
system grinds to a halt.

Unacceptable.

> You can have different pools and you can use rootfs  with thins to
> easily test i.e. system upgrades....

Sure but in the past GRUB2 would not work well with thin, I was basing 
myself on that...

I do not see real issue with using thin rootfs myself but grub-probe 
didn't work back then and OpenSUSE/GRUB guy attested to Grub not having 
thin support for that.

> Most thin-pool users are AWARE how to properly use it ;)  lvm2 tries
> to minimize (data-lost) impact for misused thin-pools - but we can't
> spend too much effort there....

Everyone would benefit from more effort being spent there, because it 
reduces the problem space and hence the burden on all those maintainers 
to provide all types of safety all the time.

EVERYONE would benefit.

> But if you advocate for continuing system use of out-of-space
> thin-pool - that I'd probably recommend start sending patches...  as
> an lvm2 developer I'm not seeing this as best time investment but
> anyway...

Not necessarily that the system continues in full operation, 
applications are allowed to crash or whatever. Just that system does not 
lock up.

But you say these are old problems and now fixed...

I am fine if filesystem is told "write error".

Then filesystem tells application "write error". That's fine.

But it might be helpful if "critical volumes" can reserve space in 
advance.

That is what Gionatan was saying...?

Filesystem can also do this itself but not knowing about thin layer it 
has to write random blocks to achieve this.

I.e. filesystem may guess about thin layout underneath and just write 1 
byte to each block it wants to allocate.

But feature could more easily be implemented by LVM -- no mixing of 
layers.

So number of (unallocated) blocks are reserved for critical volume.

When number drops below "needed" free blocks for those volumes, system 
starts returning errors for volumes not that critical volume.

I don't see why that would be such a disturbing feature.

You just cause allocator to error earlier for non-critical volumes, and 
allocator to proceed as long as possible for critical volumes.

Only think you need is runtime awareness of available free blocks.

You said before this is not efficiently possible.

Such awareness would be required, even if approximately, to implement 
any such feature.

But Gionatan was only talking about volume creation in latest messages.


>> However, from both a theoretical and practical standpoint being able 
>> to just shut down whatever services use those data volumes -- which is 
>> only possible
> 
> Are you aware there is just one single page cache shared for all 
> devices
> in your system ?

Well I know the kernel is badly designed in that area. I mean this was 
the source of the USB problems. Torvalds advocated lowering the size of 
the write buffer.

Which distributions then didn't do and his patch didn't even make it 
through :p.

He said "50 MB write cache should be enough for everyone" and not 10% of 
total memory ;-).

> Again do you have use-case where you see a crash of data mounted volume
> on overfilled thin-pool ?

Yes, again, old experiences.

> On my system - I could easily umount such volume after all 'write' 
> requests
> are timeouted (eventually use thin-pool with --errorwhenfull y   for
> instant error reaction.

That's good, I didn't have that back then (and still not).

It is Debian 8 / Kubuntu 16.04 systems.

> So please can you stop repeating overfilled thin-pool with thin LV
> data volume kills/crashes machine - unless you open BZ and prove
> otherwise -  you will surely get 'fs' corruption  but nothing like
> crashing OS can be observed on my boxes....

But when I talked about this a year ago you didn't seem to comprehend I 
was talking about an older system (back then not so old) or acknowledged 
that these problems had (once) existed, so I also didn't know they would 
now already be solved.

Sometimes if you just acknowledge problems were there before but not 
anymore, makes it a lot easier.

We spoke about this topic a year ago as well, and perhaps you didn't 
understand me because for you the problems were already fixed (in your 
LVM).

> We are here really interested in upstream issues - not about missing
> bug fixes  backports into every distribution  and its every released
> version....

I understand. But it's hard for me to know which is which.

These versions are in widespread use.

Compiling your own packages is also system maintenance burden etc.

So maybe our disagreement back then came from me experiencing something 
that was already solved upstream (or in later kernels).

>> He might be able to recover his system if his system is still allowed 
>> to be logged into.
> 
> There is no problem with that as long as  /rootfs has consistently 
> working fs!

Well I guess it was my Debian 8 / kernel 4.4 problem then...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 12:03           ` Zdenek Kabelac
@ 2017-09-12 12:47             ` Xen
  2017-09-12 13:51               ` pattonme
  2017-09-12 14:57               ` Zdenek Kabelac
  2017-09-12 16:57             ` Gionatan Danti
  1 sibling, 2 replies; 91+ messages in thread
From: Xen @ 2017-09-12 12:47 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 12-09-2017 14:03:

> Unfortunatelly lvm2 nor dm  can be responsible for whole kernel logic 
> and
> all user-land apps...

What Gionatan also means, or at least what I mean here is,

If functioning is chain and every link can be the weakest link.

Then sometimes you can build in a little redundancy so that other weak 
links do not break so easily. Or that your part can cover it.

Linux has had a mindset of reducing redundancy lately. So every bug 
anywhere can break the entire thing.

Example was something I advocated for myself.

I installed GRUB2 inside PV reserved space.

That means 2nd sector had PV, 1st sector had MBR-like boot sector.

libblkid stopped at MBR and did not recognise PV.

Now because udev required libblkid to recognise PVs it did not recognise 
and did not activate it.

Problem.

Weakest link in this case libblkid.

Earlier vgchange -ay worked flawlessly (and had some redundancy) but was 
no longer used.

So you can see how small things can break entire system. Not good 
design.

Firmware RAID signature at end of drive also breaks system.

Not good design.


> You can also minimize amount of 'dirty' pages to avoid loosing too much 
> data
> in case you hit full thin-pool unexpectedly.....

Torvalds advocated this.

> You can sync every second to minimize amount of dirty pages....
> 
> Lots of things....  all of them will in some other the other impact
> system performance....

He said no people would be hurt by such a measure except people who 
wanted to unpack and compile kernel pure in page buffers ;-).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 12:47             ` Xen
@ 2017-09-12 13:51               ` pattonme
  2017-09-12 14:57               ` Zdenek Kabelac
  1 sibling, 0 replies; 91+ messages in thread
From: pattonme @ 2017-09-12 13:51 UTC (permalink / raw)
  To: Xen

Is this the same xen? Because that was an actually intelligent and logical response. But like was the case a year ago there is no data or state sharing between the fs and LVM block so what you want is not possible and will never see the light of day. Use ZFS or btrfs or something else. Full stop.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 12:37                 ` Xen
@ 2017-09-12 14:37                   ` Zdenek Kabelac
  2017-09-12 16:44                     ` Xen
  2017-09-12 17:14                     ` Gionatan Danti
  0 siblings, 2 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 14:37 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 12.9.2017 v 14:37 Xen napsal(a):
> Zdenek Kabelac schreef op 12-09-2017 13:46:
> 
> 
> You know Zdenek, it often appears to me your job here is to dissuade people 
> from having any wishes or wanting anything new.
> 
> But if you look a little bit further, you will see that there is a lot more 
> possible within the space that you define, than you think in a black & white 
> vision.

On block layer - there are many things  black & white....

If you don't know which process 'create' written page, nor if you write
i.e. filesystem data or metadata or any other sort of 'metadata' information,
you can hardly do any 'smartness' logic on thin block level side.

> Although personally I would not mind communication between layers in which 
> providing layer (DM) communicates some stuff to using layer (FS) but 90% of 
> the time that is not even needed to implement what people would like.

The philosophy with DM device is - you can replace then online with something 
else - i.e. you could have a linear LV  which is turned to 'RAID" and than it 
could be turned to   'Cache RAID'  and then even to thinLV -  all in one raw
on life running system.

So what filesystem should be doing in this case ?

Should be doing complex question of block-layer underneath - checking current 
device properties - and waiting till the IO operation is processed  - before 
next IO comes in the process - and repeat the  some  in very synchronous
slow logic ??    Can you imagine how slow this would become ?

The main problem here is - the user typically only see a one single localized 
problem - without putting it into a global context.

So of course - if you 'restrict' a device stack to some predefined fixed
state which holds 'forever'  you may get far more chances to get couple things 
running in some more optimal way - but that's not what lvm2 aims to support.

We are targeting 'generic' usage not a specialized case - which fits 1 user 
out of 1000000 - and every other user needs something 'slightly' different....


> Also we see ext4 being optimized around 4MB block sizes right? To create 
> better allocation

I don't think there is anything related...
Thin chunk-size ranges from 64KiB to 1GiB....

> So that's example of "interoperation" without mixing layers.

The only inter-operation is the main filesystem (like extX & XFS) are getting 
fixed for better reactions for ENOSPC...
and WAY better behavior when there are 'write-errors' - surprisingly there 
were numerous faulty logic and expectation encoded in them...

> I think Gionatan has demonstrated that pure block layer functionality, is 
> possible to have more advanced protection ability that does not need any 
> knowledge about filesystems.

thin-pool provides  same level of protection in terms of not letting you 
create a new thin-lv when thin-pool is above configure threshold...


And to compare apples with apples - you need to compare performance of

ZFS with zpolls with thin with thinpools running directly on top of device.

If zpools - are 'equally' fast as thins  - and gives you better protection,
and more sane logic the why is still anyone using thins???

I'd really love to see some benchmarks....

Of course if you slow down speed of thin-pool and add way more synchronization 
points and consume 10x more memory :) you can get better behavior in those 
exceptional cases which are only hit by unexperienced users who tends to 
intentionally use thin-pools in incorrect way.....

> 
> 
> Yes apologies here, I responded to this thing earlier (perhaps a year ago) and 
> the systems I was testing on was 4.4 kernel. So I cannot currently confirm and 
> probably is already solved (could be right).
> 
> Back then the crash was kernel messages on TTY and then after some 20-30 

there is by default 60sec freeze, before unresized thin-pool start to reject
all write to unprovisioned space as 'error' and switches to out-of-space 
state.  There is though a difference if you are out-of-space in data
or metadata -  the later one is more complex...

>> If you think there is OS system which keeps running uninterrupted,
>> while number of writes ends with 'error'� - show them :)� - maybe we
>> should stop working on Linux and switch to that (supposedly much
>> better) different OS....
> 
> I don't see why you seem to think that devices cannot be logically separated 
> from each other in terms of their error behaviour.


In page cache there are no thing logically separated - you have 'dirty' pages
you need to write somewhere - and if you writes leads to errors,
and system reads errors back instead of real-data - and your execution
code start to run on completely unpredictable data-set - well 'clean' reboot 
is still very nice outcome IMHO....

> If I had a system crashing because I wrote to some USB device that was 
> malfunctioning, that would not be a good thing either.

Well try to BOOT from USB :) and detach and then compare...
Mounting user data and running user-space tools out of USB is uncomparable...


> Linux kernel has had more issues with USB for example that are unacceptable, 
> and even Linus Torvalds himself complained about it. Queues filling up because 
> of pending writes to USB device and entire system grinds to a halt.
> 
> Unacceptable.

AFAIK - this is still not resolved issue...


> 
>> You can have different pools and you can use rootfs� with thins to
>> easily test i.e. system upgrades....
> 
> Sure but in the past GRUB2 would not work well with thin, I was basing myself 
> on that...

/boot   cannot be on thin

/rootfs  is not a problem - there will be even some great enhancement for Grub
to support this more easily and switching between various snapshots...



>> Most thin-pool users are AWARE how to properly use it ;)� lvm2 tries
>> to minimize (data-lost) impact for misused thin-pools - but we can't
>> spend too much effort there....
> 
> Everyone would benefit from more effort being spent there, because it reduces 
> the problem space and hence the burden on all those maintainers to provide all 
> types of safety all the time.
> 
> EVERYONE would benefit.

Fortunately most users NEVER need it ;)
Since they properly operate thin-pool and understand it's weak points....


> Not necessarily that the system continues in full operation, applications are 
> allowed to crash or whatever. Just that system does not lock up.

When you get bad data from your block device - your system's reaction is 
unpredictable -  if your /rootfs cannot store its metadata - the most sane 
behavior is to stop - all other solutions are so complex and complicated, that 
spending resources to avoid hitting this state are way better spent effort...

Lvm2 ensures block layer behavior is sane - but cannot be held responsible 
that all layers above  are 'sane' as well...

If you hit 'fs' bug - report the issue to fs maintainer.
If you experience user-space faulty app - solve the issue there.

> Then filesystem tells application "write error". That's fine.
> 
> But it might be helpful if "critical volumes" can reserve space in advance.

Once again -  USE different pool - solve problems at proper level....
Do not over-provision critical volumes...


> I.e. filesystem may guess about thin layout underneath and just write 1 byte 
> to each block it wants to allocate.

:) so how do you resolve error paths -  i.e. how do you restore space
you have not actually used....
There are so many problems with this you can't even imagine...
Yeah - we've spent quite some time in past analyzing those paths....

> So number of (unallocated) blocks are reserved for critical volume.

Please finally stop thinking about  some 'reserved' storage for critical 
volume. It leads to nowhere....


> When number drops below "needed" free blocks for those volumes, system starts 
> returning errors for volumes not that critical volume.

Do the right action at right place.

For critical volume  use  non-overprovisiong pools - there is nothing better 
you can do - seriously!

For other cases - resolve the issue at userspace when dmeventd calls you...


> 
> I don't see why that would be such a disturbing feature.
> 

Maybe start to understand how kernel works in practice ;)

Otherwise you spend you live boring developers with ideas which simply cannot 
work...

> You just cause allocator to error earlier for non-critical volumes, and 
> allocator to proceed as long as possible for critical volumes.


So use 2 different POOLS, problem solved....

You need to focus on simple solution for a problem instead of exponentially 
over-complicating 'bad' solution....


> We spoke about this topic a year ago as well, and perhaps you didn't 
> understand me because for you the problems were already fixed (in your LVM).

As said - if you see a problem/bug  - open BZ  case - so it'd be analyzed  - 
instead of spreading FUD in mailing list, where noone tells which version of 
lvm2 and which kernel version -  but we are just informed it's crashing and 
unusable...

> 
>> We are here really interested in upstream issues - not about missing
>> bug fixes� backports into every distribution� and its every released
>> version....
> 
> I understand. But it's hard for me to know which is which.
> 
> These versions are in widespread use.
> 
> Compiling your own packages is also system maintenance burden etc.

Well it's always about checking 'upstream' first and then bothering your 
upstream maintainer...

Eventually switching to distribution with better support in case your existing 
one has 'nearly' zero reaction....

> So maybe our disagreement back then came from me experiencing something that 
> was already solved upstream (or in later kernels).

Yes - we are always interested in upstream problem.

We really cannot be solving problems of every possible deployed combination of 
software.


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 12:47             ` Xen
  2017-09-12 13:51               ` pattonme
@ 2017-09-12 14:57               ` Zdenek Kabelac
  2017-09-12 16:49                 ` Xen
  1 sibling, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 14:57 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 12.9.2017 v 14:47 Xen napsal(a):
> Zdenek Kabelac schreef op 12-09-2017 14:03:
> 
>> Unfortunatelly lvm2 nor dm� can be responsible for whole kernel logic and
>> all user-land apps...
> 
> What Gionatan also means, or at least what I mean here is,
> 
> If functioning is chain and every link can be the weakest link.
> 
> Then sometimes you can build in a little redundancy so that other weak links 
> do not break so easily. Or that your part can cover it.
> 
> Linux has had a mindset of reducing redundancy lately. So every bug anywhere 
> can break the entire thing.
> 
> Example was something I advocated for myself.
> 
> I installed GRUB2 inside PV reserved space.
> 
> That means 2nd sector had PV, 1st sector had MBR-like boot sector.
> 
> libblkid stopped at MBR and did not recognise PV.
> 

This bug has been reported (by me even to libblkid maintainer) AND already 
fixed already in past....

Yes - surprise software has bugs...

But to defend a bit libblkid maintainer side :) - this feature was not really 
well documented from lvm2 side...

>> You can sync every second to minimize amount of dirty pages....
>>
>> Lots of things....� all of them will in some other the other impact
>> system performance....
> 
> He said no people would be hurt by such a measure except people who wanted to 
> unpack and compile kernel pure in page buffers ;-).


So clearly you need to spend resources effectively and support both groups...
Sometimes is better to use large RAM (common laptops have 32G of RAM nowadays)
Sometimes is better to have more 'data' securely and permanently stored...


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 14:37                   ` Zdenek Kabelac
@ 2017-09-12 16:44                     ` Xen
  2017-09-12 17:14                     ` Gionatan Danti
  1 sibling, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-12 16:44 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 12-09-2017 16:37:

> On block layer - there are many things  black & white....
> 
> If you don't know which process 'create' written page, nor if you write
> i.e. filesystem data or metadata or any other sort of 'metadata' 
> information,
> you can hardly do any 'smartness' logic on thin block level side.

You can give any example to say that something is black and white 
somewhere, but I made a general point there, nothing specific.

> The philosophy with DM device is - you can replace then online with
> something else - i.e. you could have a linear LV  which is turned to
> 'RAID" and than it could be turned to   'Cache RAID'  and then even to
> thinLV -  all in one raw
> on life running system.

I know.

> So what filesystem should be doing in this case ?

I believe in most of these systems you cite the default extent size is 
still 4MB, or am I mistaken?

> Should be doing complex question of block-layer underneath - checking
> current device properties - and waiting till the IO operation is
> processed  - before next IO comes in the process - and repeat the
> some  in very synchronous
> slow logic ??    Can you imagine how slow this would become ?

You mean a synchronous way of checking available space in thin volume by 
thin pool manager?

> We are targeting 'generic' usage not a specialized case - which fits 1
> user out of 1000000 - and every other user needs something 'slightly'
> different....

That is completely exaggerative.

I think you will find this issue comes up often enough to think that it 
is not one out of 1000000 and besides unless performance considerations 
are at the heart of your ...reluctance ;-) no one stands to lose 
anything.

So only question is design limitations or architectural considerations 
(performance), not whether it is a wanted feature or not (it is).


> I don't think there is anything related...
> Thin chunk-size ranges from 64KiB to 1GiB....

Thin allocation is not by default in extent-sizes?

> The only inter-operation is the main filesystem (like extX & XFS) are
> getting fixed for better reactions for ENOSPC...
> and WAY better behavior when there are 'write-errors' - surprisingly
> there were numerous faulty logic and expectation encoded in them...

Well that's good right. But I did read here earlier about work between 
ExtFS team and LVM team to improve allocation characteristics to better 
align with underlying block boundaries.

> If zpools - are 'equally' fast as thins  - and gives you better 
> protection,
> and more sane logic the why is still anyone using thins???

I don't know. I don't like ZFS. Precisely because it is a 'monolith' 
system that aims to be everything. Makes it more complex and harder to 
understand, harder to get into, etc.

> Of course if you slow down speed of thin-pool and add way more
> synchronization points and consume 10x more memory :) you can get
> better behavior in those exceptional cases which are only hit by
> unexperienced users who tends to intentionally use thin-pools in
> incorrect way.....

I'm glad you like us ;-).

>> Yes apologies here, I responded to this thing earlier (perhaps a year 
>> ago) and the systems I was testing on was 4.4 kernel. So I cannot 
>> currently confirm and probably is already solved (could be right).
>> 
>> Back then the crash was kernel messages on TTY and then after some 
>> 20-30
> 
> there is by default 60sec freeze, before unresized thin-pool start to 
> reject
> all write to unprovisioned space as 'error' and switches to
> out-of-space state.  There is though a difference if you are
> out-of-space in data
> or metadata -  the later one is more complex...

I can't say whether it was that or not. I am pretty sure the entire 
system froze for longer than 60 seconds.

> In page cache there are no thing logically separated - you have 'dirty' 
> pages
> you need to write somewhere - and if you writes leads to errors,
> and system reads errors back instead of real-data - and your execution
> code start to run on completely unpredictable data-set - well 'clean'
> reboot is still very nice outcome IMHO....

Well even if that means some dirty pages are lost before the application 
discovers it, any read or write errors should at some point lead to the 
application to shut down right.

I think for most applications the most sane behaviour would simply be to 
shut down.

Unless there is more sophisticated error handling.

I am not sure what we are arguing about at this point.

Application needs to go anyway.


>> If I had a system crashing because I wrote to some USB device that was 
>> malfunctioning, that would not be a good thing either.
> 
> Well try to BOOT from USB :) and detach and then compare...
> Mounting user data and running user-space tools out of USB is 
> uncomparable...

Systems would also grind to a halt from user-data and not system files.

I know booting from USB can be 1000x slower than user data.

But shared page cache for all devices is bad design, period.

> AFAIK - this is still not resolved issue...

That's a shame.

>>> You can have different pools and you can use rootfs  with thins to
>>> easily test i.e. system upgrades....
>> 
>> Sure but in the past GRUB2 would not work well with thin, I was basing 
>> myself on that...
> 
> /boot   cannot be on thin
> 
> /rootfs  is not a problem - there will be even some great enhancement 
> for Grub
> to support this more easily and switching between various snapshots...

That's great, like with BTRFS I guess that this is possible?

But /rootfs was a problem. Grub-probe reported that it could not find 
the rootfs.

When I ran with custom grub config it worked fine. It was only 
grub-probe that failed, nothing else (Kubuntu 16.04).

>> EVERYONE would benefit.
> 
> Fortunately most users NEVER need it ;)

You're wrong. The assurance of a system not crashing (for instance) or 
some sane behaviour in case of fill-up, will put many minds at ease.

> Since they properly operate thin-pool and understand it's weak 
> points....

Yes they are all superhumans right.

I am sorry for being so inferior ;-).


>> Not necessarily that the system continues in full operation, 
>> applications are allowed to crash or whatever. Just that system does 
>> not lock up.
> 
> When you get bad data from your block device - your system's reaction
> is unpredictable -  if your /rootfs cannot store its metadata - the
> most sane behavior is to stop - all other solutions are so complex and
> complicated, that spending resources to avoid hitting this state are
> way better spent effort...

About rootfs, I agree.

But the nominal distinction was between thin-as-system and thin-as-data.

If you say that thin-as-data is specific use case that cannot be 
tailored for, that is a bit odd. It is still 90% of use.

> Once again -  USE different pool - solve problems at proper level....
> Do not over-provision critical volumes...

Again what we want is a valid use case and a valid request.

If the system is designed so badly (or designed in such a way) that it 
cannot be achieved, that does not immediately make it a bad wish.

For example if a problem is caused by the page-cache of the kernel being 
for all block devices at once, then anyone wanting something that is 
impossible because of that system...

...does not make that person bad for wanting it.

It makes the kernel bad for not achieving it.


I am sure your programmers are good enough to achieve asynchronous 
state-updating for a thin-pool that does not interfere with allocation 
to the extent that it will lazily update stats and which point 
allocation constraints might be basing themselves on older data (maybe 
seconds old) but that still doesn't mean it is useless.

It doesn't have to be perfect.

If my "critical volume" wants 1000 free extents, but it only has 988, 
that is not so great a problem.

Of course, I know, I hear you say "Use a different pool".

The whole idea for thin is resource efficiency.

There is no real reason that this "space reservation" can't happen.

Even if due to current design limitations, that might be there for a 
good reason, you are the arbiter on that.

It cannot be perfect or has to happen asynchronously.

It is better if non-critical volume starts failing than critical volume.

Failure is imminent, but we can choose which fails first.




I mean your argument is no different from.

"We need better man pages."

"REAL system administrators can use current man pages just fine."

"But any improvement would also benefit them, no need for them to do 
hard stuff when it can be easier."

"Since REAL system administrators can do their job as it is, our 
priorities lie elsewhere."

It's a stupid argument.

Any investment in user friendliness pays off for everyone.

Linux is often so impossible to use because no one makes that 
investment, even though it would have immeasurable benefits for 
everyone.

And then when someone does make the effort (e.g. makefile that displays 
help screen when run with no arguments) someone complains that it breaks 
the contract that "make" should start compiling instantly, thus using 
"status quo" as a way to never improve anything.

In this case, make "help screen" can save people litterally hours of 
time, multiplied by a 1000 people at least.


>> I.e. filesystem may guess about thin layout underneath and just write 
>> 1 byte to each block it wants to allocate.
> 
> :) so how do you resolve error paths -  i.e. how do you restore space
> you have not actually used....
> There are so many problems with this you can't even imagine...
> Yeah - we've spent quite some time in past analyzing those paths....

In this case it seems that if this is possible for regular files (and 
directories in that sense) it should also be possible for "magic" files 
and directories that only exist to allocate some space somewhere. In any 
case it is FS issue, not LVM.

Besides, you only strengthen my argument that it isn't FS that should be 
doing it.


> Please finally stop thinking about  some 'reserved' storage for
> critical volume. It leads to nowhere....

It leads to you trying to convince me it isn't possible.

But no matter how much you try to dissuade, it is still an acceptable 
use case and desire.



> Do the right action at right place.
> 
> For critical volume  use  non-overprovisiong pools - there is nothing
> better you can do - seriously!

For Gionatan's use case the problem was poor performance of 
non-overprovisioning system.



> Maybe start to understand how kernel works in practice ;)

Or how it doesn't work ;-).

Like,

I will give stupid example.

Suppose using a pen is illegal.

Now lots of people want to use pen, but they end up in jail.

Now you say "Wanting to use pen is bad desire, because of consequences".

But it's pretty clear the desire won't go away.

And the real solution needs to be had at changing the law.


In this case, people really want something and for good reasons. If 
there are structural reasons that it cannot be achieved, that is just 
that.

That doesn't mean the desires are bad.



You can forever keep saying "Do this instead" but that still doesn't 
ever make the prime desires bad.

"Don't use a pen, use a pencil. Problem solved."

Doesn't make wanting to use a pen a bad desire, nor does it make wanting 
some safe space in provisioning a bad desire ;-).


> Otherwise you spend you live boring developers with ideas which simply
> cannot work...

Or maybe changing their mind, who knows ;-).


> So use 2 different POOLS, problem solved....

Was not possible for Gionatan's use case.

Myself I do not use critical volume, but I can imagine still wanting 
some space efficiency even when "criticalness" from one volume to the 
next differs.




It is proper desire Zdenek. Even if LVM can't do it.


> Well it's always about checking 'upstream' first and then bothering
> your upstream maintainer...

If you knew about the pre-existing problems, you could have informed me.

In fact it has happened that you said something cannot be done, and then 
someone else said "Yes, this has been a problem, we have been working on 
it and problems should be resolved now in this version".



You spend most of your time denying that something is wrong.

And then someone else says "Yes, this has been an issue, it is resolved 
now".

If you communicate more clearly then you also have less people bugging 
you.

> We really cannot be solving problems of every possible deployed
> combination of software.

The issue is more that at some point this was the main released version.

Main released kernel and main released LVM, in a certain sense.


Some of your colleagues are a little more forthcoming with 
acknowledgements that something has been failing.

This would considerably cut down the amount of time you spend being 
"bored" because you try to fight people who are trying to tell you 
something.

If you say "Oh yes, I think you mean this and that, yes that's a problem 
and we are working on it" or "Yes, that was the case before, this 
version fixes that" then


these long discussions also do not need to happen.

But you almost never say "Yes it's a problem", Zdenek.

That's why we always have these debates ;-).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 14:57               ` Zdenek Kabelac
@ 2017-09-12 16:49                 ` Xen
  0 siblings, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-12 16:49 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 12-09-2017 16:57:

> This bug has been reported (by me even to libblkid maintainer) AND
> already fixed already in past....

I was the one who reported it.

This was Karel Zak's message from 30 august 2016:

"On Fri, Aug 19, 2016 at 01:14:29PM +0200, Karel Zak wrote:
On Thu, Aug 18, 2016 at 10:39:30PM +0200, Xen wrote:
Would someone be will to fix the issue that a Physical Volume from LVM2 
(PV)
when placed directly on disk (no partitions or partition tables) will 
not be

This is very unusual setup, but according to feedback from LVM guys
it's supported, so I will improve blkid to support it too.

Fixed in the git tree (for the next v2.29). Thanks.

     Karel"

So yes, I knew what I was talking about.

At least slightly ;-).

:p.


> But to defend a bit libblkid maintainer side :) - this feature was not
> really well documented from lvm2 side...

That's fine.

>>> You can sync every second to minimize amount of dirty pages....
>>> 
>>> Lots of things....  all of them will in some other the other impact
>>> system performance....
>> 
>> He said no people would be hurt by such a measure except people who 
>> wanted to unpack and compile kernel pure in page buffers ;-).
> 
> 
> So clearly you need to spend resources effectively and support both 
> groups...
> Sometimes is better to use large RAM (common laptops have 32G of RAM 
> nowadays)

Yes and he said those people wanting to compile the kernel purely in 
memory (without using RAM disk for it) have issues anyway...

;-).

So no it is not that clear that you need to support both groups. 
Certainly not by default.

Or at least not in its default configuration for some dirty page file 
flag ;-).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 12:03           ` Zdenek Kabelac
  2017-09-12 12:47             ` Xen
@ 2017-09-12 16:57             ` Gionatan Danti
  1 sibling, 0 replies; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 16:57 UTC (permalink / raw)
  To: Zdenek Kabelac, LVM general discussion and development

On 12/09/2017 14:03, Zdenek Kabelac wrote:> # lvs -a
>    LV              VG Attr       LSize  Pool Origin Data%  Meta%  Move 
> Log Cpy%Sync Convert
>    [lvol0_pmspare] vg ewi-------  2,00m
>    lvol1           vg Vwi-a-tz-- 20,00m pool        40,00
>    pool            vg twi-aotz-- 10,00m             80,00  1,95
>    [pool_tdata]    vg Twi-ao---- 10,00m
>    [pool_tmeta]    vg ewi-ao----  2,00m
> [root@linux export]# lvcreate -V10 vg/pool
>    Using default stripesize 64,00 KiB.
>    Reducing requested stripe size 64,00 KiB to maximum, physical extent 
> size 32,00 KiB.
>    Cannot create new thin volume, free space in thin pool vg/pool 
> reached threshold.
> 
> # lvcreate -s vg/lvol1
>    Using default stripesize 64,00 KiB.
>    Reducing requested stripe size 64,00 KiB to maximum, physical extent 
> size 32,00 KiB.
>    Cannot create new thin volume, free space in thin pool vg/pool 
> reached threshold.
> 
> # grep thin_pool_autoextend_threshold /etc/lvm/lvm.conf
>      # Configuration option activation/thin_pool_autoextend_threshold.
>      # thin_pool_autoextend_threshold = 70
>      thin_pool_autoextend_threshold = 70
> 
> So as you can see - lvm2 clearly prohibits you to create a new thinLV
> when you are above defined threshold.

Hi Zdenek,
this is very good news (for me at least). Thank you very much for 
pointing me that!

Anyway, I can not find the relative configuration variable in lvm.conf. 
I am on 2.02.166(2)-RHEL7, should I use a newer LVM version to set this 
threshold?

> To keep things single for a user - we have a single threshold value.
> 
> 
> So what else is missing ?

This is a very good step, indeed. However, multiple threshold (maybe 
attached/counted against the single thin volume, in a manner similar to 
how refreservation does for ZVOLs) would be even better (in my use case, 
at least).

> Unfortunatelly lvm2 nor dm  can be responsible for whole kernel logic and
> all user-land apps...

Again, I am *not* saying, nor asking, that.

I would simply like to use thinp without fearing that "forgotten" 
snapshot fill up the thin pool. I have shown how this can easily 
achieved with ZVOLs and careful use/setting of the refreservation value, 
without any upper layer knowledge and/or intra-layer communications.

> So you are already happy right  :) ?

Sure! :)
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 11:46               ` Zdenek Kabelac
  2017-09-12 12:37                 ` Xen
@ 2017-09-12 17:00                 ` Gionatan Danti
  2017-09-12 23:25                   ` Brassow Jonathan
  1 sibling, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 17:00 UTC (permalink / raw)
  To: LVM general discussion and development, Zdenek Kabelac, Xen

On 12/09/2017 13:46, Zdenek Kabelac wrote:
> 
> What's wrong with BTRFS....
> 
> Either you want  fs & block layer tied together - that the btrfs/zfs 
> approach

BTRFS really has a ton of performance problem - please, don't recommend 
it for anything IO intensive (as virtual machines and databases).

Moreover, RedHat now officially deprecated it...

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 14:37                   ` Zdenek Kabelac
  2017-09-12 16:44                     ` Xen
@ 2017-09-12 17:14                     ` Gionatan Danti
  2017-09-12 21:57                       ` Zdenek Kabelac
  1 sibling, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 17:14 UTC (permalink / raw)
  To: LVM general discussion and development, Zdenek Kabelac, Xen

On 12/09/2017 16:37, Zdenek Kabelac wrote:
> ZFS with zpolls with thin with thinpools running directly on top of device.
> 
> If zpools - are 'equally' fast as thins  - and gives you better protection,
> and more sane logic the why is still anyone using thins???
> 
> I'd really love to see some benchmarks....
> 
> Of course if you slow down speed of thin-pool and add way more 
> synchronization points and consume 10x more memory :) you can get better 
> behavior in those exceptional cases which are only hit by unexperienced 
> users who tends to intentionally use thin-pools in incorrect way.....

Having benchmarked them, I can reply :)

ZFS/ZVOLs surely are slower than thinp, full stop.
However, they are not *massively* slower.

To tell the truth, what somewhat slow me on ZFS adoption is its low 
integration in the Linux kernel subsystem. For example:
- cache duplication (ARC + pagecache)
- slow reclaim of memory used for caching
- SPL (sun porting layer)
- dependency on 3rd part module
...

Thinp is great tech - and I am already using it. I did not started this 
thread as ZFS vs Thinp, really. Rather, I would like to understand how 
to better use thinp, and I traced a parallelism with ZVOLs, nothing more.

Thanks.

-- 
-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 17:14                     ` Gionatan Danti
@ 2017-09-12 21:57                       ` Zdenek Kabelac
  2017-09-13 17:41                         ` Xen
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 21:57 UTC (permalink / raw)
  To: LVM general discussion and development, Gionatan Danti, Xen

Dne 12.9.2017 v 19:14 Gionatan Danti napsal(a):
> On 12/09/2017 16:37, Zdenek Kabelac wrote:
>> ZFS with zpolls with thin with thinpools running directly on top of device.
>>
>> If zpools - are 'equally' fast as thins� - and gives you better protection,
>> and more sane logic the why is still anyone using thins???
>>
>> I'd really love to see some benchmarks....
>>
>> Of course if you slow down speed of thin-pool and add way more 
>> synchronization points and consume 10x more memory :) you can get better 
>> behavior in those exceptional cases which are only hit by unexperienced 
>> users who tends to intentionally use thin-pools in incorrect way.....
> 
> Having benchmarked them, I can reply :)
> 
> ZFS/ZVOLs surely are slower than thinp, full stop.
> However, they are not *massively* slower.

Users interested in thin-provisioning are really mostly interested in 
performance - especially on multicore machines with lots of fast storage with 
high IOPS throughput  (some of them even expect it should be at least as good 
as linear....)

So ATM it's preferred to have more complex 'corner-case' which really mostly 
never happens when thin-pool is operated properly and the remaining use case 
you don't pay higher price for having all data always in sync and also you get 
way lower memory foot-print
(I think especially ZFS is well known for nontrivial memory resource consumption)

As has been pointed already few times in this thread - lots of those
'reserved space' ideas can be already easily handled by just more advanced 
scripting around notification from dmeventd - if you will keep thinking for a 
while you will at some point see the reasoning.

There is no difference if you start to solve problem around 70% fullness then
100%  - the main difference is - with some 'free-space' in thin-pool you can 
resolve problem way more easily and correctly.

Repeated again - whoever targets for 100% full thin-pool usage has 
misunderstood purpose of thin-provisioning.....

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 17:00                 ` Gionatan Danti
@ 2017-09-12 23:25                   ` Brassow Jonathan
  2017-09-13  8:15                     ` Gionatan Danti
  2017-09-13 18:43                     ` Xen
  0 siblings, 2 replies; 91+ messages in thread
From: Brassow Jonathan @ 2017-09-12 23:25 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Xen, Zdenek Kabelac

Hi,

I’m the manager of the LVM/DM team here at Red Hat.  Let me thank those of you who have taken the time to share how we might improve LVM thin-provisioning.  We really do appreciate it and you ideas are welcome.

I see merit in the ideas you’ve presented and if I’ve got it right, there are two main ones:
1) don’t allow creation of new thinLVs or snapshots in a pool that is beyond a certain threshold
2) allow users to reserve some space for critical volumes when a threshold is reached

I believe that #1 is already handled, are you looking for anything else?

#2 doesn’t seem crazy hard to implement - even in script form.  In RHEL7.4 (upstream = "Version 2.02.169 - 28th March 2017”), we introduced the lvm.conf:dmeventd/thin_command setting.  You can run anything you want through a script.  Right now, it is set to do lvextend in an attempt to add more space to a filling thin-pool.  However, you don’t need to be so limited.  I imagine the following:
- Add a “critical” tag to all thinLVs that are very important:
	# lvchange --addtag critical vg/thinLV
- Create script that is called by thin_command, it should:
- check if a threshold is reached (i.e. your reserved space) and if so,
- report all lvs associated with the thin-pool that are NOT critical:
	# lvs -o name --noheadings --select 'lv_tags!=critical && pool_lv=thin-pool’ vg
- run <command> on those non-critical volumes, where <command> could be:
	# fsfreeze <mnt_point>

The above should have the result you want - essentially locking out all non-critical file systems.  The admin can easily turn them back on via fsfreeze one-by-one as they resolve the critical lack of space.  If you find this too heavy-handed, perhaps try something else for <command> instead first.

If the above is sufficient, then great.  If you’d like to see something like this added to the LVM repo, then you can simply reply here with ‘yes’ and maybe provide a sentence of what the scenario is that it would solve.  (I know there are already some listed in this thread, but I’m wondering about those folks that think the script is insufficient and believe this should be more standard.)


Thanks,
 brassow

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 23:25                   ` Brassow Jonathan
@ 2017-09-13  8:15                     ` Gionatan Danti
  2017-09-13  8:33                       ` Zdenek Kabelac
  2017-09-13 18:43                     ` Xen
  1 sibling, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-13  8:15 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Xen, Brassow, Zdenek Kabelac

Hi Jonathan,

Il 13-09-2017 01:25 Brassow Jonathan ha scritto:
> Hi,
> 
> I’m the manager of the LVM/DM team here at Red Hat.  Let me thank
> those of you who have taken the time to share how we might improve LVM
> thin-provisioning.  We really do appreciate it and you ideas are
> welcome.
> 
> I see merit in the ideas you’ve presented and if I’ve got it right,
> there are two main ones:
> 1) don’t allow creation of new thinLVs or snapshots in a pool that is
> beyond a certain threshold
> 2) allow users to reserve some space for critical volumes when a
> threshold is reached
> 
> I believe that #1 is already handled, are you looking for anything 
> else?

Yeah, this is coverd by the appropriate use of 
snapshot_autoextend_percent. I did not realized that, thanks to Zdenek 
for pointing me to the right direction.

> #2 doesn’t seem crazy hard to implement - even in script form.  In
> RHEL7.4 (upstream = "Version 2.02.169 - 28th March 2017”), we
> introduced the lvm.conf:dmeventd/thin_command setting.  You can run
> anything you want through a script.  Right now, it is set to do
> lvextend in an attempt to add more space to a filling thin-pool.
> However, you don’t need to be so limited.  I imagine the following:
> - Add a “critical” tag to all thinLVs that are very important:
> 	# lvchange --addtag critical vg/thinLV
> - Create script that is called by thin_command, it should:
> - check if a threshold is reached (i.e. your reserved space) and if so,
> - report all lvs associated with the thin-pool that are NOT critical:
> 	# lvs -o name --noheadings --select 'lv_tags!=critical && 
> pool_lv=thin-pool’ vg
> - run <command> on those non-critical volumes, where <command> could 
> be:
> 	# fsfreeze <mnt_point>
> 
> The above should have the result you want - essentially locking out
> all non-critical file systems.  The admin can easily turn them back on
> via fsfreeze one-by-one as they resolve the critical lack of space.
> If you find this too heavy-handed, perhaps try something else for
> <command> instead first.

Very good suggestion. Actually, fsfreeze should works without too much 
drama.

> If the above is sufficient, then great.  If you’d like to see
> something like this added to the LVM repo, then you can simply reply
> here with ‘yes’ and maybe provide a sentence of what the scenario is
> that it would solve.  (I know there are already some listed in this
> thread, but I’m wondering about those folks that think the script is
> insufficient and believe this should be more standard.)

Yes, surely.

The combination of #1 and #2 should give the desired outcome (I quickly 
tested it and I found no evident problems).

Jonathan, Zdeneck, thanks you very much.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13  8:15                     ` Gionatan Danti
@ 2017-09-13  8:33                       ` Zdenek Kabelac
  0 siblings, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-13  8:33 UTC (permalink / raw)
  To: Gionatan Danti, LVM general discussion and development; +Cc: Xen

D     # fsfreeze <mnt_point>
>>
>> The above should have the result you want - essentially locking out
>> all non-critical file systems.  The admin can easily turn them back on
>> via fsfreeze one-by-one as they resolve the critical lack of space.
>> If you find this too heavy-handed, perhaps try something else for
>> <command> instead first.
> 
> Very good suggestion. Actually, fsfreeze should works without too much drama.
>

Think about this case:

original volume with number of timely taken snapshots.

If you ONLY use 'read-only' snaps - there is not much thing to do - writing to 
the origin gives you quite 'precise' estimation how much data are in progress.
(seeing amount of dirty-pages....)

However when all other snapshots  (i.e. VM machines) are in-use and also do 
have writable data in progress -  invoking 'fsfreeze'  operation has 
unpredictable amount of provisioning in front of you (all your dirty pages 
needs to be first committed on your disk)...

So you can easily 'freeze' yourself in 'fsfreeze'.

lvm2 has got over last year much smarter - and avoids i.e. flushing in case 
it's queering of used 'data-space'  with 2 consequences:

a) it prevents 'dead-lock' in suspending with flushing (and holding lvm2 VG 
locking - which was really bad bad bad problem....  as you could not run 
'lvextend' for thin-pool in such case to rescue situation (i.e. you still have 
free space in VG - or even extend your VG...

b) gives you some 'historical/unprecise/async' runtime data of thin-pool fullness

So you can start to see that doing some 'perfect'  decision with historical 
data is not easy task...


Reagards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 21:57                       ` Zdenek Kabelac
@ 2017-09-13 17:41                         ` Xen
  2017-09-13 19:17                           ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-13 17:41 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 12-09-2017 23:57:

> Users interested in thin-provisioning are really mostly interested in
> performance - especially on multicore machines with lots of fast
> storage with high IOPS throughput  (some of them even expect it should
> be at least as good as linear....)

Why don't you hold a survey?

And not phrase it in terms of "Would you like to sacrifice performance 
for more safety?"

But please.

Ask people:

1) What area does the LVM team needs to focus on for thin provisioning:

a) Performance and keeping performance intact
b) Safety and providing good safeguards against human and program error
c) User interface and command line tools
d) Monitoring and reporting software and systems
e) Graphical user interfaces
f) Integration into default distributions and support for booting/grub

And then allow people to score these things with a percentage or to 
distribute some 20 points across these 6 points.

Invent more points as needed.

Give people 20 points to distribute across some 8 areas of interest.

Then ask people what areas are most interesting to them.

So topics could be:
(a) Performance (b) Robustness (c) Command line user interface (d) 
Monitoring systems (e) Graphical user interface (f) Distribution support

So ask people. Don't assume.

(NetworkManager team did this pretty well by the way. They were really 
interested in user perception some time ago).

> if you will keep thinking for a while you will at some point see the 
> reasoning.

Only if your reasoning is correct. Not if your reasoning is wrong.

I could also say to you, we could also say to you "If you think longer 
on this you will see we are right". That would probably be more accurate 
even.

> Repeated again - whoever targets for 100% full thin-pool usage has
> misunderstood purpose of thin-provisioning.....

Again, no one "targets" for 100% full. It is just an eventuality we need 
to take care of.

You design for failure.

A nuclear plant who did not take account of operator drunkenness and had 
no safety measures in place to ensure that would not lead to 
catastrophe, would be a very bad nuclear plant.

Human error can be calculated into the design. In fact, it must.

DESIGN FOR HUMAN WEAKNESS.

NOT EVERYONE IS PERFECT and human faults happen.

If I was a customer and I was paying your bills, you would never respond 
like this.

We like some assurance that things do not go immediate mayhem the moment 
someone somewhere slacks off and falls asleep.

We like to design in advance so we do not have to keep a constant eye 
out.

We build "structure" so that the structure works for us, and not 
constant vigilance.

Constant vigilance can fail. Structure cannot.

Focus on "being" not "doing".

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 23:25                   ` Brassow Jonathan
  2017-09-13  8:15                     ` Gionatan Danti
@ 2017-09-13 18:43                     ` Xen
  2017-09-13 19:35                       ` Zdenek Kabelac
  1 sibling, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-13 18:43 UTC (permalink / raw)
  To: linux-lvm

Brassow Jonathan schreef op 13-09-2017 1:25:

> I’m the manager of the LVM/DM team here at Red Hat.

Thank you for responding.

> 2) allow users to reserve some space for critical volumes when a
> threshold is reached

> #2 doesn’t seem crazy hard to implement - even in script form.

> - Add a “critical” tag to all thinLVs that are very important:
> 	# lvchange --addtag critical vg/thinLV
> - Create script that is called by thin_command, it should:
> - check if a threshold is reached (i.e. your reserved space) and if so,
> - report all lvs associated with the thin-pool that are NOT critical:
> 	# lvs -o name --noheadings --select 'lv_tags!=critical && 
> pool_lv=thin-pool’ vg
> - run <command> on those non-critical volumes, where <command> could 
> be:
> 	# fsfreeze <mnt_point>

I think the above is exactly (or almost exactly) in agreement with the 
general idea yes.

It uses filesystem tool to achieve it instead of allocation blocking (so 
filesystem level, not DM level).

But if it does the same thing that is more important than having 
'perfect' solution.

The issue with scripts is that they feel rather vulnerable to 
corruption, not being there etc.

So in that case I suppose that you would want some default, shipped 
scripts that come with LVM as example for default behaviour and that are 
also activated by default?

So fixed location in FHS for those scripts and where user can find then 
and can install new ones.

Something similar to /etc/initramfs-tools/ (on Debian), so maybe 
/etc/lvm/scripts/ and /usr/share/lvm/scripts/ or similar.

Also easy to adjust by each distribution if they wanted to.

If no one uses critical tag -- nothing happens, but if they do use it, 
check unallocated space on critical volumes and sum it up to arrive at 
threshold value?

Then not even a threshold value needs to be configured.


> If the above is sufficient, then great.  If you’d like to see
> something like this added to the LVM repo, then you can simply reply
> here with ‘yes’ and maybe provide a sentence of what the scenario is
> that it would solve.

Yes. One obvious scenario is root on thin.

It's pretty mandatory for root on thin.


There is something else though.

You cannot set max size for thin snapshots?

This is part of the problem: you cannot calculate in advance what can 
happen, because by design, mayhem should not ensue, but what if your 
predictions are off?

Being able to set a maximum snapshot size before it gets dropped could 
be very nice.

This behaviour is very safe on non-thin.

It is inherently risky on thin.



> (I know there are already some listed in this
> thread, but I’m wondering about those folks that think the script is
> insufficient and believe this should be more standard.)

You really want to be able to set some minimum free space you want per 
volume.

Suppose I have three volumes of 10GB, 20GB and 3GB.

I may want the 20GB volume to be least important. The 3GB volume most 
important. The 10GB volume in between.

I want at least 100MB free on 3GB volume.

When free space on thin pool drops below ~120MB, I want the 20GB volume 
and the 10GB volumes to be frozen, no new extents for 30GB volume.

I want at least 500MB free on 10GB volume.

When free space on thin pool drops below ~520MB, I want the 20GB volume 
to be frozen, no new extents for 20GB volume.


So I would get 2 thresholds and actions:

- threshold for 3GB volume causing all others to be frozen
- threshold for 10GB volume causing 20GB volume to be frozen

This is easily scriptable and custom thing.

But it would be nice if you could set this threshold in LVM per volume?

So the script can read it out?

100MB of 3GB = 3.3%
500MB of 10GB = 5%

3-5% of mandatory free space could be a good default value.

So the default script could also provide a 'skeleton' for reading the 
'critical' tag and then calculating a default % of space that needs to 
be free.

In this case there is a hierarchy:

3GB > 10GB > 20GB.

Any 'critical volume' could cause all others 'beneath it' to be frozen.


But the most important thing is to freeze or drop snapshots I think.

And to ensure that this is default behaviour?

Or at least provide skeletons for responding to thin threshold values 
being reached so that the burden on the administrator is very minimal.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13 17:41                         ` Xen
@ 2017-09-13 19:17                           ` Zdenek Kabelac
  2017-09-14  3:19                             ` Xen
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-13 19:17 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 13.9.2017 v 19:41 Xen napsal(a):
> Zdenek Kabelac schreef op 12-09-2017 23:57:
> 
>> Users interested in thin-provisioning are really mostly interested in
>> performance - especially on multicore machines with lots of fast
>> storage with high IOPS throughput� (some of them even expect it should
>> be at least as good as linear....)
> 
> Why don't you hold a survey?
> 
> And not phrase it in terms of "Would you like to sacrifice performance for 
> more safety?"
> 
> But please.
> 
> Ask people:
>> Repeated again - whoever targets for 100% full thin-pool usage has
>> misunderstood purpose of thin-provisioning.....
> 
> Again, no one "targets" for 100% full. It is just an eventuality we need to 
> take care of.
> 
> You design for failure.

Thin-pool IS designed for failure - who said it isn't ?

It has very matured protection against data corruption.

It's just not getting  overcomplicated in-kernel - solution is left for the 
user-space - that's very clear design of 'dm' for decades...

> If I was a customer and I was paying your bills, you would never respond like 
> this.

We are very nice to customers which pays our bills....

> We like to design in advance so we do not have to keep a constant eye out.
> 

Please if you can show the case where the current upstream thinLV fails and 
you lose your data - we can finally start to fix something.

I'm still unsure what problem you want to get resolved from pretty small group 
of people around dm/lvm2 - do you want from us to rework kernel page-cache ?

I'm simply still confused what kind action you expect...

Be specific with real world example.


Regards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13 18:43                     ` Xen
@ 2017-09-13 19:35                       ` Zdenek Kabelac
  2017-09-14  5:59                         ` Xen
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-13 19:35 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 13.9.2017 v 20:43 Xen napsal(a):
> 
> There is something else though.
> 
> You cannot set max size for thin snapshots?
> 

We are moving here in right direction.

Yes - current thin-provisiong does not let you limit maximum number of blocks 
individual thinLV can address (and snapshot is ordinary thinLV)

Every thinLV can address  exactly   LVsize/ChunkSize  blocks at most.


> This is part of the problem: you cannot calculate in advance what can happen, 
> because by design, mayhem should not ensue, but what if your predictions are off?

Great - 'prediction' - we getting on the same page -  prediction is big 
problem....

> Being able to set a maximum snapshot size before it gets dropped could be very 
> nice.

You can't do that IN KERNEL.

The only tool which is able to calculate real occupancy - is user-space 
thin_ls tool.

So all you need to do is to use the tool in user-space for this task.

> This behaviour is very safe on non-thin.
> 
> It is inherently risky on thin.
> 
> 
> 
>> (I know there are already some listed in this
>> thread, but I’m wondering about those folks that think the script is
>> insufficient and believe this should be more standard.)
> 
> You really want to be able to set some minimum free space you want per volume.
> 
> Suppose I have three volumes of 10GB, 20GB and 3GB.
> 
> I may want the 20GB volume to be least important. The 3GB volume most 
> important. The 10GB volume in between.
> 
> I want at least 100MB free on 3GB volume.
> 
> When free space on thin pool drops below ~120MB, I want the 20GB volume and 
> the 10GB volumes to be frozen, no new extents for 30GB volume.
> 
> I want at least 500MB free on 10GB volume.
> 
> When free space on thin pool drops below ~520MB, I want the 20GB volume to be 
> frozen, no new extents for 20GB volume.
> 
> 
> So I would get 2 thresholds and actions:
> 
> - threshold for 3GB volume causing all others to be frozen
> - threshold for 10GB volume causing 20GB volume to be frozen
> 
> This is easily scriptable and custom thing.
> 
> But it would be nice if you could set this threshold in LVM per volume?

This is the main issue - these 'data' are pretty expensive to 'mine' out of 
data structures.

That's the reason why thin-pool is so fast and memory efficient inside the 
kernel - because it does not need to all those details about how much data 
thinLV eat from thin-pool - kernel target simply does not care - it only cares 
about referenced chunks

It's the user space utility which is able to 'parse' all the structure
and take a 'global' picture. But of course it takes CPU and TIME and it's not 
'byte accurate'  -  that's why you need to start act early on some threshold.


> But the most important thing is to freeze or drop snapshots I think.
> 
> And to ensure that this is default behaviour?

Why you think this should be default ?

Default is to auto-extend thin-data & thin-metadata when needed if you set 
threshold bellow 100%.

We can discuss if it's good idea to enable auto-extending by default - as we 
don't know if the free space in VG is meant to be used for thin-pool or there 
is some other plan admin might have...


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13 19:17                           ` Zdenek Kabelac
@ 2017-09-14  3:19                             ` Xen
  0 siblings, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-14  3:19 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 13-09-2017 21:17:

> Please if you can show the case where the current upstream thinLV
> fails and you lose your data - we can finally start to fix something.

Hum, I can only say "I owe you one" on this.

I mean to say it will have to wait, but I hope to get to this at some 
point.

> I'm still unsure what problem you want to get resolved from pretty
> small group of people around dm/lvm2 - do you want from us to rework
> kernel page-cache ?
> 
> I'm simply still confused what kind action you expect...
> 
> Be specific with real world example.

I think Brassow Jonathan's idea is very good to begin with (thank you 
sir ;-)).

I get that you say that kernel space solution is impossible to implement 
(apart from not crashing the system, and I get that you say that this is 
no longer the case) because checking several things would prolong 
execution paths considerably, is what you say.

And I realize that any such thing would need asynchronous checking and 
updating some values and then execution paths that need to check for 
such things which I guess could indeed by rather expensive to actually 
execute.

I mean the only real kernel experience I have was trying to dabble with 
filename_lookup and path_lookupat or whatever it was called. I mean 
inode path lookups, which is a bit of the same thing. And indeed even a 
single extra check would have incurred a performance overhead.

I mean the code to begin with differentiated between fast lookup and 
slow lookup and all of that.

And particularly the fast lookup was not something you'd want to mess 
with, etc.

But I absolutely have no issue to begin with I want to say with 
asynchronous 'intervention' even if it is not byte accurate, as you say 
in the other email.

And I get that you prefer user-space tools doing the thing...

And you say there that this information is hard to mine.

And that the "thin_ls" tool does that.

It's just that I don't want it to be 'random' and depending on your 
particular random sysadmin doing the right thing in isolation of all 
other random sysadmins having to do the right thing all in isolation of 
each other all writing the same code.

At the very least if you recognise your responsibility, which you are 
doing now, we can have a bit of a framework that is delivered by 
upstream LVM so the thing comes out more "fully fleshed" and sysadmins 
have less work to do, even if they still have to customize the scripts 
or anything.

Most ideal thing would definitely be something you "set up" and then the 
thing takes care of itself, ie. you only have to input some values and 
constraints.

But intervention in forms of "fsfreeze" or whatever is very personal, I 
get that.

And I get that previously auto-unmounting also did not really solve 
issues for everyone.

So a general interventionalist policy that is going to work for everyone 
is hard to get.

So the only thing that could work for everyone is if there is actually a 
block on new allocations. If that is not possible, then indeed I agree 
that a "one size fits all" approach is hardly possible.

Intervention is system-specific.

Regardless at least it should be easy to ensure that some constraints 
are enforced, that's all I'm asking.

Regards, (I'll respond further in the other email).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13 19:35                       ` Zdenek Kabelac
@ 2017-09-14  5:59                         ` Xen
  2017-09-14 19:05                           ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-14  5:59 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 13-09-2017 21:35:

> We are moving here in right direction.
> 
> Yes - current thin-provisiong does not let you limit maximum number of
> blocks individual thinLV can address (and snapshot is ordinary thinLV)
> 
> Every thinLV can address  exactly   LVsize/ChunkSize  blocks at most.

So basically the only options are allocation check with asynchronously 
derived intel that might be a few seconds late, as a way to execute some 
standard and general "prioritizing" policy, and an interventionalist 
policy that will (fs)freeze certain volumes depending on admin knowledge 
about what needs to happen in his/her particular instance.

>> This is part of the problem: you cannot calculate in advance what can 
>> happen, because by design, mayhem should not ensue, but what if your 
>> predictions are off?
> 
> Great - 'prediction' - we getting on the same page -  prediction is
> big problem....

Yes I mean my own 'system' I generally of course know how much data is 
on it and there is no automatic data generation.

Matthew Patton referenced quotas in some email, I didn't know how to do 
it as quickly when I needed it so I created a loopback mount from a 
fixed sized container to 'solve' that issue when I did have an 
unpredictable data source... :p.

But if I do create snapshots (which I do every day) when the root and 
boot snapshots fill up (they are on regular lvm) they get dropped which 
is nice, but particularly the big data volume if I really were to move a 
lot of data around I might need to first get rid of the snapshots or 
else I don't know what will happen or when.

Also my system (yes I am an "outdated moron") does not have thin_ls tool 
yet so when I was last active here and you mentioned that tool (thank 
you for that, again) I created this little script that would give me 
also info:

$ sudo ./thin_size_report.sh
[sudo] password for xen:
Executing self on linux/thin
Individual invocation for linux/thin

     name               pct       size
     ---------------------------------
     data            54.34%     21.69g
     sites            4.60%      1.83g
     home             6.05%      2.41g
     --------------------------------- +
     volumes         64.99%     25.95g
     snapshots        0.09%     24.00m
     --------------------------------- +
     used            65.08%     25.97g
     available       34.92%     13.94g
     --------------------------------- +
     pool size      100.00%     39.91g

The above "sizes" are not volume sizes but usage amounts.

And the % are % of total pool size.

So you can see I have 1/3 available on this 'overprovisioned' thin pool 
;-).


But anyway.


>> Being able to set a maximum snapshot size before it gets dropped could 
>> be very nice.
> 
> You can't do that IN KERNEL.
> 
> The only tool which is able to calculate real occupancy - is
> user-space thin_ls tool.

Yes my tool just aggregated data from "lvs" invocations to calculate the 
numbers.

If you say that any additional allocation checks would be infeasible 
because it would take too much time per request (which still seems odd 
because the checks wouldn't be that computation intensive and even for 
100 gigabyte you'd only have 25.000 checks at default extent size) -- of 
course you asynchronously collect the data.

So I don't know if it would be *that* slow provided you collect the data 
in the background and not while allocating.

I am also pretty confident that if you did make a policy it would turn 
out pretty good.

I mean I generally like the designs of the LVM team.

I think they are some of the most pleasant command line tools anyway...

But anyway.

On the other hand if all you can do is intervene in userland, then all 
LVM team can do is provide basic skeleton for execution of some standard 
scripts.

> So all you need to do is to use the tool in user-space for this task.

So maybe we can have an assortment of some 5 interventionalist policies 
like:

a) Govern max snapshot size and drop snapshots when they exceed this
b) Freeze non-critical volumes when thin space drops below aggegrate 
values appropriate for the critical volumes
c) Drop snapshots when thin space <5% starting with the biggest one
d) Also freeze relevant snapshots in case (b)
e) Drop snapshots when exceeding max configured size in case of 
threshold reach.

So for example you configure max size for snapshot. When snapshots 
exceeds size gets flagged for removal. But removal only happens when 
other condition is met (threshold reach).

So you would have 5 different interventions you could use that could be 
considered somewhat standard and the admit can just pick and choose or 
customize.


> This is the main issue - these 'data' are pretty expensive to 'mine'
> out of data structures.

But how expensive is it to do it say every 5 seconds?


> It's the user space utility which is able to 'parse' all the structure
> and take a 'global' picture. But of course it takes CPU and TIME and
> it's not 'byte accurate'  -  that's why you need to start act early on
> some threshold.

I get that but I wonder how expensive it would be to do that 
automatically all the time in the background.

It seems to already happen?

Otherwise you wouldn't be reporting threshold messages.

In any case the only policy you could have in-kernel would be either 
what Gionatan proposed (fixed reserved space for certain volumes) (easy 
calculation right) or potentially allocation freeze at threshold for 
non-critical volumes,


I say you only implement per-volume space reservation, but anyway.

I just still don't see how one check per 4MB would be that expensive 
provided you do data collection in background.

You say size can be as low as 64kB... well.... in that case...

You might have issues.



But in any case,

a) For intervention, choice is between customization by code and 
customization by values.
b) Ready made scripts could take values but could also be easy to 
customize
c) Scripts could take values from LVM config or volume config but must 
be easy to know/change/know about.

d) Scripts could document where to set the values.

e) Personally I would do the following:

    a) Stop snapshots from working when a threshold is reached (95%) in a 
rapid fasion

    or

    a) Just let everything fill up as long as system doesn't crash

    b) Intervene to drop/freeze using scripts, where

       1) I would drop snapshots starting with the biggest one in case of 
threshold reach (general)

       2) I would freeze non-critical volumes ( I do not write to 
snapshots so that is no issue ) when critical volumes reached safety 
threshold in free space ( I would do this in-kernel if I could ) ( But 
Freezing In User-Space is almost the same ).

       3) I would shrink existing volumes to better align with this 
"critical" behaviour because now they are all large size to make moving 
data easier

       4) I would probably immediately implement these strategies if the 
scripts were already provided

       5) Currently I already have reporting in place (by email) so I 
have no urgent need myself apart from still having an LVM version that 
crashes

f) For a critical volume script, it is worth considering that small 
volumes are more likely to be critical than big ones, so this could also 
prompt people to organize their volumes in that way, and have a standard 
mechanism to first protect the free space of smaller volumes against all 
of the bigger ones, then the next up is only protected against ITS 
bigger ones, and so on.

Basically when you have Big, Medium and Small, Medium is protected 
against Big, and Small is protected against both others.

So the Medium protection is triggered sooner because it has a higher 
space need compared to the Small volume, so Big is frozen before Medium 
is frozen.

So when space then runs out, first Big is frozen, and when that doesn't 
help, in time Medium is also frozen.

Seems pretty legit I must say.

And this could be completely unconfigured, just a standard recipe using 
for configuration only the percentage you want to use.

Ie. you can say I want 5% free on all volumes from the top down, and 
only the biggest one isn't protected, but all the smaller ones are.

If several are the same size you lump them together.

Now you have a cascading system in which if you choose this script, you 
will have "Small ones protected against Big ones" protection in which 
you really don't have to set anything up yourself.

You don't even have to flag them as critical...

Sounds like fun to make in any case.


g) There is a little program called "pam_shield" that uses 
"shield_triggers" to select which kind of behaviour the user wants to 
use in blocking external IPs. It provides several alternatives such as 
IP routing block (blackhole) and iptables block.

You can choose which intervention you want. The scripts are already 
provided. You just have to select the one you want.


>> And to ensure that this is default behaviour?
> 
> Why you think this should be default ?
> 
> Default is to auto-extend thin-data & thin-metadata when needed if you
> set threshold bellow 100%.

Q: In a 100% filled up pool, are snapshots still going to be valid?

Could it be useful to have a default policy of dropping snapshots at 
high consumption? (ie. 99%). But it doesn't have to be default if you 
can easily configure it and the scripts are available.

So no, if the scripts are available and the system doesn't crash as you 
say it doesn't anymore, there does not need to be a default.

Just documented.

I've been condensing this email.

You could have a script like:

#!/bin/bash

# Assuming $1 is the thin pool I am getting executed on, that $2 is the 
threshold that
# has been reached, and $3 is the free space available in pool

MIN_FREE_SPACE_CRITICAL_VOLUMES_PCT=5

1. iterate critical volumes
2. calculate needed free space for those volumes based on above value
3. check against the free space in $3

4. perform action

Well I am not saying anything new here compared to Brassow Jonathan.

But it could be that simple to have a script you don't even need to 
configure.

More sophisticated then would be a big vs small script in which you 
don't even need to configure the critical volumes.

So to sum up my position is still:

a) Personally I would still prefer in-kernel protection based on quotas
b) Personally I would not want anything else from in-kernel protection
c) No other policies than that in the kernel
d) Just allocation block based on quotas based on lazy data collection

e) If people really use 64kB chunksizes and want max performance then 
it's not for them
f) The analogy of the aeroplane that runs out of fuel and you have to 
choose which passengers to eject does not apply if you use quotas.

g) I would want more advanced policy or protection mechanisms 
(intervention) in userland using above ideas.

h) I would want inclusion of those basic default scripts in LVM upstream

i) The model of "shield_trigger" of "pam_shield" is a choice between 
several default interventions


> We can discuss if it's good idea to enable auto-extending by default -
> as we don't know if the free space in VG is meant to be used for
> thin-pool or there is some other plan admin might have...

I don't think you should. Any admin that uses thin and that intends to 
auto-extend, will be able to configure so anyway.

When I said I wanted default, it is more like "available by default" 
than "configured by default".

Using thin is a pretty conscious choice.

As long as it is easy to activate protection measures, that is not an 
issue and does not need to be default imo.

Priorities for me:

1) Monitoring and reporting
2) System could block allocation for critical volumes
3) I can drop snapshots starting with the biggest one in case of <5% 
pool free
4) I can freeze volumes when space for critical volumes runs out

Okay sending this now. I tried to summarize.

See ya.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-14  5:59                         ` Xen
@ 2017-09-14 19:05                           ` Zdenek Kabelac
  2017-09-15  2:06                             ` Brassow Jonathan
  2017-09-15  7:34                             ` Xen
  0 siblings, 2 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-14 19:05 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 14.9.2017 v 07:59 Xen napsal(a):
> Zdenek Kabelac schreef op 13-09-2017 21:35:
> 
>> We are moving here in right direction.
>>
>> Yes - current thin-provisiong does not let you limit maximum number of
>> blocks individual thinLV can address (and snapshot is ordinary thinLV)
>>
>> Every thinLV can address� exactly�� LVsize/ChunkSize� blocks at most.
> 
> So basically the only options are allocation check with asynchronously derived 
> intel that might be a few seconds late, as a way to execute some standard and 
> general "prioritizing" policy, and an interventionalist policy that will 
> (fs)freeze certain volumes depending on admin knowledge about what needs to 
> happen in his/her particular instance.


Basically user-land tool takes a runtime snapshot of kernel metadata
(so gets you information from some frozen point in time) then it processes the 
input data (up to 16GiB!) and outputs some number - like what is the
real unique blocks allocated in thinLV.  Typically snapshot may share some 
blocks - or could have already be provisioning all blocks  in case shared 
blocks were already modified.


>> Great - 'prediction' - we getting on the same page -� prediction is
>> big problem....
> 
> Yes I mean my own 'system' I generally of course know how much data is on it 
> and there is no automatic data generation.

However lvm2 is not 'Xen oriented' tool only.
We need to provide universal tool - everyone can adapt to their needs.

Since your needs are different from others needs.

> But if I do create snapshots (which I do every day) when the root and boot 
> snapshots fill up (they are on regular lvm) they get dropped which is nice, 

old snapshot are different technology for different purpose.

> 
> $ sudo ./thin_size_report.sh
> [sudo] password for xen:
> Executing self on linux/thin
> Individual invocation for linux/thin
> 
>  ��� name�������������� pct������ size
>  ��� ---------------------------------
>  ��� data����������� 54.34%���� 21.69g
>  ��� sites����������� 4.60%����� 1.83g
>  ��� home������������ 6.05%����� 2.41g
>  ��� --------------------------------- +
>  ��� volumes�������� 64.99%���� 25.95g
>  ��� snapshots������� 0.09%���� 24.00m
>  ��� --------------------------------- +
>  ��� used����������� 65.08%���� 25.97g
>  ��� available������ 34.92%���� 13.94g
>  ��� --------------------------------- +
>  ��� pool size����� 100.00%���� 39.91g
> 
> The above "sizes" are not volume sizes but usage amounts.

With 'plain'  lvs output is - it's just an orientational number.
Basically highest referenced chunk for a thin given volume.
This is great approximation of size for a single thinLV.
But somewhat 'misleading' for thin devices being created as snapshots...
(having shared blocks)

So you have no precise idea how many blocks are shared or uniquely owned by a 
device.

Removal of snapshot might mean you release  NOTHING from your thin-pool if all 
snapshot blocks where shared with some other thin volumes....


> If you say that any additional allocation checks would be infeasible because 
> it would take too much time per request (which still seems odd because the 
> checks wouldn't be that computation intensive and even for 100 gigabyte you'd 
> only have 25.000 checks at default extent size) -- of course you 
> asynchronously collect the data.

Processing of mapping of upto 16GiB of metadata will not happen in 
miliseconds.... and consumes memory and CPU...


> I mean I generally like the designs of the LVM team.
> 
> I think they are some of the most pleasant command line tools anyway...

We try really hard....


> On the other hand if all you can do is intervene in userland, then all LVM 
> team can do is provide basic skeleton for execution of some standard scripts.

Yes - we give all the power to suit thin-p for individual needs to the user.

> 
>> So all you need to do is to use the tool in user-space for this task.
> 
> So maybe we can have an assortment of some 5 interventionalist policies like:
> 
> a) Govern max snapshot size and drop snapshots when they exceed this
> b) Freeze non-critical volumes when thin space drops below aggegrate values 
> appropriate for the critical volumes
> c) Drop snapshots when thin space <5% starting with the biggest one
> d) Also freeze relevant snapshots in case (b)
> e) Drop snapshots when exceeding max configured size in case of threshold reach.

But you are aware you can run such task even with cronjob.

> So for example you configure max size for snapshot. When snapshots exceeds 
> size gets flagged for removal. But removal only happens when other condition 
> is met (threshold reach).

We are blamed already for having way too much configurable knobs....


> 
> So you would have 5 different interventions you could use that could be 
> considered somewhat standard and the admit can just pick and choose or customize.
> 

And we have way longer list of actions we want to do ;) We have not yet come 
to any single conclusion how to make such thing manageable for a user...

> 
> But how expensive is it to do it say every 5 seconds?

If you have big metadata - you would keep you Intel Core busy all the time ;)

That's why we have those thresholds.

Script is called at  50% fullness, then when it crosses 55%, 60%, ... 95%, 
100%. When it drops bellow threshold - you are called again once the boundary 
is crossed...

So you can do different action at different fullness level...

> 
> I get that but I wonder how expensive it would be to do that automatically all 
> the time in the background.

If you are proud sponsor of your electricity provider and you like the extra 
heating in your house - you can run this in loop of course...


> It seems to already happen?
> 
> Otherwise you wouldn't be reporting threshold messages.

Threshold are based on  mapped size for whole thin-pool.

Thin-pool surely knows all the time how many blocks are allocated and free for
its data and metadata devices.

(Thought 'lvs' presented numbers are not 'synchronized' - there could be up to 
1.second delay between reported & real number)

> In any case the only policy you could have in-kernel would be either what 
> Gionatan proposed (fixed reserved space for certain volumes) (easy calculation 
> right) or potentially allocation freeze at threshold for non-critical volumes,


In the single thin-pool  all thins ARE equal.

Low number of 'data' block may cause tremendous amount of provisioning.

With specifically written data pattern you can (in 1 second!) cause 
provisioning of large portion of your thin-pool (if not the whole one in case 
you have small one in range of gigabytes....)

And that's the main issue - what we solve in  lvm2/dm  - we want to be sure 
that when thin-pool is FULL  -  written & committed data are secure and safe.
Reboot is mostly unavoidable if you RUN from a device which is out-of-space -
we cannot continue to use such device - unless you add MORE space to it within 
60second window.


All other proposals solve only very localized solution and problems which are 
different for every user.

I.e. you could have a misbehaving daemon filling your system device very fast 
with logs...

In practice - you would need some system analysis and detect which application 
causes highest pressure on provisioning  - but that's well beyond range lvm2 
team ATM with the amount of developers can provide....


> I just still don't see how one check per 4MB would be that expensive provided 
> you do data collection in background.
> 
> You say size can be as low as 64kB... well.... in that case...

Default chunk size if 64k for the best 'snapshot' sharing - the bigger the 
pool chunk is the less like you could 'share' it between snapshots...

(As pointed in other thread - ideal chunk for best snapshot sharing would be 
4K - but that's not affordable for other reasons....)


>  ����� 2) I would freeze non-critical volumes ( I do not write to snapshots so 
> that is no issue ) when critical volumes reached safety threshold in free 
> space ( I would do this in-kernel if I could ) ( But Freezing In User-Space is 
> almost the same ).

There are lots of troubles when you have freezed filesystems present in your 
machine fs tree... -  if you know all connections and restrictions - it can be 
'possibly' useful - but I can't imagine this being useful in generic case...

And more for your thinking -

If you have pressure on provisioning caused by disk-load on one of your 
'critical' volumes this FS 'freezeing' scripting will 'buy' you only couple 
seconds (depends how fast drives you have and how big thresholds you will use) 
and you are in the 'exact' same situation - expect now you have  system in 
bigger troubles - and you already might have freezed other systems apps by 
having them accessing your 'low-prio' volumes....

And how you will be solving 'unfreezing' in cases thin-pool usage drops down 
is also pretty interesting topic on its own...

I need to wish good luck when you will be testing and developing all this 
machinery.

>> Default is to auto-extend thin-data & thin-metadata when needed if you
>> set threshold bellow 100%.
> 
> Q: In a 100% filled up pool, are snapshots still going to be valid?
> 
> Could it be useful to have a default policy of dropping snapshots at high 
> consumption? (ie. 99%). But it doesn't have to be default if you can easily 
> configure it and the scripts are available.

All snapshots/thins with 'fsynced' data are always secure.
Thin-pool is protecting all user-data on disk.

The only lost data are those flying in your memory (unwritten on disk).
And depends on you 'page-cache' setup how much that can be...


Regards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-14 19:05                           ` Zdenek Kabelac
@ 2017-09-15  2:06                             ` Brassow Jonathan
  2017-09-15  6:02                               ` Gionatan Danti
  2017-09-15  8:37                               ` Xen
  2017-09-15  7:34                             ` Xen
  1 sibling, 2 replies; 91+ messages in thread
From: Brassow Jonathan @ 2017-09-15  2:06 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Xen


> On Sep 14, 2017, at 2:05 PM, Zdenek Kabelac <zkabelac@redhat.com> wrote:
> 
>> 
>>       2) I would freeze non-critical volumes ( I do not write to snapshots so that is no issue ) when critical volumes reached safety threshold in free space ( I would do this in-kernel if I could ) ( But Freezing In User-Space is almost the same ).
> 
> There are lots of troubles when you have freezed filesystems present in your machine fs tree... -  if you know all connections and restrictions - it can be 'possibly' useful - but I can't imagine this being useful in generic case...
> 
> And more for your thinking -
> 
> If you have pressure on provisioning caused by disk-load on one of your 'critical' volumes this FS 'freezeing' scripting will 'buy' you only couple seconds (depends how fast drives you have and how big thresholds you will use) and you are in the 'exact' same situation - expect now you have  system in bigger troubles - and you already might have freezed other systems apps by having them accessing your 'low-prio' volumes....
> 
> And how you will be solving 'unfreezing' in cases thin-pool usage drops down is also pretty interesting topic on its own...
> 
> I need to wish good luck when you will be testing and developing all this machinery.

Our general philosophy is, don’t do anything that will corrupt user data.  After that, the LVM team wants to put in place the best possible solutions for a generic user set.  When it comes to thin-provisioning, the best possible thing we can do that we are certain will not corrupt/loose data and is least likely to cause unintended consequences, is to try to grow the thin-pool.  If we are unable to grow and the thin-pool is filling up, it is really hard to “do the right thing”.

There are many solutions that could work - unique to every workload and different user.  It is really hard for us to advocate for one of these unique solutions that may work for a particular user, because it may work very badly for the next well-intentioned googler.

We’ve tried to strike a balance of doing the things that are knowably correct and getting 99% of the problems solved, and making the user aware of the remaining problems (like 100% full thin-provisioning) while providing them the tools (like the ‘thin_command’ setting) so they can solve the remaining case in the way that is best for them.

We probably won’t be able to provide any highly refined scripts that users can just plug in for the behavior they want, since they are often so highly specific to each customer.  However, I think it will be useful to try to create better tools so that users can more easily get the behavior they want.  We want to travel as much distance toward the user as possible and make things as usable as we can for them.  From this discussion, we have uncovered a handful of useful ideas (e.g. this bug that Zdenek filed: https://bugzilla.redhat.com/show_bug.cgi?id=1491609) that will make more robust scripts possible.  We are also enhancing our reporting tools so users can better sort through LVM information and take action.  Again, this is in direct response to the feedback we’ve gotten here.

Thanks,
 brassow

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-15  2:06                             ` Brassow Jonathan
@ 2017-09-15  6:02                               ` Gionatan Danti
  2017-09-15  8:37                               ` Xen
  1 sibling, 0 replies; 91+ messages in thread
From: Gionatan Danti @ 2017-09-15  6:02 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Xen

Il 15-09-2017 04:06 Brassow Jonathan ha scritto:
> We probably won’t be able to provide any highly refined scripts that
> users can just plug in for the behavior they want, since they are
> often so highly specific to each customer.  However, I think it will
> be useful to try to create better tools so that users can more easily
> get the behavior they want.  We want to travel as much distance toward
> the user as possible and make things as usable as we can for them.
> From this discussion, we have uncovered a handful of useful ideas
> (e.g. this bug that Zdenek filed:
> https://bugzilla.redhat.com/show_bug.cgi?id=1491609) that will make
> more robust scripts possible.  We are also enhancing our reporting
> tools so users can better sort through LVM information and take
> action.  Again, this is in direct response to the feedback we’ve
> gotten here.

Excellent, thank you all very much.

 From the two proposed solutions (lvremove vs lverror), I think I would 
prefer the second one. Obviously with some warnings as "are you sure do 
error this active volume"?

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-14 19:05                           ` Zdenek Kabelac
  2017-09-15  2:06                             ` Brassow Jonathan
@ 2017-09-15  7:34                             ` Xen
  2017-09-15  9:22                               ` Zdenek Kabelac
  1 sibling, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-15  7:34 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 14-09-2017 21:05:

> Basically user-land tool takes a runtime snapshot of kernel metadata
> (so gets you information from some frozen point in time) then it
> processes the input data (up to 16GiB!) and outputs some number - like
> what is the
> real unique blocks allocated in thinLV.

That is immensely expensive indeed.

> Typically snapshot may share
> some blocks - or could have already be provisioning all blocks  in
> case shared blocks were already modified.

I understand and it's good technology.

>> Yes I mean my own 'system' I generally of course know how much data is 
>> on it and there is no automatic data generation.
> 
> However lvm2 is not 'Xen oriented' tool only.
> We need to provide universal tool - everyone can adapt to their needs.

I said that to indicate that prediction problems are not current 
important for me as much but they definitely would be important in other 
scenarios or for other people.

You twist my words around to imply that I am trying to make myself 
special, while I was making myself unspecial: I was just being modest 
there.

> Since your needs are different from others needs.

Yes and we were talking about the problems of prediction, thank you.

>> But if I do create snapshots (which I do every day) when the root and 
>> boot snapshots fill up (they are on regular lvm) they get dropped 
>> which is nice,
> 
> old snapshot are different technology for different purpose.

Again, what I was saying was to support the notion that having snapshots 
that may grow a lot can be a problem.

I am not sure the purpose of non-thin vs. thin snapshots is all that 
different though.

They are both copy-on-write in a certain sense.

I think it is the same tool with different characteristics.

> With 'plain'  lvs output is - it's just an orientational number.
> Basically highest referenced chunk for a thin given volume.
> This is great approximation of size for a single thinLV.
> But somewhat 'misleading' for thin devices being created as 
> snapshots...
> (having shared blocks)

I understand. The above number for "snapshots" were just the missing 
numbers from this summing up the volumes.

So I had no way to know snapshot usage.

I just calculated all used extents per volume.

The missing extents I put in snapshots.

So I think it is a very good approximation.

> So you have no precise idea how many blocks are shared or uniquely
> owned by a device.

Okay. But all the numbers were attributed to the correct volume 
probably.

I did not count the usage of the snapshot volumes.

Whether they are shared or unique is irrelevant from the point of view 
of wanting to know the total consumption of the "base" volume.

In the above 6 extents were not accounted for (24 MB) so I just assumed 
that would be sitting in snapshots ;-).

> Removal of snapshot might mean you release  NOTHING from your
> thin-pool if all snapshot blocks where shared with some other thin
> volumes....

Yes, but that was not indicated in above figure either. It was just 24 
MB that would be freed ;-).

Snapshots can only become a culprit if you start overwriting a lot of 
data, I guess.

>> If you say that any additional allocation checks would be infeasible 
>> because it would take too much time per request (which still seems odd 
>> because the checks wouldn't be that computation intensive and even for 
>> 100 gigabyte you'd only have 25.000 checks at default extent size) -- 
>> of course you asynchronously collect the data.
> 
> Processing of mapping of upto 16GiB of metadata will not happen in
> miliseconds.... and consumes memory and CPU...

I get that. If that is the case.

That's just the sort of thing that in the past I have been keeping track 
of continuously (in unrelated stuff) such that every mutation also 
updated the metadata without having to recalculate it...

I am meaning to say that if indeed this is the case and indeed it is 
this expensive, then clearly what I want is not possible with that 
scheme.

I mean to say that I cannot argue about this design. You are the 
experts.

I would have to go in learning first to be able to say anything about it 
;-).

So I can only defer to your expertise. Of course.

But the purpose of what you're saying is that the number of uniquely 
owned blocks by any snapshot is not known at any one point in time.

And needs to be derived from the entire map. Okay.

Thus reducing allocation would hardly be possible, you say.

Because the information is not known anyway.


Well pardon me for digging this deeply. It just seemed so alien that 
this thing wouldn't be possible.

I mean it seems so alien that you cannot keep track of those numbers 
runtime without having to calculate them using aggregate measures.

It seems information you want the system to have at all times.

I am just still incredulous that this isn't being done...

But I am not well versed in kernel concurrency measures so I am hardly 
qualified to comment on any of that.

In any case, thank you for your time in explaining. Of course this is 
what you said in the beginning as well, I am just still flabbergasted 
that there is no accounting being done...

Regards.


>> I think they are some of the most pleasant command line tools 
>> anyway...
> 
> We try really hard....

You're welcome.

>> On the other hand if all you can do is intervene in userland, then all 
>> LVM team can do is provide basic skeleton for execution of some 
>> standard scripts.
> 
> Yes - we give all the power to suit thin-p for individual needs to the 
> user.

Which is of course pleasant.

>>> So all you need to do is to use the tool in user-space for this task.
>> 
>> So maybe we can have an assortment of some 5 interventionalist 
>> policies like:
>> 
>> a) Govern max snapshot size and drop snapshots when they exceed this
>> b) Freeze non-critical volumes when thin space drops below aggegrate 
>> values appropriate for the critical volumes
>> c) Drop snapshots when thin space <5% starting with the biggest one
>> d) Also freeze relevant snapshots in case (b)
>> e) Drop snapshots when exceeding max configured size in case of 
>> threshold reach.
> 
> But you are aware you can run such task even with cronjob.

Sure the point is not that it can't be done, but that it seems an unfair 
burden on the system maintainer to do this in isolation of all other 
system maintainers who might be doing the exact same thing.

There is some power in numbers and it is just rather facilitating if a 
common scenario is somewhat provided by a central party.

I understand that every professional outlet dealing in terabytes upon 
terabytes of data will have the manpower to do all of this and do it 
well.

But for everyone else, it is a landscape you cannot navigate because you 
first have to deploy that manpower before you can start using the 
system!!!

It becomes a rather big enterprise to install thinp for anyone!!!

Because to get it running takes no time at all!!! But to get it running 
well then implies huge investment.

I just wouldn't mind if this gap was smaller.

Many of the things you'd need to do are pretty standard. Running more 
and more cronjobs... well I am already doing that. But it is not just 
the maintenance of the cron job (installation etc.) but also the script 
itself that you have to first write.

That means for me and for others that may not be doing it professionally 
or in a larger organisation, the benefit of spending all that time may 
not weigh up to the cost it has and the result is then that you keep 
stuck with a deeply suboptimal situation in which there is little or no 
reporting or fixing, all because the initial investment is too high.

Commonly provided scripts just hugely reduce that initial investment.

For example the bigger vs. smaller system I imagined. Yes I am eager to 
make it. But I got other stuff to do as well :p.

And then, when I've made it, chances are high no one will ever use it 
for years to come.

No one else I mean.


>> So for example you configure max size for snapshot. When snapshots 
>> exceeds size gets flagged for removal. But removal only happens when 
>> other condition is met (threshold reach).
> 
> We are blamed already for having way too much configurable knobs....

Yes but I think it is better to script these things anyway.

Any official mechanism is only going to be inflexible when it goes that 
far.

Like I personally don't like SystemD services compared to cronjobs. 
Systemd services take longer to set up, have to agree to a descriptive 
language, and so on.

Then you need to find out exactly what are the extents of the 
possibilities of that descriptive language, maybe there is a feature you 
do not know about yet, but you can probably also code it using knowledge 
you already have and for which you do not need to read any man pages.

So I do create those services.... for the boot sequence... but anything 
I want to run regularly I still do with a cron job...

It's a bit archaic to install but... it's simple, clean, and you have 
everything in one screen.

>> So you would have 5 different interventions you could use that could 
>> be considered somewhat standard and the admit can just pick and choose 
>> or customize.
>> 
> 
> And we have way longer list of actions we want to do ;) We have not
> yet come to any single conclusion how to make such thing manageable
> for a user...

Hmm.. Well I cannot ... claim to have the superior idea here.

But Idk... I think you can focus on the model right.

Maintaining max snapshot consumption is one model.

Freezing bigger volumes to protect space for smaller volumes is another 
model.

Doing so based on a "critical" flag is another model... (not myself such 
a fan of that)... (more to configure).

Reserving max, set or configured space for a specific volume is another 
model.

(That would be actually equivalent to a 'critical' flag since only those 
volumes that have reserved space would become 'critical' and their space 
reservation is going to be the threshold to decide when to deny other 
volumes more space).

So you can simply call the 'critical flag' idea the same as the 'space 
reservation' idea.

The basic idea is that all space reservations get added together and 
become a threshold.

So that's just one model and I think it is the most important one.

"Reserve space for certain volumes" (but not all of them or it won't 
work). ;-).

This is what Gionatan refered to with the ZFS ehm... shit :p.

And the topic of this email thread.




So you might as well focus on that one alone as per mr. Jonathan's 
reply.

(Pardon for my language there).




While personally I also like the bigger versus smaller idea because you 
don't have to configure it.


The only configuration you need to do is to ensure that the more 
important volumes are a bit smaller.

Which I like.

Then there is automatic space reservation using fsfreezing.

Because the free space required for bigger volumes is always going to be 
bigger than that of smaller volumes.



>> But how expensive is it to do it say every 5 seconds?
> 
> If you have big metadata - you would keep you Intel Core busy all the 
> time ;)
> 
> That's why we have those thresholds.
> 
> Script is called at  50% fullness, then when it crosses 55%, 60%, ...
> 95%, 100%. When it drops bellow threshold - you are called again once
> the boundary is crossed...

How do you know when it is at 50% fullness?

> If you are proud sponsor of your electricity provider and you like the
> extra heating in your house - you can run this in loop of course...

> Threshold are based on  mapped size for whole thin-pool.
> 
> Thin-pool surely knows all the time how many blocks are allocated and 
> free for
> its data and metadata devices.

But didn't you just say you needed to process up to 16GiB to know this 
information?

I am confused?

This means the in-kernel policy can easily be implemented.

You may not know the size and attribution of each device but you do know 
the overall size and availability?

>> In any case the only policy you could have in-kernel would be either 
>> what Gionatan proposed (fixed reserved space for certain volumes) 
>> (easy calculation right) or potentially allocation freeze at threshold 
>> for non-critical volumes,
> 
> 
> In the single thin-pool  all thins ARE equal.

But you could make them unequal ;-).

> Low number of 'data' block may cause tremendous amount of provisioning.
> 
> With specifically written data pattern you can (in 1 second!) cause
> provisioning of large portion of your thin-pool (if not the whole one
> in case you have small one in range of gigabytes....)

Because you only have to write a byte to every extent, yes.

> And that's the main issue - what we solve in  lvm2/dm  - we want to be
> sure that when thin-pool is FULL  -  written & committed data are
> secure and safe.
> Reboot is mostly unavoidable if you RUN from a device which is 
> out-of-space -
> we cannot continue to use such device - unless you add MORE space to
> it within 60second window.

That last part is utterly acceptable.

> All other proposals solve only very localized solution and problems
> which are different for every user.
> 
> I.e. you could have a misbehaving daemon filling your system device
> very fast with logs...
> 
> In practice - you would need some system analysis and detect which
> application causes highest pressure on provisioning  - but that's well
> beyond range lvm2 team ATM with the amount of developers can
> provide....

And any space reservation would probably not do much; if it is not 
filled 100% now, it will be so in a few seconds, in that sense.

The goal was more to protect the other volumes, supposing that log 
writing happened on another one, for that other log volume not to impact 
the other main volumes.

So you have thin global reservation of say 10GB.

Your log volume is overprovisioned and starts eating up the 20GB you 
have available and then runs into the condition that only 10GB remains.

The 10GB is a reservation maybe for your root volume. The system 
(scripts) (or whatever) recognises that less than 10GB remains, that you 
have claimed it for the root volume, and that the log volume is 
intruding upon that.

It then decides to freeze the log volume.

But it is hard to decide what volume to freeze because it would need 
that run-time analysis of what's going on. So instead you just freeze 
all non-reserved volumes.

So all non-critical volumes in Gionatan and Brassow's parlance.


>> I just still don't see how one check per 4MB would be that expensive 
>> provided you do data collection in background.
>> 
>> You say size can be as low as 64kB... well.... in that case...
> 
> Default chunk size if 64k for the best 'snapshot' sharing - the bigger
> the pool chunk is the less like you could 'share' it between
> snapshots...

Okay.. I understand. I guess I was deluded a bit by non-thin snapshot 
behaviour (filled up really fast without me understanding why, and 
concluding that it was doing 4MB copies).

As well as of course that extents were calculated in whole numbers in 
overviews... apologies.

But attribution of an extent to a snapshot will still be done in 
extent-sizes right?

So I was just talking about allocation, nothing else.

BUT if allocator operates on 64kB requests, then yes...

> (As pointed in other thread - ideal chunk for best snapshot sharing
> would be 4K - but that's not affordable for other reasons....)

Okay.

>>        2) I would freeze non-critical volumes ( I do not write to 
>> snapshots so that is no issue ) when critical volumes reached safety 
>> threshold in free space ( I would do this in-kernel if I could ) ( But 
>> Freezing In User-Space is almost the same ).
> 
> There are lots of troubles when you have freezed filesystems present
> in your machine fs tree... -  if you know all connections and
> restrictions - it can be 'possibly' useful - but I can't imagine this
> being useful in generic case...

Well, yeah. Linux.

(I mean, just a single broken NFS or CIFS connection can break so 
much....).



> And more for your thinking -
> 
> If you have pressure on provisioning caused by disk-load on one of
> your 'critical' volumes this FS 'freezeing' scripting will 'buy' you
> only couple seconds

Oh yeah of course, this is correct.

> (depends how fast drives you have and how big
> thresholds you will use) and you are in the 'exact' same situation -
> expect now you have  system in bigger troubles - and you already might
> have freezed other systems apps by having them accessing your
> 'low-prio' volumes....

Well I guess you would reduce non-critical volumes to single-purpose 
things.

Ie. only used by one application.

> And how you will be solving 'unfreezing' in cases thin-pool usage
> drops down is also pretty interesting topic on its own...

I guess that would be manual?

> I need to wish good luck when you will be testing and developing all
> this machinery.

Well as you say it has to be an anomaly in the first place -- an error 
or problem situation.

It is not standard operation.

So I don't think the problems of freezing are bigger than the problems 
of rebooting.

The whole idea is that you attribute non-critical volumes to single apps 
or single purposes so that when they run amock, or in any case, that if 
anything runs amock on them...

Yes it won't protect the critical volumes from being written to.

But that's okay.

You don't need to automatically unfreeze.

You need to send an email and say stuff has happened ;-).

"System is still running but some applications may have crashed. You 
will need to unfreeze and restart in order to solve it, or reboot if 
necessary. But you can still log into SSH, so maybe you can do it 
remotely without a console ;-)".

I don't see any issues with this.

One could say: use filesystem quotas.

Then that involves setting up users etc.

Setting up a quota for a specific user on a specific volume...

All more configuration.

And you're talking mostly about services of course.

The benefit (and danger) of LVM is that it is so easy to create more 
volumes.

(The danger being that you now also need to back up all these volumes).

(Independently).

>>> Default is to auto-extend thin-data & thin-metadata when needed if 
>>> you
>>> set threshold bellow 100%.
>> 
>> Q: In a 100% filled up pool, are snapshots still going to be valid?
>> 
>> Could it be useful to have a default policy of dropping snapshots at 
>> high consumption? (ie. 99%). But it doesn't have to be default if you 
>> can easily configure it and the scripts are available.
> 
> All snapshots/thins with 'fsynced' data are always secure.
> Thin-pool is protecting all user-data on disk.
> 
> The only lost data are those flying in your memory (unwritten on disk).
> And depends on you 'page-cache' setup how much that can be...

That seemes pretty secure. Thank you.

So there is no issue with snapshots behaving differently. It's all the 
same and all committed data will be safe prior to the fillup and not 
change afterward.

I guess.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-15  2:06                             ` Brassow Jonathan
  2017-09-15  6:02                               ` Gionatan Danti
@ 2017-09-15  8:37                               ` Xen
  1 sibling, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-15  8:37 UTC (permalink / raw)
  To: linux-lvm

Brassow Jonathan schreef op 15-09-2017 4:06:

> There are many solutions that could work - unique to every workload
> and different user.  It is really hard for us to advocate for one of
> these unique solutions that may work for a particular user, because it
> may work very badly for the next well-intentioned googler.

Well, thank you.

Of course in the split between saying "it is the administrator's job 
that everyone works well" and at the same time saying that those 
administrators can be "googlers".

There's a big gap between that. I think that many who do employ thinp 
will be at least a bit more serious about it, but perhaps not as serious 
that they can devote all the resources to developing all of the 
mitigating measures that anyone could want.

So I think the common truth lies more in the middle: they are not 
googlers who implement the first random article they find without 
thinking about it, and they are not professional people in full time 
employment doing this thing.


So because of that fact that most administrators interested in thin like 
myself will have read LVM manpages a great deal already on their own 
systems...

And any common default targets for "thin_command" could also be well 
documented and explained, and pros and cons layed out.

The only thing we are talking about today is reserving space due to some 
threshold.

And performing an action when that reservation is threatened.

So this is the common need here.

This need is going to be the same for everyone that uses any scheme that 
could be offered.

Then the question becomes: are interventions also as common?

Well there are really only a few available:

a) turning into error volume as per the bug
b) fsfreezing
c) merely reporting
d) (I am not sure if "lvremove" should really be seriously considered).

At this point you have basically exhausted any default options you may 
have that are "general". No one actually needs more than that.

What becomes interesting now is the logic underpinning these decisions.

This logic needs some time to write and this is the thing that 
administrators will put off.

So they will live with not having any intelligence in automatic response 
and will just live with the risk of a volume filling up without having 
written the logic that could activate the above measures.

That's the problem.

So what I am advocating for -- I am not disregarding Mr. Zdenek's bug 
;-), [1], In fact I think this "lverror" would be very welcome 
(paraphrasing here) even though personally I would want to employ a 
filesystem mechanic if I am doing this using a userland too anyway!!!

But sure, why not.

I think that is complementary to and orthogonal to the issue of where 
the logic is coming from, and that the logic also requires a lot of 
resources to write.

So even though you could probably hack it together in some 15 minutes, 
and then you need testing etc...

I think it would just be a lot more pleasant if this logic framework 
already existed, was tried and tested, did the job correctly, and can 
easily be employed by anyone else.

So I mean to say that currently we are only talking about space 
reservation.


You can only do this in a number of ways:

- % of total volume size.

- fixed amount configured per volume

And that's basically it.

The former merely requires each volume to be 'flagged' as 'critical' as 
suggested.
The latter requires some number to be defined and then flagging is 
unnecessary.

The script would ensure that:

- not ALL thin volumes are 'critical'.
- as long as a single volume is non-critical, the operation can continue
- all critical volumes are aggregated in required free space
- the check is done against currently available free space
- the action on the non-critical-volumes is performed if necessary.

That's it. Anyone could use this.



The "Big vs. Small" model is a little bit more involved and requires a 
little bit more logic, and I would not mind writing it, but it follows 
along the same lines.

*I* say that in this department, *only* these two things are needed.

+ potentially the lverror thing.

So I don't really see this wildgrowth of different ideas.


So personally I would like the "set manual size" more than the "use 
percentage" in the above. I would not want to flag volumes as critical, 
I would just want to set their reserved space.

I would prefer if I could set this in the LVM volumes themselves, rather 
than in the script.

If the script used a percentage, I would want to be able to configure 
the percentage outside the script as well.

I would want the script to do the heavy lifting of knowing how to 
extract these values from the LVM volumes, and some information on how 
to put them there.

(Using tags and all of that is not all that common knowledge I think).

Basically, I want the script to know how to set and retrieve properties 
from the LVM volumes.

Then I want it to be easy to see the reserved space (potentially) 
(although this can conflict with not being a really integrated feature) 
and perhaps to set and change it...

So I think that what is required is really only minimal...

But that doesn't mean it is unnecessary.

> We’ve tried to strike a balance of doing the things that are knowably
> correct and getting 99% of the problems solved, and making the user
> aware of the remaining problems (like 100% full thin-provisioning)
> while providing them the tools (like the ‘thin_command’ setting) so
> they can solve the remaining case in the way that is best for them.

I am really happy to learn about these considerations.

I hope that we can see as the result of this today the inclusion of the 
script you mentioned in the previous email.

Something that hopefully would use values tagged into volumes, and a 
script that would need no modification by the user.

Something that would e.g. be called with the name of the thin pool as 
first parameter (pardon my ignorance) and would take care of all the 
rest by retrieving values tagged onto volumes.


( I mean that's what I would write, but if I were to write it probably 
no one else would ever use it, so .... (He says with a small voice) ).

And personally I would prefer this script to use "fsfreeze" as you 
mentioned (I was even not all that aware of this command...) rather than 
changing to an error target.

But who knows.

I am not saying it's a bad thing.

Seems risky though.

So honestly I just completely second the script you proposed, mr. 
Jonathan.

;-).


While I still don't know why any in-kernel thing is impossible, seeing 
that Zdenek-san mentioned overall block availability to be known, and 
that you only need overall block availability + some configured values 
to impose any sanctions on non-critical volumes.....

I would hardly feel a need for such a measure if the script mentioned 
and perhaps the other idea that I like so much of "big vs small" would 
be readily available.

I really have no other wishes than that personally.

It's that simple.

Space reservation and big to small protection.

Those are the only things I want.

Now all that's left to do is upgrade my LVM version ;-).

(Hate messing with a Debian install ;-)).

And I feel almost like writing it myself after having talked about it 
for so long anyway...

(But that's what happens when you develop ideas).



> We probably won’t be able to provide any highly refined scripts that
> users can just plug in for the behavior they want, since they are
> often so highly specific to each customer.  However, I think it will
> be useful to try to create better tools so that users can more easily
> get the behavior they want.  We want to travel as much distance toward
> the user as possible and make things as usable as we can for them.
> From this discussion, we have uncovered a handful of useful ideas
> (e.g. this bug that Zdenek filed:
> https://bugzilla.redhat.com/show_bug.cgi?id=1491609) that will make
> more robust scripts possible.  We are also enhancing our reporting
> tools so users can better sort through LVM information and take
> action.  Again, this is in direct response to the feedback we’ve
> gotten here.

Well that's very good. It took me a long time to sort through the 
information without the thinls command.

Indeed, if such data is readily available it makes the burden of writing 
the script yourself much less as well.

Still I would still vouch for the inclusion of the 2 scripts mentioned:

- space reservation
- big vs. small

And I don't mind writing the second one myself, or at least an example 
of it.

Regards.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1491609

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-15  7:34                             ` Xen
@ 2017-09-15  9:22                               ` Zdenek Kabelac
  2017-09-16 22:33                                 ` Xen
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-15  9:22 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 15.9.2017 v 09:34 Xen napsal(a):
> Zdenek Kabelac schreef op 14-09-2017 21:05:
> 

>>> But if I do create snapshots (which I do every day) when the root and boot 
>>> snapshots fill up (they are on regular lvm) they get dropped which is nice,
>>
>> old snapshot are different technology for different purpose.
> 
> Again, what I was saying was to support the notion that having snapshots that 
> may grow a lot can be a problem.


lvm2 makes them look the same - but underneath it's very different (and it's 
not just by age - but also for targeting different purpose).

- old-snaps are good for short-time small snapshots - when there is estimation 
for having low number of changes and it's not a big issue if snapshot is 'lost'.

- thin-snaps are ideal for long-time living objects with possibility to take 
snaps of snaps of snaps and you are guaranteed the snapshot will not 'just 
dissapear' while you modify your origin volume...

Both have very different resources requirements and performance...

> I am not sure the purpose of non-thin vs. thin snapshots is all that different 
> though.
> 
> They are both copy-on-write in a certain sense.
> 
> I think it is the same tool with different characteristics.

That are cases where it's quite valid option to take  old-snap of thinLV and 
it will payoff...

Even exactly in the case you use thin and you want to make sure your temporary 
snapshot will not 'eat' all your thin-pool space and you want to let snapshot die.

Thin-pool still does not support shrinking - so if the thin-pool auto-grows to 
big size - there is not a way for lvm2 to reduce the thin-pool size...



> That's just the sort of thing that in the past I have been keeping track of 
> continuously (in unrelated stuff) such that every mutation also updated the 
> metadata without having to recalculate it...

Would you prefer to spend all you RAM to keep all the mapping information for 
all the volumes and put very complex code into kernel to parse the information 
which is technically already out-of-data in the moment you get the result ??

In 99.9% of runtime you simply don't need this info.

> But the purpose of what you're saying is that the number of uniquely owned 
> blocks by any snapshot is not known at any one point in time.

As long as 'thinLV' (i.e. your snapshot thinLV) is NOT active - there is 
nothing in kernel maintaining its dataset.  You can have lots of thinLV active 
and lots of other inactive.


> Well pardon me for digging this deeply. It just seemed so alien that this 
> thing wouldn't be possible.

I'd say it's very smart ;)

You can use only very small subset of 'metadata' information for individual 
volumes.
> 
> It becomes a rather big enterprise to install thinp for anyone!!!

It's enterprise level software ;)

> Because to get it running takes no time at all!!! But to get it running well 
> then implies huge investment.

In most common scenarios - user knows when he runs out-of-space - it will not 
be 'pleasant' experience - but users data should be safe.

And then it depends how much energy/time/money user wants to put into 
monitoring effort to minimize downtime.

As has been said - disk-space is quite cheap.
So if you monitor and insert your new disk-space in-time  (enterprise...)  you 
have less set of problems - then if you try to fight constantly with 100% full 
thin-pool...

You have still problems even when you have 'enough' disk-space ;)
i.e. you select small chunk-size and you want extend thin-pool data volume 
beyond addressable capacity -  each chunk-size has its final maximum data size....

> That means for me and for others that may not be doing it professionally or in 
> a larger organisation, the benefit of spending all that time may not weigh up 
> to the cost it has and the result is then that you keep stuck with a deeply 
> suboptimal situation in which there is little or no reporting or fixing, all 
> because the initial investment is too high.

You can always use normal device - it's really about the choice and purpose...


> 
> While personally I also like the bigger versus smaller idea because you don't 
> have to configure it.

I'm still proposing to use different pools for different purposes...

Sometimes spreading the solution across existing logic is way easier,
then trying to achieve some super-inteligent universal one...

>> Script is called at  50% fullness, then when it crosses 55%, 60%, ...
>> 95%, 100%. When it drops bellow threshold - you are called again once
>> the boundary is crossed...
> 
> How do you know when it is at 50% fullness?
> 
>> If you are proud sponsor of your electricity provider and you like the
>> extra heating in your house - you can run this in loop of course...
> 
>> Threshold are based on  mapped size for whole thin-pool.
>>
>> Thin-pool surely knows all the time how many blocks are allocated and free for
>> its data and metadata devices.
> 
> But didn't you just say you needed to process up to 16GiB to know this 
> information?

Of course thin-pool has to be aware how much free space it has.
And this you can somehow imagine as 'hidden' volume with FREE space...

So to give you this 'info' about  free blocks in pool - you maintain very 
small metadata subset - you don't need to know about all other volumes...

If other volume is releasing or allocation chunks - your 'FREE space' gets 
updated....

It's complex underneath and locking is very performance sensitive - but for 
easy understanding you can possibly get the picture out of this...

> 
> You may not know the size and attribution of each device but you do know the 
> overall size and availability?

Kernel support 1 setting for threshold - where the user-space (dmeventd) is 
waked-up when usage has passed it.

The mapping of value is lvm.conf autoextend threshold.

As a 'secondary' source - dmeventd checks every 10 second pool fullness with 
single ioctl() call and compares how the fullness has changed and provides you 
with callbacks for those  50,55...  jumps
(as can be found in  'man dmeventd')

So for autoextend theshold passing you get instant call.
For all others there is up-to 10 second delay for discovery.

>> In the single thin-pool  all thins ARE equal.
> 
> But you could make them unequal ;-).

I cannot ;)  - I'm lvm2 coder -   dm thin-pool is Joe's/Mike's toy :)

In general - you can come with many different kernel modules which take 
different approach to the problem.

Worth to note -  RH has now Permabit  in its porfolio - so there can more then 
one type of thin-provisioning supported in lvm2...

Permabit solution has deduplication, compression, 4K blocks - but no snapshots....


> 
> The goal was more to protect the other volumes, supposing that log writing 
> happened on another one, for that other log volume not to impact the other 
> main volumes.

IMHO best protection is different pool for different thins...
You can more easily decide which pool can 'grow-up'
and which one should rather be taken offline.

So your 'less' important data volumes may simply hit the wall hard,
while your 'strategically important' one will avoid using overprovisioning as 
much as possible to keep it running.

Motto: keep it simple ;)

> So you have thin global reservation of say 10GB.
> 
> Your log volume is overprovisioned and starts eating up the 20GB you have 
> available and then runs into the condition that only 10GB remains.
> 
> The 10GB is a reservation maybe for your root volume. The system (scripts) (or 
> whatever) recognises that less than 10GB remains, that you have claimed it for 
> the root volume, and that the log volume is intruding upon that.
> 
> It then decides to freeze the log volume.

Of course you can play with 'fsfreeze' and other things - but all these things 
are very special to individual users with their individual preferences.

Effectively if you freeze your 'data' LV - as a reaction you may paralyze the 
rest of your system - unless you know the 'extra' information about the user 
use-pattern.

But do not take this as something to discourage you to try it - you may come 
with perfect solution for your particular system  - and some other user may 
find it useful in some similar pattern...

It's just something that lvm2 can't give support globally.

But lvm2 will give you enough bricks for writing 'smart' scripts...

> Okay.. I understand. I guess I was deluded a bit by non-thin snapshot 
> behaviour (filled up really fast without me understanding why, and concluding 
> that it was doing 4MB copies).

Fast disks are now easily able to write gigabytes in second... :)

> 
> But attribution of an extent to a snapshot will still be done in extent-sizes 
> right?

Allocation unit in VG  is 'extent'   - ranges from 1sector to 4GiB
and default is 4M - yes....

> 
> So I don't think the problems of freezing are bigger than the problems of 
> rebooting.

With 'reboot' you know where you are -  it's IMHO fair condition for this.

With frozen FS and paralyzed system and your 'fsfreeze' operation of 
unimportant volumes actually has even eaten the space from thin-pool which may 
possibly been used better to store data for important volumes....
and there is even big danger you will 'freeze' yourself already during call of 
fsfreeze  (unless you of course put BIG margins around)


> 
> "System is still running but some applications may have crashed. You will need 
> to unfreeze and restart in order to solve it, or reboot if necessary. But you 
> can still log into SSH, so maybe you can do it remotely without a console ;-)".

Compare with  email:

Your system has run out-of-space, all actions to gain some more space has 
failed  - going to reboot into some 'recovery' mode

> 
> So there is no issue with snapshots behaving differently. It's all the same 
> and all committed data will be safe prior to the fillup and not change afterward.

Yes - snapshot is 'user-land' language  -  in kernel - all thins  maps chunks...

If you can't map new chunk - things is going to stop - and start to error 
things out shortly...

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-15  9:22                               ` Zdenek Kabelac
@ 2017-09-16 22:33                                 ` Xen
  2017-09-17  6:31                                   ` Xen
  2017-09-18  8:56                                   ` Zdenek Kabelac
  0 siblings, 2 replies; 91+ messages in thread
From: Xen @ 2017-09-16 22:33 UTC (permalink / raw)
  To: linux-lvm

Zdenek Kabelac schreef op 15-09-2017 11:22:

> lvm2 makes them look the same - but underneath it's very different
> (and it's not just by age - but also for targeting different purpose).
> 
> - old-snaps are good for short-time small snapshots - when there is
> estimation for having low number of changes and it's not a big issue
> if snapshot is 'lost'.
> 
> - thin-snaps are ideal for long-time living objects with possibility
> to take snaps of snaps of snaps and you are guaranteed the snapshot
> will not 'just dissapear' while you modify your origin volume...
> 
> Both have very different resources requirements and performance...

Point being that short-time small snapshots are also perfectly served by 
thin...

So I don't really think there are many instances where "old" trumps 
"thin".

Except, of course, if the added constraint is a plus (knowing in advance 
how much it is going to cost).

But that's the only thing: predictability.

I use my regular and thin snapshots for the same purpose. Of course you 
can do more with Thin.

> That are cases where it's quite valid option to take  old-snap of
> thinLV and it will payoff...
> 
> Even exactly in the case you use thin and you want to make sure your
> temporary snapshot will not 'eat' all your thin-pool space and you
> want to let snapshot die.

Right.

That sounds pretty sweet actually. But it will be a lot slower right.

I currently just make new snapshots each day. They live for an entire 
day. If the system wants to make a backup of the snapshot it has to do 
it within the day ;-).

My root volume is not on thin and thus has an "old-snap" snapshot. If 
the snapshot is dropped it is because of lots of upgrades but this is no 
biggy; next week the backup will succeed. Normally the root volume 
barely changes.

So it would be possible to reserve regular LVM space for thin volumes as 
well right, for snapshots, as you say below. But will this not slow down 
all writes considerably more than a thin snapshot?

So while my snapshots are short-lived, they are always there.

The current snapshot is always of 0:00.

> Thin-pool still does not support shrinking - so if the thin-pool
> auto-grows to big size - there is not a way for lvm2 to reduce the
> thin-pool size...

Ah ;-). A detriment of auto-extend :p.

>> That's just the sort of thing that in the past I have been keeping 
>> track of continuously (in unrelated stuff) such that every mutation 
>> also updated the metadata without having to recalculate it...
> 
> Would you prefer to spend all you RAM to keep all the mapping
> information for all the volumes and put very complex code into kernel
> to parse the information which is technically already out-of-data in
> the moment you get the result ??

No if you only kept some statistics that would not amount to all the 
mapping data but only to a summary of it.

Say if you write a bot that plays a board game. While searching for 
moves the bot has to constantly perform moves on the board. It can 
either create new board instances out of every move, or just mutate the 
existing board and be a lot faster.

In mutating the board it will each time want the same information as 
before: how many pieces does the white player have, how many pieces the 
black player, and so on.

A lot of this information is easier to update than to recalculate, that 
is, the moves themselves can modify this summary information, rather 
than derive it again from the board positions.

This is what I mean by "updating the metadata without having to 
recalculate it".

You wouldn't have to keep the mapping information in RAM, just the 
amount of blocks attributed and so on. A single number. A few single 
numbers for each volume and each pool.

No more than maybe 32 bytes, I don't know.

It would probably need to be concurrently updated, but that's what it 
is.

You just maintain summary information that you do not recalculate, but 
just modify each time an action is performed.

>> But the purpose of what you're saying is that the number of uniquely 
>> owned blocks by any snapshot is not known at any one point in time.
> 
> As long as 'thinLV' (i.e. your snapshot thinLV) is NOT active - there
> is nothing in kernel maintaining its dataset.  You can have lots of
> thinLV active and lots of other inactive.

But if it's not active, can it still 'trace' another volume? Ie. it has 
to get updated if it is really a snapshot of something right.

If it doesn't get updated (and not written to) then it also does not 
allocate new extents.

So then it never needs to play a role in any mechanism needed to prevent 
allocation.

However volumes that see new allocation happening for them, would then 
always reside in kernel memory right.

You said somewhere else that overall data (for pool) IS available. But 
not for volumes themselves?

Ie. you don't have a figure on uniquely owned vs. shared blocks.

I get that it is not unambiguous to interpret these numbers.

Regardless with one volume as "master" I think a non-ambiguous 
interpretation arises?

So is or is not the number of uniquely owned/shared blocks known for 
each volume at any one point in time?

>> Well pardon me for digging this deeply. It just seemed so alien that 
>> this thing wouldn't be possible.
> 
> I'd say it's very smart ;)

You mean not keeping everything in memory.

> You can use only very small subset of 'metadata' information for
> individual volumes.

But I'm still talking about only summary information...


>> It becomes a rather big enterprise to install thinp for anyone!!!
> 
> It's enterprise level software ;)

Well I get that you WANT that ;-).

However with the appropriate amount of user friendliness what was first 
only for experts can be simply for more ordinary people ;-).

I mean, kuch kuch, if I want some SSD caching in Microsoft Windows, kuch 
kuch, I right click on a volume in Windows Explorer, select properties, 
select ReadyBoost tab, click "Reserve complete volume for ReadyBoost", 
click okay, and I'm done.

It literally takes some 10 seconds to configure SSD caching on such a 
machine.

Would probably take me some 2 hours in Linux not just to enter the 
commands but also to think about how to do it.

Provided I don't end up with the SSD kernel issues with IO queue 
bottlenecking I had before...

Which, I can tell you, took a multitude of those 2 hours with the 
conclusion that the small mSata SSD I had was just not suitable, much 
like some USB device.


For example, OpenVPN clients on Linux are by default not configured to 
automatically reconnect when there is some authentication issue (which 
could be anything, including a dead link I guess) and will thus simply 
quit at the smallest issue. It then needs the "auth-retry nointeract" 
directive to keep automatically reconnecting.

But on any Linux machine the command line version of OpenVPN is going to 
be probably used as an unattended client.

So it made no sense to have to "figure this out" on your own. An 
enterprise will be able to do so yes.

But why not make it easier...

And even if I were an enterprise, I would still want:

- ease of mind
- sane defaults
- if I make a mistake the earth doesn't explode
- If I forget to configure something it will have a good default
- System is self-contained and doesn't need N amount of monitoring 
systems before it starts working

> In most common scenarios - user knows when he runs out-of-space - it
> will not be 'pleasant' experience - but users data should be safe.

Yes again, apologies, but I was basing myself on Kernel 4.4 in Debian 8 
with LVM 2.02.111 which, by now, is three years old hahaha.

Hehe, this is my self-made reporting tool:

Subject: Snapshot linux/root-snap has been umounted

Snapshot linux/root-snap has been unmounted from /srv/root because it 
filled up to a 100%.

Log message:

Sep 16 22:37:58 debian lvm[16194]: Unmounting invalid snapshot 
linux-root--snap from /srv/root.

Earlier messages:

Sep 16 22:37:52 debian lvm[16194]: Snapshot linux-root--snap is now 97% 
full.
Sep 16 22:37:42 debian lvm[16194]: Snapshot linux-root--snap is now 93% 
full.
Sep 16 22:37:32 debian lvm[16194]: Snapshot linux-root--snap is now 86% 
full.
Sep 16 22:37:22 debian lvm[16194]: Snapshot linux-root--snap is now 82% 
full.

Now do we or do we not upgrade to Debian Stretch lol.

> And then it depends how much energy/time/money user wants to put into
> monitoring effort to minimize downtime.

Well yes but this is exacerbated by say this example of OpenVPN having 
bad defaults. If you can't figure out why your connection is not 
maintained now you need monitoring script to automatically restart it.

If something is hard to recover from, now you need monitoring script to 
warn you plenty ahead of time so you can prevent it, etc.

If the monitoring script can fail, now you need a monitoring script to 
monitor the monitoring script ;-).

System admins keep busy ;-).

> As has been said - disk-space is quite cheap.
> So if you monitor and insert your new disk-space in-time
> (enterprise...)  you have less set of problems - then if you try to
> fight constantly with 100% full thin-pool...

In that case it's more of a safety measure. But a bit pointless if you 
don't intend to keep growing your data collection.

Ie. you could keep an extra disk in your system for this purpose, but 
then you can't shrink the thing as you said once it gets used ;-).

That makes it rather pointless to have it as a safety net for a system 
that is not meant to expand ;-).


> You can always use normal device - it's really about the choice and 
> purpose...

Well the point is that I never liked BTRFS.

BTRFS has its own set of complexities and people running around and 
tumbling over each other in figuring out how to use the darn thing. 
Particularly with regards to the how-to of using subvolumes, of which 
there seem to be many different strategies.

And then Red Hat officially deprecates it for the next release. Hmmmmm.

So ZFS has very linux-unlike command set.

Its own universe.

LVM in general is reasonably customer-friendly or user-friendly. 
Configuring cache volumes etc. is not that easy but also not that 
complicated. Configuring RAID is not very hard compared to mdadm 
although it remains a bit annoying to have to remember pretty explicit 
commands to manage it.

But rebuilding e.g. RAID 1 sets is pretty easy and automatic.

Sometimes there is annoying stuff like not being able to change a volume 
group (name) when a PV is missing, but if you remove the PV how do you 
put it back in? And maybe you don't want to... well whatever.

I guess certain things are difficult enough that you would really want a 
book about it, and having to figure it out is fun the first time but 
after that a chore.

So I am interested in developing "the future" of computing you could 
call it.

I believe that using multiple volumes is "more natural" than a single 
big partition.

But traditionally the "single big partition" is the only way to get a 
flexible arrangement of free space.

So when you move towards multiple (logical) volumes, you lose that 
flexibility that you had before.

The only way to solve that is by making those volumes somewhat virtual.

And to have them draw space from the same big pool.

So you end up with thin provisioning. That's all there is to it.

>> While personally I also like the bigger versus smaller idea because 
>> you don't have to configure it.
> 
> I'm still proposing to use different pools for different purposes...

You mean use a different pool for that one critical volume that can't 
run out of space.


This goes against the idea of thin in the first place. Now you have to 
give up the flexibility that you seek or sought in order to get some 
safety because you cannot define any constraints within the existing 
system without separating physically.

> Sometimes spreading the solution across existing logic is way easier,
> then trying to achieve some super-inteligent universal one...

I get that... building a wall between two houses is easier than having 
to learn to live together.

But in the end the walls may also kill you ;-).

Now you can't share washing machine, you can't share vacuum cleaner, you 
have to have your own copy of everything, including bath rooms, toilet, 
etc.

Even though 90% of the time these things go unused.

So resource sharing is severely limited by walls.

Total cost of services goes up.


>> But didn't you just say you needed to process up to 16GiB to know this 
>> information?
> 
> Of course thin-pool has to be aware how much free space it has.
> And this you can somehow imagine as 'hidden' volume with FREE space...
> 
> So to give you this 'info' about  free blocks in pool - you maintain
> very small metadata subset - you don't need to know about all other
> volumes...

Right, just a list of blocks that are free.

> If other volume is releasing or allocation chunks - your 'FREE space'
> gets updated....

That's what I meant by mutating the data (summary).

> It's complex underneath and locking is very performance sensitive -
> but for easy understanding you can possibly get the picture out of
> this...

I understand, but does this mean that the NUMBER of free blocks is also 
always known?

So isn't the NUMBER of used/shared blocks in each DATA volume also 
known?

>> You may not know the size and attribution of each device but you do 
>> know the overall size and availability?
> 
> Kernel support 1 setting for threshold - where the user-space
> (dmeventd) is waked-up when usage has passed it.
> 
> The mapping of value is lvm.conf autoextend threshold.
> 
> As a 'secondary' source - dmeventd checks every 10 second pool
> fullness with single ioctl() call and compares how the fullness has
> changed and provides you with callbacks for those  50,55...  jumps
> (as can be found in  'man dmeventd')
> 
> So for autoextend theshold passing you get instant call.
> For all others there is up-to 10 second delay for discovery.

But that's about the 'free space'.

What about the 'used space'. Could you, potentially, theoretically, set 
a threshold for that? Or poll for that?

I mean the used space of each volume.

>> But you could make them unequal ;-).
> 
> I cannot ;)  - I'm lvm2 coder -   dm thin-pool is Joe's/Mike's toy :)
> 
> In general - you can come with many different kernel modules which
> take different approach to the problem.
> 
> Worth to note -  RH has now Permabit  in its porfolio - so there can
> more then one type of thin-provisioning supported in lvm2...
> 
> Permabit solution has deduplication, compression, 4K blocks - but no
> snapshots....

Hmm, sounds too 'enterprise' for me ;-).

In principle it comes down to the same thing... one big pool of storage 
and many views onto it.

Deduplication is natural part of that...

Also for backup purposes mostly.

You can have 100 TB worth of backups only using 5 TB.

Without having to primitively hardlink everything.

And maintaining complete trees of every backup open on your 
filesystem.... no usage of archive formats...

If the system can hardlink blocks instead of files, that is very 
interesting.

Of course snapshots (thin) are also views onto the dataset.

That's the point of sharing.

But sometimes you live in the same house and you want a little room for 
yourself ;-).

But in any case...

Of course if you can only change lvm2, maybe nothing of what I said was 
ever possible.

But I thought you also spoke of possibilities including the possibility 
of changing the device mapper, saying it is impossible what I want :p.

IF you could change the device mapper, THEN could it be possible to 
reserve allocation space for a single volume???

All you have to do is lie to the other volumes when they want to know 
how much space is available ;-).

Or something of the kind.

Logically there are only two conditions:

- virtual free space for critical volume is smaller than its reserved 
space
- virtual free space for critical volume is bigger than its reserved 
space

If bigger, then all the reserved space is necessary to stay free
If smaller, then we don't need as much.

But it probably also doesn't hurt.

So 40GB virtual volume has 5GB free but reserved space is 10GB.

Now real reserved space also becomes 5GB.

So for this system to work you need only very limited data points:

- unallocated extents of virtual 'critical' volumes (1 number for each 
'critical' volume)
- total amount of free extents in pool

And you're done.

+ the reserved space for each 'critical volume'.

So say you have 2 critical volumes:

virtual size      reserved space
     10GB                500MB
     40GB                 10GB

Total reserved space is 10.5GB

If second one has allocated 35GB, only could possibly need 5GB more, so 
figure changes to

   5.5GB reserved space

Now other volumes can't touch that space, when the available free space 
in entire pool becomes <= 5.5GB, allocation fails for non-critical 
volumes.

It really requires very limited information.

- free extents for all critical volumes (unallocated as per the virtual 
size)
- total amount free extents in pool
- max space reservation for each critical volume

And you're done. You now have a working system. This is the only 
information the allocator needs to employ this strategy.

No full maps required.

If you have 2 critical volumes, this is a total of 5 numbers.

This is 40 bytes of data@most.


>> The goal was more to protect the other volumes, supposing that log 
>> writing happened on another one, for that other log volume not to 
>> impact the other main volumes.
> 
> IMHO best protection is different pool for different thins...
> You can more easily decide which pool can 'grow-up'
> and which one should rather be taken offline.

Yeah yeah.

But that is like avoiding the problem, so there doesn't need to be a 
solution.

> Motto: keep it simple ;)

The entire idea of thin provisioning is to not keep it simple ;-).

Same goes for LVM.

Otherwise we'd be still using physical partitions.


>> So you have thin global reservation of say 10GB.
>> 
>> Your log volume is overprovisioned and starts eating up the 20GB you 
>> have available and then runs into the condition that only 10GB 
>> remains.
>> 
>> The 10GB is a reservation maybe for your root volume. The system 
>> (scripts) (or whatever) recognises that less than 10GB remains, that 
>> you have claimed it for the root volume, and that the log volume is 
>> intruding upon that.
>> 
>> It then decides to freeze the log volume.
> 
> Of course you can play with 'fsfreeze' and other things - but all
> these things are very special to individual users with their
> individual preferences.
> 
> Effectively if you freeze your 'data' LV - as a reaction you may
> paralyze the rest of your system - unless you know the 'extra'
> information about the user use-pattern.

Many things only work if the user follows a certain model of behaviour.

The whole idea of having a "critical" versus a "non-critical" volume is 
that you are going to separate the dependencies such that a failure of 
the "non-critical" volume will not be "critical" ;-).

So the words themselves predict that anyone employing this strategy will 
ensure that the non-critical volumes are not critically depended upon 
;-).

> But do not take this as something to discourage you to try it - you
> may come with perfect solution for your particular system  - and some
> other user may find it useful in some similar pattern...
> 
> It's just something that lvm2 can't give support globally.

I think the model is clean enough that you can provide at least a 
skeleton script for it...

But that was already suggested you know, so...


If people want different intervention than "fsfreeze" that is perfectly 
fine.

Most of the work goes into not deciding the intervention (that is 
usually simple) but in writing the logic.

(Where to store the values, etc.).

(Do you use LVM tags, how to use that, do we read some config file 
somewhere else, etc.).

Only reason to provide skeleton script with LVM is to lessen the burden 
on all those that would like to follow that separation of critical vs. 
non-critical.

The big vs. small idea is extension of that.

Of course you don't have to support it in that sense personally.

But logical separation of more critical vs. less critical of course 
would require you to also organize your services that way.

If you have e.g. three levels of critical services (A B C) and three 
levels of critical volumes (X Y Z) then:

A (most critical)   B (intermediate)   C (least critical)
         |               ___/|     _______/  ___/|
         |           ___/   _|____/      ___/    |
         |       ___/  ____/ |       ___/        |
         |   ___/_____/      |   ___/            |
         |  /                |  /                |
X (most critical)   Y (intermediate)   Z (least critical)

Service A can only use volume X
Service B can use both X and Y
Service C can use X Y and Z.

This is the logical separation you must make if "critical" is going to 
have any value.

> But lvm2 will give you enough bricks for writing 'smart' scripts...

I hope so.

It is just convenient if certain models are more mainstream or more easy 
to implement.

Instead of each person having to reinvent the wheel...

But anyway.

I am just saying that the simple thing Sir Jonathan offered would 
basically implement the above.

It's not very difficult, just a bit of level-based separation of orders 
of importance.

Of course the user (admin) is responsible for ensuring that programs 
actually agree with it.

>> So I don't think the problems of freezing are bigger than the problems 
>> of rebooting.
> 
> With 'reboot' you know where you are -  it's IMHO fair condition for 
> this.
> 
> With frozen FS and paralyzed system and your 'fsfreeze' operation of
> unimportant volumes actually has even eaten the space from thin-pool
> which may possibly been used better to store data for important
> volumes....

Fsfreeze would not eat more space than was already eaten.

A reboot doesn't change anything about that either.

If you don't freeze it (and neither reboot) the whole idea is that more 
space would be eaten than was already.

So not doing anything is not a solution (and without any measures in 
place like this, the pool would be full).

So we know why we want reserved space; it was already rapidly being 
depleted.

> and there is even big danger you will 'freeze' yourself already during
> call of fsfreeze  (unless you of course put BIG margins around)

Well I didn't say fsfreeze was the best high level solution anyone could 
ever think of.

But I think freezing a less important volume should ... according to the 
design principles laid out above... not undermine the rest of the 
'critical' system.

That's the whole idea right.

Again not suggesting everyone has to follow that paradigm.

But if you're gonna talk about critical vs. non-critical, the admin has 
to pursue that idea throughout the entire system.

If I freeze a volume only used by a webserver... I will only freeze the 
webserver... not anything else?


>> "System is still running but some applications may have crashed. You 
>> will need to unfreeze and restart in order to solve it, or reboot if 
>> necessary. But you can still log into SSH, so maybe you can do it 
>> remotely without a console ;-)".
> 
> Compare with  email:
> 
> Your system has run out-of-space, all actions to gain some more space
> has failed  - going to reboot into some 'recovery' mode

Actions to gain more space in this case only amounts to dropping 
snapshots, otherwise we are talking much more aggressive policy.

So now your system has rebooted and is in a recovery mode. Your system 
ran 3 different services. SSH/shell/email/domain etc, webserver and 
providing NFS mounts.

Very simple example right.

Your webserver had dedicated 'less critical' volume.

Some web application overflowed, user submitted lots of data, etc.

Web application volume is frozen.

(Or web server has been shut down, same thing here).

- Now you can still SSH, system still receives and sends email
- You can still access filesystems using NFS

Compare to recovery console:

- SSH doesn't work, you need Console
- email isn't received nor sent
- NFS is unavailable
- pings to domain don't work
- other containers go offline too
- entire system is basically offline.

Now for whatever reason you don't have time to solve the problem.

System is offline for a week. Emails are thrown away, not received, you 
can't ssh and do other tasks, you may be able to clean the mess but you 
can't put the server online (webserver) in case it happens again.

You need time to deal with it but in the meantime entire system was 
offline. You have to manually reboot and shut down web application.

But in our proposed solution, the script already did that for you.

So same outcome. Less intervention from you required.

Better to keep the system running partially than not at all?

SSH access is absolute premium in many cases.

>> So there is no issue with snapshots behaving differently. It's all the 
>> same and all committed data will be safe prior to the fillup and not 
>> change afterward.
> 
> Yes - snapshot is 'user-land' language  -  in kernel - all thins  maps 
> chunks...
> 
> If you can't map new chunk - things is going to stop - and start to
> error things out shortly...

I get it.

We're going to prevent them from mapping new chunks ;-).

Well.

:p.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-16 22:33                                 ` Xen
@ 2017-09-17  6:31                                   ` Xen
  2017-09-17  7:10                                     ` Xen
  2017-09-18  8:56                                   ` Zdenek Kabelac
  1 sibling, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-17  6:31 UTC (permalink / raw)
  To: linux-lvm

Xen schreef op 17-09-2017 0:33:

> But if it's not active, can it still 'trace' another volume? Ie. it
> has to get updated if it is really a snapshot of something right.
> 
> If it doesn't get updated (and not written to) then it also does not
> allocate new extents.

Oh now I get what you mean.

If it's not active it can also in that sense not reserve any extents for 
itself.

So the calculations I proposed way below require at least 2 numbers for 
each 'critical' volume to be present in the kernel.

Which is the unallocated virtual size and the reserved space.

So even if they're not active they would need to provide this 
information somehow.

Of course the information also doesn't change if it's not active, so it 
would just be 2 static numbers.

But then what happens if you change the reserved space etc...

In any case that sorta thing would indeed be required...

(For an in-kernel thing...)

(Also any snapshot of a critical volume would not in itself become a 
critical volume...)

(But you're saying that "thin snapshot" is not a kernel concept...)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-17  6:31                                   ` Xen
@ 2017-09-17  7:10                                     ` Xen
  2017-09-18 19:20                                       ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-17  7:10 UTC (permalink / raw)
  To: linux-lvm

Xen schreef op 17-09-2017 8:31:

> If it's not active it can also in that sense not reserve any extents 
> for itself.

But if it's not active I don't see why it should be critical or why you 
should reserve space for it to be honest...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-16 22:33                                 ` Xen
  2017-09-17  6:31                                   ` Xen
@ 2017-09-18  8:56                                   ` Zdenek Kabelac
  1 sibling, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-18  8:56 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 17.9.2017 v 00:33 Xen napsal(a):
> Zdenek Kabelac schreef op 15-09-2017 11:22:
> 
>> lvm2 makes them look the same - but underneath it's very different
>> (and it's not just by age - but also for targeting different purpose).
>>
>> - old-snaps are good for short-time small snapshots - when there is
>> estimation for having low number of changes and it's not a big issue
>> if snapshot is 'lost'.
>>
>> - thin-snaps are ideal for long-time living objects with possibility
>> to take snaps of snaps of snaps and you are guaranteed the snapshot
>> will not 'just dissapear' while you modify your origin volume...
>>
>> Both have very different resources requirements and performance...
> 
> Point being that short-time small snapshots are also perfectly served by thin...

if you take into account other the constrain - like necessity of planning 
small chunk sizes for thin-pool to have reasonably efficient snapshots,
not so small memory footprint - there are cases where short lived
snapshot is simply better choice.

> My root volume is not on thin and thus has an "old-snap" snapshot. If the 
> snapshot is dropped it is because of lots of upgrades but this is no biggy; 
> next week the backup will succeed. Normally the root volume barely changes.

And you can really have VERY same behavior WITH thin-snaps.

All you need to do is - 'erase' your inactive thin volume snapshot before 
thin-pool switches to out-of-space mode.

You really have A LOT of time (60seconds) to do this - even when thin-pool 
hits 100% fullness.

All you need to do is to write your 'specific' maintenance mode that will 
'erase' volumes tagged/named with some specific name, so you can easily find 
those LVs and 'lvremove' them when thin-pool is getting out of the space.

That's the advantage of 'inactive' snapshot.

If you have snapshot 'active' - you need to kill 'holders' (backup software),
umount volume and remove it.

Again - quite reasonably simple task when you know all 'variables'.

Hardly doable at generic level....


> So it would be possible to reserve regular LVM space for thin volumes as well 

'reserve'  can't really be 'generic'.
Everyone has different view on what is 'safe' reserve.
And you loose a lot of space in unusable reserves...

I.e. think about  2000LV in single thin-pool - and design reserves....
Start to 'think big' instead of focusing on 3 thinLVs...


>> Thin-pool still does not support shrinking - so if the thin-pool
>> auto-grows to big size - there is not a way for lvm2 to reduce the
>> thin-pool size...
> 
> Ah ;-). A detriment of auto-extend :p.

Yep - that's why we have not enable 'autoresize' by default.

It's admin decision ATM whether the free space in VG should be used by 
thin-pool or something else.

It would be better is  there would be shrinking support - but it's not yet here...

> No if you only kept some statistics that would not amount to all the mapping 
> data but only to a summary of it.

Why should kernel be doing some complex statistic management ?

(Again 'think-big' - kernel is not supposed to be parsing ALL metadata ALL the 
time -  really  - in this case we could 'drop' all the user-space :) and shift 
everything to kernel - and we end with similar complexity of kernel code as 
the btrfs has....


> Say if you write a bot that plays a board game. While searching for moves the 
> bot has to constantly perform moves on the board. It can either create new 
> board instances out of every move, or just mutate the existing board and be a 
> lot faster.

Such bot  KNOW all the combination.. - you are constantly forgetting  thin 
volume target maps very small portion of the whole metadatata set.

> A lot of this information is easier to update than to recalculate, that is, 
> the moves themselves can modify this summary information, rather than derive 
> it again from the board positions.


Maybe you should try to write a chess-player then - AFAIK it's purely based on 
brutal CPU power and massive library of know 'starts' & 'finish'....

Your simplification proposal 'with summary' seems to be quite innovative here...


> This is what I mean by "updating the metadata without having to recalculate it".

When you propose is very different thin-pool architecture - so you should try 
to talk with it's authors -  I can only provide you with  'lvm2' abstraction 
level details.

I cannot change kernel level....

The ideal upstreaming mechanism for a new target is to provide some at least 
basic implementation proving the concept can work.

And you should also show how is this complicated kernel code giving any better 
result then current user-space solution we provide.


> You wouldn't have to keep the mapping information in RAM, just the amount of 
> blocks attributed and so on. A single number. A few single numbers for each 
> volume and each pool.

It really means  - kernel would need to read ALL data,
and do ALL validation in kernel   (which is currently work made in use-space)

Hopefully it's finally cleat at this point.

> But if it's not active, can it still 'trace' another volume? Ie. it has to get 
> updated if it is really a snapshot of something right.

Inactive volume CANNOT change - so it doesn't need to be traced.

> If it doesn't get updated (and not written to) then it also does not allocate 
> new extents.

Allocation of new chunks always happen for an active thin LV.

> However volumes that see new allocation happening for them, would then always 
> reside in kernel memory right.
> 
> You said somewhere else that overall data (for pool) IS available. But not for 
> volumes themselves?

Yes -  kernel knows how many 'free' chunks are in POOL.
Kernel does NOT know  how many individual chunks belongs to single thinLVs.

> Regardless with one volume as "master" I think a non-ambiguous interpretation 
> arises?

There is no 'master' volume.

All thinLVs   are equal - and present only set of mapped chunks.
Just some of them can be mapped by more then one thinLV...


> So is or is not the number of uniquely owned/shared blocks known for each 
> volume at any one point in time?


Unless you parse all metadata and create a big data structures for this info,
you do not have this information available.

>> You can use only very small subset of 'metadata' information for
>> individual volumes.
> 
> But I'm still talking about only summary information...

I'm wondering how would you be updating such summery information in case you
have just simple 'fstrim' information.

To update such info - you would need 'backtrace'  ALL the 'released' blocks 
for your fstrimed thin volume - figure out how many OTHER thinLV (snapst) were 
sharing same  blocks - and update all their summary information.

Effectively you again need pretty complex data processing (which is otherwise 
ATM happening at user-space level with current design) to be shifted into kernel.

I'm not saying it cannot be done - surely you can reach the goal (just like 
btrfs) - but it's simply different design requiring to write completely 
different kernel target and all user-land app.

It's not something we can reach with few months of codding...

> However with the appropriate amount of user friendliness what was first only 
> for experts can be simply for more ordinary people ;-).

I assume you overestimate how many people works on the project...
We do the best we can...

> I mean, kuch kuch, if I want some SSD caching in Microsoft Windows, kuch kuch, 
> I right click on a volume in Windows Explorer, select properties, select 
> ReadyBoost tab, click "Reserve complete volume for ReadyBoost", click okay, 
> and I'm done.

Do you think it's fair to compare us with  MS capacity  :)  ??


> It literally takes some 10 seconds to configure SSD caching on such a machine.
> 
> Would probably take me some 2 hours in Linux not just to enter the commands 
> but also to think about how to do it.

It's the open source world...



> So it made no sense to have to "figure this out" on your own. An enterprise 
> will be able to do so yes.
> 
> But why not make it easier...

All which needs to happen is -  someone sits and write the code :)
Nothing else is really needed ;)

Hopefully my time invested into this low-level explanation will motivate 
someone to write something for users....


> Yes again, apologies, but I was basing myself on Kernel 4.4 in Debian 8 with 
> LVM 2.02.111 which, by now, is three years old hahaha.

Well we are at 2.02.174 -  so I'm really mainly interested for complains 
against upstream version of lvm2.

There is not much point in discussing 3 years history...


> If the monitoring script can fail, now you need a monitoring script to monitor 
> the monitoring script ;-).

Maybe you start to see why  'reboot' is not such a bad option...

>> You can always use normal device - it's really about the choice and purpose...
> 
> Well the point is that I never liked BTRFS.

Do not take this as some  'advocating' for usage of btrfs.

But all you are proposing here is mostly 'btrfs' design.

lvm2/dm  is quite different solution with different goals.



> BTRFS has its own set of complexities and people running around and tumbling 
> over each other in figuring out how to use the darn thing. Particularly with 
> regards to the how-to of using subvolumes, of which there seem to be many 
> different strategies.

It's been BTRFS 'solution' how to overcome problems...

> And then Red Hat officially deprecates it for the next release. Hmmmmm.

Red Hat simply can't do everything for everyone...

> Sometimes there is annoying stuff like not being able to change a volume group 
> (name) when a PV is missing, but if you remove the PV how do you put it back 

You may possibly miss the complexity behind those operations.

But we try to keep them at 'reasonable' minimum.

Again please try to 'think' big when you have i.e. hundreds of PVs attached 
over network... used in clusters...

There are surely things, which do look over-complicated when you have just 2 
disks in your laptop.....

But as it has been said - we address issues on 'generic' level...

You have states - and transition between states is defined in some way and 
applies for systems states XYZ....


> I guess certain things are difficult enough that you would really want a book 
> about it, and having to figure it out is fun the first time but after that a 
> chore.

Would be nice if someone would have wrote a book about it ;)

> You mean use a different pool for that one critical volume that can't run out 
> of space.
> 
> This goes against the idea of thin in the first place. Now you have to give up 
> the flexibility that you seek or sought in order to get some safety because 
> you cannot define any constraints within the existing system without 
> separating physically.

Nope - it's still well within.

Imagine you have  a VG with  1TB space,
You create  0.2TB  'userdata'  thin-pool with some thins
and you create 0.2TB  'criticalsystem'  thin-pool with some thins.

Then you orchestrate growth of those 2 thin-pools according to your rules and 
needs -  i.e. always need  0.1TB of free space in VG to get some space for 
system thin-pool.   You may even start to remove 'userdata' thin-pool in case 
you would like to get some space for 'cricticalsystem'  thin-pool

There is NO solution to protect you again running out of system space when are 
overprovissiong.

It always end with having  1TB thin-pool with  2TB volume on it.

You can't fit 2TB into 1TB so at some point in time every overprovisioning is 
going to hit dead-end....


> I get that... building a wall between two houses is easier than having to 
> learn to live together.
> 
> But in the end the walls may also kill you ;-).
> 
> Now you can't share washing machine, you can't share vacuum cleaner, you have 
> to have your own copy of everything, including bath rooms, toilet, etc.
> 
> Even though 90% of the time these things go unused.

When you share - you need to HEAVILY plan for everything.

There is always some price paid.

In many cases it's better to leave your vacuum cleaner unused for 99% of its 
time, just to be sure you can take ANYTIME you need....

You may also drop usage of modern CPUs which are 99% left unused....

So of course it's cheaper to share  - but is it comfortable??
Does it payoff??

Your pick....


> I understand, but does this mean that the NUMBER of free blocks is also always 
> known?

Thin-pool knows how many blocks are 'free'.


> So isn't the NUMBER of used/shared blocks in each DATA volume also known?

It's not known per volume.

All you now  is  - thin-pool has size X and has free Y blocks.
Pool does not know how many thin-devices are there - unless you scan metadata.

All known info is visible with  'dmsetup status'

Status report exposes all known info for thin-pool and for thin volumes.

All is described in kernel documentation for these DM targets.


> What about the 'used space'. Could you, potentially, theoretically, set a 
> threshold for that? Or poll for that?

Clearly used_space is  'whole_space -  free_space'



> IF you could change the device mapper, THEN could it be possible to reserve 
> allocation space for a single volume???

You probably need to start then discussion at more kernel oriented DM list.

> Logically there are only two conditions:
> 
> - virtual free space for critical volume is smaller than its reserved space
> - virtual free space for critical volume is bigger than its reserved space
> 
> If bigger, then all the reserved space is necessary to stay free
> If smaller, then we don't need as much.

You can implement all this logic with existing lvm2 2.02.174.
Scripting gives you all the power to your hands.

> 
> But it probably also doesn't hurt.
> 
> So 40GB virtual volume has 5GB free but reserved space is 10GB.
> 
> Now real reserved space also becomes 5GB.

Please try to stop thinking within your  'margins' and your 'conditions' 
every user/customer has different view - sometimes you simply need to 
'think-big' in TiB or PiB ;)....


> Many things only work if the user follows a certain model of behaviour.
> 
> The whole idea of having a "critical" versus a "non-critical" volume is that 
> you are going to separate the dependencies such that a failure of the 
> "non-critical" volume will not be "critical" ;-).

Already explained few times...

>> With 'reboot' you know where you are -� it's IMHO fair condition for this.
>>
>> With frozen FS and paralyzed system and your 'fsfreeze' operation of
>> unimportant volumes actually has even eaten the space from thin-pool
>> which may possibly been used better to store data for important
>> volumes....
> 
> Fsfreeze would not eat more space than was already eaten.


If you 'fsfreeze' - the filesystem has to be put into consistent state -
so all unwritten 'data' & 'metadata' out of your page-cache has to pushed on 
your disk.

This will cause very hardly 'predictable' amount of provisioning on your 
thin-pool.   You can possibly estimate 'maximum' number....

> If I freeze a volume only used by a webserver... I will only freeze the 
> webserver... not anything else?

Number of system apps are doing  scans over entire system....
Apps are talking to each other and waiting for answers...
Of course lots it 'transiently' freezed apps would be, because other apps are 
not well written for parallel world...

Again - if you have set of constrains - like you have a 'special' volume for 
web server which is ONLY used by web server,  you can a better decision.

In this case it would be likely better to kill 'web server' and umount volume....



> We're going to prevent them from mapping new chunks ;-).
> 

You can't prevent kernel from mapping new chunks....

But you can do ALL in userspace - though ATM you need to possibly use 
'dmsetup' commands....

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-17  7:10                                     ` Xen
@ 2017-09-18 19:20                                       ` Gionatan Danti
  2017-09-20 13:05                                         ` Xen
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-18 19:20 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Xen

Il 17-09-2017 09:10 Xen ha scritto:
> Xen schreef op 17-09-2017 8:31:
> 
> 
> But if it's not active I don't see why it should be critical or why
> you should reserve space for it to be honest...

Xen, I really think that the combination of hard-threshold obtained by 
setting thin_pool_autoextend_threshold and thin_command hook for 
user-defined script should be sufficient to prevent and/or react to full 
thin pools.

I'm all for the "keep it simple" on the kernel side. After all, thinp 
maintain very high performance in spite of its CoW behavior *even when 
approaching pool fullness*, a thing which can not be automatically said 
for advanced in-kernel filesystems as BTRFS (which very low 
random-rewrite performance) and ZFS (I just recently opened a ZoL issue 
for ZVOLs with *much* lower than expected write performance, albeit the 
workaround/correction was trivial in this case).

That said, I would like to see some pre-defined scripts to easily manage 
pool fullness. For example, a script to automatically delete all 
inactive snapshots with "deleteme" or "temporary" flag. Sure, writing 
such a script is trivial for any sysadmin - but I would really like the 
standardisation such predefined scripts imply.

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-18 19:20                                       ` Gionatan Danti
@ 2017-09-20 13:05                                         ` Xen
  2017-09-21  9:49                                           ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-20 13:05 UTC (permalink / raw)
  To: linux-lvm

Gionatan Danti schreef op 18-09-2017 21:20:

> Xen, I really think that the combination of hard-threshold obtained by
> setting thin_pool_autoextend_threshold and thin_command hook for
> user-defined script should be sufficient to prevent and/or react to
> full thin pools.

I will hopefully respond to Zdenek's message later (and the one before 
that that I haven't responded to),

> I'm all for the "keep it simple" on the kernel side.

But I don't mind if you focus on this,

> That said, I would like to see some pre-defined scripts to easily
> manage pool fullness. (...) but I would really
> like the standardisation such predefined scripts imply.

And only provide scripts instead of kernel features.

Again, the reason I am also focussing on the kernel is because:

a) I am not convinced it cannot be done in the kernel
b) A kernel feature would make space reservation very 'standardized'.

Now I'm not convinced I really do want a kernel feature but saying it 
isn't possible I think is false.

The point is that kernel features make it much easier to standardize and 
to put some space reservation metric in userland code (it becomes a 
default feature) and scripts remain a little bit off to the side.

However if we *can* standardize on some tag or way of _reserving_ this 
space, I'm all for it.

I think a 'critical' tag in combination with the standard 
autoextend_threshold (or something similar) is too loose and ill-defined 
and not very meaningful.

In other words you would be abusing one feature for another purpose.

So I do propose a way to tag volumes with a space reservation (turning 
them cricical) or alternatively to configure a percentage of reserved 
space and then merely tag some volumes as critical volumes.

I just want these scripts to be such that you don't really need to 
modify them.

In other words: values configured elsewhere.

If you think that should be the thin_pool_autoextend_threshold, fine, 
but I really think it should be configured elsewhere (because you are 
not using it for autoextending in this case).

thin_command is run every 5%:

https://www.mankier.com/8/dmeventd

You will need to configure a value to check against.

This is either going to be a single, manually configured, fixed value 
(in % or extents)

Or it can be calculated based on reserved space of individual volumes.

So if you are going to have a kind of "fsfreeze" script based on 
critical volumes vs. non-critical volumes I'm just saying it would be 
preferable to set the threshold at which to take action in another way 
than by using the autoextend_threshold for that.

And I would prefer to set individual space reservation for each volume 
even if it can only be compared to 5% threshold values.

So again: if you want to focus on scripts, fine.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-20 13:05                                         ` Xen
@ 2017-09-21  9:49                                           ` Zdenek Kabelac
  2017-09-21 10:22                                             ` Xen
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-21  9:49 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 20.9.2017 v 15:05 Xen napsal(a):
> Gionatan Danti schreef op 18-09-2017 21:20:
> 
>> Xen, I really think that the combination of hard-threshold obtained by
>> setting thin_pool_autoextend_threshold and thin_command hook for
>> user-defined script should be sufficient to prevent and/or react to
>> full thin pools.
> 
> I will hopefully respond to Zdenek's message later (and the one before that 
> that I haven't responded to),
> 
>> I'm all for the "keep it simple" on the kernel side.
> 
> But I don't mind if you focus on this,
> 
>> That said, I would like to see some pre-defined scripts to easily
>> manage pool fullness. (...) but I would really
>> like the standardisation such predefined scripts imply.
> 
> And only provide scripts instead of kernel features.
> 
> Again, the reason I am also focussing on the kernel is because:
> 
> a) I am not convinced it cannot be done in the kernel
> b) A kernel feature would make space reservation very 'standardized'.


Hi

Some more 'light' into the existing state as this is really not about what can 
and what cannot be done in kernel - as clearly you can do 'everything' in 
kernel - if you have the code for it...

I'm here explaining position of lvm2 - which is user-space project (since we 
are on lvm2 list) - and lvm2 is using  'existing'  dm  kernel target which 
provides  thin-provisioning (and has it's configurables). So this is kernel 
piece and differs from user-space lvm2 counterpart.

Surely there is cooperation between these two - but anyone else can write some 
other 'dm'  target - and lvm2 can extend support for given  target/segment 
type if such target is used by users.

In practice your 'proposal' is quite different from the existing target - 
essentially major rework if not a whole new re-implementation  - as it's not 
'a few line' patch extension  which you might possibly believe/hope into.

I can (and effectively I've already spent a lot of time) explaining the 
existing logic and why it is really hardly doable with current design, but we 
cannot work on support for 'hypothetical' non-existing kernel target from lvm2 
side - so you need to start from 'ground-zero' level on dm target design....
or you need to 'reevaluate' your vision to be more in touch with existing 
kernel target output...

However we believe our exiting solution in 'user-space' can cover most common 
use-cases and we might just have 'big-holes' in providing better documentation 
to explain reasoning and guide users to use existing technology in more 
optimal way.

> 
> The point is that kernel features make it much easier to standardize and to 
> put some space reservation metric in userland code (it becomes a default 
> feature) and scripts remain a little bit off to the side.

Maintenance/devel/support of kernel code is more expensive - it's usually very 
easy to upgrade small 'user-space' encapsulated package - compared with major 
changes on kernel side.

So that's where dm/lvm2 design gets from - do the 'minimum necessary' inside 
kernel and  maximize usage of user-space.

Of course this decision makes some tasks harder (i.e. there are surely 
problems which would not even exist if it would be done in kernel)  - but lots 
of other things are way easier - you really can't compare those....

Yeah - standards are always problem :)  i.e. Xorg & Wayland....
but it's way better to play with user-space then playing with kernel....

> However if we *can* standardize on some tag or way of _reserving_ this space, 
> I'm all for it.

Problems of a desktop user with 0.5TB SSD are often different with servers 
using 10PB across multiple network-connected nodes.

I see you call for one standard - but it's very very difficult...


> I think a 'critical' tag in combination with the standard autoextend_threshold 
> (or something similar) is too loose and ill-defined and not very meaningful.

We look for delivering admins rock-solid bricks.

If you make small house or you build a Southfork out of it is then admins' choice.

We have spend really lot of time thinking if there is some sort of 
'one-ring-to-rule-them-all' solution - but we can't see it yet - possibly 
because we know wider range of use-cases compared with individual user-focused 
problem.

> And I would prefer to set individual space reservation for each volume even if 
> it can only be compared to 5% threshold values.

Which needs 'different' kernel target driver (and possibly some way to 
kill/split page-cache to work on 'per-device' basis....)

And just as an illustration of problems you need to start solving for this design:

You have origin and 2 snaps.
You set different 'thresholds' for these volumes  -
You then overwrite 'origin'  and you have to maintain 'data' for OTHER LVs.
So you get into the position - when 'WRITE' to origin will invalidate volume 
that is NOT even active (without lvm2 being even aware).
So suddenly rather simple individual thinLV targets  will have to maintain 
whole 'data set' and cooperate with all other active thins targets in case 
they share some data.... - so in effect WHOLE data tree needs to be 
permanently accessed -  this could be OK when you focus for use of 3 volumes
with at most couple hundreds GiB of addressable space - but does not 'fit' 
well for 1000LVs and PB of addressable data.


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-21  9:49                                           ` Zdenek Kabelac
@ 2017-09-21 10:22                                             ` Xen
  2017-09-21 13:02                                               ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-21 10:22 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

Hi,

thank you for your response once more.

Zdenek Kabelac schreef op 21-09-2017 11:49:

> Hi
> 
> Some more 'light' into the existing state as this is really not about
> what can and what cannot be done in kernel - as clearly you can do
> 'everything' in kernel - if you have the code for it...

Well thank you for that ;-).

> In practice your 'proposal' is quite different from the existing
> target - essentially major rework if not a whole new re-implementation
>  - as it's not 'a few line' patch extension  which you might possibly
> believe/hope into.

Well I understand that the solution I would be after would require 
modification to the DM target. I was not arguing for LVM alone; I 
assumed that since DM and LVM are both hosted in the same space there 
would be at least the idea of cooperation between the two teams.

And that it would not be too 'radical' to talk about both at the same 
time.

> Of course this decision makes some tasks harder (i.e. there are surely
> problems which would not even exist if it would be done in kernel)  -
> but lots of other things are way easier - you really can't compare
> those....

I understand. But many times lack of integration of shared goal of 
multiple projects is also big problem in Linux.

>> However if we *can* standardize on some tag or way of _reserving_ this 
>> space, I'm all for it.
> 
> Problems of a desktop user with 0.5TB SSD are often different with
> servers using 10PB across multiple network-connected nodes.
> 
> I see you call for one standard - but it's very very difficult...

I am pretty sure that if you start out with something simple, it can 
extend into the complex.

That's of course why an elementary kernel feature would make sense.

A single number. It does not get simpler than that.

I am not saying you have to.

I was trying to find out if your statements that something was 
impossible, was actually true.

You said that you need a completely new DM target from the ground up. I 
doubt that. But hey, you're the expert, not me.

I like that you say that you could provide an alternative to the regular 
DM target and that LVM could work with that too.

Unfortunately I am incapable of doing any development myself at this 
time (sounds like fun right) and I also of course could not myself test 
20 PB.

>> I think a 'critical' tag in combination with the standard 
>> autoextend_threshold (or something similar) is too loose and 
>> ill-defined and not very meaningful.
> 
> We look for delivering admins rock-solid bricks.
> 
> If you make small house or you build a Southfork out of it is then
> admins' choice.
> 
> We have spend really lot of time thinking if there is some sort of
> 'one-ring-to-rule-them-all' solution - but we can't see it yet -
> possibly because we know wider range of use-cases compared with
> individual user-focused problem.

I think you have to start simple.

You can never come up with a solution if you start out with the complex.

The only thing I ever said was:
- give each volume a number of extents or a percentage of reserved space 
if needed
- for all the active volumes in the thin pool, add up these numbers
- when other volumes require allocation, check against free extents in 
the pool
- possibly deny allocation for these volumes

I am not saying here you MUST do anything like this.

But as you say, it requires features in the kernel that are not there.

I did not know or did not realize the upgrade paths of the DM module(s) 
and LVM2 itself would be so divergent.

So my apologies for that but obviously I was talking about a full-system 
solution (not partial).

>> And I would prefer to set individual space reservation for each volume 
>> even if it can only be compared to 5% threshold values.
> 
> Which needs 'different' kernel target driver (and possibly some way to
> kill/split page-cache to work on 'per-device' basis....)

No no, here I meant to set it by a script or to read it by a script or 
to use it by a script.

> And just as an illustration of problems you need to start solving for
> this design:
> 
> You have origin and 2 snaps.
> You set different 'thresholds' for these volumes  -

I would not allow setting threshold for snapshots.

I understand that for dm thin target they are all the same.

But for this model it does not make sense because LVM talks of "origin" 
and "snapshots".

> You then overwrite 'origin'  and you have to maintain 'data' for OTHER 
> LVs.

I don't understand. Other LVs == 2 snaps?

> So you get into the position - when 'WRITE' to origin will invalidate
> volume that is NOT even active (without lvm2 being even aware).

I would not allow space reservation for inactive volumes.

Any space reservation is meant for safeguarding the operation of a 
machine.

Thus it is meant for active volumes.

> So suddenly rather simple individual thinLV targets  will have to
> maintain whole 'data set' and cooperate with all other active thins
> targets in case they share some data

I don't know what data sharing has to do with it.

The entire system only works with unallocated extents.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-21 10:22                                             ` Xen
@ 2017-09-21 13:02                                               ` Zdenek Kabelac
  2017-09-21 14:34                                                 ` [linux-lvm] Clarification (and limitation) of the kernel feature I proposed Xen
  2017-09-21 14:49                                                 ` [linux-lvm] Reserve space for specific thin logical volumes Xen
  0 siblings, 2 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-21 13:02 UTC (permalink / raw)
  To: Xen; +Cc: LVM general discussion and development

Dne 21.9.2017 v 12:22 Xen napsal(a):
> Hi,
> 
> thank you for your response once more.
> 
> Zdenek Kabelac schreef op 21-09-2017 11:49:
> 
>> Hi
>>
>> Of course this decision makes some tasks harder (i.e. there are surely
>> problems which would not even exist if it would be done in kernel)� -
>> but lots of other things are way easier - you really can't compare
>> those....
> 
> I understand. But many times lack of integration of shared goal of multiple 
> projects is also big problem in Linux.

And you also have project that do try to integrate shared goals like btrfs.


>>> However if we *can* standardize on some tag or way of _reserving_ this 
>>> space, I'm all for it.
>>
>> Problems of a desktop user with 0.5TB SSD are often different with
>> servers using 10PB across multiple network-connected nodes.
>>
>> I see you call for one standard - but it's very very difficult...
> 
> I am pretty sure that if you start out with something simple, it can extend 
> into the complex.

We hope community will provide some individual scripts...
Not a big deal to integrate them into repo dir...

>> We have spend really lot of time thinking if there is some sort of
>> 'one-ring-to-rule-them-all' solution - but we can't see it yet -
>> possibly because we know wider range of use-cases compared with
>> individual user-focused problem.
> 
> I think you have to start simple.

It's mostly about what can be supported 'globally'
and what is rather 'individual' customization.


> You can never come up with a solution if you start out with the complex.
> 
> The only thing I ever said was:
> - give each volume a number of extents or a percentage of reserved space if 
> needed

Which can't be deliver with current thinp technology.
It's simply too computational invasive for our targeted performance.

The only deliverable we have is - you create a 'cron' job that does hard 
'computing' once in a while - and makes some 'action' when individual 
'volumes' goes out of their preconfigured boundaries.  (often such logic is 
implemented outside of lvm2 - in some DB engine - since   lvm2 itself is 
really NOT a high performing DB - the ascii format has it's age....)

You can't get this 'percentage' logic online in kernel (aka while you update 
individual volume).


> - for all the active volumes in the thin pool, add up these numbers
> - when other volumes require allocation, check against free extents in the pool

I assume you possibly missed this logic of thin-p:

When you update origin - you always allocate FOR origin, but allocated chunk
remains claimed by snapshots (if there are any).

So if snapshot shared all pages with the origin at the beginning (so basically 
consumed only some 'metadata' space and 0% real exclusive own space)  - after 
full rewrite of the origin your snapshot suddenly 'holds' all the old chunks 
(100% of its size)

So when you 'write' to ORIGIN - your snapshot which becomes bigger in terms of 
individual/exclusively owned chunks - so if you have i.e. configured snapshot 
to not  consume more then  XX% of your pool - you would simply need to recalc 
this with every update on shared chunks....

And as has been already said - this is currently unsupportable 'online'

Another aspect here is - thin-pool has  no idea about 'history' of volume 
creation - it doesn't not know  there is volume X being snapshot of volume Y - 
this all is only 'remembered' by lvm2 metadata  -  in kernel - it's always 
like  -  volume X  owns set of chunks  1...
That's all kernel needs to know for a single thin volume to work.

You can do it with 'reasonable' delay in user-space upon 'triggers' of global 
threshold  (thin-pool fullness).

> - possibly deny allocation for these volumes


Unsupportable in 'kernel' without rewrite and you can i.e. 'workaround' this 
by placing 'error' targets in place of less important thinLVs...

Imagine you would get pretty random 'denials' of your WRITE request depending 
on interaction with other snapshots....


Surely if use 'read-only' snapshot you may not see all related problems, but 
such a very minor subclass of whole provisioning solution is not worth a 
special handling of whole thin-p target.



> I did not know or did not realize the upgrade paths of the DM module(s) and 
> LVM2 itself would be so divergent.

lvm2 is  volume manager...

dm is implementation layer for different 'segtypes' (in lvm2 terminology).

So i.e. anyone can write his own 'volume manager'  and use 'dm'  - it's fully 
supported - dm is not tied to lvm2 and is openly designed  (and used by other 
projects)....


> So my apologies for that but obviously I was talking about a full-system 
> solution (not partial).

yep - 2 different worlds....

i.e. crypto, multipath,...



>> You have origin and 2 snaps.
>> You set different 'thresholds' for these volumes� -
> 
> I would not allow setting threshold for snapshots.
> 
> I understand that for dm thin target they are all the same.
> 
> But for this model it does not make sense because LVM talks of "origin" and 
> "snapshots".
> 
>> You then overwrite 'origin'� and you have to maintain 'data' for OTHER LVs.
> 
> I don't understand. Other LVs == 2 snaps?

yes - other LVs are snaps in this example...


> 
>> So you get into the position - when 'WRITE' to origin will invalidate
>> volume that is NOT even active (without lvm2 being even aware).
> 
> I would not allow space reservation for inactive volumes.

You are not 'reserving' any space as the space already IS assigned to those 
inactive volumes.

What you would have to implement is to TAKE the space FROM them to satisfy 
writing task to your 'active' volume and respect prioritization...

If you will not implement this 'active' chunk 'stealing' - you are really ONLY 
shifting 'hit-the-wall' time-frame....  (worth possibly couple seconds only of 
your system load)...

In other words - tuning 'thresholds' in userspace's 'bash' script will give 
you very same effect as if you are focusing here on very complex 'kernel' 
solution.


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [linux-lvm] Clarification (and limitation) of the kernel feature I proposed
  2017-09-21 13:02                                               ` Zdenek Kabelac
@ 2017-09-21 14:34                                                 ` Xen
  2017-09-21 14:49                                                 ` [linux-lvm] Reserve space for specific thin logical volumes Xen
  1 sibling, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-21 14:34 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

Zdenek, you will believe in below email that I am advocating for max 
snapshot size.

I was not.

Only kernel feature I was suggesting was making judgement about when or 
how to refuse allocation for new chunks. Nothing else. Not based on 
consumed space, unique space consumed by volumes or snapshots.

Based only on FREE SPACE metric, not USED SPACE metric (which can be 
more complex).

When you say that freezing allocation has same effect as error target, 
you could be correct.

I will not respond to individual remarks but rewrite here below as 
summary:

- call collection of all critical volumes C.

- call C1 and C2 members of C.

- each Ci ∈ C has a number FE(Ci) for the number of free (unallocated) 
extents of that volume
- each Ci ∈ C has a fixed number RE(Ci) for the number of reserved 
extents.

Observe that FE(Ci) may be smaller than RE(Ci). E.g. a volume may have 
1000 reserved extents (RE(Ci)) but 500 free extents (FE(Ci)) at which 
point it has more reserved extents than it can use.

Therefore in our calculations we use the smaller of those two numbers 
for the effective reserved extents (ERE(Ci)).

ERE(Ci) = min( FE(Ci), RE(Ci) )

Now the total number of effective reserved extents of the pool is the 
total effective number of reserved extents of collection C.

ERE(POOL) = ERE(C) = ∑ ERE(Ci)

This number is dependent on the live number of free extents of each 
critical volume Ci.

Now the critical equation that each time will be evaluated when a chunk 
is being requested for allocation, is:

ERE(POOL) < FE(POOL)

As long as the Effective Reserved Extents of the entire pool is smaller 
than the number of Free Extents in the entire pool, nothing is the 
matter.

However, when

ERE(POOL) >= FE(POOL) we enter a critical 'fullness' situation.

This may be likened to a 95% threshold.

At this point you will start 'randomly' denying not only write requests 
for regular volumes, or writeable snapshots, but also regular read-only 
snapshots can see their allocation requests (for CoW) denied.

This would of course immediately invalidate those snapshots, if the 
write request was caused by a critical volume (Ci) that is still being 
serviced.

If you say this is not much different from replacing the volumes by 
error targets, I would agree.

As long as pool fullness is maintained, this 'denial of service' is not 
really random but consistent. However if something was done to e.g. drop 
a snapshot, and space would be freed, then writes would continue after.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-21 13:02                                               ` Zdenek Kabelac
  2017-09-21 14:34                                                 ` [linux-lvm] Clarification (and limitation) of the kernel feature I proposed Xen
@ 2017-09-21 14:49                                                 ` Xen
  2017-09-21 20:32                                                   ` Zdenek Kabelac
  1 sibling, 1 reply; 91+ messages in thread
From: Xen @ 2017-09-21 14:49 UTC (permalink / raw)
  To: linux-lvm

Instead of responding here individually I just sought to clarify in my 
other email that I did not intend to mean by "kernel feature" any form 
of max snapshot constraint mechanism.

At least nothing that would depend on size of snapshots.


Zdenek Kabelac schreef op 21-09-2017 15:02:

> And you also have project that do try to integrate shared goals like 
> btrfs.

Without using disjunct components.

So they solve human problem (coordination) with technical solution (no 
more component-based design).

> We hope community will provide some individual scripts...
> Not a big deal to integrate them into repo dir...

We were trying to identify common cases so that LVM team can write those 
skeletons for us.

> It's mostly about what can be supported 'globally'
> and what is rather 'individual' customization.

There are people going to be interested in common solution even if it's 
not everyone all at the same time.

> Which can't be deliver with current thinp technology.
> It's simply too computational invasive for our targeted performance.

You misunderstood my intent.

> I assume you possibly missed this logic of thin-p:

> So when you 'write' to ORIGIN - your snapshot which becomes bigger in
> terms of individual/exclusively owned chunks - so if you have i.e.
> configured snapshot to not  consume more then  XX% of your pool - you
> would simply need to recalc this with every update on shared
> chunks....

I knew this. But we do not depend for the calculations on CONSUMED SPACE 
(and its character/distribution) but only on FREE SPACE.

> And as has been already said - this is currently unsupportable 'online'

And unnecessary for the idea I was proposing.

Look, I am just trying to get the idea across correctly.

> Another aspect here is - thin-pool has  no idea about 'history' of
> volume creation - it doesn't not know  there is volume X being
> snapshot of volume Y - this all is only 'remembered' by lvm2 metadata
> -  in kernel - it's always like  -  volume X  owns set of chunks  1...
> That's all kernel needs to know for a single thin volume to work.

I know this.

However you would need LVM2 to make sure that only origin volumes are 
marked as critical.

> Unsupportable in 'kernel' without rewrite and you can i.e.
> 'workaround' this by placing 'error' targets in place of less
> important thinLVs...

I actually think that if I knew how to do multithreading in the kernel, 
I could have the solution in place in a day...

If I were in the position to do any such work to begin with... :(.

But you are correct that error target is almost the same thing.

> Imagine you would get pretty random 'denials' of your WRITE request
> depending on interaction with other snapshots....

All non-critical volumes would get write requests denied, including 
snapshots (even read-only ones).

> Surely if use 'read-only' snapshot you may not see all related
> problems, but such a very minor subclass of whole provisioning
> solution is not worth a special handling of whole thin-p target.

Read-only snapshots would also die en masse ;-).


> You are not 'reserving' any space as the space already IS assigned to
> those inactive volumes.

Space consumed by inactive volumes is calculated into FREE EXTENTS for 
the ENTIRE POOL.

We need no other data for the above solution.

> What you would have to implement is to TAKE the space FROM them to
> satisfy writing task to your 'active' volume and respect
> prioritization...

Not necessary. Reserved space is a metric, not a real thing.

Reserved space by definition is a part of unallocated space.

> If you will not implement this 'active' chunk 'stealing' - you are
> really ONLY shifting 'hit-the-wall' time-frame....  (worth possibly
> couple seconds only of your system load)...

Irrelevant. Of course we are employing a measure at 95% full that will 
be like error targets replacing all non-critical volumes.

Of course if total mayhem ensues we will still be in trouble.

The idea is that if this total mayhem originates from non-critical 
volumes, the critical ones will be unaffected (apart from their 
snapshots).

You could flag snapshots of critical volumes also as critical and then 
not reserve any space for them so you would have a combined space 
reservation.

Then snapshots for critical volumes would live longer.

Again, no consumption metric required. Only free space metrics.

> In other words - tuning 'thresholds' in userspace's 'bash' script will
> give you very same effect as if you are focusing here on very complex
> 'kernel' solution.

It's just not very complex.

You thought I wanted space consumption metric for all volumes including 
snapshots and then invididual attribution of all consumed space.

Not necessary.

The only thing I proposed used negative space (free space).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-21 14:49                                                 ` [linux-lvm] Reserve space for specific thin logical volumes Xen
@ 2017-09-21 20:32                                                   ` Zdenek Kabelac
  0 siblings, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-21 20:32 UTC (permalink / raw)
  To: LVM general discussion and development, Xen

Dne 21.9.2017 v 16:49 Xen napsal(a):
> 
> However you would need LVM2 to make sure that only origin volumes are marked 
> as critical.

'dmeventd' executed binary - which can be a simple bash script called at 
threshold level can be tuned to various naming logic.

So far there is no plan to enforce 'naming' or 'tagging' since from user base 
feedback we can see numerous ways how to deal with large volume naming 
strategies often made by external tools/databases -  so enforcing i.e. 
specific tag would require changes in larger systems - so when it's compared 
with rather simple tuning script of bash script...


> I actually think that if I knew how to do multithreading in the kernel, I 
> could have the solution in place in a day...
> 
> If I were in the position to do any such work to begin with... :(.
> 
> But you are correct that error target is almost the same thing.

It's the most 'safest'  - avoids any sort of further possibly damaging of 
filesystem.

Note - typical 'fs' may remount 'ro' at reasonable threshold time - the 
precise points depends on workload. If you have 'PB' arrays - surely leaving 
5% of free space is rather huge, if you work with GB on fast operation SSD - 
taking action at 70% might be better.

If anytime 'during' write  users hits 'full pool' - there is currently no 
other way then to stop using FS  - there are numerous way -

You can replace device with 'error'
You can replace device with 'delay' that splits reads to thin and writes to error

There is just not any way-back - FS should be checked  (i.e. full FS could be 
'restored' by deleting some files, but in the full thin-pool case  'FS' needs 
to get consistent first - so focusing on solving full-pool is like preparing 
for missed battle - focus should go into ensuring you not hit full pool and on 
the 'sad' occasion of 100% full pool - the worst case scenario is not all the 
bad - surely way better then 4 year old experience with old kernel and old 
lvm2....

>> What you would have to implement is to TAKE the space FROM them to
>> satisfy writing task to your 'active' volume and respect
>> prioritization...
> 
> Not necessary. Reserved space is a metric, not a real thing.
> 
> Reserved space by definition is a part of unallocated space.

How is this different from having VG with 1TB where you allocate for your 
thin-pool only i.e. 90% for thin-pool and you have 10% of free space for 
'extension' of thin-pool for your 'critical' moment.

I'm still not seeing any difference - except you would need to invest lot of 
energy into handling of this 'reserved' space inside kernel.

With actual versions of lvm2 you can handle these tasks at user-space and 
quite early before you reach 'real' out-of-space condition.

>> In other words - tuning 'thresholds' in userspace's 'bash' script will
>> give you very same effect as if you are focusing here on very complex
>> 'kernel' solution.
> 
> It's just not very complex.
> 
> You thought I wanted space consumption metric for all volumes including 
> snapshots and then invididual attribution of all consumed space.


Maybe you can try existing proposed solutions first and showing 'weak' points 
which are not solvable by it ?

We all agree we could not store 10G thin volume into 1G thin-pool - so there 
will be always the case of having  'full pool'.

Either you could handle reserves of 'early' remount-ro  or you keep some 
'spare' LV/space in VG you attach to thin-pool 'when' needed...
Having such 'great' level of free choice here is IMHO big advantage as it's 
always 'admin' to decide how to use available space in the best way -  instead 
of keeping 'reserves' somewhere hidden in kernel....

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-15  8:15 ` matthew patton
  2017-09-15 10:01   ` Zdenek Kabelac
@ 2017-09-15 18:35   ` Xen
  1 sibling, 0 replies; 91+ messages in thread
From: Xen @ 2017-09-15 18:35 UTC (permalink / raw)
  To: linux-lvm

matthew patton schreef op 15-09-2017 10:15:
>>  From the two proposed solutions (lvremove vs lverror), I think I 
>> would  prefer the second one.
> 
> I vote the other way. :)

Unless you were only talking about lvremoving snapshots this is hugely 
irresponsible.

You are throwing away a data collection?

> First because 'remove' maps directly to the DM equivalent action which
> brought this about.

That would imply you are only talking about snapshots, ?, even if 
dmsetup only creates a mapping, throwing away that mapping generally 
throws away data.

> With respect to freezing or otherwise stopping further I/O to LV being
> used by virtual machines, the only correct/sane solution is one of
> 'power off' or 'suspend'. Reaching into the VM to freeze
> individual/all filesystems but otherwise leave the VM running assumes
> significant knowledge of the VM's internals and the luxury of time.

That is simply user centric action that will in this case differ because 
a higher level function is better at cleanly killing something than a 
lower level function.

Same as that fsfreeze would probably be better in that sense than 
lverror.

So in that sense, yes of course, I agree, that would be logical.

If you can do it clean, you don't have to do it rough.

In the same way you could have user centric action shut down webservers 
and so on. By user I mean administrator.

But for LVM team you can only provide mechanic that decides when action 
should be taken, and some default actions like possibly lverror and 
fsfreeze. You cannot decide what higher level interventions would be 
needed everywhere.

Of course clean shutdown would be ideal. This depends on system 
administrator.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-15  8:15 ` matthew patton
@ 2017-09-15 10:01   ` Zdenek Kabelac
  2017-09-15 18:35   ` Xen
  1 sibling, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-15 10:01 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

Dne 15.9.2017 v 10:15 matthew patton napsal(a):
> 
>>   From the two proposed solutions (lvremove vs lverror), I think I would  prefer the second one.
> 
> I vote the other way. :)
> First because 'remove' maps directly to the DM equivalent action which brought this about. Second because you are in fact deleting the object - ie it's not coming back. That it returns a nice and timely error code up the stack instead of the kernel doing 'wierd things' is an implementation detail.
> 

It's not that easy.

lvm2  cannot just 'lose' the volume which is still mapped IN table (even if it 
will an error segment)

So the result of operation will be  some 'LV' in the lvm2 metadata.
which could be possibly flagged for 'automatic' removal later once it's no 
longer hold in use.

There could be seen 'some' similarity between snapshot marge -  where lvm2 
also maintains some 'fictional' volumes internally..

So 'lvm2' could possibly 'mask' device as 'removed'   - or it can keep it 
remapped to error target - which could be possibly usable for other things.


> Not to say 'lverror' might have a use of it's own as a "mark this device as in an error state and return EIO on every OP". Which implies you could later remove the flag and IO could resume subject to the higher levels not having already wigged out in some fashion. However why not change the behavior of 'lvchange -n' to do that on it's own on a previously activated entry that still has a ref count > 0? With '--force' of course

'lvrerror' can be also used for 'lvchange -an' - so not such 'lvremoval' and 
it could be used for other volumes (not just things) possibly -

so you get and lvm2 mapping of  'dmsetup wipe_table'

('lverror'  would be actually something like   'lvconvert --replacewitherror' 
-  likely we would not add a new 'extra' command for this conversion)

> 
> With respect to freezing or otherwise stopping further I/O to LV being used by virtual machines, the only correct/sane solution is one of 'power off' or 'suspend'. Reaching into the VM to freeze individual/all filesystems but otherwise leave the VM running assumes significant knowledge of the VM's internals and the luxury of time.


And 'suspend' can be dropped from this list ;) as so far  lvm2 is seeing a 
device left in suspend after command execution as a serious internal error,
and there is long list of good reasons for not leaking suspend devices.

Suspend is designed as short-living  'state' of device  - it's not meant to be 
held suspend for undefined amount of time - it cause lots of troubles to 
various /dev scanning softwares (lvm2 included....) - and as such it's has 
racy built-in :)


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <914479528.2618507.1505463313888.ref@mail.yahoo.com>
@ 2017-09-15  8:15 ` matthew patton
  2017-09-15 10:01   ` Zdenek Kabelac
  2017-09-15 18:35   ` Xen
  0 siblings, 2 replies; 91+ messages in thread
From: matthew patton @ 2017-09-15  8:15 UTC (permalink / raw)
  To: LVM general discussion and development


>  From the two proposed solutions (lvremove vs lverror), I think I would  prefer the second one.

I vote the other way. :)
First because 'remove' maps directly to the DM equivalent action which brought this about. Second because you are in fact deleting the object - ie it's not coming back. That it returns a nice and timely error code up the stack instead of the kernel doing 'wierd things' is an implementation detail.

Not to say 'lverror' might have a use of it's own as a "mark this device as in an error state and return EIO on every OP". Which implies you could later remove the flag and IO could resume subject to the higher levels not having already wigged out in some fashion. However why not change the behavior of 'lvchange -n' to do that on it's own on a previously activated entry that still has a ref count > 0? With '--force' of course

With respect to freezing or otherwise stopping further I/O to LV being used by virtual machines, the only correct/sane solution is one of 'power off' or 'suspend'. Reaching into the VM to freeze individual/all filesystems but otherwise leave the VM running assumes significant knowledge of the VM's internals and the luxury of time.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <498090067.1559855.1505338608244.ref@mail.yahoo.com>
@ 2017-09-13 21:36 ` matthew patton
  0 siblings, 0 replies; 91+ messages in thread
From: matthew patton @ 2017-09-13 21:36 UTC (permalink / raw)
  To: LVM general discussion and development

Xen wrote:

> The issue with scripts is that they feel rather vulnerable to  corruption, not being there etc.

Only you are responsible for making sure scripts are available and correctly written.

>  So in that case I suppose that you would want some default, shipped  scripts that come with LVM as
> example for default behaviour and that are  also activated by default?
...
> /usr/share/lvm/scripts/

Heck no to activation. The only path that's correct is that last one. The previously supplied example code should have been more than enough for you to venture out on your own and write custom logic.
 
> Then not even a threshold value needs to be configured.

Nobody else wants your particular logic, run interval, or thresholds. Write your scripts to suck in /etc/sysconfig/lvm/<vgname> or whatever the distro of your choice puts such things.
 
> Yes. One obvious scenario is root on thin.
> It's pretty mandatory for root on thin.

Can you elaborate with an example? Because that's the most dangerous one not to have space fully reserved unless you've established other means to ensure nothing writes to that volume or the writes are very, very well defined. ie. 'yum update' is disabled, nobody has root, the filesystem is mounted RO, etc.
 
>  You cannot set max size for thin snapshots?

And you want to do that to 'root' volumes?!?!
 
> you cannot calculate in advance what can  happen,
> because by design, mayhem should not ensue, but what if your
> predictions are off?

Simple. You don't do stupid things like NOT reserving 100% of space in a thinLV for all root volumes. You buy as many hard drives as necessary so you don't get burned.

> Being able to set a maximum snapshot size before it gets dropped could  be very nice.

Write your own script that queries all volumes, and destroys those that are beyond your high-water mark unless optionally they are "special".

> When free space on thin pool drops below ~120MB

At best your user-space program runs each minute and writing 120MB takes a couple seconds and thus between runs of your 'monitor' you've blown past any chance of taking action.

8TB drives are $250 bucks. buy disk. and buy more disk already and quadruple your notion of reserves. If your workload isn't critical then nobody cares if you blow it sky high and you can do silly things like shaving too close. But I have to ask, to what possible purpose?

> I want the 20GB volume  and the 10GB volumes to be frozen

It takes time for a script log into each KVM domain and issue an fsfreeze or even to just suspend the VM from the hypervisor. Meanwhile writers are potentially writing at several hundred MB per second. You're looking at a massive torrent of write errors.

> much deleted

Sounds like you got all the bits figured out. Write the script and post it to GitHub or PasteBin.


> so that the burden on the administrator is very minimal.

No sysadmin worth a damn is going to not spend a LOT of time thinking whether this sort of thing is even rational, and if so, where they want to draw the line. This sort of behavior doesn't suffer fools gladly nor is it appropriate for people to attempt who don't first know what they are doing. Some parts of Linux/Unix are Experts-Only for a reason.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13  8:44         ` Zdenek Kabelac
@ 2017-09-13 10:46           ` Gionatan Danti
  0 siblings, 0 replies; 91+ messages in thread
From: Gionatan Danti @ 2017-09-13 10:46 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: matthew, LVM general discussion and development

Il 13-09-2017 10:44 Zdenek Kabelac ha scritto:
> Forcible remove (with some reasonable locking - so i.e. 2 processes
> are not playing with same device :)   'dmsetup remove --force' - is
> replacing
> existing device with 'error' target (with built-in noflush)
> 
> Anyway - if you see a reproducible problem with forcible removal - it
> needs to be reported as this is a real bug then and BZ shall be
> opened...
> 
> Regards
> 
> Zdenek

Ok, I'll do some more test and in case problems arise I'll open the BZ 
:)
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13  8:28       ` Gionatan Danti
@ 2017-09-13  8:44         ` Zdenek Kabelac
  2017-09-13 10:46           ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-13  8:44 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: LVM general discussion and development

Dne 13.9.2017 v 10:28 Gionatan Danti napsal(a):
> Il 13-09-2017 10:15 Zdenek Kabelac ha scritto:
>> Ohh this is pretty major constrain ;)
> 
>
>> I can well imagine� LVM will let you forcible� replace such LV with
>> error target� - so instead of� thinLV� - you will have� single 'error'
>> target snapshot - which could be possibly even� auto-cleaned once the
>> volume use-count drops bellow 0� (lvmpolld/dmeventd monitoring
>> whatever...)
>>
>> (Of course - we are not solving what happens to application
>> using/running out of such error target - hopefully something not
>> completely bad....)
>>
>> This way - you get very 'powerful' weapon to be used in those 'scriplets'
>> so you can drop uneeded volumes ANYTIME you need to and reclaim its 
>> resources...
>>
> 
> This would be *really* great. I played with dm-setup remove/error target and, 
> while working, it often freezed LVM.
> An integrated forced volume removal/swith to error target would be great.
> 


Forcible remove (with some reasonable locking - so i.e. 2 processes are not 
playing with same device :)   'dmsetup remove --force' - is replacing
existing device with 'error' target (with built-in noflush)

Anyway - if you see a reproducible problem with forcible removal - it needs to 
be reported as this is a real bug then and BZ shall be opened...

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13  8:15     ` Zdenek Kabelac
@ 2017-09-13  8:28       ` Gionatan Danti
  2017-09-13  8:44         ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-13  8:28 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: matthew, LVM general discussion and development

Il 13-09-2017 10:15 Zdenek Kabelac ha scritto:
> Ohh this is pretty major constrain ;)

Sure :p
Sorry for not explicitly stating that before.

> 
> But as pointed out multiple times - with scripting around various
> fullness moments of thin-pool - several different actions can be
> programmed around,
> starting from fstrim, ending with plain erase of unneeded snapshot.
> (Maybe erasing unneeded files....)
> 
> To get most secure application - such app should actually avoid using
> page-cache (using direct-io)  in such case you are always guaranteed
> to get exact error at the exact time (i.e. even without journaled
> mounting option for ext4....)
> 
> 

True, but pagecache exists for a reason. Anyway, this is not anything 
you can "fix" in device mapper/lvm, I 100% agree with that.

> Partially this might get solved in 'some' cases with fully provisioned
> thinLVs within thin-pool...
> 
> What comes to my mind as possible supporting solution is -
> adding possible enhancement on LVM2 side could be  'forcible' removal
> of running volumes  (aka lvm2 equivalent  of 'dmsetup remove --force')
> 
> ATM lvm2 prevents you to remove 'running/mounted' volumes.
> 
> I can well imagine  LVM will let you forcible  replace such LV with
> error target  - so instead of  thinLV  - you will have  single 'error'
> target snapshot - which could be possibly even  auto-cleaned once the
> volume use-count drops bellow 0  (lvmpolld/dmeventd monitoring
> whatever...)
> 
> (Of course - we are not solving what happens to application
> using/running out of such error target - hopefully something not
> completely bad....)
> 
> This way - you get very 'powerful' weapon to be used in those 
> 'scriplets'
> so you can drop uneeded volumes ANYTIME you need to and reclaim its 
> resources...
> 

This would be *really* great. I played with dm-setup remove/error target 
and, while working, it often freezed LVM.
An integrated forced volume removal/swith to error target would be 
great.

Thank.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 23:31             ` Zdenek Kabelac
@ 2017-09-13  8:21               ` Gionatan Danti
  0 siblings, 0 replies; 91+ messages in thread
From: Gionatan Danti @ 2017-09-13  8:21 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

Il 13-09-2017 01:31 Zdenek Kabelac ha scritto:
> It's not just about 'complexity' in frame work.
> You would lose all the speed as well.
> You would significantly raise-up  memory requirement.
> 
> There is very good reason complex tools like  'thin_ls' are kept in
> user-space  outside of kernel   - with  'dm'  we tend to have simpler
> kernel logic
> and complexity should stay in user-space.
> 
> And of course - as pointed out - the size of your 'reserve' is so
> vague  :) and could potentially present major portion of you whole
> thin-pool size without any extra benefit (as obviously any reserve
> could be too small unless you 'reach' fully provisioned state :)
> 
> i.e. example:
> 10G thinLV  with  1G chunks -  single byte write may require full 1G 
> chunk...
> so do you decide to keep 10 free chunks in reserves ??

What was missing (because I thought it was implicit) is that I expect 
snapshot to never change - ie: they are read-only.
Anyway, I was not writing about "resevers" - rather, to 
preassign/preallocate the required space to a specific volume.
A fallocate on a otherwise thinly provisioned volume, if you like.

> Supposedly:
> 
> lvmconfig --typeconfig full --withversion
> 
>         # Available since version 2.2.89.
>         thin_pool_autoextend_threshold=70
> 
> 
> However there were some bugs and fixes - and validation for not
> allowing to create new thins - so do not try anything below 169 and if
> you can
> go with 173....
> 

Ah! I was not thinking about thin_pool_autoextend_threshold! I tried 
with 166 (for now) and I don't see any major problems. However, I will 
surely upgrade at the first opportunity!

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13  7:53   ` Gionatan Danti
@ 2017-09-13  8:15     ` Zdenek Kabelac
  2017-09-13  8:28       ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-13  8:15 UTC (permalink / raw)
  To: LVM general discussion and development, Gionatan Danti, matthew patton

Dne 13.9.2017 v 09:53 Gionatan Danti napsal(a):
> Il 13-09-2017 01:22 matthew patton ha scritto:
>>> Step-by-step example:
>>  > - create a 40 GB thin volume and subtract its size from the thin
>> pool (USED 40 GB, FREE 60 GB, REFER 0 GB);
>>  > - overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
>>  > - snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
>>
>> And 3 other threads also take snapshots against the same volume, or
>> frankly any other volume in the pool.
>> Since the next step (overwrite) hasn't happened yet or has written
>> less than 20GB, all succeed.
>>
>>  > - completely overwrite the original volume (USED 80 GB, FREE 20 GB,
>> REFER 40 GB);
>>
>> 4 threads all try to write their respective 40GB. Afterall, they got
>> the green-light since their snapshot was allowed to be taken.
>> Your thinLV blows up spectacularly.
>>
>>  > - a new snapshot creation will fails (REFER is higher then FREE).
>> nobody cares about new snapshot creation attempts at this point.
>>
>>
>>> When do you decide it ?  (you need to see this is total race-lend)
>>
>> exactly!
> 
> I all the examples I did, the snapshot are suppose to be read-only or at least 
> never written. I thought that it was implicitly clear due to ZFS (used as 
> example) being read-only by default. Sorry for not explicitly stating that.
> 

Ohh this is pretty major constrain ;)

But as pointed out multiple times - with scripting around various fullness 
moments of thin-pool - several different actions can be programmed around,
starting from fstrim, ending with plain erase of unneeded snapshot.
(Maybe erasing unneeded files....)

To get most secure application - such app should actually avoid using 
page-cache (using direct-io)  in such case you are always guaranteed
to get exact error at the exact time (i.e. even without journaled mounting 
option for ext4....)


> After the last write, the cloned cvol1 is clearly corrputed, but the original 
> volume has not problem at all.

Surely there is good reason we keep 'old snapshots' still with us - although 
everyone knows it's implementation has aged :)

There are cases where this copying into separate COW areas simply works better 
- especially for temporary living object with low number of 'small' changes.

We even support  old-snapshot for thin-volumes for this reason - so you can 
use 'bigger' thin-pool chunks - but for temporary snapshot for taking backups
you can take old snapshot of thin volume...


> 
> This was more or less the case with classical, fat LVM: a snapshot runnig out 
> of space *will* fail, but the original volume remains unaffected.

Partially this might get solved in 'some' cases with fully provisioned thinLVs 
within thin-pool...

What comes to my mind as possible supporting solution is -
adding possible enhancement on LVM2 side could be  'forcible' removal of 
running volumes  (aka lvm2 equivalent  of 'dmsetup remove --force')

ATM lvm2 prevents you to remove 'running/mounted' volumes.

I can well imagine  LVM will let you forcible  replace such LV with error 
target  - so instead of  thinLV  - you will have  single 'error' target 
snapshot - which could be possibly even  auto-cleaned once the volume 
use-count drops bellow 0  (lvmpolld/dmeventd monitoring whatever...)

(Of course - we are not solving what happens to application using/running out 
of such error target - hopefully something not completely bad....)

This way - you get very 'powerful' weapon to be used in those 'scriplets'
so you can drop uneeded volumes ANYTIME you need to and reclaim its resources...

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 23:22 ` matthew patton
@ 2017-09-13  7:53   ` Gionatan Danti
  2017-09-13  8:15     ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-13  7:53 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

Il 13-09-2017 01:22 matthew patton ha scritto:
>> Step-by-step example:
>  > - create a 40 GB thin volume and subtract its size from the thin
> pool (USED 40 GB, FREE 60 GB, REFER 0 GB);
>  > - overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
>  > - snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> 
> And 3 other threads also take snapshots against the same volume, or
> frankly any other volume in the pool.
> Since the next step (overwrite) hasn't happened yet or has written
> less than 20GB, all succeed.
> 
>  > - completely overwrite the original volume (USED 80 GB, FREE 20 GB,
> REFER 40 GB);
> 
> 4 threads all try to write their respective 40GB. Afterall, they got
> the green-light since their snapshot was allowed to be taken.
> Your thinLV blows up spectacularly.
> 
>  > - a new snapshot creation will fails (REFER is higher then FREE).
> nobody cares about new snapshot creation attempts at this point.
> 
> 
>> When do you decide it ?  (you need to see this is total race-lend)
> 
> exactly!

I all the examples I did, the snapshot are suppose to be read-only or at 
least never written. I thought that it was implicitly clear due to ZFS 
(used as example) being read-only by default. Sorry for not explicitly 
stating that.

However, the refreservation mechanism can protect the original volume 
even when snapshots are writeable. Here we go:

# Create a 400M ZVOL and fill it
[root@localhost ~]# zfs create -V 400M tank/vol1
[root@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M 
oflag=direct
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
401+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 23.0573 s, 18.2 MB/s
[root@localhost ~]# zfs list -t all
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        416M   464M    24K  /tank
tank/vol1   414M   478M   401M  -

# Create some snapshots (note how the USED value increased due to the 
snapshot reserving space for all "live" data in the ZVOL)
[root@localhost ~]# zfs set snapdev=visible tank/vol1
[root@localhost ~]# zfs snapshot tank/vol1@snap1
[root@localhost ~]# zfs snapshot tank/vol1@snap2
[root@localhost ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              816M  63.7M    24K  /tank
tank/vol1         815M   478M   401M  -
tank/vol1@snap1     0B      -   401M  -
tank/vol1@snap2     0B      -   401M  -

# Clone the snapshot (to be able to overwrite it)
[root@localhost ~]# zfs clone tank/vol1@snap1 tank/cvol1
[root@localhost ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              815M  64.6M    24K  /tank
tank/cvol1          1K  64.6M   401M  -
tank/vol1         815M   479M   401M  -
tank/vol1@snap1     0B      -   401M  -
tank/vol1@snap2     0B      -   401M  -

# Writing to the cloned ZVOL fails (after only 66 MB written) *without* 
impacting the original volume
[root@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/cvol1 bs=1M 
oflag=direct
dd: error writing ‘/dev/zvol/tank/cvol1’: Input/output error
64+0 records in
63+0 records out
66060288 bytes (66 MB) copied, 25.9189 s, 2.5 MB/s

After the last write, the cloned cvol1 is clearly corrputed, but the 
original volume has not problem at all.

Now, I am *not* advocating switching thinp to a ZFS-like things (ie: 
note the write speed, which is low even for my super-slow notebook HDD). 
However, a mechanism with which we can tell LVM "hey, this volume should 
have all its space as reserved, don't worry about preventing snapshots 
and/or freezing them when free space runs out".

This was more or less the case with classical, fat LVM: a snapshot 
runnig out of space *will* fail, but the original volume remains 
unaffected.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13  2:23 ` matthew patton
@ 2017-09-13  7:25   ` Zdenek Kabelac
  0 siblings, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-13  7:25 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

Dne 13.9.2017 v 04:23 matthew patton napsal(a):
> I don't recall seeing an actual, practical, real-world example of why this issue got broached again. So here goes.
> 
> Create a thin LV on KVM dom0, put XFS/EXT4 on it, lay down (sparse) files as KVM virtual disk files.
> Create and launch VMs and configure to suit. For example a dedicated VM for each of web server, a Tomcat server, and database. Let's call it a 'Stack'.
> You're done configuring it.
> 
> You take a snapshot as a "restore point".
> Then you present to your developers (or customers) a "drive-by" clone (snapshot) of the LV in which changes are typically quite limited (but could go up to full capacity) worth of overwrites depending on how much they test/play with it. You could have 500 such copies resident. Thin LV clones are damn convenient and mostly "free" and attractive for that purpose.
> 


There is one point which IMHO would be way more worth to invest resource into 
ATM whenever you have snapshot -  there is unfortunately no page-cache sharing.

So i.e. you have 10 LVs being snapshots of the single origin you get 10 
different copies of pages in RAM of the same data.

But this is really hard problem to solve...


Regards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-13  0:04 ` matthew patton
@ 2017-09-13  7:10   ` Zdenek Kabelac
  0 siblings, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-13  7:10 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

Dne 13.9.2017 v 02:04 matthew patton napsal(a):
> 'yes'
> 
> The filesystem may not be resident on the hypervisor (dom0) so 'dmsetup suspend' is probably more apropos. How well that propagates upward to the unwary client VM remains to be seen. But if one were running a NFS server using thin+xfs/ext4 then the 'fsfreeze' makes sense.
> 


lvm2 is not 'expecting' someone will touch lvm2 controlled DM devices.

If you suspend  thinLV with dmsetup - you are in big 'danger' of freezing 
further lvm2 processing - i.e. command will try scan device list and will get 
blocked on suspend  (even if lvm2 checks for 'suspended' dm devices to skip 
then via lvm.conf setting - there is clear race)

So any solution which works outside lvm2 and changes  dm table outside of lvm2 
locking mechanism is hardly supportable - it can be used as 'last weapon' - 
but it should be clear to user - that next proper step is to reboot the 
machine....


Regards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <1771452279.913055.1505269434212.ref@mail.yahoo.com>
@ 2017-09-13  2:23 ` matthew patton
  2017-09-13  7:25   ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: matthew patton @ 2017-09-13  2:23 UTC (permalink / raw)
  To: LVM general discussion and development, matthew patton

I don't recall seeing an actual, practical, real-world example of why this issue got broached again. So here goes.

Create a thin LV on KVM dom0, put XFS/EXT4 on it, lay down (sparse) files as KVM virtual disk files.
Create and launch VMs and configure to suit. For example a dedicated VM for each of web server, a Tomcat server, and database. Let's call it a 'Stack'.
You're done configuring it.

You take a snapshot as a "restore point".
Then you present to your developers (or customers) a "drive-by" clone (snapshot) of the LV in which changes are typically quite limited (but could go up to full capacity) worth of overwrites depending on how much they test/play with it. You could have 500 such copies resident. Thin LV clones are damn convenient and mostly "free" and attractive for that purpose.

At some point one of those snapshots gets launched as, or converted into a production instance. Or if you rather, a customer purchases it and now you must be able to guarantee that it can do a full overwrite of it's space and that any interaction with the underlying thin pool trumps all the other ankle-biters (demo, dev, qa, trial) that might also be resident. Lesser snapshots will necessarily be evicted (destroyed) until the volume reaches some pre-defined level of reserved space that is now solely used for quick point-in-time restore points of the remaining instances. These snaps are retained for some amount of time and likely spooled off to a backup location. If thinPool pressure gets too high the oldest restore points (snapshots) get destroyed.

In any given ThinPool there may be multiple Stacks or flavors/versions of same.

I believe the pseudo-script provided earlier this afternoon suffices to implement the above.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <57374453.843393.1505261050977.ref@mail.yahoo.com>
@ 2017-09-13  0:04 ` matthew patton
  2017-09-13  7:10   ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: matthew patton @ 2017-09-13  0:04 UTC (permalink / raw)
  To: LVM general discussion and development

'yes'

The filesystem may not be resident on the hypervisor (dom0) so 'dmsetup suspend' is probably more apropos. How well that propagates upward to the unwary client VM remains to be seen. But if one were running a NFS server using thin+xfs/ext4 then the 'fsfreeze' makes sense.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 23:15           ` Gionatan Danti
@ 2017-09-12 23:31             ` Zdenek Kabelac
  2017-09-13  8:21               ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 23:31 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: LVM general discussion and development

Dne 13.9.2017 v 01:15 Gionatan Danti napsal(a):
>> There could be a simple answer and complex one :)
>>
>> I'd start with simple one - already presented here -
> There you can decide - if you want to extend thin-pool...
>> You may drop some snapshot...
>> You may fstrim mounted thinLVs...
>> You can kill volumes way before the situation becomes unmaintable....
> 
> Ok, this is an answer I totally accept: if enable per-lv used and reserved 
> space is so difficult in the current thinp framework, don't do it.

It's not just about 'complexity' in frame work.
You would lose all the speed as well.
You would significantly raise-up  memory requirement.

There is very good reason complex tools like  'thin_ls' are kept in user-space 
  outside of kernel   - with  'dm'  we tend to have simpler kernel logic
and complexity should stay in user-space.

And of course - as pointed out - the size of your 'reserve' is so vague  :) 
and could potentially present major portion of you whole thin-pool size 
without any extra benefit (as obviously any reserve could be too small unless 
you 'reach' fully provisioned state :)

i.e. example:
10G thinLV  with  1G chunks -  single byte write may require full 1G chunk...
so do you decide to keep 10 free chunks in reserves ??


>> All you need to accept is - you will kill them at 95% -
>> in your world with reserves it would be already reported as 100% full,
>> with totally unknown size of reserves :)
> 
> Minor nitpicking: I am not speaking about "reserves" to use when free space is 
> low, but about "reserved space" - ie: per-volume space which can not be used 
> by any other object.
> 
> One question: in a previous email you shown how a threshold can be set to deny 
> new volume/snapshot creation. How can I do that? What LVM version I need?

Supposedly:

lvmconfig --typeconfig full --withversion

         # Available since version 2.2.89.
         thin_pool_autoextend_threshold=70


However there were some bugs and fixes - and validation for not allowing to 
create new thins - so do not try anything below 169 and if you can
go with 173....


Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <691633892.829188.1505258696384.ref@mail.yahoo.com>
@ 2017-09-12 23:24 ` matthew patton
  0 siblings, 0 replies; 91+ messages in thread
From: matthew patton @ 2017-09-12 23:24 UTC (permalink / raw)
  To: LVM general discussion and development

> True, with a catch: with the default data=ordered option, even ext4 does 
> *not* remount read only when data writeout fails. You need to use
> both  "errors=remount-ro" and "data=journal" which basically nobody uses.
 
Then you need to hang out with people who actually do storage for a living.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <1575245610.821680.1505258554456.ref@mail.yahoo.com>
@ 2017-09-12 23:22 ` matthew patton
  2017-09-13  7:53   ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: matthew patton @ 2017-09-12 23:22 UTC (permalink / raw)
  To: LVM general discussion and development


 > Step-by-step example:
 > - create a 40 GB thin volume and subtract its size from the thin pool (USED 40 GB, FREE 60 GB, REFER 0 GB);
 > - overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
 > - snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);

And 3 other threads also take snapshots against the same volume, or frankly any other volume in the pool.
Since the next step (overwrite) hasn't happened yet or has written less than 20GB, all succeed.

 > - completely overwrite the original volume (USED 80 GB, FREE 20 GB, REFER 40 GB);

4 threads all try to write their respective 40GB. Afterall, they got the green-light since their snapshot was allowed to be taken.
Your thinLV blows up spectacularly.

 > - a new snapshot creation will fails (REFER is higher then FREE).
nobody cares about new snapshot creation attempts at this point.


> When do you decide it ?  (you need to see this is total race-lend)

exactly!

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 23:02         ` Zdenek Kabelac
@ 2017-09-12 23:15           ` Gionatan Danti
  2017-09-12 23:31             ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 23:15 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

> There could be a simple answer and complex one :)
> 
> I'd start with simple one - already presented here -
> 
> when you write to INDIVIDUAL thin volume target - respective dn thin
> target DOES manipulate with single btree set - it does NOT care there
> are some other snapshot and never influnces them -
> 
> You ask here to heavily 'change' thin-pool logic - so writing to THIN
> volume A  can remove/influence volume B - this is very problematic for
> meny reasons.
> 
> We can go into details of BTree updates  (that should be really
> discussed with its authors on dm channel ;)) - but I think the key
> element is capturing the idea the usage of thinLV A does not change
> thinLV B.
> 
> 
> ----
> 
> 
> Now to your free 'reserved' space fiction :)
> There is NO way to decide WHO deserves to use the reserve :)
> 
> Every thin volume is equal - (the fact we call some thin LV snapshot
> is user-land fiction - in kernel all thinLV are just equal -  every
> thinLV reference set of thin-pool chunks)  -
> 
> (for late-night thinking -  what would be snapshot of snapshot which
> is fully overwritten ;))
> 
> So when you now see that all thinLVs  just maps set of chunks,
> and all thinLVs can be active and running concurrently - how do you
> want to use reserves in thin-pool :) ?
> When do you decide it ?  (you need to see this is total race-lend)
> How do you actually orchestrate locking around this single point of 
> failure ;) ?
> You will surely come with and idea of having reserve separate for every 
> thinLV ?
> How big it should actually be ?
> Are you going to 'refill' those reserves  when thin-pool gets emptier ?
> How you decide which thinLV deserves bigger reserves ;) ??
> 
> I assume you can start to SEE the whole point of this misery....
> 
> So instead -  you can start with normal thin-pool - keep it simple in 
> kernel,
> and solve complexity in user-space.
> 
> There you can decide - if you want to extend thin-pool...
> You may drop some snapshot...
> You may fstrim mounted thinLVs...
> You can kill volumes way before the situation becomes unmaintable....

Ok, this is an answer I totally accept: if enable per-lv used and 
reserved space is so difficult in the current thinp framework, don't do 
it.

Thanks to taking the time to explain (on late night ;))

> All you need to accept is - you will kill them at 95% -
> in your world with reserves it would be already reported as 100% full,
> with totally unknown size of reserves :)

Minor nitpicking: I am not speaking about "reserves" to use when free 
space is low, but about "reserved space" - ie: per-volume space which 
can not be used by any other object.

One question: in a previous email you shown how a threshold can be set 
to deny new volume/snapshot creation. How can I do that? What LVM 
version I need?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 22:55       ` Gionatan Danti
@ 2017-09-12 23:11         ` Zdenek Kabelac
  0 siblings, 0 replies; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 23:11 UTC (permalink / raw)
  To: Gionatan Danti, LVM general discussion and development

Dne 13.9.2017 v 00:55 Gionatan Danti napsal(a):
> Il 13-09-2017 00:41 Zdenek Kabelac ha scritto:
>> There are maybe few worthy comments - XFS is great on stanadar big
>> volumes, but there used to be some hidden details when used on thinly
>> provisioned volumes on older RHEL (7.0, 7.1)
>>
>> So now it depend how old distro you use (I'd probably highly recommend
>> upgrade to RH7.4 if you are on RHEL based distro)
> 
> Sure.
> 
>> Basically 'XFS' does not have similar 'remount-ro' on error behavior
>> which 'extX' provides - but now XFS knows how to shutdown itself when
>> meta/data updates starts to fail - although you may need to tune some
>> 'sysfs' params to get 'ideal' behavior.
> 
> True, with a catch: with the default data=ordered option, even ext4 does *not* 
> remount read only when data writeout fails. You need to use both 
> "errors=remount-ro" and "data=journal" which basically nobody uses.
> 
>> Personally for smaller sized thin volumes I'd prefer 'ext4' over XFS -
>> unless you demand some specific XFS feature...
> 
> Thanks for the input. So, do you run your ext4 filesystem with data=journal? 
> How they behave performane-wise?
> 

As said     data=journal   is big performance killer (especially on SSD)

Personally  I prefer early 'shutdown' in case the situation becomes critical
(i.e. 95% fullness because some process gets crazy)

But you can write any advanced scripting logic to suit best your needs -

i.e. replace all thins on thin-pool with 'error' target....
(which  is as simple as using  'dmsetup remove --force'.... - this will
make all future read/writes  giving you  i/o errors....)

Simply do all in user-space early enough before  thin-pool can ever get NEAR t 
being 100% full -  reaction is really quick - and you have at least 60seconds 
to solve the problem in worst case.....



Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 22:41       ` Gionatan Danti
@ 2017-09-12 23:02         ` Zdenek Kabelac
  2017-09-12 23:15           ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 23:02 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: LVM general discussion and development

Dne 13.9.2017 v 00:41 Gionatan Danti napsal(a):
> Il 13-09-2017 00:16 Zdenek Kabelac ha scritto:
>> Dne 12.9.2017 v 23:36 Gionatan Danti napsal(a):
>>> Il 12-09-2017 21:44 matthew patton ha scritto:
>>
>>> Again, please don't speak about things you don't know.
>>> I am *not* interested in thin provisioning itself at all; on the other 
>>> side, I find CoW and fast snapshots very useful.
>>
>>
>> Not going to comment KVM storage architecture - but with this statemnet -
>> you have VERY simple usage:
>>
>>
>> Just minimize chance for overprovisioning -
>>
>> let's go by example:
>>
>> you have� 10� 10GiB volumes� and you have 20 snapshots...
>>
>>
>> to not overprovision - you need 10 GiB * 30 LV� = 300GiB thin-pool.
>>
>> if that sounds too-much.
>>
>> you can go with 150 GiB - to always 100% cover all 'base' volumes.
>> and have some room for snapshots.
>>
>>
>> Now the fun begins - while monitoring is running -
>> you get callback for� 50%, 55%... 95% 100%
>> at each moment� you can do whatever action you need.
>>
>>
>> So assume 100GiB is bare minimum for base volumes - you ignore any
>> state with less then 66% occupancy of thin-pool and you start solving
>> problems with 85% (~128GiB)- you know some snapshot is better to be
>> dropped.
>> You may try 'harder' actions for higher percentage.
>> (you need to consider how many dirty pages you leave floating your system
>> and other variables)
>>
>> Also you pick with some logic the snapshot which you want to drop -
>> Maybe the oldest ?
>> (see airplane :) URL link)....
>>
>> Anyway - you have plenty of time to solve it still at this moment
>> without any danger of losing write operation...
>> All you can lose is some 'snapshot' which might have been present a
>> bit longer...� but that is supposedly fine with your model workflow...
>>
>> Of course you are getting in serious problem, if you try to keep all
>> these demo-volumes within 50GiB with massive overprovisioning ;)
>>
>> There you have much hard times what should happen what should be
>> removed and where is possibly better to STOP everything and let admin
>> decide what is the ideal next step....
>>
> 
> Hi Zdenek,
> I fully agree with what you said above, and I sincerely thank you for taking 
> the time to reply.
> However, I am not sure to understand *why* reserving space for a thin volume 
> seems a bad idea to you.
> 
> Lets have a 100 GB thin pool, and wanting to *never* run out of space in spite 
> of taking multiple snapshots.
> To achieve that, I need to a) carefully size the original volume, b) ask the 
> thin pool to reserve the needed space and c) counting the "live" data (REFER 
> in ZFS terms) allocated inside the thin volume.
> 
> Step-by-step example:
> - create a 40 GB thin volume and subtract its size from the thin pool (USED 40 
> GB, FREE 60 GB, REFER 0 GB);
> - overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> - snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> - completely overwrite the original volume (USED 80 GB, FREE 20 GB, REFER 40 GB);
> - a new snapshot creation will fails (REFER is higher then FREE).
> 
> Result: thin pool is *never allowed* to fill. You need to keep track of 
> per-volume USED and REFER space, but thinp performance should not be impacted 
> in any manner. This is not theoretical: it is already working in this manner 
> with ZVOLs and refreservation, *without* involing/requiring any advanced 
> coupling/integration between block and filesystem layers.
> 
> Don't get me wrong: I am sure that, if you choose to not implement this 
> scheme, you have a very good reason to do that. Moreover, I understand that 
> patches are welcome :)
> 
> But I would like to understand *why* this possibility is ruled out with such 
> firmness.
> 

There could be a simple answer and complex one :)

I'd start with simple one - already presented here -

when you write to INDIVIDUAL thin volume target - respective dn thin target 
DOES manipulate with single btree set - it does NOT care there are some other 
snapshot and never influnces them -

You ask here to heavily 'change' thin-pool logic - so writing to THIN volume A 
  can remove/influence volume B - this is very problematic for meny reasons.

We can go into details of BTree updates  (that should be really discussed with 
its authors on dm channel ;)) - but I think the key element is capturing the 
idea the usage of thinLV A does not change thinLV B.


----


Now to your free 'reserved' space fiction :)
There is NO way to decide WHO deserves to use the reserve :)

Every thin volume is equal - (the fact we call some thin LV snapshot is 
user-land fiction - in kernel all thinLV are just equal -  every thinLV 
reference set of thin-pool chunks)  -

(for late-night thinking -  what would be snapshot of snapshot which is fully 
overwritten ;))

So when you now see that all thinLVs  just maps set of chunks,
and all thinLVs can be active and running concurrently - how do you want to 
use reserves in thin-pool :) ?
When do you decide it ?  (you need to see this is total race-lend)
How do you actually orchestrate locking around this single point of failure ;) ?
You will surely come with and idea of having reserve separate for every thinLV ?
How big it should actually be ?
Are you going to 'refill' those reserves  when thin-pool gets emptier ?
How you decide which thinLV deserves bigger reserves ;) ??

I assume you can start to SEE the whole point of this misery....

So instead -  you can start with normal thin-pool - keep it simple in kernel,
and solve complexity in user-space.

There you can decide - if you want to extend thin-pool...
You may drop some snapshot...
You may fstrim mounted thinLVs...
You can kill volumes way before the situation becomes unmaintable....

All you need to accept is - you will kill them at 95% -
in your world with reserves it would be already reported as 100% full,
with totally unknown size of reserves :)

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 22:41     ` Zdenek Kabelac
@ 2017-09-12 22:55       ` Gionatan Danti
  2017-09-12 23:11         ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 22:55 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

Il 13-09-2017 00:41 Zdenek Kabelac ha scritto:
> There are maybe few worthy comments - XFS is great on stanadar big
> volumes, but there used to be some hidden details when used on thinly
> provisioned volumes on older RHEL (7.0, 7.1)
> 
> So now it depend how old distro you use (I'd probably highly recommend
> upgrade to RH7.4 if you are on RHEL based distro)

Sure.

> Basically 'XFS' does not have similar 'remount-ro' on error behavior
> which 'extX' provides - but now XFS knows how to shutdown itself when
> meta/data updates starts to fail - although you may need to tune some
> 'sysfs' params to get 'ideal' behavior.

True, with a catch: with the default data=ordered option, even ext4 does 
*not* remount read only when data writeout fails. You need to use both 
"errors=remount-ro" and "data=journal" which basically nobody uses.

> Personally for smaller sized thin volumes I'd prefer 'ext4' over XFS -
> unless you demand some specific XFS feature...

Thanks for the input. So, do you run your ext4 filesystem with 
data=journal? How they behave performane-wise?

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 17:09   ` Gionatan Danti
@ 2017-09-12 22:41     ` Zdenek Kabelac
  2017-09-12 22:55       ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 22:41 UTC (permalink / raw)
  To: LVM general discussion and development, Gionatan Danti

Dne 12.9.2017 v 19:09 Gionatan Danti napsal(a):
> Hi,
> 

> The default combination is automatically the most tested one. This will really 
> pay off when you face some unexptected bug/behavior
> 
>> And yet you persist on using the dumbest combo available: thin + xfs. No 
>> offense to LVM Thin, it works great WHEN used correctly. To channel Apple, 
>> "you're holding it wrong".
> 
> This is what RedHat is heavily supporting. I see nothing wrong with thin + 
> XFS, and both thinp and XFS developers confirm that.
> 
> Again: maybe I am missing something?

There are maybe few worthy comments - XFS is great on stanadar big volumes, 
but there used to be some hidden details when used on thinly provisioned 
volumes on older RHEL (7.0, 7.1)

So now it depend how old distro you use (I'd probably highly recommend upgrade 
to RH7.4 if you are on RHEL based distro)

Basically 'XFS' does not have similar 'remount-ro' on error behavior which 
'extX' provides - but now XFS knows how to shutdown itself when meta/data 
updates starts to fail - although you may need to tune some 'sysfs' params to 
get 'ideal' behavior.

Personally for smaller sized thin volumes I'd prefer 'ext4' over XFS - unless 
you demand some specific XFS feature...

Regards


Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 22:16     ` Zdenek Kabelac
@ 2017-09-12 22:41       ` Gionatan Danti
  2017-09-12 23:02         ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 22:41 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: matthew, LVM general discussion and development

Il 13-09-2017 00:16 Zdenek Kabelac ha scritto:
> Dne 12.9.2017 v 23:36 Gionatan Danti napsal(a):
>> Il 12-09-2017 21:44 matthew patton ha scritto:
> 
>> Again, please don't speak about things you don't know.
>> I am *not* interested in thin provisioning itself at all; on the other 
>> side, I find CoW and fast snapshots very useful.
> 
> 
> Not going to comment KVM storage architecture - but with this statemnet 
> -
> you have VERY simple usage:
> 
> 
> Just minimize chance for overprovisioning -
> 
> let's go by example:
> 
> you have  10  10GiB volumes  and you have 20 snapshots...
> 
> 
> to not overprovision - you need 10 GiB * 30 LV  = 300GiB thin-pool.
> 
> if that sounds too-much.
> 
> you can go with 150 GiB - to always 100% cover all 'base' volumes.
> and have some room for snapshots.
> 
> 
> Now the fun begins - while monitoring is running -
> you get callback for  50%, 55%... 95% 100%
> at each moment  you can do whatever action you need.
> 
> 
> So assume 100GiB is bare minimum for base volumes - you ignore any
> state with less then 66% occupancy of thin-pool and you start solving
> problems with 85% (~128GiB)- you know some snapshot is better to be
> dropped.
> You may try 'harder' actions for higher percentage.
> (you need to consider how many dirty pages you leave floating your 
> system
> and other variables)
> 
> Also you pick with some logic the snapshot which you want to drop -
> Maybe the oldest ?
> (see airplane :) URL link)....
> 
> Anyway - you have plenty of time to solve it still at this moment
> without any danger of losing write operation...
> All you can lose is some 'snapshot' which might have been present a
> bit longer...  but that is supposedly fine with your model workflow...
> 
> Of course you are getting in serious problem, if you try to keep all
> these demo-volumes within 50GiB with massive overprovisioning ;)
> 
> There you have much hard times what should happen what should be
> removed and where is possibly better to STOP everything and let admin
> decide what is the ideal next step....
> 

Hi Zdenek,
I fully agree with what you said above, and I sincerely thank you for 
taking the time to reply.
However, I am not sure to understand *why* reserving space for a thin 
volume seems a bad idea to you.

Lets have a 100 GB thin pool, and wanting to *never* run out of space in 
spite of taking multiple snapshots.
To achieve that, I need to a) carefully size the original volume, b) ask 
the thin pool to reserve the needed space and c) counting the "live" 
data (REFER in ZFS terms) allocated inside the thin volume.

Step-by-step example:
- create a 40 GB thin volume and subtract its size from the thin pool 
(USED 40 GB, FREE 60 GB, REFER 0 GB);
- overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
- snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
- completely overwrite the original volume (USED 80 GB, FREE 20 GB, 
REFER 40 GB);
- a new snapshot creation will fails (REFER is higher then FREE).

Result: thin pool is *never allowed* to fill. You need to keep track of 
per-volume USED and REFER space, but thinp performance should not be 
impacted in any manner. This is not theoretical: it is already working 
in this manner with ZVOLs and refreservation, *without* 
involing/requiring any advanced coupling/integration between block and 
filesystem layers.

Don't get me wrong: I am sure that, if you choose to not implement this 
scheme, you have a very good reason to do that. Moreover, I understand 
that patches are welcome :)

But I would like to understand *why* this possibility is ruled out with 
such firmness.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 21:36   ` Gionatan Danti
@ 2017-09-12 22:16     ` Zdenek Kabelac
  2017-09-12 22:41       ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: Zdenek Kabelac @ 2017-09-12 22:16 UTC (permalink / raw)
  To: LVM general discussion and development, Gionatan Danti, matthew patton

Dne 12.9.2017 v 23:36 Gionatan Danti napsal(a):
> Il 12-09-2017 21:44 matthew patton ha scritto:

> Again, please don't speak about things you don't know.
> I am *not* interested in thin provisioning itself at all; on the other side, I 
> find CoW and fast snapshots very useful.


Not going to comment KVM storage architecture - but with this statemnet -
you have VERY simple usage:


Just minimize chance for overprovisioning -

let's go by example:

you have  10  10GiB volumes  and you have 20 snapshots...


to not overprovision - you need 10 GiB * 30 LV  = 300GiB thin-pool.

if that sounds too-much.

you can go with 150 GiB - to always 100% cover all 'base' volumes.
and have some room for snapshots.


Now the fun begins - while monitoring is running -
you get callback for  50%, 55%... 95% 100%
at each moment  you can do whatever action you need.


So assume 100GiB is bare minimum for base volumes - you ignore any state with 
less then 66% occupancy of thin-pool and you start solving problems with 85% 
(~128GiB)- you know some snapshot is better to be dropped.
You may try 'harder' actions for higher percentage.
(you need to consider how many dirty pages you leave floating your system
and other variables)

Also you pick with some logic the snapshot which you want to drop -
Maybe the oldest ?
(see airplane :) URL link)....

Anyway - you have plenty of time to solve it still at this moment
without any danger of losing write operation...
All you can lose is some 'snapshot' which might have been present a bit 
longer...  but that is supposedly fine with your model workflow...

Of course you are getting in serious problem, if you try to keep all these 
demo-volumes within 50GiB with massive overprovisioning ;)

There you have much hard times what should happen what should be removed and 
where is possibly better to STOP everything and let admin decide what is the 
ideal next step....

Regards

Zdenek

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] ` <418002318.647314.1505245480415@mail.yahoo.com>
@ 2017-09-12 21:36   ` Gionatan Danti
  2017-09-12 22:16     ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 21:36 UTC (permalink / raw)
  To: matthew patton; +Cc: linux-lvm

Il 12-09-2017 21:44 matthew patton ha scritto:
>>  Wrong: Qemu/KVM *does* honors write barrier, unless you use  
>> "cache=unsafe".
> 
> seems the default is now 'unsafe'.
> http://libvirt.org/formatdomain.html#elementsDisks

The default is cache=none

> Only if the barrier frame gets passed along by KVM and only if you're
> running in "directsync" (or perhaps 'none?') mode there is no
> guarantee any of it hit the platter. Let's assume a hypervisor I/O is
> ahead of VM 'A's barrier frame, and blows up the thinLV. Then yes it's
> possible to propagate the ENOSPACE or other error back to the VM 'A'
> to realize that write didn't actually succeed. The rapid failure
> cascade across resident VMs is not going to end well either.

fsynched writes that hits a full pool returns EIO to the upper layer

> But if that's the only condition it works under they why bother with
> XFS on top of thin? You already mentioned that XFS is lousy about
> actually detecting underlying block layer problems until much too
> late. Just provision the Thin LV and map it directly to the VM. If
> your choice of hypervisor is broken, fix it, or choose another that
> doesn't have the problem. But really, using ThinLV for guest VM and
> asking the hypervisor to not blow up is nuts. Buy a friggin disk. Or
> put the onus on try/error where it belongs - inside the guest VM. Why
> is that so hard?

KVM is the most valid GPL hypervisor, and libvirt is the virtualization 
library of choice.
But I can not fix/implement thin pool/volumes management alone.

> You're a web-hosting company and you're trying to duck the laws of
> economics and the reality of running a business where other often
> clueless people trust you to keep their data intact?

Please, don't elaborate on things you don't know.
I asked a specific question on the linux-lvm list, and (as always) I 
learnt something. I don't see any problem in doing that.

> Given the screwball way you're going about handling your customer
> data, why are you trying to be 'creative'? Storage is your MOST
> important and vital capability and naturally your most expensive. Get
> real and spend the money. Storage is where you NEVER cut corners and
> you NEVER attempt naive optimization efforts.
> 
> I'm not quibbling over XFS, you could have picked EXT4 for all I care.
> The point is you're trying to get jiggy with customer data to save
> pennies. We have a saying, picking up pennies in front of a road
> paver. If you're selling capacity you don't have, it's not only
> dishonest, but there are proven and well understood ways to do so (eg.
> NFS or thin-alloc iSCSI) aimed at a proper storage head that is
> properly managed. But ultimately you HAVE to be able to satisfy the
> instant demands of all users even if that means they all suddenly wake
> up and want to use what you supposedly sold them and entered into a
> contract to supply by taking  their money.

Again, please don't speak about things you don't know.
I am *not* interested in thin provisioning itself at all; on the other 
side, I find CoW and fast snapshots very useful.

> Computing/Storage is not a ponzi scheme.

Thanks for remind me that.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-12 15:03 ` matthew patton
@ 2017-09-12 17:09   ` Gionatan Danti
  2017-09-12 22:41     ` Zdenek Kabelac
  0 siblings, 1 reply; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12 17:09 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

Hi,

On 12/09/2017 17:03, matthew patton wrote:
>> I need to take a step back: my main use for thinp is virtual machine
>> backing store
> ...
>> Rather, I had to use a single, big thin volumes with XFS on top.
> ...
>> I used  ZFS as volume manager, with the intent to place an XFS filesystem on top
> 
> Good grief, you had integration (ZFS) and then you broke it. The ZFS as block or as filesystem is just symantics. 

I did for a compelling reason - to use DRBD for realtime replication. 
Moreover, this is the *expected* use for ZVOLs.

While you're at it dig into libvirt and see if you can fix it's silliness.

This simply can not be done by a single person in reasonable time, so I 
had to find other solution for now...

> Say you allowed a snapshot to be created when it was 31%. And 100 milliseconds later you had 2 more all ask for a snapshot and they succeeded. But 2 seconds later just one of your snapshot writers decided to write till it ran off the end of available space. What have you gained?

With the refreservation property we can *avoid* such a situation. Please 
re-read my bash examples in the previous email.

> FSync'd where? Inside your client VM? The hell they're safe. Your hypervisor is under no obligation to honor a write request issued to XFS as if it's synchronous.

Wrong: Qemu/KVM *does* honors write barrier, unless you use 
"cache=unsafe". Other behaviors should be threat as bugs.

> Is XFS at the hypervisor being mounted 'sync'? That's not nearly enough though. You can also prove that there is a direct 1:1 map between the client VM's aggregate of FSync inspired blocks and general writes being de-staged at the same time it gets handed off to the hypervisor's XFS with the same atomicity? And furthermore when your client VM's kernel ACK's the FSYNC it is saying so without having any idea that the write actually made it. It *thought* it had done all it was supposed to do. Now the user software as well as the VM kernel are being actively misled!
> 
> You're going about this completely wrong.
> 
> You have to push the "did my write actually succeed or not and how do I recover" to inside the client VM. Your client VM either gets issued a block device that is iSCSI (can be same host) or 'bare metal' LVM on the hypervisor. That's the ONLY way to make sure the I/O's don't get jumbled and errors map exactly. Otherwise for application scribble, the client VM mounts an NFS share that can be thinLV+XFS at the fileserver. Or buy a proper enterprise storage array (they are dirt-cheap used, off maint) where people far smarter than you have solved this problem decades ago.

Again: this is not how Qemu/KVM threats write barriers on the guest 
side. Really. You can check the qemu/libvirt mailing list for that. 
Bottom line: guest fsynced writes *are absolutely safe.* I even tested 
this on my lab by pulling of the plug *tens of times* during heavy IO.

> And yet you have demonstrated no ability to do so. Or at least have a very naive notion of what happens when multiple, simultaneous actors are involved. It sounds like some of your preferred toolset is letting you down. Roll up your sleeves and fix it. Why you give a damn about what filesystem is 'default' in any particular distribution is beyond me. Use the combination that actually works - not "if only this or that were changed it could/might work."

The default combination is automatically the most tested one. This will 
really pay off when you face some unexptected bug/behavior

> And yet you persist on using the dumbest combo available: thin + xfs. No offense to LVM Thin, it works great WHEN used correctly. To channel Apple, "you're holding it wrong".

This is what RedHat is heavily supporting. I see nothing wrong with thin 
+ XFS, and both thinp and XFS developers confirm that.

Again: maybe I am missing something?
Thanks.

-- 
-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <1806055156.426333.1505228581063.ref@mail.yahoo.com>
@ 2017-09-12 15:03 ` matthew patton
  2017-09-12 17:09   ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: matthew patton @ 2017-09-12 15:03 UTC (permalink / raw)
  To: LVM general discussion and development

> I need to take a step back: my main use for thinp is virtual machine 
> backing store
...
> Rather, I had to use a single, big thin volumes with XFS on top.
...
>I used  ZFS as volume manager, with the intent to place an XFS filesystem on top

Good grief, you had integration (ZFS) and then you broke it. The ZFS as block or as filesystem is just symantics. While you're at it dig into libvirt and see if you can fix it's silliness.
 
> provisioned blocks. Rather, I all for something as "if free space  is lower than 30%, disable new snapshot *creation*"

Say you allowed a snapshot to be created when it was 31%. And 100 milliseconds later you had 2 more all ask for a snapshot and they succeeded. But 2 seconds later just one of your snapshot writers decided to write till it ran off the end of available space. What have you gained?

> is that, by cleaver using of the refreservation property, I can engineer 

You're not being nearly clever enough. You're using the wrong set of tools and making unsupported assumptions about future writes.

> Committed (fsynced) writes are safe

FSync'd where? Inside your client VM? The hell they're safe. Your hypervisor is under no obligation to honor a write request issued to XFS as if it's synchronous. 
Is XFS at the hypervisor being mounted 'sync'? That's not nearly enough though. You can also prove that there is a direct 1:1 map between the client VM's aggregate of FSync inspired blocks and general writes being de-staged at the same time it gets handed off to the hypervisor's XFS with the same atomicity? And furthermore when your client VM's kernel ACK's the FSYNC it is saying so without having any idea that the write actually made it. It *thought* it had done all it was supposed to do. Now the user software as well as the VM kernel are being actively misled!

You're going about this completely wrong.

You have to push the "did my write actually succeed or not and how do I recover" to inside the client VM. Your client VM either gets issued a block device that is iSCSI (can be same host) or 'bare metal' LVM on the hypervisor. That's the ONLY way to make sure the I/O's don't get jumbled and errors map exactly. Otherwise for application scribble, the client VM mounts an NFS share that can be thinLV+XFS at the fileserver. Or buy a proper enterprise storage array (they are dirt-cheap used, off maint) where people far smarter than you have solved this problem decades ago.

> really want to prevent full thin pools even in the face of failed

And yet you have demonstrated no ability to do so. Or at least have a very naive notion of what happens when multiple, simultaneous actors are involved. It sounds like some of your preferred toolset is letting you down. Roll up your sleeves and fix it. Why you give a damn about what filesystem is 'default' in any particular distribution is beyond me. Use the combination that actually works - not "if only this or that were changed it could/might work."

> to design system where some types of problems can not  simply happen.

And yet you persist on using the dumbest combo available: thin + xfs. No offense to LVM Thin, it works great WHEN used correctly. To channel Apple, "you're holding it wrong".

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
  2017-09-11 22:56 ` matthew patton
@ 2017-09-12  5:28   ` Gionatan Danti
  0 siblings, 0 replies; 91+ messages in thread
From: Gionatan Danti @ 2017-09-12  5:28 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

Il 12-09-2017 00:56 matthew patton ha scritto:
> with the obvious caveat that in ZFS the block layer and the file
> layers are VERY tightly coupled. LVM and the block layer see
> eye-to-eye but ext4 et. al. have absolutely (almost?) no clue what's
> going on beneath it and thus LVM is making (false) guarantees that the
> filesystem is relying upon to actually be true.

Sure, but in the previous examples, I did *not* use the ZFS filesystem 
part; rather, I used it as a logical volume manager to carve out block 
devices to be used by other, traditional filesystems.

The entire discussion stems from the idea to let thinp reserve some 
space to avoid a full pool, by denying new snapshot and volume creation 
when a free space threshold is crossed.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [linux-lvm] Reserve space for specific thin logical volumes
       [not found] <832949972.1610294.1505170613541.ref@mail.yahoo.com>
@ 2017-09-11 22:56 ` matthew patton
  2017-09-12  5:28   ` Gionatan Danti
  0 siblings, 1 reply; 91+ messages in thread
From: matthew patton @ 2017-09-11 22:56 UTC (permalink / raw)
  To: LVM general discussion and development

> Let me de-tour by using ZFS as an example

with the obvious caveat that in ZFS the block layer and the file layers are VERY tightly coupled. LVM and the block layer see eye-to-eye but ext4 et. al. have absolutely (almost?) no clue what's going on beneath it and thus LVM is making (false) guarantees that the filesystem is relying upon to actually be true.

IMO Thin-Pool is like waving around a lit welding torch - it's incredibly useful to do certain tasks but you can easily burn yourself and the building down if you don't handle it properly.

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2017-09-21 20:32 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-08 10:35 [linux-lvm] Reserve space for specific thin logical volumes Gionatan Danti
2017-09-08 11:06 ` Xen
2017-09-09 22:04 ` Gionatan Danti
2017-09-11 10:35   ` Zdenek Kabelac
2017-09-11 10:55     ` Xen
2017-09-11 11:20       ` Zdenek Kabelac
2017-09-11 12:06         ` Xen
2017-09-11 12:45           ` Xen
2017-09-11 13:11           ` Zdenek Kabelac
2017-09-11 13:46             ` Xen
2017-09-12 11:46               ` Zdenek Kabelac
2017-09-12 12:37                 ` Xen
2017-09-12 14:37                   ` Zdenek Kabelac
2017-09-12 16:44                     ` Xen
2017-09-12 17:14                     ` Gionatan Danti
2017-09-12 21:57                       ` Zdenek Kabelac
2017-09-13 17:41                         ` Xen
2017-09-13 19:17                           ` Zdenek Kabelac
2017-09-14  3:19                             ` Xen
2017-09-12 17:00                 ` Gionatan Danti
2017-09-12 23:25                   ` Brassow Jonathan
2017-09-13  8:15                     ` Gionatan Danti
2017-09-13  8:33                       ` Zdenek Kabelac
2017-09-13 18:43                     ` Xen
2017-09-13 19:35                       ` Zdenek Kabelac
2017-09-14  5:59                         ` Xen
2017-09-14 19:05                           ` Zdenek Kabelac
2017-09-15  2:06                             ` Brassow Jonathan
2017-09-15  6:02                               ` Gionatan Danti
2017-09-15  8:37                               ` Xen
2017-09-15  7:34                             ` Xen
2017-09-15  9:22                               ` Zdenek Kabelac
2017-09-16 22:33                                 ` Xen
2017-09-17  6:31                                   ` Xen
2017-09-17  7:10                                     ` Xen
2017-09-18 19:20                                       ` Gionatan Danti
2017-09-20 13:05                                         ` Xen
2017-09-21  9:49                                           ` Zdenek Kabelac
2017-09-21 10:22                                             ` Xen
2017-09-21 13:02                                               ` Zdenek Kabelac
2017-09-21 14:34                                                 ` [linux-lvm] Clarification (and limitation) of the kernel feature I proposed Xen
2017-09-21 14:49                                                 ` [linux-lvm] Reserve space for specific thin logical volumes Xen
2017-09-21 20:32                                                   ` Zdenek Kabelac
2017-09-18  8:56                                   ` Zdenek Kabelac
2017-09-11 14:00             ` Xen
2017-09-11 17:34               ` Zdenek Kabelac
2017-09-11 15:31             ` Eric Ren
2017-09-11 15:52               ` Zdenek Kabelac
2017-09-11 21:35                 ` Eric Ren
2017-09-11 17:41               ` David Teigland
2017-09-11 21:08                 ` Eric Ren
2017-09-11 16:55             ` David Teigland
2017-09-11 17:43               ` Zdenek Kabelac
2017-09-11 21:59     ` Gionatan Danti
2017-09-12 11:01       ` Zdenek Kabelac
2017-09-12 11:34         ` Gionatan Danti
2017-09-12 12:03           ` Zdenek Kabelac
2017-09-12 12:47             ` Xen
2017-09-12 13:51               ` pattonme
2017-09-12 14:57               ` Zdenek Kabelac
2017-09-12 16:49                 ` Xen
2017-09-12 16:57             ` Gionatan Danti
     [not found] <832949972.1610294.1505170613541.ref@mail.yahoo.com>
2017-09-11 22:56 ` matthew patton
2017-09-12  5:28   ` Gionatan Danti
     [not found] <1806055156.426333.1505228581063.ref@mail.yahoo.com>
2017-09-12 15:03 ` matthew patton
2017-09-12 17:09   ` Gionatan Danti
2017-09-12 22:41     ` Zdenek Kabelac
2017-09-12 22:55       ` Gionatan Danti
2017-09-12 23:11         ` Zdenek Kabelac
     [not found] <418002318.647314.1505245480415.ref@mail.yahoo.com>
     [not found] ` <418002318.647314.1505245480415@mail.yahoo.com>
2017-09-12 21:36   ` Gionatan Danti
2017-09-12 22:16     ` Zdenek Kabelac
2017-09-12 22:41       ` Gionatan Danti
2017-09-12 23:02         ` Zdenek Kabelac
2017-09-12 23:15           ` Gionatan Danti
2017-09-12 23:31             ` Zdenek Kabelac
2017-09-13  8:21               ` Gionatan Danti
     [not found] <1575245610.821680.1505258554456.ref@mail.yahoo.com>
2017-09-12 23:22 ` matthew patton
2017-09-13  7:53   ` Gionatan Danti
2017-09-13  8:15     ` Zdenek Kabelac
2017-09-13  8:28       ` Gionatan Danti
2017-09-13  8:44         ` Zdenek Kabelac
2017-09-13 10:46           ` Gionatan Danti
     [not found] <691633892.829188.1505258696384.ref@mail.yahoo.com>
2017-09-12 23:24 ` matthew patton
     [not found] <57374453.843393.1505261050977.ref@mail.yahoo.com>
2017-09-13  0:04 ` matthew patton
2017-09-13  7:10   ` Zdenek Kabelac
     [not found] <1771452279.913055.1505269434212.ref@mail.yahoo.com>
2017-09-13  2:23 ` matthew patton
2017-09-13  7:25   ` Zdenek Kabelac
     [not found] <498090067.1559855.1505338608244.ref@mail.yahoo.com>
2017-09-13 21:36 ` matthew patton
     [not found] <914479528.2618507.1505463313888.ref@mail.yahoo.com>
2017-09-15  8:15 ` matthew patton
2017-09-15 10:01   ` Zdenek Kabelac
2017-09-15 18:35   ` Xen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.