All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [linux-lvm] thin handling of available space
       [not found] <1684768750.3193600.1461851163510.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-04-28 13:46 ` matthew patton
  0 siblings, 0 replies; 29+ messages in thread
From: matthew patton @ 2016-04-28 13:46 UTC (permalink / raw)
  To: LVM general discussion and development

> > The real question you should be asking is if it increases the monitoring
> > aspect (enhances it) if thin pool data is seen through the lens of the
> > filesystems as well.

Then you want an integrated block+fs implementation. See BTRFS and ZFS. WAFL and friends.

> kernel for communication from lower fs layers to higher layers -

Correct. Because doing so violates the fundamental precepts of OS design. Higher layers trust lower layers. Thin Pools are outright lying about the real world to anything that uses it's services. That is its purpose. The FS doesn't give a damn that the block layer is lying to it, it can and does assume and rightly so that what the block layer says it has, it indeed does have. The onus of keeping the block layer ahead of the FS falls on a third party - the system admin. The system admin decided it was a bright idea to use thin pools in the first place so he necessarily signed up to be liable for the hazards and risks that choice entails. It is not the job of the FS to bail his ass out.

A responsible sysadmin who chose to use thin pools might configure the initial FS size to be some modest size well within the constraints of the actual block store, and then as the FS hit say 85% utilization to run a script that investigated the state of the block layer and use resize2fs and friends to grow the FS and let the thin-pool likewise grow to fit as IO gets issued. But at some point when the competing demands of other FS on the thin-pool were set to breach actual block availability, the script would refuse to grow the FS and thus userland would get signaled by the FS layer that it's out of space when it hit 100% util.

Another way (haven't tested) to 'signal' the FS as to the true state of the underlying storage is to have a sparse file that gets shrunk over time.

But either way if you have a sudden burst of I/O from competing interests in the thin-pool, what appeared to be a safe growth allocation at one instant of time is not likely to be true when actual writes try to get fulfilled. 

Mindless use of thin-pools is akin to crossing a heavily mined beach. Bring a long stick and say your prayers because you'r likely going to lose a limb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-04  1:25   ` Mark Mielke
@ 2016-05-04 18:16     ` Xen
  0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-04 18:16 UTC (permalink / raw)
  To: LVM general discussion and development

Mark Mielke schreef op 04-05-2016 3:25:

> Thanks for entertaining this discussion, Matthew and Zdenek. I realize
> this is an open source project, with passionate and smart people,
> whose time is precious. I don't feel I have the capability of really
> contributing code changes at this time, and I'm satisfied that the
> ideas are being considered even if they ultimately don't get adopted.
> Even the mandatory warning about snapshots exceeding the volume group
> size is something I can continue to deal with using scripting and
> filtering. I mostly want to make sure that my perspective is known and
> understood.

You know, you really don't need to be this apologetic even if I mess up 
my own replies ;-).

I think you have a right and a reason to say what you've said, and 
that's it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
       [not found] <799090122.6079306.1462373733693.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-04 14:55 ` matthew patton
  0 siblings, 0 replies; 29+ messages in thread
From: matthew patton @ 2016-05-04 14:55 UTC (permalink / raw)
  To: LVM general discussion and development

On Tue, 5/3/16, Mark Mielke <mark.mielke@gmail.com> wrote:

> I get a bit lost here in the push towards BTRFS and ZFS for people with these expectations as
> I see BTRFS and ZFS as having a similar problem. They can both still fill up.

Well of course everything fills up eventually. BTRFS and ZFS are integrated systems where the FS can see into the block layer and "do" block layer activities vs the clear demarcation between XFS/EXT and LVM/MD.

If you write too much to a Thin FS today you get serious data loss. Oh sure, the metadata might have landed but the file contents sure didn't. Somebody (you?) mentioned how you seemingly were able to write 4x90GB to a 300GB block device and the FS fsck'd successfully. This doesn't happen in BTRFS/ZFS and friends. At 300.001GB you would have gotten a write error and the write operation would not have succeeded.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 12:00 ` matthew patton
  2016-05-03 14:38   ` Xen
@ 2016-05-04  1:25   ` Mark Mielke
  2016-05-04 18:16     ` Xen
  1 sibling, 1 reply; 29+ messages in thread
From: Mark Mielke @ 2016-05-04  1:25 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

[-- Attachment #1: Type: text/plain, Size: 7566 bytes --]

On Tue, May 3, 2016 at 8:00 AM, matthew patton <pattonme@yahoo.com> wrote:

> > written as required. If the file system has particular areas
> > of importance that need to be writable to prevent file
> > system failure, perhaps the file system should have a way of
> > communicating this to the volume layer. The naive approach
> > here might be to preallocate these critical blocks before
> >  proceeding with any updates to these blocks, such that the
> > failure situations can all be "safe" situations,
> > where ENOSPC can be returned without a danger of the file
> > system locking up or going read-only.
>
> why all of a sudden does each and every FS have to have this added code to
> second guess the block layer? The quickest solution is to mount the FS in
> sync mode. Go ahead and pay the performance piper. It's still not likely to
> be bullet proof but it's a sure step closer.
>

Not all of a sudden. From "at work" perspective, LVM thinp as a technology
is relatively recent, and only recently being deployed in more places as we
migrate our systems from RHEL 5 to RHEL 6 to RHEL 7. I didn't consider
thinp an option before RHEL 7, and I didn't consider it stable even in RHEL
7 without significant testing on our part.

From an "at home" perspective, I have been using LVM thinp from the day it
was available in a Fedora release. The previous snapshot model was
unusable, and I wished upon a star that a better technology would arrive. I
tried BTRFS and while it did work - it was still marked as experimental, it
did not have the exact same behaviour as EXT4 or XFS from an applications
perspective, and I did encounter some early issues with subvolumes.
Frankly... I was happy to have LVM thinp, and glad that you LVM developers
provided it when you did. It is excellent technology from my perspective.
But, "at home", I was willing to accept some loose edge case behaviour. I
know when I use storage on my server at home, and if it fails, I can accept
the consequences for myself.

"At work", the situation is different. These are critical systems that I am
betting LVM on. As we begin to use it more broadly (after over a year of
success in hosting our JIRA + Confluence instances on local flash using LVM
thinp for much of the application data including PostgreSQL databases). I
am very comfortable with it from a "< 80% capacity" perspective. However,
every so often it passes 80%, and I have to raise the alarm, because I know
that there are edge cases that LVM / DM thinp + XFS don't handle quite so
well. It's never happened in production yet, but I've seen it happen many
times on designer desktops when they are using LVM, and they lock up their
system and require a system reboot to recover from.

I know there are smart people working on Linux, and smart people working on
LVM. Give the opportunity, and the perspective, I think the worst of these
cases are problems that deserve to be addressed, and probably that people
have been working on with or without my contributions to the subject.



> What you're saying is that when mounting a block device the layer needs to
> expose a "thin-mode" attribute (or the sysdmin sets such a flag via
> tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry
> (does that actually work reliably?).
>
> Then there has to be code in every FS block de-stage path:
> IF thin {
>   tickle block layer to allocate the block (aka write zeros to it? - what
> about pre-existing data, is there a "fake write" BIO call that does
> everything but actually write data to a block but would otherwise trigger
> LVM thin's extent allocation logic?)
>    IF success, destage dirty block to block layer ELSE
>    inform userland of ENOSPC
> }
>
> In a fully journal'd FS (metadata AND data) the journal could be 'pinned'
> and likewise the main metadata areas if for no other reason they are zero'd
> at onset and or constantly being written to. Once written to, LVM thin
> isn't going to go back and yank away an allocated extent.
>

Yes. This is exactly the type of solution I was thinking of including
pinning the journal! You used the correct terminology. I can read the terms
but not write them. :-)

You also managed to summarize it in only a few lines of text. As concepts
go, I think that makes it not-too-complex.

But, the devil is often in the details, and you are right that this is a
per-file system cost.

Balancing this, however, I am perhaps presuming that *all* systems will
eventually be thin volume systems, and that correct behaviour and highly
available behaviour will eventually require that *all* systems invest in
technology such as this. My view of the future is that fixed sized thick
partitions are very often a solution which is compromised from the start.
Most systems of significance grow over time, and the pressure to reduce
cost is real. I think we are taking baby steps to start, but that the
systems of the future will be thin volume systems. I see this as a problem
that needs to be understood and solved, except in the most limited of use
cases. This is my opinion, which I don't expect anybody to share.



> This at least should maintain FS integrity albeit you may end up in a
> situation where the journal can never get properly de-staged, so you're
> stuck on any further writes and need to force RO.
>

Interesting to consider. I don't see this as necessarily a problem - or
that it necessitates "RO" as a persistent state. For example, it would be
most practical if sufficient room was reserved to allow for content to be
removed, allowing for the file system to become unwedged and become "RW"
again. Perhaps there is always an edge case that would necessitate a
persistent "RO" state that requires the volume be extended to recover from,
but I think the edge case could be refined to something that will tend to
never happen?



> > just want a sanely behaving LVM + XFS...)
> IMO if the system admin made a conscious decision to use thin AND
> overprovision (thin by itself is not dangerous), it's up to HIM to actively
> manage his block layer. Even on million dollar SANs the expectation is that
> the engineer will do his job and not drop the mic and walk away. Maybe the
> "easiest" implementation would be a MD layer job that the admin can tailor
> to fail all allocation requests once extent count drops below a number and
> thus forcing all FS mounted on the thinpool to go into RO mode.
>

Another interesting idea. I like the idea of automatically shutting down
our applications or PostgreSQL database if the thin pool reaches an unsafe
allocation, such as 90% or 95%. This would ensure the integrity of the
data, at the expense of an outage. This is something we could implement
today. Thanks.



> But in any event it won't prevent irate users from demanding why the space
> they appear to have isn't actually there.


Users will always be irate. :-) I mostly don't consider that as a real
factor in my technical decisions... :-)

Thanks for entertaining this discussion, Matthew and Zdenek. I realize this
is an open source project, with passionate and smart people, whose time is
precious. I don't feel I have the capability of really contributing code
changes at this time, and I'm satisfied that the ideas are being considered
even if they ultimately don't get adopted. Even the mandatory warning about
snapshots exceeding the volume group size is something I can continue to
deal with using scripting and filtering. I mostly want to make sure that my
perspective is known and understood.


-- 
Mark Mielke <mark.mielke@gmail.com>

[-- Attachment #2: Type: text/html, Size: 9191 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 13:01 ` matthew patton
  2016-05-03 15:47   ` Xen
@ 2016-05-04  0:56   ` Mark Mielke
  1 sibling, 0 replies; 29+ messages in thread
From: Mark Mielke @ 2016-05-04  0:56 UTC (permalink / raw)
  To: matthew patton, LVM general discussion and development

[-- Attachment #1: Type: text/plain, Size: 6474 bytes --]

On Tue, May 3, 2016 at 9:01 AM, matthew patton <pattonme@yahoo.com> wrote:

> On Mon, 5/2/16, Mark Mielke <mark.mielke@gmail.com> wrote:
> <quote>
>  very small use case in reality. I think large service
>  providers would use Ceph or EMC or NetApp, or some such
>  technology to provision large amounts of storage per
>  customer, and LVM would be used more at the level of a
>  single customer, or a single machine.
> </quote>
>
> Ceph?!? yeah I don't think so.
>

I don't use Ceph myself. I only listed it as it may be more familiar to
others, and because I was responding to a Red Hat engineer. We use NetApp
and EMC for the most part.


> If you thin-provision an EMC/Netapp volume and the block device runs out
> of blocks (aka Raid Group is full) all volumes on it will drop OFFLINE.
> They don't even go RO. Poof, they disappear. Why? Because there is no
> guarantee that every NFS client, every iSCSI client, every FC client is
> going to do the right thing. The only reliable means of telling everyone
> "shit just broke" is for the asset to disappear.
>

I think you are correct. Based upon experience, I don't recall this ever
happening, but upon reflection, it may just be that our IT team always
caught the situation before it became too bad, and either extended the
storage, or asked permission to delete snapshots.



> All in-flight writes to the volume that the array ACK'd are still good
> even if they haven't been de-staged to the intended device thanks to NVRAM
> and the array's journal device.
>

Right. A good feature. An outage occurs, but the data that was properly
written stays written.


<quote>
>  In these cases, I
>  would expect that LVM thin volumes should not be used across
>  multiple customers without understanding the exact type of
>  churn expected, to understand what the maximum allocation
>  that would be required.
> </quote>
>
> sure, but that spells responsible sysadmin. Xen's post implied he didn't
> want to be bothered to manage his block layer  that magically the FS' job
> was to work closely with the block layer to suss out when it was safe to
> keep accepting writes. There's an answer to "works closely with block
> layer" - it's spelled BTRFS and ZFS.
>

I get a bit lost here in the push towards BTRFS and ZFS for people with
these expectations as I see BTRFS and ZFS as having a similar problem. They
can both still fill up. They just might get closer to 100% utilization
before they start to fail.

My use case isn't about reaching closer to 100% utilization. For example,
when I first proposed our LVM thinp model for dealing with host-side
snapshots, there were people in my team that felt that "fstrim" should be
run very frequently (even every 15 minutes!), so as to make maximum use of
the available free space across multiple volumes and reduce churn captured
in snapshots. I think anybody with this perspective really should be
looking at BTRFS or ZFS. Myself, I believe fstrim should run once a week or
less, and not really to save space, but more to hint to the flash device
which blocks are definitely not in use over time, to make the best use of
the flash storage over time. If we start to pass 80%, I raise the alarm
that we need to consider increasing the local storage, or moving more
content out of the thin volumes. Usually we find out that more-than-normal
churn occurred, and we just need to prune a few snapshots to drop below 50%
again. I still made them move the content that doesn't need to be snapshot
out of the thin volume, and to a stand-alone LVM thick volume so as to
entirely eliminate this churn from being trapped in snapshots and
accumulating.


LVM has no obligation to protect careless sysadmins doing dangerous things
> from themselves. There is nothing wrong with using THIN every which way you
> want just as long as you understand and handle the eventuality of extent
> exhaustion. Even thin snaps go invalid if it needs to track a change and
> can't allocate space for the 'copy'.
>

Right.


> Amazon would make sure to have enough storage to meet my requirement if I
> need them.
>
> Yes, because Amazon is a RESPONSIBLE sysadmin and has put in place tools
> to manage the fact they are thin-provisoning and to make damn sure they can
> cash the checks they are writing.
>

Right.



> > the nature of the block device, such as "how much space
> > do you *really* have left?"
>
> So you're going to write and then backport "second guess the block layer"
> code to all filesystems in common use and god knows how many versions back?
> Of course not. Just try to get on the EXT developer mailing list and ask
> them to write "block layer second-guessing code (aka branch on device
> flag=thin)" because THINP will cause problems for the FS when it runs out
> of extents. To which the obvious and correct response will be "Don't use
> THINP if you're not prepared to handle it's pre-requisites."
>

Bad things happen. Sometimes they happen very quickly. I don't intend to
dare fate, but if fate comes knocking, I prefer to be prepared. For
example, we had two monitoring systems in place for one particularly
critical piece of storage, where the application is particularly poor at
dealing with "out of space". No thin volumes in use here. Thick volumes all
the way. The system on the storage appliance stopped sending notifications
a few weeks prior as a result of some mistake during a reconfiguration or
upgrade. The separate monitoring system using entirely different software
and configuration, on different host, also failed for a different reason
that I no longer recall. The volume became full, and the application data
was corrupted in a bad way that required recovery. My immediate reaction
after best addressing the corruption, was to demand three monitoring
systems instead of two. :-)




> > you and the other people. You think the block storage should
> > be as transparent as possible, as if the storage was not
> > thin. Others, including me, think that this theory is
> > impractical
> Then by all means go ahead and retrofit all known filesystems with the
> extra logic. ALL of the filesystems were written with the understanding
> that the block layer is telling the truth and that any "white lie" was
> benign in so much that it would be made good and thus could be assumed to
> be "truth" for practical purpose.


I think this relates more closely to your other response, that I will
respond to separately...


-- 
Mark Mielke <mark.mielke@gmail.com>

[-- Attachment #2: Type: text/html, Size: 8607 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
@ 2016-05-03 18:19 Xen
  0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 18:19 UTC (permalink / raw)
  To: LVM general discussion and development

Zdenek Kabelac schreef op 03-05-2016 17:45:

> It's not  'continued' suggestion.
> 
> It's just the example of solution where   'filesystem & block layer'
> are tied together.  Every solution has some advantages and
> disadvantages.

So what if more systems were tied together in that way? What would be 
the result?

Tying together does not have to do away with layers.

It is not either/or, it is both/and.

You can have separate layers and you can have intregration.

In practice all it would require is for the LVM, ext and XFS people to 
agree.

You could develop extensions to the existing protocols that are only 
used if both parties understand it.

Then pretty much btrfs has no raison d'être anymore. You would have an 
integrated system but people can retain their own identities as much as 
they want.

 From what you say LVM+ext4/XFS is already a partner system anyway.

It is CLEAR LVM+BTRFS or LVM+ZFS is NOT a popular system.

You can and you could but it does not synergize. OpenSUSE uses btrfs by 
default and I guess they use LVM just as well. For LVM you want a 
simpler filesystem that does its own work.

(At the same time I am not so happy with the RAID capability of LVM, nor 
do I care much at this point).

LVM raid seems for me the third solution after firmware raid, regular 
dmraid and .... and that.

I prefer to use LVM on top of raid really. But maybe that's not very 
helpful.


> So far I'm convinced layered design gives user more freedom - for the 
> price
> of bigger space usage.

Well let's stop directing people to btrfs then.

Linux people have a tendency and habit to send people from pillar to 
post.

You know what that means.

It means 50% of answers you get are redirects.

They think it's efficient to spend their time redirecting you or wasting 
your time in other ways, rather than using the same time and energy 
answering your question.

If the social Linux system was a filesystem, people would run benchmarks 
and complain that its organisation is that of a lunatic.

Where 50% of read requests get directed to another sector, of which 50% 
again get redirected, and all for no purpose really.

Write requests get 90% deflected. The average number of write requests 
before you hit your target is about ... it converges exactly to 10.

If I had been better at math I would have known that :p.

You say:

"Please don't compare software to real life".

No, let's compare the social world to technology. We have very bad 
technology if you look at it like that. Which in turn doesn't make the 
"real" technology much better.



SUM( i * p * (1-p)^(i-1) ) with i = (1, inf) = 1/p.

with p a chance of success at each hit.

The sum of that formula with i iterating from 1 to infinite is 1/p.

With a hit chance of 90% per attempt, the average number of hits to be 
successful is 1/.9 = 10/9.

I'm not very brilliant today.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 13:01 ` matthew patton
@ 2016-05-03 15:47   ` Xen
  2016-05-04  0:56   ` Mark Mielke
  1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 15:47 UTC (permalink / raw)
  To: LVM general discussion and development

matthew patton schreef op 03-05-2016 15:01:

> Ceph?!? yeah I don't think so.

Mark's argument was nothing about comparing feature sets or something at 
this point. So I don't know what you are responding to. You respond like 
a bitten bee.

Read again. Mark Mielke described actual present-day positions. He 
described what he thinks is how LVM is positioning itself in conjunction 
with and with regards to other solutions in industry. He described that 
to his mind the bigger remote storage solutions do not or would not 
easily or readily start using LVM for those purposes, while the smaller 
scale or more localized systems would.

He described a layering solution, that you seem to be allergic to. He 
described a modularized system where thin is being used both at the 
remote backend (using a different technology) and at the local end 
(using LVM) for different purposes but achieving much of the same 
results.

He described how he considered the availability of the remote pool a 
responsibility for that remote supplier (and paying good money for it) 
while having different uses cases for LVM thin himself or themselves.

And I did think he made a very good case for this. I absolutely believe 
his use case is the most dominant and important one for LVM. LVM is for 
local systems.

In this case it is a local system running storage on a remote backend. 
Yet the local system has different requirements and uses LVM thin for a 
different purpose.

And this purpose falls along the lines of having cheap and freely 
available snapshots.

And he still feels and believes, apparently, that using the LVM admin 
tools for ensuring the stability of his systems might not be the most 
attractive and functional thing to do.

You may not agree with that but it is what he believes and feels. It is 
a real life data point, if you care about that.

Sometimes people's opinions actually simply just inform you of the 
world. It is information. It is not something to fight or disagree with, 
it is something to take note of.

The better you are able to respond to these data points, the better you 
are aware of the system you are dealing with. That could be real people 
paying or not paying you money.

However if you are going to fight every opinion that disagrees with you, 
you will never get to the point of actually realizing that they are just 
opinions and they are a wealth of information if you'd make use of it.

And that is not a devious thing to do if you're thinking that. It is 
being aware. Nothing more, nothing less.

And we are talking about awareness here. Not surprising then that the 
people most vehemently opposing this also seem to be the people least 
aware of the fact that real people with real uses cases might find the 
current situation not practical.

Mr. Zdenek can say all he wants that the current situation is very 
practical.

If that is not a data point but an opinion (not of someone experiencing 
it, but someone who wants certain people to experience certain things) 
then we must listen to actual data points and not what he wants.

Mr. Zdenek (I haven't responded to him here now) also responds like a 
bitten bee to simple allusions that Red Hat might be thinking this or 
that.

Not just stung by a bee. A bee getting stung ;-).

I mean come on people. You have nothing to lose. Either it is a good 
idea or it isn't. If it gets support, maybe someone will implement it 
and deliver proof of concept. But if you go about shooting it down the 
moment it rears its ugly (or beautiful) head you also ensure that that 
developer time is not going to be spend on it even if it were an asset 
to you.

Someone discussing a need might not always be that person that in the 
end is not going to do anything himself.

You are trying to avoid work but in doing so you avoid work being done 
for you as well.

It's give or take, it's plus plus.

Don't kill other people's ideas and maybe they start doing work for you 
too.

Oh yeah. Sorry if I'm being judgmental or belligerent (or pedantic):

The great irony and tragedy of the Linux world is this:




Someone comes with a great idea that he/she believes in and wants to 
work on.

They shoot it down.

Next they complain why there are so very few volunteers.



They can ban someone on a mailing list one instant and out loud wonder 
how they can attract more interest to their system, the next.




Not unrelated.




> sure, but that spells responsible sysadmin. Xen's post implied he
> didn't want to be bothered to manage his block layer  that magically
> the FS' job was to work closely with the block layer to suss out when
> it was safe to keep accepting writes. There's an answer to "works
> closely with block layer" - it's spelled BTRFS and ZFS.

It is not my block layer. I'm not the fucking system admin.

I can only talk to the FS. Or that might very well be the case for my 
purposes here.

It is pretty amazing that any attempt to separate responsibilities in 
actuality is met with a rebuttal that insists one use a solution that 
mingles everything.

In your ideal world then, everyone is forced to use BTRFS/ZFS because at 
least these take the worries away from the software/application 
designer.

And you ensure a beautiful world without LVM because it has no purpose.

As as software developer I cannot depend on your magical solution and 
assertion that every admin out there is going to be this amazing person 
that never makes a mistake.



> Responsible usage has nothing to do with single vs multiple customers.
> Though Xen broached the 'hosting' example and in the cut-rate hosting
> business over-provisioning is rampant. It's not a problem unless the
> syadmin drops the ball.

What if I want him to be able to drop the ball and still survive?

What about designing systems that are actually failsafe and resilient?

What about resilience?

What about goodness?

What about quality?

What about good stuff?

Why do you feed your admins bad stuff just so that they can shine and 
consider themselves important?





> So you're going to write and then backport "second guess the block
> layer" code to all filesystems in common use and god knows how many
> versions back? Of course not. Just try to get on the EXT developer
> mailing list and ask them to write "block layer second-guessing code
> (aka branch on device flag=thin)" because THINP will cause problems
> for the FS when it runs out of extents. To which the obvious and
> correct response will be "Don't use THINP if you're not prepared to
> handle it's pre-requisites."

So you are basically suggesting a solution that you know will fail, but 
still you recommend it.

That spells out "I don't know how to achieve my goals" like no other 
thing.

But you still think people should follow your recommendations.

What you say is completely anemic to how the open source world works.

You do not ask people to do your work for you.

Why do you even insist on recommending that. And then when you (in your 
imagination here) do ask those people to do it for you, they refuse. No 
small wonder.

Still you consider that a good way to approach things. To depend on 
someone else to do your work for you.

Really.

"Of course not. Just try to get on the EXT developer mailing list and 
ask them to..."

Yes I am ridiculing you.

You were sincere in saying those words. You ridicule yourself.

Of course you would start designing patches and creating a workable 
solution with yourself as the main leader or catalyst of that project. 
There is not other way to do things in life. You should know that.



> Then by all means go ahead and retrofit all known filesystems with the
> extra logic. ALL of the filesystems were written with the
> understanding that the block layer is telling the truth and that any
> "white lie" was benign in so much that it would be made good and thus
> could be assumed to be "truth" for practical purpose.

Maybe we should also retrofit all unknown filesystems and those that 
might be designed on different planets. Yeah, that would be a good way 
to approach things.

I really want to follow your recommendations here. If I do, I will have 
good chances of achieving success.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 13:15             ` Gionatan Danti
@ 2016-05-03 15:45               ` Zdenek Kabelac
  0 siblings, 0 replies; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03 15:45 UTC (permalink / raw)
  To: LVM general discussion and development

On 3.5.2016 15:15, Gionatan Danti wrote:
> On 03/05/2016 13:42, Zdenek Kabelac wrote:
>>
>> What's wrong with  'lvs'?
>> This will give you the available space in thin-pool.
>>
>
> Oh, absolutely nothing wrong with lvs. I used "lsblk" only as an example of
> the block device/layer exposing some (lack of) features to upper layer.
>
> One note about the continued "suggestion" to use BTRFS. While for relatively

It's not  'continued' suggestion.

It's just the example of solution where   'filesystem & block layer' are tied 
together.  Every solution has some advantages and disadvantages.

> simple use case it can be ok, for more demanding (rewrite-heavy) scenarios
> (eg: hypervisor, database, ecc) it performs *really* bad, even when "nocow" is
> enabled.

So far I'm convinced layered design gives user more freedom - for the price
of bigger space usage.


>
> Anyway, ThinLVM + XFS is an extremely good combo in my opinion.
>

Yes, thought ext4 is quite good as well...

Zdenek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 12:00 ` matthew patton
@ 2016-05-03 14:38   ` Xen
  2016-05-04  1:25   ` Mark Mielke
  1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 14:38 UTC (permalink / raw)
  To: LVM general discussion and development

Just want to respond to this just to make things clear.

matthew patton schreef op 03-05-2016 14:00:

> why all of a sudden does each and every FS have to have this added
> code to second guess the block layer? The quickest solution is to
> mount the FS in sync mode. Go ahead and pay the performance piper.
> It's still not likely to be bullet proof but it's a sure step closer.

Why would anyone do what you don't want to do. Don't suggest solutions 
you don't even want yourself. That goes for all of you (Zdenek mostly).

And it is not second guessing. It is second guessing what it is doing 
currently. If you have actual information from the block layer, you 
don't NEED to second guess.

Isn't that obvious?

> What you're saying is that when mounting a block device the layer
> needs to expose a "thin-mode" attribute (or the sysdmin sets such a
> flag via tune2fs). Something analogous to mke2fs can "detect" LVM raid
> mode geometry (does that actually work reliably?).

Not necessarily. It could be transparent if these were actual available 
features as part of a feature set. The features would individually be 
able to be turned on and off, not necessarily calling it "thin".

> Then there has to be code in every FS block de-stage path:
> IF thin {
>   tickle block layer to allocate the block (aka write zeros to it? -
> what about pre-existing data, is there a "fake write" BIO call that
> does everything but actually write data to a block but would otherwise
> trigger LVM thin's extent allocation logic?)
>    IF success, destage dirty block to block layer ELSE
>    inform userland of ENOSPC
> }

What Mark suggested is not actually so bad. Preallocating means you have 
to communicate in some way to the user that space is going to run out. 
My suggestion would have been and still is in that sense to simply do 
this by having the filesystem update the amount of free space.

> This at least should maintain FS integrity albeit you may end up in a
> situation where the journal can never get properly de-staged, so
> you're stuck on any further writes and need to force RO.

I'm glad you think of solutions.


> IMO if the system admin made a conscious decision to use thin AND
> overprovision (thin by itself is not dangerous)

Again, that is just nonsense. There is not a person alive who wants to 
use thin for something that is not overprovisioning, whether it be 
snapshots or client sharing.

You are trying to get away with "hey, you chose it! now sucks if we 
don't actually listen to you! hahaha."

SUCKER!!!!.

No, the primary use case for thin is overprovisioning.

> , it's up to HIM to
> actively manage his block layer.


Block layer doesn't come into play with it.

You are separating "main admin task" and "local admin task".

What I mean is that there are different roles. Even if they are the same 
person, they are different tasks.

Someone writing software, his task is to ensure his software keeps 
working given failure conditions.

This software writer, even if it is the same person, cannot be expected 
to at that point be thinking of LVM block allocation. These are 
different things.

You communicate with the layers you communicate with. You don't go 
around that.

When you write a system that is supposed to be portable, for instance, 
you do not start depending on other features, tools or layers that are 
out of reach the moment your system or software is deployed somewhere 
else.

Filesystem communication is available to all applications. So any 
application designed for a generic purpose of installment is going to be 
wanting to depend on filesystem tools, not block layer tools.

You people apparently don't understand layering very well OR you would 
never recommend avoiding an intermediate layer (the filesystem) to go 
directly to the lower level (the block layer) for ITS admin tools.

I mean are you insane. You (Zdenek mostly) is so much about not mixing 
layers but then it is alright to go around them?

A software tool that is meant to be redeployable and should be able to 
depend on a minimalist set of existing features in the direct layer it 
is interfacing with, but still wants to use whatever is available given 
circumstances that dictate that it wouldn't harm its redeployability, 
would never choose the acquire and use the more remote and more 
uncertain set (such as LVM) when it could also be using directly 
available measures (such as free disk space, as a crude measure) that 
are available on ANY system provided that yes indeed, there is some 
level of sanity to it.

If you ARE deployed on thin and the filesystem cannot know about actual 
space then you are left in the dark, you are left blind, and there is 
nothing you can do as a systems programmer.

> Even on million dollar SANs the
> expectation is that the engineer will do his job and not drop the mic
> and walk away.

You constantly focus on the admin.

With all of this hotshot and idealist behaviour about layers you are 
espousing, you actually advocate going around them completely and using 
whatever deepest-layer or most-impact solution that is available (LVM) 
in order to troubleshoot issues that should be handled by interfacing 
with the actual layer you always have access to.

It is not just about admins. You make this about admins as if they are 
solely responsible for the entire system.

> Maybe the "easiest" implementation would be a MD layer job that the
> admin can tailor to fail all allocation requests once
> extent count drops below a number and thus forcing all FS mounted on
> the thinpool to go into RO mode.

A real software engineer doesn't go for the easiest solution or 
implementation. I am not approaching this from the perspective of an 
admin exclusively. I am also and most and more importantly a software 
programmer that wants to use systems that are going to work regardless 
of the pecularities of an implementation or system I have to work on , 
and I don't leave it to the admin of said system to do all my tasks.

As a programmer I cannot decide that the admin is going to be a perfect 
human being like so you well try to believe in, because that's what you 
think you are: you are that amazing admin that never fails taking 
account of available disk space.

But that's a moron position.

If I am to write my software, I cannot depend on bigger-scale or 
outer-level solutions to always be in place. I cannot offload my 
responsibilities to the admin.

You are insisting here that layers (administration layers and tasks) are 
mixed and completely destroyed, all in the sense of not doing that to 
the software itself?

Really?

Most importantly if I write any system that cannot depend on LVM being 
present, then NO THOSE TOOLS ARE NOT AVAILABLE TO ME.

"Why don't you just use LVM?" well fuck off.

I am not that admin. I write his system. I don't do his work.

Yet I still have the responsibility that MY component is going to work 
and not give HIM headaches. That's real life for you.

Even if in actually I might be imprisoned with broken feet and arms. I 
still care about this and I still do this work in a certain sense.

And yes I utterly care about modularity in software design. I understand 
layers much better than you do if you are able or even capable of 
suggestion such solutions.

Communication between layers does not necessarily integrate the layers 
if those interfaces are well defined and allow for modular "changing" of 
the chose solution.

I recognise full well that there is integration and that you do get a 
working together. But that is the entire purpose of it. To get the two 
things to work together more. But that is the whole gist of having 
interfaces and APIs in the first place.

It is for allowing stuff to work together to achieve a higher goal than 
they could achieve if they were just on their own.

While recognising where each responbility lies.

BLOCK LAYER <----> BLOCK LAYER ADMIN
FILESYSTEM LAYER <----> FILESYSTEM LAYER ADMIN
APPLICATION LAYER <---> APPLICATON WRITER.

Me, the application writer, cannot be expected to deal with number one, 
the block layer.

At the same time I need tools to do my work. I also cannot go to any 
random block layer admin my system might get deployed on (who's to say I 
will be there?) and beg for him to spend ample amount of time designing 
his systems from scratch so that even if my software fails, it won't 
hurt anyone.

But without information on available space I might not be able to do 
anything.

Then what happens is that I have to design for this uncertainty.

Then what happens is that I (with capital IIIII) start allocating space 
in advance as a software developer making applications for systems that 
might I don't know, run on banks or whatever. Just saying something.

Yes now this task is left to the software designer making the 
application.

Now I have to start allocating buffers to ensure graceful shutdown or 
termination, for instance.

I might for instance allocate a block file, and if writes to the 
filesystem start to fail or the filesystem becomes read-only, I might 
still be in trouble not being able to write to it ;-). So I might start 
thinking about kernel modules that I can redeploy with my system that 
ensure graceful shutdown or even continued operation. I might decide 
that files mounted as loopback are going to stay writable even if the 
filesystem they reside on is now readonly. I am going to ensure these 
are not sparse blocks and that the entire file is written to and grown 
in advance, so that my writes start to look like real block device 
writes. Then I'm just going to just patch the filesystem or the VFS to 
allow writes to these files even if it comes with a performance hit of 
additional checks.

And that hopefully not the entire volume gets frozen by LVM.

But that the kernel or security scripts just remount it ro.

That is then the best way solution for my needs in that circumstance. 
Just saying you know.

It's not all exclusively about admins working with LVM directly.


> But in any event it won't prevent irate users from demanding why the
> space they appear to have isn't actually there.

If that is your life I feel sorry for you.

I just do.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 11:42           ` Zdenek Kabelac
@ 2016-05-03 13:15             ` Gionatan Danti
  2016-05-03 15:45               ` Zdenek Kabelac
  0 siblings, 1 reply; 29+ messages in thread
From: Gionatan Danti @ 2016-05-03 13:15 UTC (permalink / raw)
  To: LVM general discussion and development

On 03/05/2016 13:42, Zdenek Kabelac wrote:
>
> Danger with having 'disable' options like this is many distros do decide
> themselves about best defaults for their users, but Ubuntu with their
> issue_discards=1 shown us to be more careful as then it's not Ubuntu but
> lvm2 which is blamed for dataloss.
>
> Options are evaluated...
>

Very true. "Sane defaults" is one of the reason why I (happily) use 
RHEL/CentOS as hypervisors and other critical tasks.

>
>
> What's wrong with  'lvs'?
> This will give you the available space in thin-pool.
>

Oh, absolutely nothing wrong with lvs. I used "lsblk" only as an example 
of the block device/layer exposing some (lack of) features to upper layer.

One note about the continued "suggestion" to use BTRFS. While for 
relatively simple use case it can be ok, for more demanding 
(rewrite-heavy) scenarios (eg: hypervisor, database, ecc) it performs 
*really* bad, even when "nocow" is enabled.

I had much more fortune, performance wise, with ZFS. Too bad ZoL is an 
out-of-tree component (albeit very easy to install and, in my 
experience, quite stable also).

Anyway, ThinLVM + XFS is an extremely good combo in my opinion.

>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
       [not found] <1614984310.1700582.1462280490763.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-03 13:01 ` matthew patton
  2016-05-03 15:47   ` Xen
  2016-05-04  0:56   ` Mark Mielke
  0 siblings, 2 replies; 29+ messages in thread
From: matthew patton @ 2016-05-03 13:01 UTC (permalink / raw)
  To: LVM general discussion and development

On Mon, 5/2/16, Mark Mielke <mark.mielke@gmail.com> wrote:

<quote>
 very small use case in reality. I think large service
 providers would use Ceph or EMC or NetApp, or some such
 technology to provision large amounts of storage per
 customer, and LVM would be used more at the level of a
 single customer, or a single machine.
</quote>

Ceph?!? yeah I don't think so.

If you thin-provision an EMC/Netapp volume and the block device runs out of blocks (aka Raid Group is full) all volumes on it will drop OFFLINE. They don't even go RO. Poof, they disappear. Why? Because there is no guarantee that every NFS client, every iSCSI client, every FC client is going to do the right thing. The only reliable means of telling everyone "shit just broke" is for the asset to disappear.

All in-flight writes to the volume that the array ACK'd are still good even if they haven't been de-staged to the intended device thanks to NVRAM and the array's journal device.

<quote>
 In these cases, I
 would expect that LVM thin volumes should not be used across
 multiple customers without understanding the exact type of
 churn expected, to understand what the maximum allocation
 that would be required.
</quote>

sure, but that spells responsible sysadmin. Xen's post implied he didn't want to be bothered to manage his block layer  that magically the FS' job was to work closely with the block layer to suss out when it was safe to keep accepting writes. There's an answer to "works closely with block layer" - it's spelled BTRFS and ZFS.

LVM has no obligation to protect careless sysadmins doing dangerous things from themselves. There is nothing wrong with using THIN every which way you want just as long as you understand and handle the eventuality of extent exhaustion. Even thin snaps go invalid if it needs to track a change and can't allocate space for the 'copy'.

Responsible usage has nothing to do with single vs multiple customers. Though Xen broached the 'hosting' example and in the cut-rate hosting business over-provisioning is rampant. It's not a problem unless the syadmin drops the ball.


> Amazon would make sure to have enough storage to meet my requirement if I need them.

Yes, because Amazon is a RESPONSIBLE sysadmin and has put in place tools to manage the fact they are thin-provisoning and to make damn sure they can cash the checks they are writing.

  
> the nature of the block device, such as "how much space
> do you *really* have left?"

So you're going to write and then backport "second guess the block layer" code to all filesystems in common use and god knows how many versions back? Of course not. Just try to get on the EXT developer mailing list and ask them to write "block layer second-guessing code (aka branch on device flag=thin)" because THINP will cause problems for the FS when it runs out of extents. To which the obvious and correct response will be "Don't use THINP if you're not prepared to handle it's pre-requisites."

> you and the other people. You think the block storage should
> be as transparent as possible, as if the storage was not
> thin. Others, including me, think that this theory is
> impractical

Then by all means go ahead and retrofit all known filesystems with the extra logic. ALL of the filesystems were written with the understanding that the block layer is telling the truth and that any "white lie" was benign in so much that it would be made good and thus could be assumed to be "truth" for practical purpose.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-29 11:23     ` Zdenek Kabelac
  2016-05-02 14:32       ` Mark Mielke
@ 2016-05-03 12:42       ` Xen
  1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 12:42 UTC (permalink / raw)
  To: LVM general discussion and development

Zdenek Kabelac schreef op 29-04-2016 13:23:

> I'm not going to add much to this thread - since there is nothing
> really useful for devel.  But let me strike out few important moments:

If you like to keep things short now I will give short replies. Also 
other people have responded and I haven't read everything yet.

> it's still the admin who creates thin-volume and gets WARNING if VG is 
> not big enough when
> all thin volumes would be fully provisioned.

That is just what we could call insincere or that beautiful strange word 
that I cannot remember.

The opposite of innocuous. Disingenuous (thank you dictionary).

You know perfectly well that this warning doesn't do much of anything 
when all people approach thin from the view point of wanting to 
overprovision.

That is like saying "Don't enter this pet store here, because you might 
buy pets, and pets might scratch your arm. Now what can we serve you 
with?".

It's those insincere warnings many business or ideas give to people to 
supposedly warn them in advance of what they want to do anyway. "I told 
you it was a bad idea, now what can we do for you? :) :) :) :)". It's a 
way of being politically correct mostly.

You want to do it anyway. But now someone tells you it might be a bad 
idea even if both of you want it.


> So you try to design  'another btrfs' on top of thin provisioning?

Maybe I am. At least you recognise that I am trying to design something, 
many people would just throw it in the wastebasket with "empty 
complains".

That in itself.... ;-)

speaks some volumes.

But let's talk about real volumes now :p.

There's nothing bad about btrfs except that it usurps everything, 
doesn't separate any layers, and just overall means the end and death of 
a healthy filesystem system. It wants to be the monopoly.

> With 'thinp' you  want simplest  filesystem with robust metadata -  so
> in theory  -  'ext4' or  XFS without all 'improvements for rotational
> hdd that has accumulated over decades of their evolution.

I agree. I don't even use ext4, I use ext3. I feel ext4 may have some 
benefits but they are not really worth anything.

> You miss the 'key' details.
> 
> Thin pool is not constructing  'free-maps'  for each LV all the time -
> that's why tools like  'thin_ls'  are meant to be used from the
> user-space.
> It IS very EXPENSIVE operation.
> 
> So before you start to present your visions here, please spend some
> time with reading doc and understanding all the technology behind it.

Sure I could do that. I could also allow myself to die without ever 
having contributed to anything.

>> Even with a perfect LVM monitoring tool, I would experience a 
>> consistent
>> lack of feedback.
> 
> Mistake of your expectations

It is nothing to do with expectations. Things and feeling that keep 
creeping up to you and keep annoying you have nothing to do with 
expectations.

That is like being thoroughly annoyed about something for years and 
expecting it to go away by itself, is the epitome of sanity.

For example: monitor makes buzzing noise when turned off. Deeply 
frustrating, annoying, downright bad. Gives me nightmares even. You say 
"You have bad expectations of hardware, hardware just does that thing, 
you have to live with it." I go to shop, shop says "Yeah all hardware 
does that (so we don't need to pay you anything back)".

That has nothing to do with bad expectations.

> If you are trying to operate  thin-pool near 100% fullness - you will
> need to write and design completely different piece of software -
> sorry thinp
> is not for you and never will...

I am not trying to operate near 100% fullness.

Although it wouldn't be bad if I could manage that.

That would not be such a bad thing at all. If the tools where there to 
actually do it, and the mechanisms. Wouldn't you agree? Regardless of 
what is possible or even what is to be considered "wise" here, wouldn't 
it be beneficial in some way?

> 
>> 
>> Just a simple example: I can adjust "df" to do different stuff. But 
>> any
>> program reporting free diskspace is going to "lie" to me in that 
>> sense. So
>> yes I've chosen to use thin LVM because it is the best solution for me
>> right now.
> 
> 'df'  has nothing in common with  'block' layer.

A clothing retailer has nothing in common with a clothing manufacturer 
either, but they are just both in the same business.

> But if you've never planned to buy 10TB - you should have never allow
> to create such big volume in the first place!

So you are like saying the only use case of thin is a growth scenario 
that can be met.

> So don't do it - and don't plan to use it - it's really that simple.

What I was saying was that it would be possible to maintain the contract 
that any individual volume at any one time would be able to grow to max 
size as long as other volumes don't start acting aberrant. If you manage 
all those volumes of course you would be able to choose this.

The purpose of the thin system is to maintain the situation that all 
volumes can reach their full potential without (auto)extending, in that 
sense.

If you did actually make a 1TB volume for a single client with a 10TB 
V-size, you would be a very bad contractor. Who says it is not going to 
happen overnight? How will you be able to respond.

The situation where you have a 10TB volume and you have 20 clients with 
1TB each, is very different.

I feel the contract should be that the available real space should 
always be equal to or greater than the available on any one filesystem 
(volume).

So: R >= max(A(1), A(2), A(3), ..., A(n))

Of course it is pleasant not having to resize the filesystem but would 
you really do that for yourself? Make a 10TB filesystem on a 1TB disk as 
you expect to buy more disks in the future?

I mean you could. But in this sense resizing the filesystem (growing it) 
is not a very expensive operation, usually.

I would only want to do that if I could limit the actual usage of the 
filesystem in a real way.

Any runaway process causing my volume to drop...... NOT a good thing.

> Actually it's the core principle!
> It lies (or better say uses admin's promises) that there is going to
> be a disk space. And it's admin responsibility to fulfill it.

The admin never comes into it. What the admin does or doesn't do, what 
the admin thinks or doesn't think. These are all interpretations of 
intents.

Thinp should function regardless of what the admin is thinking or not. 
Regardless of what his political views are.

You are bringing morality into the technical system.

You are saying /thinp should work/ because /the admin should be a good 
person/.

When the admin creates the system, no "promise" is ever communicated to 
the hardware layer, OR the software layer. You are turning the correct 
operation of the machine into a human problem in the way of saying 
"/Linux is a great system and everyone can use it, but some people are 
just too stupid to spend a few hours reading a manual on a daily basis, 
and we can't help that/".

These promises are not there in the system. Someone might be using the 
system for reasons you have not envisioned. But the system is there and 
it can be used for it. Now if things go wrong you say "You you had the 
wrong use case" but a use case is just a use case, it has no morality to 
it.

If you build a waterway system that only functions as long as it doesn't 
rain (overflowing the channels) then you can say "Well my system is 
perfect, it is just God who is a bitch and messes things up".

No you have to take account of real life human beings, not those ideal 
pictures of admins that you have.

Stop the idealism you know. Admins are humans and they can be expected 
to be humans.

It is you who have wrong expectations of people.

If people mess up they mess up but it is part of the human agenda and 
you design for that.


> If you know in front you will need quickly all the disk space - then
> using thinp and expecting miracle is not going to work.

Nobody ever said anything of that kind.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
       [not found] <1870050920.5354287.1462276845385.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-03 12:00 ` matthew patton
  2016-05-03 14:38   ` Xen
  2016-05-04  1:25   ` Mark Mielke
  0 siblings, 2 replies; 29+ messages in thread
From: matthew patton @ 2016-05-03 12:00 UTC (permalink / raw)
  To: LVM general discussion and development

> written as required. If the file system has particular areas
> of importance that need to be writable to prevent file
> system failure, perhaps the file system should have a way of
> communicating this to the volume layer. The naive approach
> here might be to preallocate these critical blocks before
>  proceeding with any updates to these blocks, such that the
> failure situations can all be "safe" situations,
> where ENOSPC can be returned without a danger of the file
> system locking up or going read-only.

why all of a sudden does each and every FS have to have this added code to second guess the block layer? The quickest solution is to mount the FS in sync mode. Go ahead and pay the performance piper. It's still not likely to be bullet proof but it's a sure step closer.

What you're saying is that when mounting a block device the layer needs to expose a "thin-mode" attribute (or the sysdmin sets such a flag via tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry (does that actually work reliably?). 

Then there has to be code in every FS block de-stage path:
IF thin {
  tickle block layer to allocate the block (aka write zeros to it? - what about pre-existing data, is there a "fake write" BIO call that does everything but actually write data to a block but would otherwise trigger LVM thin's extent allocation logic?)
   IF success, destage dirty block to block layer ELSE
   inform userland of ENOSPC
}

In a fully journal'd FS (metadata AND data) the journal could be 'pinned' and likewise the main metadata areas if for no other reason they are zero'd at onset and or constantly being written to. Once written to, LVM thin isn't going to go back and yank away an allocated extent.

This at least should maintain FS integrity albeit you may end up in a situation where the journal can never get properly de-staged, so you're stuck on any further writes and need to force RO.

> just want a sanely behaving LVM + XFS...)
 

IMO if the system admin made a conscious decision to use thin AND overprovision (thin by itself is not dangerous), it's up to HIM to actively manage his block layer. Even on million dollar SANs the expectation is that the engineer will do his job and not drop the mic and walk away. Maybe the "easiest" implementation would be a MD layer job that the admin can tailor to fail all allocation requests once extent count drops below a number and thus forcing all FS mounted on the thinpool to go into RO mode.

But in any event it won't prevent irate users from demanding why the space they appear to have isn't actually there.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 10:15         ` Gionatan Danti
@ 2016-05-03 11:42           ` Zdenek Kabelac
  2016-05-03 13:15             ` Gionatan Danti
  0 siblings, 1 reply; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03 11:42 UTC (permalink / raw)
  To: LVM general discussion and development

On 3.5.2016 12:15, Gionatan Danti wrote:
>
> On 02/05/2016 16:32, Mark Mielke wrote:
>> The WARNING is a cover-your-ass type warning that is showing up
>> inappropriately for us. It is warning me something that I should already
>> know, and it is training me to ignore warnings. Thinp doesn't have to be
>> the answer to everything. It does, however, need to provide a block
>> device visible to the file system layer, and it isn't invalid for the
>> file system layer to be able to query about the nature of the block
>> device, such as "how much space do you *really* have left?"
>
> As this warning appears on snapshots, it is quite annoying in fact. On the
> other hand, I fully understand that the developers want to avoid "blind"
> overprovisioning. A commmand-line (or a lvm.conf) option to override the
> warning would be welcomed, though.

Since number of reports from people who used thin-pool without realizing what 
they could do wrong was too high - rather 'dramatic'  WARNING approach is 
used. Advised usage is with dmeventd & monitoring.

Danger with having 'disable' options like this is many distros do decide 
themselves about best defaults for their users, but Ubuntu with their 
issue_discards=1 shown us to be more careful as then it's not Ubuntu but lvm2 
which is blamed for dataloss.

Options are evaluated...


>
>> This seems to be a crux of this debate between you and the other people.
>> You think the block storage should be as transparent as possible, as if
>> the storage was not thin. Others, including me, think that this theory
>> is impractical, as it leads to edge cases where the file system could
>> choose to fail in a cleaner way, but it gets too far today leading to a
>> more dangerous failure when it allocates some block, but not some other
>> block.
>>
>> ...
>>
>>
>> It is your opinion that extending thin volumes to allow the file system
>> to have more information is breaking some fundamental law. But, in
>> practice, this sort of thing is done all of the time. "Size", "Read
>> only", "Discard/Trim Support", "Physical vs Logical Sector Size", ...
>> are all information queried from the device, and used by the file
>> system. If it is a general concept that applies to many different device
>> targets, and it will help the file system make better and smarter
>> choices, why *shouldn't* it be communicated? Who decides which ones are
>> valid and which ones are not?
>
> This seems reasonable. After all, a simple "lsblk" already reports plenty of
> information to the upper layer, so adding a "REAL_AVAILABLE_SPACE" info should
> not be infeasible.


What's wrong with  'lvs'?
This will give you the available space in thin-pool.

However combining this number with number of free-space in filesystem - that 
needs magic.

When you create file with hole in your filesystem - how much free space do you 
have ?

If you have 2 filesystem in a single thin-pool - each takes 1/2 ?
It's all about lying....


Regards

Zdenek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03 10:41           ` Mark Mielke
@ 2016-05-03 11:18             ` Zdenek Kabelac
  0 siblings, 0 replies; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03 11:18 UTC (permalink / raw)
  To: Mark Mielke; +Cc: LVM general discussion and development

On 3.5.2016 12:41, Mark Mielke wrote:
> On Tue, May 3, 2016 at 5:45 AM, Zdenek Kabelac <zkabelac@redhat.com
> <mailto:zkabelac@redhat.com>> wrote:
>
>     On 2.5.2016 16:32, Mark Mielke wrote:
>
>              If you seek for a filesystem with over-provisioning - look at
>         btrfs, zfs
>              and other variants...
>
>         I have to say that I am disappointed with this view, particularly if
>         this is a
>         view held by Red Hat. To me this represents a misunderstanding of the
>         purpose
>
>
>     So first - this is  AMAZING deduction you've just shown.
>
>     You've cut sentence out of the middle of a thread and used as kind of evidence
>     that Red Hat is suggesting usage of ZFS, Btrfs  - sorry man - read this
>     thread again...
>
>
> My intent wasn't to cut a sentence in the middle. I responded to the each
> sentence in its place. I think it really comes down to this:
>
>         This seems to be a crux of this debate between you and the other
>         people. You
>         think the block storage should be as transparent as possible, as if the
>         storage was not thin. Others, including me, think that this theory is
>         impractical, as it leads to edge cases where the file system could
>         choose to
>
>
>     It's purely practical and it's the 'crucial' difference between
>
>     i.e. thin+XFS/ext4     and   BTRFS.
>
>
>
> I think I captured the crux of this pretty well. If anybody suggests that
> there could be value to exposing any information related to the nature of the
> "thinly provisioned block devices", you suggest that the only route forwards
> here is BTRFS and ZFS. You are saying directly and indirectly, that anybody
> who disagrees with you should switch to what you feel are the only solutions
> that are in this space, and that LVM should never be in this space.
>
> I think I understand your perspective. However, I don't agree with it. I don't

The perspective of lvm2 team is pretty simple as a small team there is 
absolutely no time to venture into this road-path.

Also technically you are crying on the wrong grave/barking up the wrong tree.

Try to push you visions to some filesystem developers.

> agree that the best solution is one that fails at the last instant with ENOSPC
> and/or for the file system to become read-only. I think there is a whole lot
> of grey possibilities between the polar extremes of "BTRFS/ZFS" vs
> "thin+XFS/ext4 with last instant failure".

The other point is technical difficulties are very high and you are really 
asking for Btrfs logic, you just fail to admit this to yourself.

It's been the 'core' idea of Btrfs to combine volume management and filesystem 
together for a better future...


> What started me on this list was the CYA mandatory warning about over
> provisioning that I think is inappropriate, and causing us tooling problems.
> But seeing the debate unfold, and having seen some related failures in the
> Docker LVM thin pool case where the system may completely lock up, I have a
> conclusion that this type of failure represents a fundamental difference in
> opinion around what thin volumes are for, and what place they have. As I see
> them as highly valuable for various reasons including Docker image layers
> (something Red Hat appears to agree with, having targeted LVM thinp instead of


As you mention Docker - again I've no idea why do you think there is 'one-way' 
path ?

Red Hat is not political party with a single leading direction.

Many variant are being implemented in parallel (yes even in Red Hat) and the 
best one will win over the time - but there is no single 'directive' decision. 
It really is the open source way.


> the union file systems), and the snapshot use cases I presented prior, I think
> there must be a way to avoid the worst scenarios, if the right people consider
> all the options, and don't write off options prematurely due to preconceived
> notions about what is and what is not appropriate in terms of communication of
> information between system layers.
>
> There are many types of information that *are* passed from the block device
> layer to the file system layer. I don't see why awareness of thin volumes,
> should not be one of them.
>


Find a use-case, build a patch, show results and add info what the filesystem 
shall be doing when the filesystem underlying device changes its characteristic.

There is an API between block-layer and fs-layer - so propose extension
with a patch for a filesystem with clearly defined benefit.

That's my best advice.


> communicating this to the volume layer. The naive approach here might be to
> preallocate these critical blocks before proceeding with any updates to these
> blocks, such that the failure situations can all be "safe" situations, where
> ENOSPC can be returned without a danger of the file system locking up or going
> read-only.
>
> Or, maybe I am out of my depth, and this is crazy talk... :-)


Basically you are not realizing how much work is behind all those simple 
sentences.   At this moment there is 'fallocate' being in discussion...
But it's most or less 'nuclear weapon' for thin provisioning.


>
> (Personally, I'm not really needing a "df" to approximate available storage...
> I just don't want the system to fail badly in the "out of disk space"
> scenario... I can't speak for others, though... I do *not* want BTRFS/ZFS... I
> just want a sanely behaving LVM + XFS...)

Yes - that's what we try to improve  daily.


Regards

Zdenek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-03  9:45         ` Zdenek Kabelac
@ 2016-05-03 10:41           ` Mark Mielke
  2016-05-03 11:18             ` Zdenek Kabelac
  0 siblings, 1 reply; 29+ messages in thread
From: Mark Mielke @ 2016-05-03 10:41 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

[-- Attachment #1: Type: text/plain, Size: 4497 bytes --]

On Tue, May 3, 2016 at 5:45 AM, Zdenek Kabelac <zkabelac@redhat.com> wrote:

> On 2.5.2016 16:32, Mark Mielke wrote:
>>
>>     If you seek for a filesystem with over-provisioning - look at btrfs,
>> zfs
>>     and other variants...
>>
>> I have to say that I am disappointed with this view, particularly if this
>> is a
>> view held by Red Hat. To me this represents a misunderstanding of the
>> purpose
>>
>
> So first - this is  AMAZING deduction you've just shown.
>
> You've cut sentence out of the middle of a thread and used as kind of
> evidence
> that Red Hat is suggesting usage of ZFS, Btrfs  - sorry man - read this
> thread again...
>

My intent wasn't to cut a sentence in the middle. I responded to the each
sentence in its place. I think it really comes down to this:

This seems to be a crux of this debate between you and the other people. You
>> think the block storage should be as transparent as possible, as if the
>> storage was not thin. Others, including me, think that this theory is
>> impractical, as it leads to edge cases where the file system could choose
>> to
>>
>
> It's purely practical and it's the 'crucial' difference between
>
> i.e. thin+XFS/ext4     and   BTRFS.
>


I think I captured the crux of this pretty well. If anybody suggests that
there could be value to exposing any information related to the nature of
the "thinly provisioned block devices", you suggest that the only route
forwards here is BTRFS and ZFS. You are saying directly and indirectly,
that anybody who disagrees with you should switch to what you feel are the
only solutions that are in this space, and that LVM should never be in this
space.

I think I understand your perspective. However, I don't agree with it. I
don't agree that the best solution is one that fails at the last instant
with ENOSPC and/or for the file system to become read-only. I think there
is a whole lot of grey possibilities between the polar extremes of
"BTRFS/ZFS" vs "thin+XFS/ext4 with last instant failure".

What started me on this list was the CYA mandatory warning about over
provisioning that I think is inappropriate, and causing us tooling
problems. But seeing the debate unfold, and having seen some related
failures in the Docker LVM thin pool case where the system may completely
lock up, I have a conclusion that this type of failure represents a
fundamental difference in opinion around what thin volumes are for, and
what place they have. As I see them as highly valuable for various reasons
including Docker image layers (something Red Hat appears to agree with,
having targeted LVM thinp instead of the union file systems), and the
snapshot use cases I presented prior, I think there must be a way to avoid
the worst scenarios, if the right people consider all the options, and
don't write off options prematurely due to preconceived notions about what
is and what is not appropriate in terms of communication of information
between system layers.

There are many types of information that *are* passed from the block device
layer to the file system layer. I don't see why awareness of thin volumes,
should not be one of them.

For example, and I'm not pretending this is the best idea that should be
implemented, but just to see where the discussion might lead:

The Linux kernel needs to deal with problems such as memory being swapped
out due to memory pressures. In various cases, it is dangerous to swap
memory out. The memory can be protected from being swapped out where
required using various technique such as pinning pages. This takes up extra
RAM, but ensures that the memory can be safely accessed and written as
required. If the file system has particular areas of importance that need
to be writable to prevent file system failure, perhaps the file system
should have a way of communicating this to the volume layer. The naive
approach here might be to preallocate these critical blocks before
proceeding with any updates to these blocks, such that the failure
situations can all be "safe" situations, where ENOSPC can be returned
without a danger of the file system locking up or going read-only.

Or, maybe I am out of my depth, and this is crazy talk... :-)

(Personally, I'm not really needing a "df" to approximate available
storage... I just don't want the system to fail badly in the "out of disk
space" scenario... I can't speak for others, though... I do *not* want
BTRFS/ZFS... I just want a sanely behaving LVM + XFS...)


-- 
Mark Mielke <mark.mielke@gmail.com>

[-- Attachment #2: Type: text/html, Size: 6095 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-02 14:32       ` Mark Mielke
  2016-05-03  9:45         ` Zdenek Kabelac
@ 2016-05-03 10:15         ` Gionatan Danti
  2016-05-03 11:42           ` Zdenek Kabelac
  1 sibling, 1 reply; 29+ messages in thread
From: Gionatan Danti @ 2016-05-03 10:15 UTC (permalink / raw)
  To: LVM general discussion and development


On 02/05/2016 16:32, Mark Mielke wrote:
>
> 2) Frequent snapshots. In many of our use cases, we may take snapshots
> every 15 minutes, every hour, and every day, keeping 3 or more of each.
> If this storage had to be allocated in full, this amounts to at least
> 10X the storage cost. Using snapshots, and understanding the rate of
> churn, we can use closer to 1X or 2X the storage overhead, instead of
> 10X the storage overhead.
>
> 3) Snapshot as a means of achieving a consistent backup at low cost of
> outage or storage overhead. If we "quiesce" the application (flush
> buffers, put new requests on hold, etc.) take the snapshot, and then
> "resume" the application, this can be achieved in a matter of seconds or
> less. Then, we can mount the snapshot at a separate mount point and
> proceed with a more intensive backup process against a particular
> consistent point-in-time. This can be fast and require closer to 1X the
> storage overhead, instead of 2X the storage overhead.
>

This is exactly my main use case.

>
>
> The WARNING is a cover-your-ass type warning that is showing up
> inappropriately for us. It is warning me something that I should already
> know, and it is training me to ignore warnings. Thinp doesn't have to be
> the answer to everything. It does, however, need to provide a block
> device visible to the file system layer, and it isn't invalid for the
> file system layer to be able to query about the nature of the block
> device, such as "how much space do you *really* have left?"

As this warning appears on snapshots, it is quite annoying in fact. On 
the other hand, I fully understand that the developers want to avoid 
"blind" overprovisioning. A commmand-line (or a lvm.conf) option to 
override the warning would be welcomed, though.

> This seems to be a crux of this debate between you and the other people.
> You think the block storage should be as transparent as possible, as if
> the storage was not thin. Others, including me, think that this theory
> is impractical, as it leads to edge cases where the file system could
> choose to fail in a cleaner way, but it gets too far today leading to a
> more dangerous failure when it allocates some block, but not some other
> block.
>
> ...
>
>
> It is your opinion that extending thin volumes to allow the file system
> to have more information is breaking some fundamental law. But, in
> practice, this sort of thing is done all of the time. "Size", "Read
> only", "Discard/Trim Support", "Physical vs Logical Sector Size", ...
> are all information queried from the device, and used by the file
> system. If it is a general concept that applies to many different device
> targets, and it will help the file system make better and smarter
> choices, why *shouldn't* it be communicated? Who decides which ones are
> valid and which ones are not?

This seems reasonable. After all, a simple "lsblk" already reports 
plenty of information to the upper layer, so adding a 
"REAL_AVAILABLE_SPACE" info should not be infeasible.

>
> I didn't disagree with all of your points. But, enough of them seemed to
> be directly contradicting my perspective on the matter that I felt it
> important to respond to them.
>

Thinp really is a wonderful piece of technology, and I really thanks the 
developer for it.

>
>
> --
> Mark Mielke <mark.mielke@gmail.com <mailto:mark.mielke@gmail.com>>
>
>
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-05-02 14:32       ` Mark Mielke
@ 2016-05-03  9:45         ` Zdenek Kabelac
  2016-05-03 10:41           ` Mark Mielke
  2016-05-03 10:15         ` Gionatan Danti
  1 sibling, 1 reply; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03  9:45 UTC (permalink / raw)
  To: LVM general discussion and development, mark.mielke

On 2.5.2016 16:32, Mark Mielke wrote:
>
> On Fri, Apr 29, 2016 at 7:23 AM, Zdenek Kabelac <zkabelac@redhat.com
> <mailto:zkabelac@redhat.com>> wrote:
>
>     Thin-provisioning is NOT about providing device to the upper
>     system levels and inform THEM about this lie in-progress.
>     That's complete misunderstanding of the purpose.
>
>
> I think this line of thought is a bit of a strawman.
>
> Thin provisioning is entirely about presenting the upper layer with a logical
> view which does not match the physical view, including the possibility for
> such things as over provisioning. How much of this detail is presented to the
> higher layer is an implementation detail and has nothing to do with "purpose".
> The purpose or objective is to allow volumes that are not fully allocated in
> advance. This is what "thin" means, as compared to "thick".
>
>     If you seek for a filesystem with over-provisioning - look at btrfs, zfs
>     and other variants...
>
>
> I have to say that I am disappointed with this view, particularly if this is a
> view held by Red Hat. To me this represents a misunderstanding of the purpose


Hi

So first - this is  AMAZING deduction you've just shown.

You've cut sentence out of the middle of a thread and used as kind of evidence
that Red Hat is suggesting usage of ZFS, Btrfs  - sorry man - read this thread 
again...

Personally I'd  never use those 2 filesystems as they are to complex for 
recovery. But I've no problem to advice users to try them if that's what fits 
their needs best and they believe into 'all in once logic'
('Hit the wall' is best learning exercise in  Xen case anyway...)


> When a storage provider providers a block device (EMC, NetApp, ...) and a
> snapshot capability, I expect to be able to take snapshots with low overhead.
> The previous LVM model for snapshots was really bad, in that it was not low
> overhead. We use this capability for many purposes including:


This usage is perfectly fine. It's been designed this way from day 1.


> 1) Instantiating test environments or dev environments from a snapshot of
> production, with copy-on-write to allow for very large full-scale environments
> to be constructed quickly and with low overhead. In one of our examples, this
> includes an example where we have about 1 TByte of JIRA and Confluence
> attachments collected over several years. It is exposed over NFS by the NetApp
> device, but in the backend it is a volume. This volume is snapshot and then
> exposed as a different volume with copy-on-write characteristics. The storage
> allocation is monitored, and if it is exceeded, it is known that there will be
> particular behaviour. I believe in our case, the behaviour is that the
> snapshot becomes unusable.


Thin pool does not make a difference between snapshot and origin.
All thin-volumes share the same volume space.

It's up to monitoring application to decide if some snapshots could be erased
to reclaim some space in thin-pool.

Recent tool  thin_ls  is showing info how much data are exclusively held by 
individual thin volumes.

It's major difference compared with old snapshots and it's 'Invalidation' logic.


>
> 2) Frequent snapshots. In many of our use cases, we may take snapshots every
> 15 minutes, every hour, and every day, keeping 3 or more of each. If this
> storage had to be allocated in full, this amounts to at least 10X the storage
> cost. Using snapshots, and understanding the rate of churn, we can use closer
> to 1X or 2X the storage overhead, instead of 10X the storage overhead.


Sure - snapper...  whatever you name.
It's just for admin to maintain space availability in thin-pool.


> 3) Snapshot as a means of achieving a consistent backup at low cost of outage
> or storage overhead. If we "quiesce" the application (flush buffers, put new
> requests on hold, etc.) take the snapshot, and then "resume" the application,
> this can be achieved in a matter of seconds or less. Then, we can mount the
> snapshot at a separate mount point and proceed with a more intensive backup
> process against a particular consistent point-in-time. This can be fast and
> require closer to 1X the storage overhead, instead of 2X the storage overhead.
>
> In all of these cases - we'll buy more storage if we need more storage. But,
> we're not going to use BTRFS or ZFS to provide the above capabilities, just


And where exactly I'd advised to you specifically to switch to those filesystem?

My advice is clearly given to a user who seeks for filesystem COMBINED with 
block layer.


> because this is your opinion on the matter. Storage vendors of reputation and
> market presence sell these capabilities as features, and we pay a lot of money
> to have access to these features.
>
> In the case of LVM... which is really the point of this discussion... LVM is
> not necessarily going to be used or available on a storage appliance. The LVM
> use case, at least for us, is for storage which is thinly provisioned by the
> compute host instead of the backend storage appliance. This includes:
>
> 1) Local disks, particularly included local flash drives that are local to
> achieve higher levels of performance than can normally be achieved with a
> remote storage appliance.
>
> 2) Local file systems, on remote storage appliances, using a protocol such as
> iSCSI to access the backend block device. This might be the case where we need
> better control of the snapshot process, or to abstract the management of the
> snapshots from the backend block device. In our case, we previously use an EMC
> over iSCSI for one of these use cases, and we are switching to NetApp.
> However, instead of embedding NetApp-specific logic into our code, we want to
> use LVM on top of iSCSI, and re-use the LVM thin pool capabilities from the
> host, such that we don't care what storage is used on the backend. The
> management scripts will work the same whether the storage is local (the first
> case above) or not (the case we are looking into now).
>
> In both of these cases, we have a need to take snapshots and manage them
> locally on the host, instead of managing them on a storage appliance. In both
> cases, we want to take many light weight snapshots of the block device. You
> could argue that we should use BTRFS or ZFS, but you should full well know
> that both of these have caveats as well. We want to use XFS or EXT4 as our
> needs require, and still have the ability to take light-weight snapshots.


Which is exactly actual Red Hat strategy. XFS is strongly pushed forward.


> Generally, I've seen the people who argue that thin provisioning is a "lie",
> tend to not be talking about snapshots. I have a sense that you are talking
> more as storage providers for customers, and talking more about thinly
> provisioning content for your customers. In this case - I think I would agree
> that it is a "lie" if you don't make sure to have the storage by the time it


Thin-provisioning simply requires  RESPONSIBLE admins - if you are not willing 
to take care about your thin-pools - don't use them - lots of kitten may die 
and that's all what this thread was about  -  it had absolutely nothing to do 
with Red Hat and any of your conspiracy theories like it would be pushing you 
to switch to a filesystem you don't like...


>     Device target is definitely not here to solve  filesystem troubles.
>     Thinp is about 'promising' - you as admin promised you will provide
>     space -  we could here discuss maybe that LVM may possibly maintain
>     max growth size we can promise to user - meanwhile - it's still the admin
>     who creates thin-volume and gets WARNING if VG is not big enough when all
>     thin volumes would be fully provisioned.
>     And  THAT'S IT - nothing more.
>     So please avoid making thinp target to be answer to ultimate question of
>     life, the universe, and everything - as we all know  it's 42...
>
>
> The WARNING is a cover-your-ass type warning that is showing up
> inappropriately for us. It is warning me something that I should already know,
> and it is training me to ignore warnings. Thinp doesn't have to be the answer
> to everything. It does, however, need to provide a block device visible to the
> file system layer, and it isn't invalid for the file system layer to be able
> to query about the nature of the block device, such as "how much space do you
> *really* have left?"


This is not so useful information - as this state is dynamic.
The only 'valid' query is -  are we out-of-space...
And that's what you get from block layer now  - ENOSPC.
Filesystems may have different reaction then to plain EIO.


I'd be really curious what would be the use case of this information even ?

If you care about i.e. 'df' - then let's fix  'df'  - it may check fs is 
thinly provisioned volume and may ask provisioner about free space in pool and 
combine result in some way...
Just DO NOT mix this with filesystem layer...

What would the filesystem do with this info ?

Should this randomly decide to drop files according to thin-pool workload ?

Would you change every filesystem in kernel to implement such policies ?

It's really the thin-pool monitoring which tries to add some space when it's 
getting low and may implement further policies to i.e. drop some snapshots.

However what is being implemented is better 'allocation' logic for pool chunk 
provisioning (for XFS ATM)  - as rather 'dated' methods for deciding where to 
store incoming data do not apply with provisioned chunks efficiently.


> This seems to be a crux of this debate between you and the other people. You
> think the block storage should be as transparent as possible, as if the
> storage was not thin. Others, including me, think that this theory is
> impractical, as it leads to edge cases where the file system could choose to


It's purely practical and it's the 'crucial' difference between

i.e. thin+XFS/ext4     and   BTRFS.


> fail in a cleaner way, but it gets too far today leading to a more dangerous
> failure when it allocates some block, but not some other block.


The best thing to do is to stop immediately on error and do 'read-only' fs -
what is exactly  'ext4 + remount-ro'

Your proposal to make  XFS a different kind of BTRFS monster is simply not 
going to work -  that's exactly what BTRFS is doing - waste of time to do it 
again.

BTRFS has built-in volume manager and combines fs layer with block layer
(making many layers in kernel quite ugly - i.e. device major:minor)

This is different logic lvm2 takse - where layers are separated with clearly 
defined logic.

So again - if you don't like separate thin block layer + XFS fs layer and you 
want to see 'merged' technology - there is BTRFS/ZFS/.... which tries to 
combine raid/caching/encryption/snapshot... - but there are no plans to 
'reinvent' the same from the other side with lvm2/dm....


> Exaggerating this to say that thinp would become everything, and the answer to
> the ultimate question of life, weakens your point to me, as it means that you
> are seeing things in far too black + white, whereas real life is often not
> black + white.


Yes we prefer clearly defined borders and responsibilities which could be well 
tested and verified..

Don't compare life with software :)


>
> It is your opinion that extending thin volumes to allow the file system to
> have more information is breaking some fundamental law. But, in practice, this
> sort of thing is done all of the time. "Size", "Read only", "Discard/Trim
> Support", "Physical vs Logical Sector Size", ... are all information queried
> from the device, and used by the file system. If it is a general concept that
> applies to many different device targets, and it will help the file system
> make better and smarter choices, why *shouldn't* it be communicated? Who
> decides which ones are valid and which ones are not?


lvm2 is  logical volume manager. Just think about it.

In future your thinLV might be turned  into plain  'linear' LV as well as your 
linearLV would become a member of thin-pool  (planned features).

Your LV could be pvmove(ed) to completely different drive with different 
geometry...

These are topics for lvm2/dm.

We are not designing filesystem - and we do plan to stay transparent for them.

And it's up to you to understand the reasoning.


> I didn't disagree with all of your points. But, enough of them seemed to be
> directly contradicting my perspective on the matter that I felt it important
> to respond to them.


It is an Open Souce World - "so send a patch" and implement your visions - 
again it is that easy - we do it every day in Red Hat...


> Mostly, I think everybody has a set of opinions and use cases in mind when
> they come to their conclusions. Please don't ignore mine. If there is
> something unreasonable above, please let me know.


It's not about ignoring - it's about having certain amount of man-hours for 
work and you have to chose how to 'spend' them.

And in this case and your ideas you will need to spend/invest your time....
(Just like Xen).


Regards

Zdenek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-29 11:23     ` Zdenek Kabelac
@ 2016-05-02 14:32       ` Mark Mielke
  2016-05-03  9:45         ` Zdenek Kabelac
  2016-05-03 10:15         ` Gionatan Danti
  2016-05-03 12:42       ` Xen
  1 sibling, 2 replies; 29+ messages in thread
From: Mark Mielke @ 2016-05-02 14:32 UTC (permalink / raw)
  To: LVM general discussion and development

[-- Attachment #1: Type: text/plain, Size: 9519 bytes --]

On Fri, Apr 29, 2016 at 7:23 AM, Zdenek Kabelac <zkabelac@redhat.com> wrote:

> Thin-provisioning is NOT about providing device to the upper
> system levels and inform THEM about this lie in-progress.
> That's complete misunderstanding of the purpose.
>

I think this line of thought is a bit of a strawman.

Thin provisioning is entirely about presenting the upper layer with a
logical view which does not match the physical view, including the
possibility for such things as over provisioning. How much of this detail
is presented to the higher layer is an implementation detail and has
nothing to do with "purpose". The purpose or objective is to allow volumes
that are not fully allocated in advance. This is what "thin" means, as
compared to "thick".



> If you seek for a filesystem with over-provisioning - look at btrfs, zfs
> and other variants...
>

I have to say that I am disappointed with this view, particularly if this
is a view held by Red Hat. To me this represents a misunderstanding of the
purpose for over-provisioning, and a misunderstanding of why thin volumes
are required. It seems there is a focus on "filesystem" in the above
statement, and that this may be the point of debate.

When a storage provider providers a block device (EMC, NetApp, ...) and a
snapshot capability, I expect to be able to take snapshots with low
overhead. The previous LVM model for snapshots was really bad, in that it
was not low overhead. We use this capability for many purposes including:

1) Instantiating test environments or dev environments from a snapshot of
production, with copy-on-write to allow for very large full-scale
environments to be constructed quickly and with low overhead. In one of our
examples, this includes an example where we have about 1 TByte of JIRA and
Confluence attachments collected over several years. It is exposed over NFS
by the NetApp device, but in the backend it is a volume. This volume is
snapshot and then exposed as a different volume with copy-on-write
characteristics. The storage allocation is monitored, and if it is
exceeded, it is known that there will be particular behaviour. I believe in
our case, the behaviour is that the snapshot becomes unusable.

2) Frequent snapshots. In many of our use cases, we may take snapshots
every 15 minutes, every hour, and every day, keeping 3 or more of each. If
this storage had to be allocated in full, this amounts to at least 10X the
storage cost. Using snapshots, and understanding the rate of churn, we can
use closer to 1X or 2X the storage overhead, instead of 10X the storage
overhead.

3) Snapshot as a means of achieving a consistent backup at low cost of
outage or storage overhead. If we "quiesce" the application (flush buffers,
put new requests on hold, etc.) take the snapshot, and then "resume" the
application, this can be achieved in a matter of seconds or less. Then, we
can mount the snapshot at a separate mount point and proceed with a more
intensive backup process against a particular consistent point-in-time.
This can be fast and require closer to 1X the storage overhead, instead of
2X the storage overhead.

In all of these cases - we'll buy more storage if we need more storage.
But, we're not going to use BTRFS or ZFS to provide the above capabilities,
just because this is your opinion on the matter. Storage vendors of
reputation and market presence sell these capabilities as features, and we
pay a lot of money to have access to these features.

In the case of LVM... which is really the point of this discussion... LVM
is not necessarily going to be used or available on a storage appliance.
The LVM use case, at least for us, is for storage which is thinly
provisioned by the compute host instead of the backend storage appliance.
This includes:

1) Local disks, particularly included local flash drives that are local to
achieve higher levels of performance than can normally be achieved with a
remote storage appliance.

2) Local file systems, on remote storage appliances, using a protocol such
as iSCSI to access the backend block device. This might be the case where
we need better control of the snapshot process, or to abstract the
management of the snapshots from the backend block device. In our case, we
previously use an EMC over iSCSI for one of these use cases, and we are
switching to NetApp. However, instead of embedding NetApp-specific logic
into our code, we want to use LVM on top of iSCSI, and re-use the LVM thin
pool capabilities from the host, such that we don't care what storage is
used on the backend. The management scripts will work the same whether the
storage is local (the first case above) or not (the case we are looking
into now).

In both of these cases, we have a need to take snapshots and manage them
locally on the host, instead of managing them on a storage appliance. In
both cases, we want to take many light weight snapshots of the block
device. You could argue that we should use BTRFS or ZFS, but you should
full well know that both of these have caveats as well. We want to use XFS
or EXT4 as our needs require, and still have the ability to take
light-weight snapshots.

Generally, I've seen the people who argue that thin provisioning is a
"lie", tend to not be talking about snapshots. I have a sense that you are
talking more as storage providers for customers, and talking more about
thinly provisioning content for your customers. In this case - I think I
would agree that it is a "lie" if you don't make sure to have the storage
by the time it is required. But, I think this is a very small use case in
reality. I think large service providers would use Ceph or EMC or NetApp,
or some such technology to provision large amounts of storage per customer,
and LVM would be used more at the level of a single customer, or a single
machine. In these cases, I would expect that LVM thin volumes should not be
used across multiple customers without understanding the exact type of
churn expected, to understand what the maximum allocation that would be
required. In the case of our IT team and EMC or NetApp, they mostly avoid
the use of thin volumes for "cross customer" purposes, and instead use thin
volumes for a specific customer, for a specific need. In the case of Amazon
EC2, for example... I would use EBS for storage, and expect that even if it
is "thin", Amazon would make sure to have enough storage to meet my
requirement if I need them. But, I would use LVM on my Amazon EC2 instance,
and I would expect to be able to use LVM thin pool snapshots to over
provision my own per-machine storage requirements by creating multiple
snapshots of the underlying storage, with a full understanding of the
amount of churn that I expect to occur, and a full understanding of the
need to monitor.



> Device target is definitely not here to solve  filesystem troubles.
> Thinp is about 'promising' - you as admin promised you will provide
> space -  we could here discuss maybe that LVM may possibly maintain
> max growth size we can promise to user - meanwhile - it's still the admin
> who creates thin-volume and gets WARNING if VG is not big enough when all
> thin volumes would be fully provisioned.
> And  THAT'S IT - nothing more.
> So please avoid making thinp target to be answer to ultimate question of
> life, the universe, and everything - as we all know  it's 42...


The WARNING is a cover-your-ass type warning that is showing up
inappropriately for us. It is warning me something that I should already
know, and it is training me to ignore warnings. Thinp doesn't have to be
the answer to everything. It does, however, need to provide a block device
visible to the file system layer, and it isn't invalid for the file system
layer to be able to query about the nature of the block device, such as
"how much space do you *really* have left?"

This seems to be a crux of this debate between you and the other people.
You think the block storage should be as transparent as possible, as if the
storage was not thin. Others, including me, think that this theory is
impractical, as it leads to edge cases where the file system could choose
to fail in a cleaner way, but it gets too far today leading to a more
dangerous failure when it allocates some block, but not some other block.

Exaggerating this to say that thinp would become everything, and the answer
to the ultimate question of life, weakens your point to me, as it means
that you are seeing things in far too black + white, whereas real life is
often not black + white.

It is your opinion that extending thin volumes to allow the file system to
have more information is breaking some fundamental law. But, in practice,
this sort of thing is done all of the time. "Size", "Read only",
"Discard/Trim Support", "Physical vs Logical Sector Size", ... are all
information queried from the device, and used by the file system. If it is
a general concept that applies to many different device targets, and it
will help the file system make better and smarter choices, why *shouldn't*
it be communicated? Who decides which ones are valid and which ones are not?

I didn't disagree with all of your points. But, enough of them seemed to be
directly contradicting my perspective on the matter that I felt it
important to respond to them.

Mostly, I think everybody has a set of opinions and use cases in mind when
they come to their conclusions. Please don't ignore mine. If there is
something unreasonable above, please let me know.


-- 
Mark Mielke <mark.mielke@gmail.com>

[-- Attachment #2: Type: text/html, Size: 11013 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-28 18:25   ` Xen
@ 2016-04-29 11:23     ` Zdenek Kabelac
  2016-05-02 14:32       ` Mark Mielke
  2016-05-03 12:42       ` Xen
  0 siblings, 2 replies; 29+ messages in thread
From: Zdenek Kabelac @ 2016-04-29 11:23 UTC (permalink / raw)
  To: linux-lvm

On 28.4.2016 20:25, Xen wrote:
> Continuing from previous mail I guess. But I realized something.
>
>> A responsible sysadmin who chose to use thin pools might configure the
>> initial FS size to be some modest size well within the constraints of
>> the actual block store, and then as the FS hit say 85% utilization to
>> run a script that investigated the state of the block layer and use
>> resize2fs and friends to grow the FS and let the thin-pool likewise grow
>> to fit as IO gets issued. But at some point when the competing demands
>> of other FS on thin-pool were set to breach actual block availability
>> the FS growth would be denied and thus userland would get signaled by
>> the FS layer that it's out of space when it hit 100% util.
>
> Well of course what you describe here are increasingly complex strategies
> that require development and should not be put on invidual administrators
> (or even organisations) to devise and come up with.
>
> Growing filesystems? If you have a platform where continous thin pool
> growth is possible (and we are talking of well developed, complex setups
> here) then maybe you have in-house tools to take care of all of that.
>
> So you suggest a strategy here that involves both intelligent automatic
> administration of the FS layer as well as the block layer.
>
> A concerted strategy where for example you do have a defined thin volume
> size but you constrain your FS artificially AND depend its intelligence on
> knowledge of your thin pool size. And then you have created an
> intelligence where the "filesystem agent" can request growth, and perhaps
> the "block level agent" may grant or deny it such that FS growth is staged
> and given hard limits at every point. And then you have the same
> functionality as what I described other than that it is more sanely
> constructed at intervals.
>
> No continuous updating, but staged growth intervals or moments.

I'm not going to add much to this thread - since there is nothing really 
useful for devel.  But let me strike out few important moments:


Thin-provisioning is NOT about providing device to the upper
system levels and inform THEM about this lie in-progress.

That's complete misunderstanding of the purpose.

If you seek for a filesystem with over-provisioning - look at btrfs, zfs and 
other variants...

Device target is definitely not here to solve  filesystem troubles.
Thinp is about 'promising' - you as admin promised you will provide
space -  we could here discuss maybe that LVM may possibly maintain
max growth size we can promise to user - meanwhile - it's still the admin
who creates thin-volume and gets WARNING if VG is not big enough when all thin 
volumes would be fully provisioned.

And  THAT'S IT - nothing more.

So please avoid making thinp target to be answer to ultimate question of life, 
the universe, and everything - as we all know  it's 42...

>
>> But either way if you have a sudden burst of I/O from competing
>> interests in the thin-pool, what appeared to be a safe growth allocation
>> at one instant of time is not likely to be true when actual writes try
>> to get fulfilled.
>
> So in the end monitoring is important but because you use a thin pool
> there are like 3 classes of situations that change:
>
> * Filesystems will generally have more leeway because you are /able/ to
> provide them with more (virtual) space to begin with, in the assumption
> that you won't readily need it, but it's normally going to be there when
> it does.

So you try to design  'another btrfs' on top of thin provisioning?

> * Thin volumes do allow you to make better use of the available space (as
> per btrfs, I guess) and give many advantages in moving data around.


With 'thinp' you  want simplest  filesystem with robust metadata -  so in 
theory  -  'ext4' or  XFS without all 'improvements for rotational hdd that 
has accumulated over decades of their evolution.

> 1. Unless you monitor it directly in some way, the lack of information is
> going to make you feel rather annoyed and insecure
>
> 2. Normally user tools do inform you of system status (a user-run "ls" or
> "df" is enough) but you cannot have lvs information unless run as root.

You miss the 'key' details.

Thin pool is not constructing  'free-maps'  for each LV all the time - that's 
why tools like  'thin_ls'  are meant to be used from the user-space.
It IS very EXPENSIVE operation.

So before you start to present your visions here, please spend some time with 
reading doc and understanding all the technology behind it.


> Even with a perfect LVM monitoring tool, I would experience a consistent
> lack of feedback.

Mistake of your expectations

If you are trying to operate  thin-pool near 100% fullness - you will need to 
write and design completely different piece of software - sorry thinp
is not for you and never will...

Simply use 'fully' provisioned - aka - already existing standard volumes.

>
> Just a simple example: I can adjust "df" to do different stuff. But any
> program reporting free diskspace is going to "lie" to me in that sense. So
> yes I've chosen to use thin LVM because it is the best solution for me
> right now.

'df'  has nothing in common with  'block' layer.


> Technically I consider autoextend not that great of a solution either.
>
> It begs the question: why did you not start out with a larger volume in
> the first place? You going to keep adding disks as the thing grows?

Very simple answer and related of to misunderstanding of the purpose.

Take it as motivation like you want to reduce amount of active device in your 
i.e. 'datacenter'.

So you start with 1TB volume - while the user may immediately create and 
format and use i.e. 10TB volume.   As the volume fill over the time - you add 
more devices to your vg (buy/pay for more disk space/energy).
But user doesn't have to resize his filesystem or have other costs with 
maintenance of slowly growing filesystem.

Of course if the first thing user will do is to i.e.  'dd'  full 10TB volume 
the are not going to be any savings!

But if you've never planned to buy 10TB - you should have never allow to 
create such big volume in the first place!

With thinp  you basically postpone or skip (fsresize) some operations.

> An overprovisioned system with individual volumes that individually cannot
> reach their max size is a bad system.

Yes - it is bad system.

So don't do it - and don't plan to use it - it's really that simple.

ThinP is NOT virtual disk-space for free...

> Thin pools lie. Yes. But it's not a lie of the space is available. It's
> only a lie if the space is no longer available!!!!!!!.
>
> It is not designed to lie.

Actually it's the core principle!
It lies (or better say uses admin's promises) that there is going to be a disk 
space. And it's admin responsibility to fulfill it.

If you know in front you will need quickly all the disk space - then using 
thinp and expecting miracle is not going to work.


Regards

Zdenek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-28 10:43 ` matthew patton
  2016-04-28 18:20   ` Xen
@ 2016-04-28 18:25   ` Xen
  2016-04-29 11:23     ` Zdenek Kabelac
  1 sibling, 1 reply; 29+ messages in thread
From: Xen @ 2016-04-28 18:25 UTC (permalink / raw)
  To: LVM general discussion and development

Continuing from previous mail I guess. But I realized something.

> A responsible sysadmin who chose to use thin pools might configure the
> initial FS size to be some modest size well within the constraints of
> the actual block store, and then as the FS hit say 85% utilization to
> run a script that investigated the state of the block layer and use
> resize2fs and friends to grow the FS and let the thin-pool likewise 
> grow
> to fit as IO gets issued. But at some point when the competing demands
> of other FS on thin-pool were set to breach actual block availability
> the FS growth would be denied and thus userland would get signaled by
> the FS layer that it's out of space when it hit 100% util.

Well of course what you describe here are increasingly complex 
strategies
that require development and should not be put on invidual 
administrators
(or even organisations) to devise and come up with.

Growing filesystems? If you have a platform where continous thin pool
growth is possible (and we are talking of well developed, complex setups
here) then maybe you have in-house tools to take care of all of that.

So you suggest a strategy here that involves both intelligent automatic
administration of the FS layer as well as the block layer.

A concerted strategy where for example you do have a defined thin volume
size but you constrain your FS artificially AND depend its intelligence 
on
knowledge of your thin pool size. And then you have created an
intelligence where the "filesystem agent" can request growth, and 
perhaps
the "block level agent" may grant or deny it such that FS growth is 
staged
and given hard limits at every point. And then you have the same
functionality as what I described other than that it is more sanely
constructed at intervals.

No continuous updating, but staged growth intervals or moments.

> But either way if you have a sudden burst of I/O from competing
> interests in the thin-pool, what appeared to be a safe growth 
> allocation
> at one instant of time is not likely to be true when actual writes try
> to get fulfilled.

So in the end monitoring is important but because you use a thin pool
there are like 3 classes of situations that change:

* Filesystems will generally have more leeway because you are /able/ to
provide them with more (virtual) space to begin with, in the assumption
that you won't readily need it, but it's normally going to be there when
it does.

* Hard limits in the filesystem itself is still a use case that has no
good solution; most applications will start crashing or behaving weirdly
when out of diskspace. Freezing a filesystem (when it is not a system
disk) might be equally well of a good mitigation strategy as anything 
that
involves "oh no, I am out of diskspace and now I am going to ensure
endless trouble as processes keep trying to write to that empty space -
that nonexistent space". If anything I don't think most systems 
gracefully
recover from that.

Creating temporary filesystems for important parts is not all that bad.

* Thin volumes do allow you to make better use of the available space 
(as
per btrfs, I guess) and give many advantages in moving data around.

The only detriment really to thin for a desktop power user, so to speak
is:

1. Unless you monitor it directly in some way, the lack of information 
is
going to make you feel rather annoyed and insecure

2. Normally user tools do inform you of system status (a user-run "ls" 
or
"df" is enough) but you cannot have lvs information unless run as root.

The system-config-lvm tool just runs as setuid. I can add volumes 
without
authenticating as root.

Regular command line tools are not accessible to the user.


So what I have been suggesting obviously seeks to address point 2. I am
more than willing to address point 1 by developing something, but I'm 
not
sure I will ever be able to develop again in this bleak sense of decay I
am experiencing life to be currently ;-).

Anyhow, it would never fully satisfy for me.

Even with a perfect LVM monitoring tool, I would experience a consistent
lack of feedback.

Just a simple example: I can adjust "df" to do different stuff. But any
program reporting free diskspace is going to "lie" to me in that sense. 
So
yes I've chosen to use thin LVM because it is the best solution for me
right now.

At the same time indeed, I lack information and this information cannot 
be
sourced directly from the block layer because that's not how computer
software works. Computer software doesn't interface with the block 
layer.
They interface with filesystems and report information from there.

Technically I consider autoextend not that great of a solution either.

It begs the question: why did you not start out with a larger volume in
the first place? You going to keep adding disks as the thing grows?

I mean, I don't know. If I'm some VPS user and I'm running on a
thinly-provisioned host. Maybe it's nice to be oblivious. But unless my
host has a perfect failsafe setup, the only time I am going to be 
notified
of failure is if my volume (that I don't know about) drops or freezes.

Would I personally like having a tool that would show at some point
something going wrong at the lower level? I think I would.

An overprovisioned system with individual volumes that individually 
cannot
reach their max size is a bad system.

That they can't do it all at the same time is not that much of a 
problem.
That is not very important.

Yet considering a different situation -- suppose this is a host with few
clients but high data requirements. Suppose there are only 4 thin 
volumes.
And suppose every thin volume is going to be something of 2TB or make it
anything as large as you want.

(I just have 50GB on my vps). Suppose you had a 6TB disk and you
provisioned it for 4 clients x 2TB. Economies of scale only start to
really show their benefit with much higher number of clients. With 200
clients the "averaging" starts to work in your favour giving you a
dependable system that is not going to suddenly do something weird.

But with smaller numbers you do run into the risk of something going
amiss.

The only reason lack of feedback would not be important for your clients
is if you had a large enough pool, and individual volumes would be just 
a
small part of that pool, say 50-100 volumes per pool.

So I guess I'm suggesting there may be a use case for thin LVM in which
you do not have this >10 number of volumes sitting in any pool.

And at that point personally even if I'm the client of that system, I do
want to be informed.

And I would prefer to be informed *through* the pipe that already 
exists.

Thin pools lie. Yes. But it's not a lie of the space is available. It's
only a lie if the space is no longer available!!!!!!!.

It is not designed to lie.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-28 10:43 ` matthew patton
@ 2016-04-28 18:20   ` Xen
  2016-04-28 18:25   ` Xen
  1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-04-28 18:20 UTC (permalink / raw)
  To: LVM general discussion and development

Let me just write down some thoughts here.

First of all you say that fundamental OS design is about higher layers 
trusting lower layers and that certain types of communications should then 
always be one way.

In this case it is about block layer vs. file system layer.

But you make certain assumptions about the nature of a block device to 
begin with.

A block device is defined by its acess method (ie. data organized in 
blocks) rather than its contigiousness or having an unchanging, "single 
block" address or access space. I know this goes pretty far but it is the 
truth.

In theory there is nothing against a hypothetical block device offering 
ranges of blocks to a higher level (that might never change) or to be 
dynamically notified of changes to that address pool.

To a process virtual memory is a space that is transparent to it whether 
that space is constructed of paged memory (swap file) or not. At the same 
time it is not impossible to imagine that an IO scheduler for swap would 
take heed of values given by applications, such as using nice or ionice 
values. That would be one way communication though.

In general a higher level should be oblivious to what kind of lower level 
layer it is running on, you are right. Yet if all lower levels exhibit the 
same kind of features, this point becomes moot, because at that point the 
higher level will not be able to know, once more, precisely what kind of 
layer it is running on, although it would have more information.

So just theoretically speaking the only thing that is required to be 
consistent is the API or whatever interface you design for it.

I think there are many cases where some software can run on some libraries 
but not on others because those other libraries do not offer the full 
feature set of whatever standard is being defined there. An example is 
DLNA/UPNP, these are not layers but the standard is ill-defined and the 
device you are communicating with might not support the full set.

Perhaps these are detrimental issues but there are plenty of cases where 
one type of "lower level" will suffice but another won't, think maybe of 
graphics drivers. Across the layer boundary, communication is two-way 
anyway. The block device *does* supply endless streams of data to the 
higher layer. The only thing that would change is that you would no longer 
have this "always one contigious block of blocks" but something that is 
slightly more volatile.

When you "mkfs" the tool reads the size of the block device. Perhaps 
subsequently the filessytem is unaware and depends on fixed values.

The feature I described (use case) would allow the set of blocks that is 
available, to dynamically change. You are right that this would apparently 
be a big departure from the current model.

So I'm not saying it is easy, perfect, or well understood. I'm just saying 
I like the idea.

I don't know what other applications it might have but it depends entirely 
on correct "discard" behaviour from the filesystem.

The filesystem should be unaware of its underlying device but discard is 
never required for rotating disks as far as I can tell. This is an option 
that assumes knowledge of the underlying device. From discard we can 
basically infer that either we are dealing with a flash device or 
something that has some smartness about what blocks it retains and what 
not (think cache).

So in general this is already a change that reflects changing conditions 
of block devices in general or its availability. And its characteristic 
behaviour or demands from filesystems.

These are block devices that want more information to operate (well).

Coincidentally, discard also favours or enhances (possibly) lvmcache.

So it's not about doing something wildly strange here, it's about offering 
a feature set that a filesystem may or may not use, or a block device may 
or may not offer.

Contrary to what you say, there is nothing inherently bad about the idea. 
The OS design principle violation you speak of is principle, not practical 
reality. It's not that it can't be done. It's that you don't want it to 
happen because it violates your principles. It's not that it wouldn't 
work. It's that you don't like it to work because it violates your 
principles.

At the same time I object to the notion of the system administrator being 
this theoretical vastly differing role/person than the user/client.

We have no in-betweens on Linux. For fun you should do a search of your 
filesystem with find -xdev based on the contents of /etc/passwd or 
/etc/group. You will find that 99% of files are owned by root and the only 
ones that aren't are usually user files in the home directory or specific 
services in /var/lib.

Here is a script that would do it for groups:

cat /etc/group | cut -d: -f1 | while read g; do printf "%-15s %6d" $g 
`find / -xdev -type f -group $g | wc -l`; done

Probably. I can't run it here it might crash my system (live dvd).

Of about 170k files on an OpenSUSE system, 15 were group writable, mostly 
due to my own interference probably. Of 170197 files (no xdev) 168161 were 
owned by root.

Excluding man and my user, 69 files did not have "root" as the group. Part 
of that was again due to my own changes.

At the same time in some debates your are presented with the ludicrous 
notion that there is some ideal desktop user who doesn't need to ever see 
anything of the internal system. She never opens a shell and certainly 
does not come across ethernet device names (for example). The "desktop 
user" does not care about the naming of devices from /dev/eth0 to 
/sys/class/net/enp3s0.

The desktop user never uses anything other than DHCP, etc. etc. etc.

The desktop user never can configure anything without the help of the 
admin, if it is slightly more advanced.

It's that user vs. admin dichotomy that is never true on any desktop 
system and I will venture it is not even true on the systems I am a client 
of, because you often need to debate stuff with the vendor or ask for 
features, offer solutions, etc.

In a store you are a client. There are employees and clients, nothing 
else. At the same time I treat these girls as my neighbours because they 
work in the block I live in.

You get the idea. Roles can be shifty. A person can use multiple roles at 
the same time. He/she can be admin and user simulaneously.

Perhaps you are correct to state that the roles themselves should not be 
watered down, that clear delimitations are required.

In your other email you allude to me not ever having done an OS design 
course.

Offlist a friendly member suggested strongly I not use personal attacks in 
my communications here. But of course this is precisely what you are doing 
here, because as a matter of fact I did follow such a course.

I don't remember the book we used because apparently between my house mate 
and me we only had one exemplar and he ended up getting it because I was 
usually the one borrowing stuff from him.

At the same time university is way beyond my current reach (in living 
conditions) so it is just an unwarranted allusion that does not have 
anything to do with anything really.

Yes I think it was the dinosaur book:

Operating System Concepts by Silberschatz, Galvin and Gagne

Anyway, irrelevant here.

> Another way (haven't tested) to 'signal' the FS as to the true state of 
> the underlying storage is to have a sparse file that gets shrunk over 
> time.

You do realize you are trying to find ways around the limitation you just 
imposed on yourself right?

> The system admin decided it was a bright idea to use thin pools in the 
> first place so he necessarily signed up to be liable for the hazards and 
> risks that choice entails. It is not the job of the FS to bail his ass 
> out.

I don't think thin pools are that risky or should be that risky. They do 
incur a management overhead compared to static filesystems because of 
adding that second layer you need to monitor. At the same time the burden 
of that can be lessened with tools.

As it stands I consider thin LVM the only reasonably way to snapshot a 
running system without dedicating specific space to it in advance. I could 
expect snaphotting to require stuff to be in the same volume group. 
Without LVM thin, snapshotting requires making at least some prior 
investment in having a snapshot device ready for you in the same VG, 
right?

Do not think btrfs and ZFS are without costs. You wrote:

> Then you want an integrated block+fs implementation. See BTRFS and ZFS.
WAFL and friends.

But btrfs is not without complexity. It uses subvolumes that differ from 
distribution to distribution as each makes its own choice. It requires 
knowledge of more complicated tools and mechanics to do the simplest (or 
most meaningful) of tasks. Working with LVM is easier. I'm not saying LVM 
is perfect and....

Using snapshotting as a backup measure is something that seems risky to me 
at the first place because it is a "partition table" operation which 
really you shouldn't be doing on a consistent basis. So in other to 
effectively use it in the first place you require tools that handle the 
safeguards for you. Tools that make sure you are not making some command 
line mistake. Tools that simply guard against misuse.

Regular users are not fit for being btrfs admins either.

It is going to confuse the hell out of people seeing as that what their 
systems run on and if they are introduced to some of the complexity of it.

You say swallow your pride. It has not much to do with pride.

It has to do with ending up in a situation I don't like. That is then 
going to "hurt" me for the remainder of my days until I switch back or get 
rid of it.

I have seen NOTHING NOTHING NOTHING inspiring about btrfs.

Not having partition tables and sending volumes across space and time to 
other systems, is not really my cup of tea.

It is a vendor lock-in system and would result in other technologies being 
lesser developed.

I am not alone in this opinion either.

Btrfs feels like a form of illness to me. It is living in a forest with 
all deformed trees, instead of something lush and inspiring. If you've 
ever played World of Warcraft, the only thing that comes a bit close is 
the Felwood area ;-).

But I don't consider it beyond Plaguelands either.

Anyway.

I have felt like btrfs in my life. They have not been the happiest moments 
of my life ;-).

I will respond more in another mail, this is getting too long.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
       [not found] <929635034.3140318.1461840230292.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-04-28 10:43 ` matthew patton
  2016-04-28 18:20   ` Xen
  2016-04-28 18:25   ` Xen
  0 siblings, 2 replies; 29+ messages in thread
From: matthew patton @ 2016-04-28 10:43 UTC (permalink / raw)
  To: Xen, Marek Podmaka, LVM general discussion and development

> > The real question you should be asking is if it increases the monitoring
> > aspect (enhances it) if thin pool data is seen through the lens of the
> > filesystems as well.

Then you want an integrated block+fs implementation. See BTRFS and ZFS. WAFL and friends.

> kernel for communication from lower fs layers to higher layers -

Correct. Because doing so violates the fundamental precepts of OS design. Higher layers trust lower layers. Thin Pools are outright lying about the real world to anything that uses it's services. That is its purpose. The FS doesn't give a damn that the block layer is lying to it, it can and does assume and rightly so that what the block layer says it has, it indeed does have. The onus of keeping the block layer ahead of the FS falls on a third party - the system admin. The system admin decided it was a bright idea to use thin pools in the first place so he necessarily signed up to be liable for the hazards and risks that choice entails. It is not the job of the FS to bail his ass out.

A responsible sysadmin who chose to use thin pools might configure the initial FS size to be some modest size well within the constraints of the actual block store, and then as the FS hit say 85% utilization to run a script that investigated the state of the block layer and use resize2fs and friends to grow the FS and let the thin-pool likewise grow to fit as IO gets issued. But at some point when the competing demands of other FS on thin-pool were set to breach actual block availability the FS growth would be denied and thus userland would get signaled by the FS layer that it's out of space when it hit 100% util.

Another way (haven't tested) to 'signal' the FS as to the true state of the underlying storage is to have a sparse file that gets shrunk over time.

But either way if you have a sudden burst of I/O from competing interests in the thin-pool, what appeared to be a safe growth allocation at one instant of time is not likely to be true when actual writes try to get fulfilled. 

Think of mindless use of thin-pools as trying to cross a heavily mined beach. Bring a long stick and say your prayers because you'r likely going to lose a limb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-28  6:46     ` Marek Podmaka
@ 2016-04-28 10:33       ` Xen
  0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-04-28 10:33 UTC (permalink / raw)
  To: LVM general discussion and development

On Thu, 28 Apr 2016, Marek Podmaka wrote:

> Hello Xen,
>
> Wednesday, April 27, 2016, 23:28:31, you wrote:
>
>> The real question you should be asking is if it increases the monitoring
>> aspect (enhances it) if thin pool data is seen through the lens of the
>> filesystems as well.
>
>> Beside the point here perhaps. But. Let's drop the "real sysadmin"
>> ideology. We are humans. We like things to work for us. "Too easy" is
>> not a valid criticism for not having something.
>
> As far as I know (someone correct me) there is no mechanism at all in
> kernel for communication from lower fs layers to higher layers -
> besides exporting static properties like physical block size. The
> other way (from higher layer like fs to lower layers works fine - for
> example discard support).

I suspected so.

> So even if what you are asking might be valid, it isn't as simple as adding
> some parameter somewhere and it would magically work. It is about
> inventing and standardizing new communication system, which would of
> course work only with new versions of all the tools involved.

Right.

> Anyway, I have no idea what would filesystem itself do with information
> that no more space is available. Also this would work only for lvm
> thin pools, not for thin provision directly from storage, so it would
> be a non-consistent mess. Or you would need another protocol for
> exporting thin-pool related dynamic data from storage (via NAS, SAN,
> iSCSI and all other protocols) to the target system. And in some
> organizations it is not desirable at all to make this kind of
> information visible to all target systems / departments.

Yes I don't know how "thin provision directly from storage" works.

I take it you mean that these protocols you mention are or would be the 
channel through which the communication would need to happen that I now 
just proposed for LVM.

I take it you mean that these systems offer regular looking devices over 
any kind of link, while "secretly" behind the scenes using thin 
provisioning for that, and that as such we are or would be dealing with 
pretty "hard coded" standards that would require a lot of momentum to 
change any of that. In that sense that the client of these storage systems 
themselves do not know about the thin provisioning and it is up to the 
admin of those systems.,.. yadda yadda yadda.

I feel really stupid now :p.

And to make it worse, is means that in these "hardware" systems the user 
and admin are separated, but the same is true if you virtualize and you 
offer the same model to your clients. I apologize for my noviceness here 
the way I come across.

But I agree that to any client it is not helpful to know about hard limits 
that should be oblivious to them provided that the provisioning is done 
right.

It would be quite disconcerting to see your total available space suddenly 
shrink without being aware of any autoextend mechanism (for instance) and 
as such there seems to be a real divide between the "user" and the 
"supplier" of any thin volume.

Maybe I have misinterpreted the real use case for thin pools then. But my 
feeling is that I am just a bit confused at this point.

> What you are asking can be done for example directly in "df" (or you
> can make a wrapper script), which would not only check the filesystems
> themselves, but also the thin part and display the result in whatever
> format you want.

That is true of course. I have to think about it.

> Also displaying real thin free space for each fs won't be "correct".
> If I see 1 TB free in each filesystem and starting writing, by the
> time I finish, those 1 TB might be taken by the other fs. So
> information about current free space in thinp is useless for me, as in
> 1 minute it could be totally different number.

But the calamity is that if that was really true, and the thing didn't 
autoextend, then you'd end up with a frozen system.

So basically it seems at this point a conflict of interests:

- you don't want your clients to know your systems are failing
- they might not even be failing if they autoextend
- you don't want to scare them with in that sense, inaccurate data

- on a desktop system, the user and sysadmin would be the same
- there is not really any provison for graphical tools.

(maybe I should develop one. I so badly want to start coding again).

- a tool that notifies the user about the thin pool would equally well do 
the job of informing the user/admin as a filesystem point of data, would 
do.

- that implies that the two roles would stay separate.
- desktops seem to be using btrfs now in some distros

I'm concerned with the use case of a desktop user that could employ this 
technique. I now understand a bit more perhaps why grub doesn't support 
LVM thin.

The management tools for a desktop user also do not exist (except the 
command line tools we have).

Well wrong again there is a GUI it is just not very helpful.

It is not helpful at all for monitoring.

It can
* create logical volumes (regular, stripe, mirror)
* move volumes to another PV
* extend volume groups to another PV

And that's about all it can do I guess. Not sure it even needs to do much 
more, but it is no monitoring tool of any sophistication.

Let me think some more on this and I apologize for the "out loud" 
thinking.

Regards.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-27 21:28   ` Xen
@ 2016-04-28  6:46     ` Marek Podmaka
  2016-04-28 10:33       ` Xen
  0 siblings, 1 reply; 29+ messages in thread
From: Marek Podmaka @ 2016-04-28  6:46 UTC (permalink / raw)
  To: Xen; +Cc: LVM general discussion and development

Hello Xen,

Wednesday, April 27, 2016, 23:28:31, you wrote:

> The real question you should be asking is if it increases the monitoring
> aspect (enhances it) if thin pool data is seen through the lens of the
> filesystems as well.

> Beside the point here perhaps. But. Let's drop the "real sysadmin" 
> ideology. We are humans. We like things to work for us. "Too easy" is 
> not a valid criticism for not having something.

As far as I know (someone correct me) there is no mechanism at all in
kernel for communication from lower fs layers to higher layers -
besides exporting static properties like physical block size. The
other way (from higher layer like fs to lower layers works fine - for
example discard support).

So even if what you are asking might be valid, it isn't as simple as adding
some parameter somewhere and it would magically work. It is about
inventing and standardizing new communication system, which would of
course work only with new versions of all the tools involved.

Anyway, I have no idea what would filesystem itself do with information
that no more space is available. Also this would work only for lvm
thin pools, not for thin provision directly from storage, so it would
be a non-consistent mess. Or you would need another protocol for
exporting thin-pool related dynamic data from storage (via NAS, SAN,
iSCSI and all other protocols) to the target system. And in some
organizations it is not desirable at all to make this kind of
information visible to all target systems / departments.

What you are asking can be done for example directly in "df" (or you
can make a wrapper script), which would not only check the filesystems
themselves, but also the thin part and display the result in whatever
format you want.

Also displaying real thin free space for each fs won't be "correct".
If I see 1 TB free in each filesystem and starting writing, by the
time I finish, those 1 TB might be taken by the other fs. So
information about current free space in thinp is useless for me, as in
1 minute it could be totally different number.


-- 
  bYE, Marki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-27 12:26 ` matthew patton
@ 2016-04-27 21:28   ` Xen
  2016-04-28  6:46     ` Marek Podmaka
  0 siblings, 1 reply; 29+ messages in thread
From: Xen @ 2016-04-27 21:28 UTC (permalink / raw)
  To: LVM general discussion and development

matthew patton schreef op 27-04-2016 12:26:
> It is not the OS' responsibility to coddle stupid sysadmins. If you're
> not watching for high-water marks in FS growth vis a vis the
> underlying, you're not doing your job. If there was anything more than
> the remotest chance that the FS would grow to full size it should not
> have been thin in the first place.

Who says the only ones who would ever use or consider using thin would 
be sysadmins.?

Monitoring Linux is troublesome enough for most people and it really is 
a "job".

You seem to be intent on making the job harder rather than easy so you 
can be a type of person that has this expert knowledge while others 
don't?

I remember a reason to crack down on sysadmins was that they didn't know 
how to use "vi" - if you can't use fucking vi, you're not a sysadmin. 
This actually is a bloated version of what a system administrator is or 
could at all times be expected to do, because you are ensuring that 
problems are going to surface one way or another when this sysadmin is 
suddenly no longer capable of being this perfect guy at 100% of times.

You are basically ensuring disaster by having that attitude.

That guy that can battle against all odds and still prevail ;-).

More to the point.

No one is getting cuddled because Linux is hard enough and it is usually 
the users who are getting cuddled; strangely enough the attitude exists 
that the average desktop user never needs to look under the hood. If 
something is ugly, who cares, the "average user" doesn't go there.

The average user is oblivious to all system internals.

The system administrator knows everything and can launch a space rocket 
with nothing more than matches and some gallons of rocket fuel.

;-).


The autoextend mechanism is designed to prevent calamity when the 
filesystem(s) grow to full size. By your reasoning , it should not exist 
because it cuddles admins.

A real admin would extend manually.

A real admin would specify the right size in advance.

A real admin would use thin pools of thin pools that expand beyond your 
wildest dreams :p.

But on a more serious note, if there is no chance a file system will 
grow to full size, then it doesn't need to be that big.

But there are more use cases for thin than hosting VMs for clients.

Also I believe thin pools have a use for desktop systems as well, when 
you see that the only alternative really is btrfs and some distros are 
going with it full-time. Btrfs also has thin provisioning in a sense but 
on a different layer, which is why I don't like it.

Thin pools from my perspective are the only valid snapshotting mechanism 
if you don't use btrfs or zfs or something of the kind.

Even a simple desktop monitor, some applet with configured thin pool 
data, would of course alleviate a lot of the problems for a "casual 
desktop user". If you remotely administer your system with VNC or the 
like, that's the same. So I am saying there is no single use case for 
thin, and.

Your response mr. patton falls along the lines of "I only want this to 
be used by my kind of people".

"Don't turn it into something everyone or anyone can use".

"Please let it be something special and nichie".

You can read coddle in place of cuddle.



It seems to me pretty clear to me that a system that *requires* manual 
intervention and monitoring at all times is not a good system, 
particularly if the feedback on its current state cannot be retrieved 
from, or is usable by, other existing systems that guard against more or 
less the same type of things.

Besides, if your arguments here were valid, then 
https://bugzilla.redhat.com/show_bug.cgi?id=1189215 would never have 
existed.



> The FS already has a notion of 'reserved'. man(1) tune2fs -r

Alright thanks. But those blocks are manually reserved for a specific 
user.

That's what they are for. It is for -u. These blocks are still available 
to the filesystem.

You could call it calamity prevention as well. There will always be a 
certain amount of space for say the root user.

and by the same measure you can also say the tmpfs overflow mechanism 
for /tmp is not required either because a real admin would not see his 
rootfs go out of diskspace.

Stuff happens. You ensure you are prepared when it does. Not stick your 
head in the sand and claim that real gurus never encounter those 
situations.

The real question you should be asking is if it increases the monitoring 
aspect (enhances it) if thin pool data is seen through the lens of the 
filesystems as well.

Or whether that is going to be a detriment.

Regards.



Erratum:

https://utcc.utoronto.ca/~cks/space/blog/tech/SocialProblemsMatter

There is a widespread attitude among computer people that it is a great 
pity that their beautiful solutions to difficult technical challenges 
are being prevented from working merely by some pesky social issues 
[read: human flaws], and that the problem is solved once the technical 
work is done. This attitude misses the point, especially in system 
administration: broadly speaking, the technical challenges are the easy 
problems.

No technical system is good if people can't use it or if it makes 
people's lives harder (my words). One good example of course is Git. The 
typical attitude you get is that a real programmer has all the skills of 
a git guru. Yet git is a git. Git is an asshole system.

Beside the point here perhaps. But. Let's drop the "real sysadmin" 
ideology. We are humans. We like things to work for us. "Too easy" is 
not a valid criticism for not having something.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
       [not found] <518072682.2617983.1461760017772.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-04-27 12:26 ` matthew patton
  2016-04-27 21:28   ` Xen
  0 siblings, 1 reply; 29+ messages in thread
From: matthew patton @ 2016-04-27 12:26 UTC (permalink / raw)
  To: LVM general discussion and development

It is not the OS' responsibility to coddle stupid sysadmins. If you're not watching for high-water marks in FS growth vis a vis the underlying, you're not doing your job. If there was anything more than the remotest chance that the FS would grow to full size it should not have been thin in the first place.

The FS already has a notion of 'reserved'. man(1) tune2fs -r

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-lvm] thin handling of available space
  2016-04-23 17:53 Xen
@ 2016-04-27 12:01 ` Xen
  0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-04-27 12:01 UTC (permalink / raw)
  To: LVM general discussion and development

I was talking about the idea to communicate to a filesystem the amount 
of available blocks.

I noticed https://bugzilla.redhat.com/show_bug.cgi?id=1189215 named "LVM 
Thin: Handle out of space conditions better" which was resolved by 
Zdenek Kabelac (hey Zdenek) and which gave rise to (apparently) the new 
warning you get when you overprovision.



But this warning when overprovisioning does not solve any problems in a 
running system.

You /still/ want to overprovision AND you want a better way to handle 
out of space conditions.

A number of items were suggested in that bug:

1) change the default "resize thin-p at 100%" setting in lvm.conf
2) warn users that they have insufficient space in a pool to cover a 
fully used thinLV
3) change default wait time from 60sec after an out-of-space condition 
to something longer

Corey Marthaler suggested that only #2 was implemented, and this bug (as 
mentioned) was linked in an errata mentioned at the end of the bug.


So since I have already talked about it here with my lengthy rambling 
post ;-) I would like to at least here "formally" suggest a #4 and ask 
whether I should comment on that bug or supply/submit a new one about 
it?


So my #4 would be:

4) communicate and dynamically update a list of free blocks being sent 
to the filesystem layer on top of a logical volume (LV) such that the 
filesystem itself is aware of shrinking free space.

Logic implies:
- any thin LV seeing more blocks being used causes the other filesystems 
in that thin pool to be updated with new available blocks (or numbers) 
if this amount becomes less than the filesystem normally would think it 
had

- any thin LV that sees blocks being discarded by the filesystem causes 
the other filesystems in that thin pool to be updated with newly 
available blocks (or numbers) op to the moment that the real available 
space agrees once more with the virtual available space (real free >= 
virtual free)

Meaning that this feedback would start happening for any thin LV when 
the real available space in the thin pool cq. volume group (depending on 
how that works at that point in that place in that configuration) 
becomes less then the virtual available space for the thin volume (LV)

This would mean that the virtual available space would in effect 
dynamically shrink and grow with the real available space as an 
envelope.

The filesystem may know this as an adjusted total available space 
(number of blocks) or as an adjusted number of unavailable blocks. It 
would need to integrate this in its free space calculation. For a user 
tool such as "df" there are 3 ways to update this changing information:

1. dynamically adjust the total available blocks
2. dynamically adjust the amount of free blocks
3. introduce a new field of "unavailable"

Traditional "df" is "total = used + free", the new one would be "total = 
used + free + unavailable".

For any user tool not working in blocks but simply available space 
(bytes) likely only the amount of free space being reported, would 
change.

One may choose to hide the information in "df" and introduce a new flag 
that shows unavailable as well?

Then only the amount of free blocks reported, would change, and the 
numbers just don't add up visibly.

It falls along the line of the "discard" family of communications that 
were introduced in 2008 (https://lwn.net/Articles/293658/).

I DO NOT KNOW if this already exists but I suppose it doesn't. I do not 
know a lot about the filesystem system. I just took the liberty to ask 
Jonathan Corwell erm Corbet whether this is possible :p.

Anyway, hopefully I am not being too much of a pain here. Regards.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [linux-lvm] thin handling of available space
@ 2016-04-23 17:53 Xen
  2016-04-27 12:01 ` Xen
  0 siblings, 1 reply; 29+ messages in thread
From: Xen @ 2016-04-23 17:53 UTC (permalink / raw)
  To: Linux lvm

Hi,

So here is my question. I was talking about it with someone, who also 
didn't know.



There seems to be a reason against creating a combined V-size that 
exceeds the total L-size of the thin-pool. I mean that's amazing if you 
want extra space to create more volumes at will, but at the same time 
having a larger sum V-size is also an important use case.

Is there any way that user tools could ever be allowed to know about the 
real effective free space on these volumes?

My thinking goes like this:

- if LVM knows about allocated blocks then it should also be aware of 
blocks that have been freed.
- so it needs to receive some communication from the filesystem
- that means the filesystem really maintains a "claim" on used blocks, 
or at least notifies the underlying layer of its mutations.

- in that case a reverse communication could also exist where the block 
device communicates to the file system about the availability of 
individual blocks (such as might happen with bad sectors) or even the 
total amount of free blocks. That means the disk/volume manager (driver) 
could or would maintain a mapping or table of its own blocks. Something 
that needs to be persistent.

That means the question becomes this:

- is it either possible (theoretically) that LVM communicates to the 
filesystem about the real number of free blocks that could be used by 
the filesystem to make "educated decisions" about the real availability 
of data/space?

- or, is it possible (theoretically) that LVM communicates a "crafted" 
map of available blocks in which a certain (algorithmically determined) 
group of blocks would be considered "unavailable" due to actual real 
space restrictions in the thin pool? This would seem very suboptimal but 
would have the same effect.

See if the filesystem thinks it has 6GB available but really there is 
only 3GB because data is filling up, does it currently get notified of 
this?

What happens if it does fill up?

Funny that we are using GB in this example. I remembered today using 
Stacker on MS-DOS disk where I had 20MB available and was able to 
increase it to 30MB ;-).

Someone else might use terabytes, but anyway.

If the filesystem normally has a fixed size and this size doesn't change 
after creation (without modifying the filesystem) then it is going to 
calculate its free space based on its knowledge of available blocks.

So there are three figures:

- total available space
- real available space
- data taken up by files.

total - data is not always real, because there may still be handles on 
deleted files, etc., open. Visible, countable files and its "du" + 
blocks still in use + available blocks should be ~ total blocks.

So we are only talking about blocks here, nothing else.

And if LVM can communicate about availability of blocks, a fourth figure 
comes into play:

total = used blocks + unused blocks + unavailable blocks.

If LVM were able to dynamically adjust this last figure, we might have a 
filesystem that truthfully reports actual available space. In a thin 
setting.

I do not even know whether this is not already the case, but I read 
something that indicated an importance of "monitoring available space" 
which would make the whole situation unusable for an ordinary user.

Then you would need GUI applets that said "The space on your thin volume 
is running out (but the filesystem might not report it)".

So question is:

* is this currently 'provisioned' for?
* is this theoretically possible, if not?

If you take it to a tool such as "df"

There are only three figures and they add up.

They are:

total = used + available

but we want

total = used + available + unavailable

either that or the total must be dynamically be adjusted, but I think 
this is not a good solution.


So another question:

*SHOULDN'T THIS simply be a feature of any filesystem?*

The provision of being able to know about the *real* number of blocks in 
case an underlying block device might not be "fixed, stable, and 
unchanging"?

The way it is you can "tell" Linux filesystems with fsck which blocks 
are bad blocks and thus unavailable, probably reducing the number of 
"total" blocks.

 From a user interface perspective, perhaps this would be an ideal 
solution, if you needed any solution at all. Personally I would probably 
prefer either the total space to be "hard limited" by the underlying 
(LVM) system, or for df to show a different output, but df output is 
often parsed by scripts.

In the former case supposing a volume was filling up.

udev             1974288       0   1974288   0% /dev
tmpfs             404384   41920    362464  11% /run
/dev/sr2         1485120 1485120         0 100% /cdrom

(Just taking 3 random filesystems)

One filesystem would see "used" space go up. The other two would see 
"total" size going down, in addition to the other one, also seeing that 
figure go down. That would be counterintuitive and you cannot really do 
this.

It's impossible to give this information to the user in a way that the 
numbers still add up.

Supposing:

real size 2000

1000  500  500
1000  500  500
1000  500  500

combined virtual size 3000. Total usage 1500. Real free 500. Now the 
first volume uses another 250.

1000  750  250
1000  500  250
1000  500  250

The numbers no longer add up for the 2nd and 3rd system.

You *can* adjust total in a way that it still makes sense (a bit)

1000  750  250
  750  500  250
  750  500  250

You can also just ignore the discrepancy, or add another figure:

total used unav avail
1000  750    0  250
1000  500  250  250
1000  500  250  250

Whatever you do, you would have to simply calculate this adjusted number 
from the real number of available blocks.

Now the third volume takes another 100

First style:

1000  750  150
1000  500  150
1000  600  150

Second style:

1000  750  150
  650  500  150
  750  600  150

Third style:

total used unav avail
1000  750  100  150
1000  500  350  150
1000  600  250  150

There's nothing technically inconsistent about it, it is just rather 
difficult to grasp at first glance.

df uses filesystem data, but we are really talking about 
block-layer-level-data now.

You would either need to communicate the number of available blocks (but 
which ones?) and let the filesystem calculate unavailable --- or 
communicate the number of unavailable blocks at which point you just do 
this calculation yourself. For each volume you reach a different number 
of "blocks" you need to withhold.

If you needed to make those blocks unavailable, you would now randomly 
(or at the end of the volume, or any other method) need to "unavail" 
those to the filesystem layer beneath (or above).

Every write that filled up more blocks would be communicated to you, 
(since you receive the write or the allocation) and would result in an 
immediate return of "spurious" mutations or an updated number of 
unavailable blocks -- and you can also communicate both.

On every new allocation, the filesystem would be returned blocks that 
you have "fakely" marked as unavailable. All of this only happens if 
available real space becomes less than that of the individual volumes 
(virtual size). The virtual "available" minus the "real available" is 
the number of blocks (extents) you are going to communicate as being 
"not there".

At every mutation from the filesystem, you respond with a like mutation: 
not to the filesystem that did the mutation, but to every other 
filesystem on every other volume.

Space being freed (deallocated) then means a reverse communication to 
all those other filesystems/volumes.

But it would work, if this was possible. This is the entire algorithm.


I'm sorry if this sounds like a lot of "talk" and very little "doing" 
and I am annoyed by that as well. Sorry about that. I wish I could 
actually be active with any of these things.

I am reminded of my father. He was in school for being a car mechanic 
but he had a scooter accident days before having to do his exam. They 
did the exam with him in a (hospital) bed. He only needed to give 
directions on what needed to be done and someone else did it for him :p.

That's how he passed his exam. It feels the same way for me.

Regards.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-05-04 18:16 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1684768750.3193600.1461851163510.JavaMail.yahoo.ref@mail.yahoo.com>
2016-04-28 13:46 ` [linux-lvm] thin handling of available space matthew patton
     [not found] <799090122.6079306.1462373733693.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-04 14:55 ` matthew patton
2016-05-03 18:19 Xen
     [not found] <1614984310.1700582.1462280490763.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-03 13:01 ` matthew patton
2016-05-03 15:47   ` Xen
2016-05-04  0:56   ` Mark Mielke
     [not found] <1870050920.5354287.1462276845385.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-03 12:00 ` matthew patton
2016-05-03 14:38   ` Xen
2016-05-04  1:25   ` Mark Mielke
2016-05-04 18:16     ` Xen
     [not found] <929635034.3140318.1461840230292.JavaMail.yahoo.ref@mail.yahoo.com>
2016-04-28 10:43 ` matthew patton
2016-04-28 18:20   ` Xen
2016-04-28 18:25   ` Xen
2016-04-29 11:23     ` Zdenek Kabelac
2016-05-02 14:32       ` Mark Mielke
2016-05-03  9:45         ` Zdenek Kabelac
2016-05-03 10:41           ` Mark Mielke
2016-05-03 11:18             ` Zdenek Kabelac
2016-05-03 10:15         ` Gionatan Danti
2016-05-03 11:42           ` Zdenek Kabelac
2016-05-03 13:15             ` Gionatan Danti
2016-05-03 15:45               ` Zdenek Kabelac
2016-05-03 12:42       ` Xen
     [not found] <518072682.2617983.1461760017772.JavaMail.yahoo.ref@mail.yahoo.com>
2016-04-27 12:26 ` matthew patton
2016-04-27 21:28   ` Xen
2016-04-28  6:46     ` Marek Podmaka
2016-04-28 10:33       ` Xen
  -- strict thread matches above, loose matches on Subject: below --
2016-04-23 17:53 Xen
2016-04-27 12:01 ` Xen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.